2001 - Parallel Execution of Prolog Programs A Survey
2001 - Parallel Execution of Prolog Programs A Survey
A Survey
GOPAL GUPTA
University of Texas at Dallas
ENRICO PONTELLI
New Mexico State University
KHAYRI A.M. ALI and MATS CARLSSON
Swedish Institute of Computer Science
and
MANUEL V. HERMENEGILDO
Technical University of Madrid (UPM)
Since the early days of logic programming, researchers in the field realized the potential for ex-
ploitation of parallelism present in the execution of logic programs. Their high-level nature, the
presence of nondeterminism, and their referential transparency, among other characteristics, make
logic programs interesting candidates for obtaining speedups through parallel execution. At the
same time, the fact that the typical applications of logic programming frequently involve irregu-
lar computations, make heavy use of dynamic data structures with logical variables, and involve
search and speculation, makes the techniques used in the corresponding parallelizing compilers
and run-time systems potentially interesting even outside the field. The objective of this article is to
provide a comprehensive survey of the issues arising in parallel execution of logic programming lan-
guages along with the most relevant approaches explored to date in the field. Focus is mostly given
to the challenges emerging from the parallel execution of Prolog programs. The article describes
the major techniques used for shared memory implementation of Or-parallelism, And-parallelism,
and combinations of the two. We also explore some related issues, such as memory management,
compile-time analysis, and execution visualization.
The work of G. Gupta and E. Pontelli is partially supported by NSF Grants CCR 98-75279, CCR
98-20852, CCR 99-00320, CDA 97-29848, EIA 98-10732, CCR 96-25358, and HRD 99-06130.
M. Hermenegildo is partially funded by Spanish Ministry of Science and Technology Grant TIC99-
1151 “EDIPIA” and EU RTD 25562 “Radio Web.”
G. Gupta, E. Pontelli, and M. Hermenegildo are all partially funded by the US—Spain Research
Commission McyT/Fulbright grant 98059 ECCOSIC.
Authors’ addresses: G. Gupta, Department of Computer Science, University of Texas at Dallas, Box
830688/EC31, Richardson, TX 75083-0688, e-mail: gupta@utdallas.edu; E. Pontelli, Department
of Computer Science, New Mexico State University, Box 30001/CS, Las Cruces, NM 88003, e-mail:
epontell@cs.nmsu.edu; K. A. M. Ali and M. Carlsson, Swedish Institute of Computer Science, Box
1263, SE-164 29, Kista, Sweden, e-mail: {khayri, matsc}@sics.se; M. V. Hermenegildo, Facultad
de Informática, Univsersidad Politécnica de Madrid, 28660-Boadilla del Monte, Madrid, Spain;
e-mail: herme@fi.upm.es.
Permission to make digital/hard copy of all or part of this material without fee for personal or class-
room use provided that the copies are not made or distributed for profit or commercial advantage,
the ACM copyright/server notice, the title of the publication, and its date appear, and notice is given
that copying is by permission of the ACM, Inc. To copy otherwise, to republish, to post on servers,
or to redistribute to lists requires prior specific permission and/or a fee.
°
C 2001 ACM 0164-0925/01/0700–0472 $5.00
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001, Pages 472–602.
Parallel Execution of Prolog Programs • 473
Categories and Subject Descriptors: A.1 [Introductory and Survey]; D.3.4 [Programming Lan-
guages]: Processors; D.3.2 [Programming Languages]: Language Classification—constraint
and logic languages
General Terms: Design, Languages, Performance
Additional Key Words and Phrases: Automatic parallelization, constraint programming, logic pro-
gramming, parallelism, prolog
1. INTRODUCTION
The technology for sequential implementation of logic programming languages
has evolved considerably in the last two decades. In recent years, it has reached
a notable state of maturity and efficiency. Today, a wide variety of commercial
logic programming systems and excellent public-domain implementations are
available that are being used to develop large real-life applications. An excellent
survey of the sequential implementation technology that has been developed
for Prolog is presented by Van Roy [1994].
For years logic programming has been considered well suited for execution
on multiprocessor architectures. Indeed research in parallel logic programming
is vast and dates back to the inception of logic programming itself—one of the
earliest published being Pollard’s [1981] Ph.D. dissertation. Kowalski [1979]
already mentions the possibility of executing logic programs in parallel in his
seminal book Logic for Problem Solving. There has been a healthy interest
in parallel logic programming ever since, as is obvious from the number of
papers that have been published in proceedings and journals devoted to logic
programming and parallel processing, and the number of advanced tutorials
and workshops organized on this topic in various conferences.
This interest in parallel execution of logic programs arises from these
perspectives:
(1) Continuous research in simple, efficient, and practical ways to make paral-
lel and distributed architectures easily programmable drew the attention to
logic programming, since, at least in principle, parallelism can be exploited
implicitly from logic programs (i.e., parallelism can be extracted from logic
programs automatically without any user intervention). Logic languages
allow the programmer to express the desired algorithm in a way that re-
flects the structure of the problem more directly (i.e., staying closer to the
specifications). This makes the parallelism available in the problem more
accessible to the compiler and run-time system. The relatively clean se-
mantics of these languages also makes it comparatively easy to use formal
methods and prove the transformations performed by the parallelizing com-
piler or run-time system both correct (in terms of computed outputs) and
efficient (in terms of computational cost).1 At the same time, parallelizing
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.
474 • G. Gupta et al.
logic programs implies having to deal with challenges such as highly irreg-
ular computations and dynamic control flow (due to the symbolic nature
of many of their applications), the presence of dynamically allocated, com-
plex data structures containing logical variables, and having to deal with
speculation, all of which lead to nontrivial notions of independence and
interesting scheduling and memory management solutions. However, the
high-level nature of the paradigm also implies that the study of paralleliza-
tion issues happens in a better-behaved environment. For example, logical
variables are in fact a very “well-behaved” version of pointers.
(2) There is an everlasting myth that logic programming languages have low
execution efficiency. While it is now clear that modern compilers for logic
programs produce executables with very competitive time and memory per-
formance, this early belief also prompted researchers to use parallelism as
an alternative way of achieving speed. As we show, some of the results ob-
tained fortunately combine well with sequential compilation techniques re-
sulting in real speedups over even the most competitive sequential systems.
(1) Those that add explicit message passing primitives to Prolog, for example,
Delta Prolog [Pereira et al. 1986] and CS-Prolog [Futó 1993]. Multiple
Prolog processes are run in parallel and they communicate with each other
via explicit message passing or other rendezvous mechanisms.
(2) Those that add blackboard primitives to Prolog, for example, Shared Prolog
[Ciancarini 1990]. These primitives are used by multiple Prolog processes
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.
Parallel Execution of Prolog Programs • 475
Head : −B1 , B2 , . . . , Bn .
where Head, B1 , . . . , Bn are atomic formulae (atoms) and n ≥ 0.2 Each clause
represents a logical implication of the form:
∀vi (B1 ∧ · · · ∧ Bn → Head ),
where vi are all the variables that appear in the clause. A separate type of
clause is where Head is the atom false, which is simply written as
: −B1 , . . . , Bn .
These types of clauses are called goals (or queries). Each atom in a goal is called
a subgoal.
Each atomic formula is composed of a predicate applied to a number of argu-
ments (terms), and this is denoted p(t1 , . . . , tn )–where p is the predicate name,
and t1 , . . . , tn are the terms used as arguments. Each term can be either a con-
stant (c), a variable (X ), or a complex term ( f (s1 , . . . , sm ), where s1 , . . . , sm are
themselves terms and f is the functor of the term).
Execution in logic programming typically involves a logic program P and
a goal : −G 1 , . . . , G n , and the objective is to verify whether there exists an
assignment σ of terms to the variables in the goal such that (G 1 ∧ · · · ∧ G n )σ
is a logical consequence of P .3 σ is called a substitution: a substitution is an
assignment of terms to a set of variables (the domain of the substitution). If a
variable X is assigned a term t by a substitution, then X is said to be bound and
t is the (run-time) binding for the variable X . The process of assigning values
to the variables in t according to a substitution σ is called binding application.
Prolog, as well as many other logic programming systems, makes use of SLD-
resolution to carry out the execution of a program. The theoretical view of the
execution of a program P with respect to a goal G is a series of transformations
of a resolvent using a sequence of resolution steps.4 Each resolvent represents
a conjunction of subgoals. The initial resolvent corresponds to the goal G. Each
resolution step proceeds as follows.
—Let us assume that : −A1 , . . . , Ak is the current resolvent. An element Ai of
the resolvent is selected (selected subgoal) according to a predefined compu-
tation rule. In the case of Prolog, the computation rule selects the leftmost
element of the resolvent.
—If Ai is the selected subgoal, then the program is searched for a renamed
clause (i.e., with “fresh variables”)
Head : −B1 , . . . , Bh
whose head successfully unifies with Ai . Unification is the process that de-
termines the existence of a substitution σ such that Head σ = Ai σ . If there
2 Ifn = 0, then the formula is simply written as Head and called a fact.
3 Following standard practice, the notation eσ denotes the application of the substitution σ to the
expression e–that is, each variable X in e is replaced by σ (X ).
4 In fact, the actual execution, as we show later, is very similar to that of standard procedu-
ral languages, involving a sequence of procedure calls, returns, etc., and stack-based memory
management.
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.
478 • G. Gupta et al.
are rules satisfying this property then one is selected (according to a selection
rule) and a new resolvent is computed by replacing Ai with the body of the
rule and properly instantiating the variables in the resolvent:
: −(A1 , . . . , Ai−1 , B1 , . . . , Bh , Ai+1 , . . . , Ak )σ.
In the case of Prolog, the clause selected is the first one in the program whose
head unifies with the selected subgoal.
—If no clause satisfies the above property, then a failure occurs. Failures cause
backtracking. Backtracking explores alternative execution paths by reducing
one of the preceding resolvents with a different clause.
—The computation stops either when a solution is determined–that is, the
resolvent contains zero subgoals–or when all alternatives have been explored
without any success.
An intuitive procedural description of this process is represented in
Figure 2. The operational semantics of a logic-based language is determined
by the choice of computation rule (selection of the subgoal in the resolvent,
called selectliteral in Figure 2) and the choice of selection rule (selection of the
clause to compute the new resolvent, called selectclause ). In the case of Prolog,
the computation rule selects the leftmost subgoal in the resolvent, while the
selection rule selects the first clause in the program that successfully unifies
with the selected subgoal.
Many logic languages (e.g., Prolog) introduce a number of extralogical pred-
icates, used to perform tasks such as
(1) perform input/output (e.g., read and write files);
(2) add a limited form of control to the execution (e.g., the cut (!) operator, used
to remove some unexplored alternatives from the computation);
(3) perform metaprogramming operations; these are used to modify the struc-
ture of the program (e.g., assert and retract, add or remove clauses from
the program), or query the status of the execution (e.g., var and nonvar,
used to test the binding status of a variable).
An important aspect of many of these extralogical predicates is that their be-
havior is order-sensitive, meaning that they can produce a different outcome
depending on when they are executed. In particular, this means that they can
potentially produce a different result if a different selection rule or a different
computation rule is adopted.
In the rest of this work we focus on execution of Prolog programs (unless
explicitly stated otherwise); this means that we assume that programs are
executed according to the computation and selection rule of Prolog. We also
frequently use the term observable semantics to indicate the overall observable
behavior of an execution, that is, the order in which all visible activities of a pro-
gram execution take place (order of input/output, order in which solutions are
obtained, etc.). If a computation respects the observable Prolog semantics, then
this means that the user does not see any difference between such computation
and a sequential Prolog execution of the same program.
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.
Parallel Execution of Prolog Programs • 479
(2) External State: It is described by the content of the logical data areas of the
machine:
(a) Heap: Data areas in which complex data structures (lists and Prolog’s
compound terms) are allocated.
(b) Local Stack: (also known as Control Stack). Serves the same purpose
as the control stack in the implementation of imperative languages;
it contains control frames, called environments (akin to the activation
records used in the implementation of imperative languages), which
are created upon entering a new clause (i.e., a new “procedure”) and are
used to store the local variables of the clause and the control information
required for “returning” from the clause.
(c) Choice Point Stack: Choice points encapsulate the execution state for
backtracking purposes. A choice point is created whenever a call having
multiple possible solution paths (i.e., more than one clause successfully
matches the call) is encountered. Each choice point should contain suf-
ficient information to restore the status of the execution at the time
of creation of the choice point, and should keep track of the remaining
unexplored alternatives.
(d) Trail Stack: During an execution variables can be instantiated (they
can receive bindings). Nevertheless, during backtracking these bind-
ings need to be undone, to restore the previous state of execution. In
order to make this possible, bindings that can be affected by this op-
eration are registered in the trail stack. Each choice point records the
point of the trail where the undoing activity needs to stop.
Prolog is a dynamically typed language; hence it requires type information to
be associated with each data object. In the WAM, Prolog terms are represented
as tagged words; each word contains:
(1) a tag describing the type of the term (atom, number, list, compound struc-
ture, unbound variable); and
(2) a value whose interpretation depends on the tag of the word; for example,
if the tag indicates that the word represents a list, then the value field will
be a pointer to the first node of the list.5
Prolog programs are compiled in the WAM into a series of abstract instruc-
tions operating on the previously described memory areas. In a typical execu-
tion, whenever a new subgoal is selected (i.e., a new “procedure call” is per-
formed), the following steps are taken.
—The arguments of the call are prepared and loaded into the temporary reg-
isters X 1 , . . . , X n ; the instruction set contains a family of instructions, the
“put” instructions, for this purpose.
—The clauses matching the subgoal are detected and, if more than one is avail-
able, a choice point is allocated (using the “try” instructions);
5 Lists in Prolog, as in Lisp, are composed of nodes, where each node contains a pointer to an element
of the list (the head) and a pointer to the rest of the list (the tail).
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.
Parallel Execution of Prolog Programs • 481
—The first clause is started: after creating (if needed) the environment for the
clause (“allocate”), the execution requires head unification (i.e., unification
between the head of the clause and the subgoal to be solved) to be performed
(using “get/unify” instructions). If head unification is successful (and as-
suming that the rule contains some user-defined subgoals), then the body of
the clause is executed, otherwise backtracking to the last choice point created
takes place.
—Backtracking involves extracting a new alternative from the topmost choice
point (“retry” will extract the next alternative, assuming this is not the last
one, while “trust” will extract the last alternative and remove the exhausted
choice point), restoring the state of execution associated with such choice
point (in particular, the content of the topmost part of the trail stack is used
to remove bindings performed after the creation of the choice point), and
restarting the execution with the new alternative.
The WAM has been designed in order to optimize the use of resources during
execution, improving speed and memory consumption. Optimizations that are
worth mentioning are:
—Last Call Optimization [Warren 1980]: Represents an instance of the well-
known tail-recursion optimization commonly used in the implementation of
many programming languages. Last call optimization allows reuse of the en-
vironment of a clause for the execution of the last subgoal of the clause itself;
—Environment Trimming [Warren 1983; Aı̈t-Kaci 1991]: Allows a progressive
reduction of the size of the environment of a clause during the execution of
the clause itself, by removing the local variables that are not needed in the
rest of the computation.
—Shallow Backtracking [Carrlsson 1989]: The principle of procrastination
[Gupta and Pontelli 1997]–postponing work until it is strictly required by
the computation–is applied to the allocation of choice points in the WAM:
the allocation of a choice point is delayed until a successful head unification
has been detected. On many occasions this allows avoiding the allocation of
the choice point if head unification fails, or if the successful one is the last
clause defining such predicate.
—Indexing: This technique is used to guide the analysis of the possible clauses
that can be used to solve the current subgoal. The values of the arguments
can be used to prune the search space at run-time. The original WAM
supplies some instructions (“switch” instructions) to analyze the functor of
the first argument and select different clusters of clauses depending on its
value. Since many programs cannot profit from first-argument selection,
more powerful indexing techniques have been proposed, taking into account
more arguments and generating more complex decision trees [Hickey and
Mudambi 1989; Van Roy and Despain 1992; Taylor 1991; Ramesh et al. 1990].
birth(day(12),month(1),year(99)) = birth(day(X),month(1),Y)
address(street(hills),number(2),city(cruces)) = address(Z,W,city(cruces))
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.
Parallel Execution of Prolog Programs • 485
than one subgoal is present in the resolvent, and (some of) these goals are
executed in parallel. And-parallelism thus permits exploitation of parallelism
within the computation of a single solution to the original goal.
And-parallelism arises in most applications, but is particularly relevant
in divide-and-conquer applications, list-processing applications, various con-
straint solving problems, and system applications.
In the literature it is common to distinguish two forms of and-parallelism
(the descriptions of these types of parallelism are clarified later in the
article).
—Independent and-parallelism (IAP) arises when, given two or more subgoals,
the run-time bindings for the variables in these goals prior to their execution
are such that each goal has no influence on the outcome of the other goals.
Such goals are said to be independent and their parallel execution gives rise
to independent and-parallelism. The typical example of independent goals
is represented by goals that, at run-time, do not share any unbound vari-
able; that is, the intersection of the sets of variables accessible by each goal
is empty. More refined notions of independence, for example, nonstrict in-
dependence, have also been proposed [Hermenegildo and Rossi 1995] where
the goals may share a variable but “cooperate” in creating the binding for the
common variable.
—Dependent and-parallelism arises when, at run-time, two or more goals in
the body of a clause have a common variable and are executed in paral-
lel, “competing” in the creation of bindings for the common variable (or
“cooperating,” if the goals share the task of creating the binding for the
common variable). Dependent and-parallelism can be exploited in varying
degrees, ranging from models that faithfully reproduce Prolog’s observable
semantics to models that use specialized forms of dependent and-parallelism
(e.g., stream parallelism) to support coroutining and other alternative se-
mantics, as in the various committed choice languages [Shapiro 1987; Tick
1995].
It has been noted that independent and dependent and-parallelism are sim-
ply the application of the same principle, independence, at different levels of
granularity in the computation model. In fact, parallelism is always obtained
by executing two (or more) operations in parallel if those two operations do not
influence each other in any way (i.e., they are independent); otherwise, parallel
execution would not be able to guarantee correctness and/or efficiency. For inde-
pendent and-parallelism, entire subgoals have to be independent of each other
to be executed in parallel. On the other hand, in dependent and-parallelism
the steps inside execution of each goal are examined, and steps in each goal
that do not interfere with each other are executed in parallel. Thus, indepen-
dent and-parallelism could be considered as macro level and-parallelism, while
dependent and-parallelism could be considered as micro level and-parallelism.
Dependent and-parallelism is typically harder to exploit for Prolog, unless ad-
equate changes to the operational semantics are introduced, as in the case of
committed choice languages [Shapiro 1987].
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.
486 • G. Gupta et al.
2.4 Discussion
Or-parallelism and and-parallelism identify opportunities for transforming cer-
tain sequential components of the operational semantics of logic programming
into concurrent operations. In the case of or-parallelism, the exploration of the
different alternatives in a choice point is parallelized, while in the case of and-
parallelism the resolution of distinct subgoals is parallelized. In both cases, we
expect the system to provide a number of computing resources that are capable
of carrying out the execution of the different instances of parallel work (i.e.,
clauses from a choice point or subgoals from a resolvent). These computing
resources can be seen as different Prolog engines that are cooperating in the
parallel execution of the program. We often refer to these computing entities
as workers [Lusk et al. 1990] or agents [Hermenegildo and Greene 1991]. The
term, process, has also been frequently used in the literature to indicate these
computing resources, as workers are typically implemented as separate pro-
cesses. The complexity and capabilities of each agent vary across the different
models proposed. Certain models view agents as processes that are created for
the specific execution of an instance of parallel work (e.g., an agent is created
to specifically execute a particular subgoal), while other models view agents as
representing individual processors, which have to be repeatedly scheduled to ex-
ecute different instances of parallel work during the execution of the program.
We return to this distinction in Section 9.1.
Intuitively, or- and and-parallelism are largely orthogonal to each other, as
they parallelize independent points of nondeterminism in the operational se-
mantics of the language. Thus, one would expect that the exploitation of one
form of parallelism does not affect the exploitation of the other, and it should
be feasible to exploit both of them simultaneously. However, practical experi-
ence has demonstrated that this orthogonality does not easily translate at the
implementation level. For various reasons (e.g., conflicting memory manage-
ment requirements) combined and/or-parallel systems have turned out to be
extremely complicated, and so far no efficient parallel system has been built
that achieves this ideal goal. At the implementation level, there is consider-
able interaction between and- and or-parallelism and most proposed systems
have been forced into restrictions on both forms of parallelism (these issues are
discussed at length in Section 6).
On the other hand, one of the ultimate aims of researchers in parallel logic
programming has been to extract the best execution performance from a given
logic program. Reaching this goal of maximum performance entails exploiting
multiple forms of parallelism to achieve best performance on arbitrary appli-
cations. Indeed, various experimental studies (e.g., Shen and Hermenegildo
[1991, 1996b] and Pontelli et al. [1998]) seem to suggest that there are large
classes of applications that are rich in either one of the two forms of parallelism,
while others offer modest quantities of both. In these situations, the ability to
concurrently exploit multiple forms of parallelism in a general-purpose system
becomes essential.
It is important to underline that the overall goal of research in parallel logic
programming is the achievement of higher performance through parallelism.
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.
Parallel Execution of Prolog Programs • 487
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.
488 • G. Gupta et al.
and the query ?- f. The calls to t, p, and q are nondeterministic and lead to
the creation of choice points. In turn, the execution of p leads to the call to
the subgoal s(L,M), which leads to the creation of another choice point. The
multiple alternatives in these choice points can be executed in parallel.
A convenient way to visualize or-parallelism is through the or-parallel search
tree. Informally, an or-parallel search tree (or simply an or-parallel tree or a
search tree) for a query Q and logic program L P is a tree of nodes, each with
an associated goal-list, such that:
(1) the root node of the tree has Q as its associated goal-list;
(2) each nonroot node n is created as a result of successful unification of the
first goal in (the goal-list of) n’s parent node with the head of a clause in
LP,
H :-B1 , B2 , . . . , Bn .
The goal-list of node n is (B1 , B2 , . . . , Bn , L2 , . . . , Lm )θ , if the goal-list of the
parent of n is L1 , L2 , . . . , Lm and θ = mgu(H, L1 ).
Figure 3 shows the or-parallel tree for the simple program presented above.
Note that, since we are considering execution of Prolog programs, the construc-
tion of the or-parallel tree follows the operational semantics of Prolog: at each
node we consider clauses applicable to the first subgoal, and the children of
a node are considered ordered from left to right according to the order of the
corresponding clauses in the program. That is, during sequential execution the
or-parallel tree of Figure 3 is searched in a depth-first manner. However, if mul-
tiple agents are available, then multiple branches of the tree can be searched
simultaneously.
Or-parallelism manifests itself in a number of applications [Kluźniak 1990;
Shen 1992b; Shen and Hermenegildo 1996b]. It arises while exercising rules of
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.
Parallel Execution of Prolog Programs • 489
queens([],_,[]).
queens([X|Xs], Placed, Values):-
delete(X, Values, New_values),
noattack(X, Placed),
queens(Xs,[X|Placed],New_values).
6 That is, the cost associated with updating the state of a worker when it switches from one node of
the tree to another.
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.
492 • G. Gupta et al.
a pointer to the linked list of bindings that map to that bucket. When a new
binding is inserted, a new entry is created and inserted at the beginning of the
linked list of that bucket as follows: (i) The next pointer field of the new entry
records the old value of the pointer in the bucket. (ii) The bucket now points
to this new entry. At a branch point each new node is given a new copy of the
buckets (but not a new copy of the lists pointed to by the buckets).
When a favored branch has to look up the value of a conditional variable
it can find it in-place in the value-cell. However, when a nonfavored branch
accesses a variable value it computes the hash value using the address of the
variable and locates the proper bucket in the hash table. It then traverses the
linked list until it finds the correct value. Notice how separate environments
are maintained by sharing the linked list of bindings in the hash tables.
7 Note that the description that follows is largely based on Warren [1987c] rather than on
Warren [1984]. The binding arrays technique in Warren [1984] is not primarily concerned with
or-parallelism but rather with (primarily sequential) non-depth-first search.
8 Most systems, for example, Aurora, initially treat all the variables as conditional, thus placing
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.
Parallel Execution of Prolog Programs • 497
0 and 1 of the binding array. The entries stored in the trail in nodes are shown
in square brackets in the figure. Suppose the value of variables M is needed in
node n1; M’s offset stored in the memory location allocated to it is then obtained.
This offset is 1, and is used by worker P1 to index into the binding array, and
obtain M’s binding. Observe that the variable L is unconditionally aliased to X,
and for this reason L is made to point to X. The unconditional nature of the
binding does not require allocation of an entry in the binding array for L.9
To ensure consistency, when a worker switches from one branch (say bi ) of
the or-tree to another (say b j ), it has to update its binding array by deinstalling
bindings from the trail of the nodes that are in bi and installing the correct
bindings from the trail of the nodes in b j . For example, suppose worker P1
finishes work along the current branch and decides to migrate to node n2 to
finish work that remains there. To be able to do so, it will have to update
its binding array so that the state which exists along the branch from the
root node to node n2 is reflected in its environment. This is accomplished by
9 Aurora allocates an entry in the array for each variable, but stores unconditional bindings directly
in the stacks.
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.
498 • G. Gupta et al.
making P1 travel up along the branch from node n1 towards the least common
ancestor node of n1 and n2, and removing those conditional bindings from its
binding array that it made on the way down. The variables whose bindings
need to be removed are found in the trail entries of intervening nodes. Once
the least common ancestor node is reached, P1 will move towards node n2, this
time installing conditional bindings found in the trail entries of nodes passed
along the way. This can be seen in Figure 5. In the example, while moving up,
worker P1 untrails the bindings for X and M, since the trail contains references
to these two variables. When moving down to node n2, worker P1 will retrieve
the new bindings for X and M from the trail and install them in the binding
array.
The binding arrays method has been used in the Aurora or-parallel sys-
tem, which is described in more detail in Section 3.5. Other systems have also
adopted the binding arrays method (e.g., the Andorra-I system [Santos Costa
et al. 1991a]). Furthermore, a number of variations on the idea of binding arrays
have been proposed–for example, Paged Binding Arrays and Sparse Binding
Arrays–mostly aimed at providing better support for combined exploitation of
and-parallelism and or-parallelism. These are discussed in Sections 6.3.6 and
6.3.7.
area easily accessible by each worker. This allows the system to maintain a sin-
gle list of unexplored alternatives for each choice point, which is accessed in mu-
tual exclusion by the different workers. A frame is created for each shared choice
point and is used to maintain various scheduling information (e.g., bitmaps
keeping track of workers working below each choice point). This is illustrated
in Figure 6. Each choice point shared by multiple workers has a correspond-
ing frame in the separate shared space. Access to the unexplored alternatives
(which are now located in these frames) will be performed in mutual exclusion,
thus guaranteeing that each alternative is executed by exactly one worker.
The copying of stacks can be made more efficient through the technique of
incremental copying. The idea of incremental copying is based on the fact that
the idle worker could have already traversed a part of the path from the root
node of the or-parallel tree to the least common ancestor node, thus it does not
need to copy this part of stacks. This is illustrated in an example in Figure 7.
In Figure 7(i) we have two workers immediately after a sharing operation that
has transferred three choice points from worker P1 to P2. In Figure 7(ii), worker
P1 has generated two new (private) choice points while P2 has failed in its
alternative. Figure 7(iii), shows the resulting situation after another sharing
between the two workers; incremental copying has been applied, leading to the
copy of only the two new choice points.
Incremental copying has been proved to have some drawbacks with respect
to management of combined and-parallelism and or-parallelism as well as
management of special types of variables (e.g., attributed variables). Recent
schemes, such as the COWL models (described in Section 6.3.5) overcome many
of these problems.
This model is an evolution of the work on the BC-machine by Ali [1988],
a model where different workers concurrently start the computation of the
query and automatically select different alternatives when choice points are
created. The idea was already present in the Kabu Wake model [Masuzawa
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.
500 • G. Gupta et al.
et al. 1986]. In this method, idle workers request work from busy ones, and work
is transmitted by copying environments between workers. The main difference
with respect to the previously described approach is that the source worker
(i.e., the busy worker from where work is taken) is required to “temporarily”
backtrack to the choice point to be split in order to undo bindings before copying
takes place.
Stack copying has found efficient implementation in a variety of sys-
tems, such as MUSE [Ali and Karlsson 1990b] (discussed in more detail in
Section 3.5.2), ECLiPSe [Wallace et al. 1997], and YAP [Rocha et al. 1999b].
Stack copying has also been adopted in a number of distributed memory imple-
mentations of Prolog, such as OPERA [Briat et al. 1992] and PALS [Villaverde
et al. 2000].
exchange of work between workers boils down to the transfer of an oracle from
the busy worker to the idle one. An oracle contains identifiers which describe
the path in the or-tree that the worker needs to follow to reach the unexplored
alternative. A centralized controller is in charge of allocating oracles to idle
agents. The method has attracted considerable attention, but has provided
relatively modest parallel performances on arbitrary Prolog programs. Vari-
ations of this method have been effectively used to parallelize specialized types
of logic programming computations (e.g., in the parallelization of stable logic
programming computations [Pontelli and EL-Kathib 2001]). The recomputa-
tion method has also found applications in the parallelization of constraint
logic programming [Mudambi and Schimpf 1994].
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.
Parallel Execution of Prolog Programs • 503
An issue that arises in the presence of pruning operators such as cuts and
commits during or-parallel execution is that of speculative work [Hausman
1989, 1990; Ali and Karlsson 1992b; Beaumont and Warren 1993; Sindaha
1992]. Consider the following program,
p(X, Y) :- q(X), !, r(Y).
p(X, Y) :- g(X), h(Y).
...
and the goal,
?- p(A, B).
Executing both branches in parallel, corresponding to the two clauses that
match this goal, may result in unnecessary work, because sequential Prolog
semantics entail that if q(X) succeeds then the second clause for p shall never
be tried. Thus, in or-parallel execution, execution of the second clause is specu-
lative, in the sense that its usefulness depends on the success/failure outcome
of goal q.
It is a good idea for a scheduler designed for an or-parallel system that
supports sequential Prolog semantics to take speculative work into account.
Essentially, such a scheduler should bias all the workers to pick work that is
within the scope of a cut from branches to the left in the corresponding subtree
rather than from branches to the right [Ali and Karlsson 1992b; Beaumont
1991; Beaumont and Warren 1993; Sindaha 1992].
A detailed survey on scheduling and handling of speculative work for
or-parallelism is beyond the scope of this article, and can be found in
Ciepielewski [1992]. One must note that the efficiency and the design of the
scheduler has the biggest bearing on the overall efficiency of an or-parallel sys-
tem (or any parallel system for that matter). We describe two such systems in
Section 3.5, where a significant amount of effort has been invested in designing
and fine-tuning the or-parallel system and its schedulers.
These three operations are assumed to be the only ones available to modify the
“physical structure” of this abstract tree.
The abstraction of an or-parallel execution should account for the various
issues present in or-parallelism (e.g., management of variables and of their
bindings, creation of tasks, etc.). Variables that arise during execution, whose
multiple bindings have to be correctly maintained, can be modeled as attributes
of the nodes in the tree. 0 denotes a set of M variables. If the computation tree
has size N , then it is possible to assume M = O(N ). At each node u, three
operations are possible:
r assign a variable X to a node u;
r dereference a variable X at node u; that is, identify the ancestor v of u
(if any) that has been assigned X ; and
r alias two variables X and X at node u; this means that for every node v
1 2
ancestor of u, every reference to X 1 in v will produce the same result as X 2
and vice versa.
The previous abstraction assumed the presence of one variable binding per
node. This restriction can be made without loss of generality; it is always pos-
sible to assume that the number of bindings in the node is bound by a program
dependent constant. The problem of supporting these dynamic tree operations
has been referred to as the OP problem [Ranjan et al. 1999].
the OP problem will take in the worst case an amount of time which is at least
as large as lg N (where N is the number of choice points in the computation
tree).
It is also interesting to point out that the result does not depend on the
presence of the alias operation; this means that the presence of aliasing between
unbound conditional variables during an or-parallel execution does not create
any serious concern (note that this is not the case for other forms of parallelism,
where aliasing is a major source of complexity).
The result essentially states that, no matter how smart the implementation
scheme selected is, there will be cases that will lead to a nonconstant time cost.
This proof confirms the result put forward in Gupta and Jayaraman [1993a].
This nonconstant time nature is also evident in all the implementation schemes
presented in the literature, for example, the creation of the shared frames and
the copying of the choice points in MUSE [Ali and Karlsson 1990b], the instal-
lation of the bindings in Aurora [Lusk et al. 1990], and the management of
timestamps in various other models [Gupta 1994].
Upper Bound for OP. The relevant research on complexity of the OP problem
has been limited to showing that a constant time cost per operation cannot
be achieved in any implementation scheme. Limited effort has been placed to
supply a tight upper bound to this problem. Most of the implementation schemes
proposed in the literature can be shown to have a worst-case complexity of O(N )
per operation. Currently, the best result achieved is that the OP problem with
no aliasing can be solved on √ a pointer machine with a single operation worst-
case time complexity of O( 3 N (lg N )k ) for a small k.
The lower bound produced, O(lg N ) per operation, is a confirmation and
refinement of the results proposed by Gupta and Jayaraman [1993a], and a
further proof that an ideal or-parallel system (where all the basic operations
are√realized with constant-time overhead) cannot be realized. The upper bound,
Õ( 3 N ),11 even if far from the lower bound, is of great importance, as it in-
dicates that (at least theoretically) there are implementation schemes which
have a worst-case time complexity better than that of the existing models.
Table I compares the worst-case time complexity of performing a sequence of
K operations, on an N node tree, for some of the most well-known schemes for
or-parallelism [Gupta 1994]. The proof of the upper bound result indeed pro-
vides one such model, although it is still an open issue whether the theoretical
superiority of such model can be translated into a practical implementation
scheme.
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.
506 • G. Gupta et al.
12 The porting, however, did not involve modifications of the system structure to take full advantage
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.
Parallel Execution of Prolog Programs • 507
that branch. Bindings of shared variables must of course be kept private, and
are recorded in the worker’s private binding array. The basic Prolog operations
of binding, unbinding, and dereferencing are performed with an overhead of
about 25% relative to sequential execution (and remain fast constant-time op-
erations). However, during task switching the worker has to update its binding
array by deinstalling bindings as it moves up the tree and installing bindings
as it moves down another branch. This incurred overhead, called migration
cost (or task-switching cost), is proportional to the number of bindings that
are deinstalled and installed. Aurora divides the or-parallel search tree into a
public region and a private region. The public region consists of those nodes
from which other workers can pick up untried alternatives. The private region
consists of nodes private to a worker that cannot be accessed by other workers.
Execution within the private region is exactly like sequential Prolog execution.
Nodes are transferred from the private region of a worker P to the public region
by the scheduler, which does so when another idle worker Q requests work from
worker P .
One of the principal goals of Aurora has been the support of the full
Prolog language. Preserving the semantics of built-in predicates with side
effects is achieved by synchronization: whenever a nonleftmost branch of ex-
ecution reaches an order-sensitive predicate, the given branch is suspended
until it becomes leftmost [Hausman 1990]. This technique ensures that the
order-sensitive predicates are executed in the same left-to-right order as in
a sequential implementation, thus preserving compatibility with these imple-
mentations.
It is often the case that this strict form of synchronization is unnecessary,
and slows down parallel execution. Aurora therefore provides nonsynchronized
variants for most order-sensitive predicates that come in two flavors: the asyn-
chronous form respecting the cut pruning operator, and the completely relaxed
cavalier form. Notably, nonsynchronized variants are available for the dynamic
database update predicates (assert, retract, etc.) [Szeredi 1991].
A systematic treatment of pruning operators (cut and commit) and of spec-
ulative work has proved to be of tremendous importance in or-parallel imple-
mentations. Algorithms for these aspects have been investigated by Hausman
[1989, 1990] and incorporated into the interface and schedulers.
Graphical tracing packages have turned out to be essential for understanding
the behavior of schedulers and parallel programs and finding performance bugs
in them [Disz and Lusk 1987; Herrarte and Lusk 1991; Carro et al. 1993].
Several or-parallel applications for Aurora were studied in Kluźniak [1990]
and Lusk et al. [1993]. The nonsynchronized dynamic database features have
been exploited in the implementation of a general algorithm for solving opti-
mization problems [Szeredi 1991, 1992].
Three schedulers are currently operational. Two older schedulers were writ-
ten [Butler et al. 1998; Brand 1998], but have not been updated to comply with
the scheduler–engine interface:
The speedups obtained by all schedulers of Aurora for a diverse set of bench-
mark programs have been very encouraging. Some of the benchmark programs
contain a significant amount of speculative work, in which speedups are mea-
sured for finding the first (leftmost) solution. The degree of speedup obtained
for such benchmark programs depends on where in the Prolog search tree the
first solution is, and on the frequency of workers moving from right to left to-
wards less speculative work. There are other benchmark programs that have
little or no speculative work because they produce all solutions. The degree of
speedup for such benchmark programs depends on the amount of parallelism
present and on the granularity of parallelism.
More on the Aurora system, and a detailed discussion of its performance re-
sults, can be found in Calderwood and Szeredi [1989], Szeredi [1989], Beaumont
et al. [1991], Beaumont and Warren [1993], and Sindaha [1992]. The binding
array model has also been adapted for distributed shared memory architectures
and implemented in the Dorpp system [Silva and Watson 2000].
3.5.2 The MUSE Or-Parallel Prolog System. The MUSE or-parallel Prolog
system has been designed and implemented on a number of UMA and NUMA
computers (Sequent Symmetry, Sun Galaxy, BBN Butterfly II, etc.) [Ali and
Karlsson 1990b, 1992a,b; Ali et al. 1992; Karlsson 1992]. It supports the full
Prolog language and programs run on it with almost no user annotations. It is
based on a simple extension of the state of the art sequential Prolog implemen-
tation (SICStus WAM [Carlsson et al. 1995]).
The MUSE model assumes a number of extended WAMs (called workers, as in
Aurora), each with its own local address space, and some global space shared by
all workers. The model requires copying parts of the WAM stacks when a worker
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.
Parallel Execution of Prolog Programs • 509
runs out of work or suspends its current branch. The copying operation is made
efficient by utilizing the stack organization of the WAM. To allow copying of
memory between workers without the need of any pointer relocation operation,
MUSE makes use of a sophisticated memory mapping scheme. The memory
is partitioned among the different workers; each worker is implemented as a
separate process, and each process maps its own local partition to the same
range of memory addresses, which allows for copying without pointer reloca-
tions. The partitions belonging to other processes are instead locally mapped to
different address ranges. This is illustrated in Figure 9. The partition of worker
1 is mapped at different address ranges in different workers; the local partition
resides at the same address range in each worker.
Workers make a number of choice points sharable, and they get work from
those shared choice points (nodes) by the normal backtracking of Prolog. As
in Aurora, the Muse system has two components: the engine and the sched-
uler. The engine performs the actual Prolog work while the schedulers, work-
ing together, schedule the work between engines and support the sequential
semantics of Prolog.
The first MUSE engine has been produced by extending the SICStus Prolog
version 0.6 [Carlsson et al. 1995]. Extensions are carefully added to preserve the
high efficiency of SICStus leading to a negligible overhead which is significantly
lower than in other or-parallel models.
The MUSE scheduler supports efficient scheduling of speculative and
nonspeculative work [Ali and Karlsson 1992b]. For purposes of scheduling, the
Prolog tree is divided into two sections: the right section contains voluntar-
ily suspended work and the left section contains active work. Voluntarily sus-
pended work refers to the work that was suspended because the worker doing
it found other work to the left of the current branch that was less speculative.
Active work is work that is nonspeculative and is actively pursued by work-
ers. The available workers concentrate on the available nonspeculative work
in the left section. When the amount of work in the left section is not enough
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.
510 • G. Gupta et al.
for the workers, some of the leftmost part of the voluntarily suspended section
(i.e., speculative work) will be resumed. A worker doing speculative work will
always suspend its current work and migrate to another node to its left if that
node has less speculative work.
The scheduling strategy for nonspeculative work, in general, is based on
the principle that when a worker is idle, its next piece of work will be taken
from the bottommost (i.e., youngest) node in the richest branch (i.e., the branch
with maximum or-parallel work) of a set of active nonspeculative branches.
When the work at the youngest node is exhausted, that worker will find more
work by backtracking to the next youngest node. If the idle worker cannot
find nonspeculative work in the system, it will resume the leftmost part of the
voluntarily suspended section of the tree.
The MUSE system controls the granularity of jobs at run-time by avoiding
sharing very small tasks. The idea is that when a busy worker reaches a situa-
tion in which it has only one private parallel node, it will make its private load
visible to the other workers only when that node is still alive after a certain
number of Prolog procedure calls. Without such a mechanism, the gains due to
parallel execution can be lost as the number of workers is increased.
A clean interface between the MUSE engine and the MUSE scheduler has
been designed and implemented. It has improved the modularity of the system
and preserved its high efficiency.
Tools for debugging and evaluating the MUSE system have been developed.
The evaluation of the system on Sequent Symmetry and on BBN Butterfly
machines I and II shows very promising results in absolute speed and also in
comparison with results of the other similar systems. The speedups obtained
are near linear for programs with large amounts of or-parallelism. For programs
that do not have enough or-parallelism to keep all available workers busy the
speedups are (near) linear up to the point where all parallelism is exploited.
The speed up does not increase or decrease thereafter with increase in num-
ber of workers. For programs with no or very low or-parallelism, the speedups
obtained are close to 1 due to very low parallel overheads. More details of the
MUSE system and a discussion of its performance results can be found in ref-
erences cited earlier [Ali and Karlsson 1992a, 1992b; Ali et al. 1992; Karlsson
1992].
MUSE can be considered one of the first commercial parallel logic pro-
gramming systems ever to be developed; MUSE was included for a number
of years as part of the standard distribution of SICStus Prolog [Carlsson et al.
1995].13
4. INDEPENDENT AND-PARALLELISM
Independent and-parallelism refers to the parallel execution of goals that have
no “data dependencies” and thus do not affect each other. To take a simple
example, consider the naı̈ve Fibonacci program shown below.
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.
Parallel Execution of Prolog Programs • 511
fib(0, 1).
fib(1, 1).
fib(M, N) :- [ M1 is M - 1, fib(M1, N1) ],
[ M2 is M - 2, fib(M2, N2) ],
N is N1 + N2.
Assuming the execution of this program by supplying the first argument as
input, the two lists of goals, each enclosed within square brackets above, have no
data dependencies among themselves and hence can be executed independently
in parallel with each other. But the last subgoal N is N1 + N2 depends on the
outcomes of the two and-parallel subgoals, and should start execution only after
N1 and N2 get bound.
Similarly to the case of or-parallelism, development of an and-parallel com-
putation can be depicted using a tree structure (and-tree). In this case, each
node in the tree is labeled by a conjunction of subgoals and it contains as many
children as subgoals in the conjunction. Figure 10 illustrates a simple and-tree
for the execution of fib(2,X) with respect to the above program. The dashed
line in Figure 10 is used to denote the fact that it is irrelevant whether the
subgoal X is N1 + N2 is a child of either of the two nodes above.
Independent and-parallelism manifests itself in a number of applications,
those in which a given problem can be divided into a number of independent
subproblems. For example, it appears in divide-and-conquer algorithms, where
the independent recursive calls can be executed in parallel (e.g., matrix multi-
plication, quicksort).
(3) Backward Execution Phase: Deals with steps to be taken when a goal fails,
that is, the operation of backtracking.
s1 Y := W+2; (+ (+ W 2) Y = W+2,
s2 X := Y+Z; Z) X = Y+Z,
(a) (b) (c)
(d) main:- p(X) :- X=a.
s1 p(X),
s2 q(X), q(X) :- X=b, large computation.
... q(X) :- X=a.
14 To complete the discussion above, note that output-dependencies do not appear in functional or
logic and constraint programs because single assignment is generally used; we consider this a minor
point of difference since one of the standard techniques for parallelizing imperative programs is to
perform a transformation to a single assignment program before performing the parallelization.
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.
Parallel Execution of Prolog Programs • 513
in a logic program. The fact that (at least in pure segments of programs) the
order of statements in logic programming does not affect the result15 led in
early models to the proposal of execution strategies where parallelism was ex-
ploited “fully” (i.e., all statements were eligible for parallelization). However,
the problem is that such parallelization often violates the principle of efficiency:
for a finite number of processors, the parallelized program can be arbitrarily
slower than the sequential program, even under ideal assumptions regarding
run-time overheads. For instance, in the last example, reversing the order of the
calls to p and q in the body of main implies that the call q(X) (X at this point is
free, i.e., a pointer to an empty cell) will first enter its first alternative, perform-
ing the large computation. Upon return of q (with X pointing to the constant b)
the call to p will fail and the system will backtrack to the second alternative
of q, after which p will succeed with X=a. On the other hand, the sequential
execution would terminate in two or three steps, without performing the large
computation. The fundamental observation is that, in the sequential execution,
p affects q, in the sense that it prunes (limits) its choices. Executing q before
executing p results in performing speculative choices with respect to the sequen-
tial execution. Note that this is in fact very related to executing conditionals
in parallel (or ahead of time) in traditional languages (note that q above could
also be (loosely) written as “q(X) :- if X=b then large computation else if
X=a then true else fail.”).
Something very similar occurs in case (c) above, which corresponds to a
constraint logic program: while execution of the two constraints in the original
order involves two additions and two assignments (the same set of operations
as those of the imperative or functional programs), executing them in reversed
order involves first adding an equation to the system, corresponding to state-
ment s2 , and then solving it against s1 , which is more expensive. The obvious
conclusion is that, in general, even for pure programs, arbitrary paralleliza-
tion does not guarantee that the two conditions (correctness and efficiency) are
met.16 We return to the very interesting issue of what notions of parallelism
are appropriate for constraint logic programming in Section 8.
Contrary to early beliefs held in the field, most work in the last decade has
considered that violating the efficiency condition is as much a “sign of depen-
dence” among goals as violating the correctness condition. As a result, interest-
ing notions of independence have been developed that capture these two issues
of correctness and efficiency at the same time: independent goals as those whose
run-time behavior, if parallelized, produces the same results as their sequential
execution and an increase (or, at least, no decrease) in performance. To sepa-
rate issues better, we discuss the issue under the assumption of ideal run-time
conditions, that is, no task creation and scheduling overheads (we deal with
15 Note that in practical languages, however, termination characteristics may change, but termina-
tion can actually also be seen as an extreme effect of the other problem to be discussed: efficiency.
16 In fact, this is similar to the phenomenon that occurs in or-parallelism where arbitrarily par-
allelizing branches of the search does not produce incorrect results, but, if looking for only one
solution to a problem (or, more generally, in the presence of pruning operators) results in specula-
tive computations that can have a negative effect of efficiency.
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.
514 • G. Gupta et al.
overheads later). Note that, even under these ideal conditions, the goals in (c)
and (d) are clearly dependent using the definition.
A fundamental question then is how to guarantee independence (without
having to actually run the goals, as suggested by the definition given above).
A fundamental result in this context is the fact that, if only the Herbrand con-
straint system is used (as in the Prolog language), a goal or procedure call, q,
cannot be affected by another, p, if it does not share logical variables with it at
the point in time just before execution (i.e., in the substitution represented by
s1 ). that is, in those cases correctness and efficiency hold and no-slowdown is
guaranteed. In practice, the condition implies that there are no shared free vari-
ables (pointers to empty structure fields) between the run-time data structures
passed to q and the data structures passed to p. This condition is called strict
independence [DeGroot 1984; Hermenegildo and Rossi 1995].17 For example, in
the following program:
main :- X=f(K,g(K)),
Y=a,
Z=g(L),
W=h(b,L),
p(X,Y),
q(Y,Z),
r(W).
p and q are strictly independent, because, at the point in execution just before
calling p (the situation depicted in the right part of the figure), X and Z point to
data structures that do not point to each other, and, even though Y is a pointer
which is shared between p and q, Y points to a fixed value, which p cannot
change (note again that we are dealing with single assignment languages). As
a result, the execution of p cannot affect q in any way and q can be safely run
ahead of time in parallel with p (and, again assuming no run-time overheads,
no-slowdown is guaranteed). Furthermore, no locking or copying of the inter-
vening data structures is required (which helps bring the implementation closer
to the ideal situation). Similarly, q and r are not strictly independent, because
there is a pointer in common (L) among the data structures they have access to
and thus the execution of q could affect that of r.
Unfortunately, it is not always easy to determine independence by simply
looking at one procedure, as above. For example, in the program below,
main :- t(X,Y),
p(X),
q(Y).
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.
Parallel Execution of Prolog Programs • 515
shared with t. On the other hand, after execution of t the situation is unknown
since perhaps the structures created by t (and pointed to by X and Y) do not share
variables. Unfortunately, in order to determine this for sure a global (inter-
procedural) analysis of the program (in this case, to determine the behavior of t)
must be performed. Alternatively, a run-time test can be performed just after
the execution of t to detect independence of p and q. This has the undesirable
side-effect that then the no-slowdown property does not automatically hold,
because of the overhead involved in the test, but it is still potentially useful.
A number of approaches have been proposed for addressing the data depen-
dency detection issues discussed above. They range from purely compile-time
techniques to purely run-time ones. There is obviously a trade-off between the
amount of and-parallelism exploited and data dependency analysis overhead
incurred at run-time: purely compile-time techniques may miss many instances
of independent and-parallelism but incur very little run-time overhead, while
purely run-time techniques may capture maximal independent and-parallelism
at the expense of costly overhead which prevents the system from achieving
the theoretical efficiency results. However, data dependencies cannot always
be detected entirely at compile time, although compile-time analysis tools can
uncover a significant portion of such dependencies. The various approaches are
briefly described below.
(1) Input/Output Modes: One way to overcome the data dependency problem
is to require the user to specify the “mode” of the variables, that is, whether
an argument of a predicate is an input or output variable. Input variables
of a subgoal are known to become bound before the subgoal starts and
output variables are variables that will be bound by the subgoal during its
execution.
Modes have also been introduced in the committed choice languages [Tick
1995; Shapiro 1987] to actually control the and-parallel execution (but lead-
ing to an operational semantics different from that of Prolog).
(2) Static Data Dependency Analysis: In this technique the goal and the pro-
gram clauses are globally analyzed at compile time, assuming a worst case
for subgoal dependencies. No checks are done at run-time. This approach
was first attempted in Chang et al. [1985]. However, the relatively simple
compile-time analysis techniques used, combined with no run-time check-
ing means that a lot of parallelism may be lost. The advantage is, of course,
that no overhead is incurred at run-time.
(3) Run-Time Dependency Graphs: Another approach is to generate the de-
pendency graph at run-time. This involves examining bindings of relevant
variables every time a subgoal finishes executing. This approach has been
adopted, for example, by Conery in his and/or model [Conery and Kibler
1981, 1983; Conery 1987b]. The approach has prohibitive run-time cost,
since variables may be bound to large structures with embedded variables.
The advantage of this scheme is that maximal independent and-parallelism
could be potentially exploited (but after paying a significant cost at run-
time). A simplified version of this idea has also been used in the APEX
system [Lin and Kumar 1988]. In this model, a token-passing scheme is
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.
516 • G. Gupta et al.
adopted: a token exists for each variable and is made available to the left-
most subgoal accessing the variable. A subgoal is executable as soon as it
owns the tokens for each variable in its binding environment.
(4) A fourth approach, which is midway between (2) and (3), encapsulates
the dependency information in the code generated by the compiler along
with the addition of some extra conditions (tests) on the variables. In
this way simple run-time checks can be done to check for dependency.
This technique, called Restricted (or Fork/Join) And-Parallelism (RAP),
was first proposed by DeGroot [1984]. Hermenegildo [1986a] defined a
source-level language (Conditional Graph Expressions—CGEs) in which
the conditions and parallel expressions can be expressed either by the
user or by the compiler. The advantage of this approach is that it makes
it possible for the compiler to express the parallelization process in a
user-readable form and for the user to participate in the process. This
effectively eliminates the dichotomy between manual and automatic paral-
lelization. Hermenegildo, Nasr, Rossi, and Garcı́a de la Banda formalized
and enhanced the Restricted And-Parallelism model further by providing
backtracking semantics, a formal model, and correctness and efficiency
results, showing the conditions under which the “no-slowdown” property
(i.e., that parallel execution is no slower than sequential execution) holds
[Hermenegildo 1986a, 1987; Hermenegildo and Nasr 1986; Hermenegildo
and Rossi 1995; Garcı́a de la Banda et al. 2000]. A typical CGE has the form:
(conditions => goal1 & . . . & goaln )
equivalent to (using Prolog’s if-then-else):
(conditions -> goal1 & . . . & goaln ; goal1 , . . ., goaln )
where “&” indicates a parallel conjunction, that is, subgoals that can be
solved concurrently (while “,” is maintained to represent sequential con-
junction, i.e., to indicate that the subgoals should be solved sequentially).
The Restricted And-Parallelism model is discussed in more detail in Sec-
tion 4.3. Although Restricted And-Parallelism may not capture all the in-
stances of independent and-parallelism present in the program, in practice
it can exploit a substantial part of it.
Approach (1) differs from the rest in that the programmer has to explicitly
specify the dependencies, using annotations. Approach (4) is a nice compromise
between (2), where extensive compile-time analysis is done to get suboptimal
parallelism, and (3), where a costly run-time analysis is needed to get maximal
parallelism. The annotations of (4) can be generated by the compiler [DeGroot
1987a] and the technique has been shown to be successful when powerful global
analysis (generally based on the technique of abstract interpretation [Cousot
and Cousot 1977, 1992]) is used [Hermenegildo and Warren 1987; Winsborough
and Waern 1988; Muthukumar and Hermenegildo 1990, 1992a; Giannotti and
Hermenegildo 1991; Hermenegildo et al. 1992, 2000; Jacobs and Langen 1992;
Bueno et al. 1994, 1999; Muthukumar et al. 1999; Puebla and Hermenegildo
1999, 1996].
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.
Parallel Execution of Prolog Programs • 517
4.1.2 Forward Execution Phase. The forward execution phase follows the
ordering phase. It selects independent goals that can be executed in indepen-
dent and-parallel, and initiates their execution. The execution continues as nor-
mal sequential Prolog execution until either failure occurs, in which case the
backward execution phase is entered, or a solution is found. It is also possible
that the ordering phase might be entered again during forward execution, for
example, in the case of Conery’s scheme when a nonground term is generated.
Implementation of the forward execution phase is relatively straightforward;
the only major problem is the efficient determination of the goals that are ready
for independent and-parallel execution. Different models have adopted differ-
ent approaches to tackle this issue, and they are described in the successive
subsections.
Various works have pointed out the importance of good scheduling strate-
gies. Hermenegildo [1987] showed the relationship between scheduling and
memory management, and provided ideas on using more sophisticated schedul-
ing techniques for guaranteeing a better match between the logical organiza-
tion of the computation and its physical distribution on the stacks, with the
aim of simplifying backtracking and memory performance. This issue has been
studied further in Shen and Hermenegildo [1994, 1996a], where flexible re-
lated scheduling and memory management approaches are studied. Related
research on scheduling for independent and-parallel systems has also been
proposed by Dutra [1994]. In Pontelli and Gupta [1995b] a methodology is de-
scribed which adapts scheduling mechanisms developed for or-parallel systems
to the case of independent and-parallel systems. In the same way in which an
or-parallel system tries to schedule first work that is more likely to succeed,
and-parallel systems will gain from scheduling first work that is more likely
to fail. The advantage of doing this comes from the fact that most IAP sys-
tems support intelligent forms of backtracking over and-parallel calls, which
allow us to quickly propagate failure of a subgoal to the whole parallel call.
Thus, if a parallel call does not have solutions, the sooner we find a failing sub-
goal, the sooner backtracking can be started. Some experimental results have
been provided in Pontelli and Gupta [1995b] to support this perspective. This
notion is also close to the first-fail principle widely used in constraint program-
ming [Haralick and Elliot 1980]. The importance of determining goals that will
not fail and/or are deterministic was studied also in Hermenegildo [1986a],
Pontelli et al. [1996], Hermenegildo and Rossi [1995], and Garcı́a de la Banda
et al. [2000], and techniques have been devised for detecting deterministic and
nonfailing computations at compile-time [Debray and Warren 1989; Debray
et al. 1997].
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.
518 • G. Gupta et al.
should backtrack is determined, the machine state is restored, and forward ex-
ecution of the selected subgoal is initiated.
As mentioned before, Hermenegildo [1986a] showed that, in the presence of
IAP, backtracking becomes considerably more complex, especially if the system
strives to explore the search space in the same order as in a sequential Prolog
execution. In particular:
—IAP leads to the loss of correspondence between logical organization of the
computation and its physical layout; this means that logically contiguous
subgoals (i.e., subgoals that are one after the other in the resolvent) may be
physically located in noncontiguous parts of the stack, or in stacks of different
workers. In addition, the order of subgoals in the stacks may not correspond
to their backtracking order.
This is illustrated in the example in Figure 11. Worker 1 starts with the
first parallel call, making b and c available for remote execution and locally
starting the execution of a. Worker 2 immediately starts and completes the
execution of b. In the meantime, Worker 1 opens a new parallel call, locally
executing d and making e available to other workers. At this point, Worker 2
may choose to execute e, and then c. The final placement of subgoals in the
stacks of the two workers is illustrated on the right of Figure 11. As we can
see, the physical order of the subgoals in the stack of Worker 2 does not match
the logical order. This will clearly create a hazard during backtracking, since
Prolog semantics require first exploring the alternatives of b before those of
e, while the computation of b is trapped on the stack below that of e;
—backtracking may need to continue to the (logically) preceding subgoal, which
may still be executing at the time backtracking takes place.
These problems are complicated by the fact that independent and-parallel
subgoals may have nested independent and-parallel subgoals currently execut-
ing which have to be terminated or backtracked over.
Considerably different approaches have been adopted in the literature to
handle the backward execution phase. The simplest approach, as adopted in
models such as Epilog, ROPM, AO-WAM [Wise 1986; Ramkumar and Kalé
1989], is based on removing the need for actual backtracking over and-parallel
goals through the use of parallelism and solution reuse. For example, as shown
in Figure 12, two threads of execution are assigned to the distinct subgoals, and
they will be used to generate (via local standard backtracking) all solutions to
a and b. The backward execution phase is then replaced by a relatively simpler
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.
Parallel Execution of Prolog Programs • 519
and let us consider the possible cases that can arise whenever one of the sub-
goals in the query fails.
precedes the parallel call (b2 ). If qi succeeds and produces a new solution,
then some parallelism can be recovered by allowing parallel recomputation
of the subgoals q j for j > i.
(3) If qi (i ∈ {1, 2, 3}) fails (inside backtracking) during its execution, then
(a) the subgoals q j ( j > i) should be removed;
(b) as soon as the computation of qi−1 is completed, backtracking should
move to it and search for new alternatives.
This is illustrated in Case 3 of Figure 13. In practice all these steps can be
avoided by relying on the fact that the parallel subgoals are independent:
thus failure of one of the subgoals cannot be cured by backtracking on
any of the other parallel subgoals. Hermenegildo suggested a form of semi-
intelligent backtracking, in which the failure of either one of the qi causes
the failure of the whole parallel conjunction and backtracking to b2 .
To see why independent and-parallel systems should support this form of semi-
intelligent backtracking consider the goal:
?- a, b, c, d.
Suppose b and c are independent subgoals and can be executed in indepen-
dent and-parallel. Suppose that both b and c are nondeterminate and have a
number of solutions. Consider what happens if c fails. In normal sequential
execution we would backtrack to b and try another solution for it. However,
since b and c do not have any data dependencies, retrying b is not going to
bind any variables that would help c to succeed. So if c fails, we should back-
track and retry a. This kind of backtracking, based on the knowledge of data
dependence, is called intelligent backtracking [Cox 1984]. As should be obvious,
knowledge about data dependencies is needed for both intelligent backtracking
as well as independent and-parallel execution. Thus, if an independent and-
parallel system performs data dependency analysis for parallel execution, it
should take further advantage of it for intelligent backtracking as well. Note
that the intelligent backtracking achieved may be limited, since, in the example
above, a may not be able to cure failure of c. Execution models for independent
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.
Parallel Execution of Prolog Programs • 521
be performed after the last order-sensitive predicate in the goal to the left has
been executed. Given that this property is undecidable in general, it is typically
approximated by suspending the side-effect until the branch in which it appears
is the leftmost in the computation tree (i.e., all the branches on the left have
completed). It also means that intelligent backtracking has to be sacrificed,
because considering again the previous example, if c fails and we backtrack
directly into a, without backtracking into b first, then we may miss executing
one or more extralogical predicates (e.g., input/output operations) that would
be executed had we backtracked into b. A form of intelligent backtracking can
be maintained and applied to the subgoals lying on the right of the failing one.
In the same way as or-parallel systems, these systems also include useful “con-
current” versions of order-sensitive predicates, whose semantics do not require
sequencing. In addition, supporting full Prolog also introduces challenges in
other parts of and-parallel systems, such as, for example, in parallelizing com-
pilers that perform global analysis [Bueno et al. 1996].
The issue of speculative computation also arises in independent and-parallel
systems [Tebra 1987; Hermenegildo and Rossi 1995; GarcÍa de la Banda et al.
2000]. Given two independent goals a(X), b(Y) that are being executed in and-
parallel, if a eventually fails, then work put in for solving b will be wasted (in
sequential Prolog the goal b will never be executed). Therefore, not too many
resources (workers) should be invested in goals to the right. Once again, it
should be stressed that the design of the work-scheduler is very important for a
parallel logic programming system. Also, and as pointed out before, issues such
as nonfailure and determinism analysis can provide important performance
gains.
4.3.1 Conery’s Model. In this method [Conery and Kibler 1983], a dataflow
graph is constructed during the ordering phase making the producer–consumer
relationships between subgoals explicit. If a set of subgoals has an uninstanti-
ated variable V in common, one of the subgoals is designated as the producer
of the value of V and is solved first. Its solution is expected to instantiate V.
When the producer has been solved, the other subgoals, the consumers, may be
scheduled for evaluation. The execution order of the subgoals is expressed as a
dataflow graph, in which an arc is drawn from the producer of a variable to all
its consumers.
Once the dataflow graph is determined, the forward execution phase ensues.
In this phase, independent and-parallel execution of subgoals that do not have
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.
Parallel Execution of Prolog Programs • 523
any arcs incident on them in the dataflow graph is initiated. When a subgoal
is resolved away from the body of a clause (i.e., it is successfully solved), the
corresponding node and all of the arcs emanating from it are removed from
the dataflow graph. If a producer creates a nonground term during execution,
the ordering algorithm must be invoked again to incrementally redraw the
dataflow graph.
When execution fails, some previously solved subgoal must be solved again to
yield a different solution. The backward execution phase picks the last parent
(as defined by a linear ordering of subgoals, obtained by a depth-first traversal
of the dataflow graph) for the purpose of re-solving.
Note that in this method data dependency analysis for constructing the
dataflow graph has to be carried out every time a nonground term is gener-
ated, making its cost prohibitive.
4.3.2 APEX Model. The APEX (And-Parallel EXecution) model has been
devised by Lin and Kumar [1988]. In this method forward execution is im-
plemented via a token-passing mechanism. A token is created for every new
variable that appears during execution of a clause. A subgoal P is a producer
of a variable V if it holds the token for V. A newly created token for a vari-
able V is given to the leftmost subgoal P in the clause which contains that
variable. A subgoal becomes executable when it receives tokens for all the
uninstantiated variables in its current binding environment. Parallelism is
exploited automatically when there is more than one executable subgoal in a
clause.
The backward execution algorithm performs intelligent backtracking at the
clause level. Each subgoal Pi dynamically maintains a list of subgoals (denoted
as B-list(Pi )) consisting of those subgoals in the clause that may be able to
cure the failure of Pi , if it fails, by producing new solutions. When a subgoal Pi
starts execution, B-list(Pi ) consists of those subgoals that have contributed to
the bindings of the variables in the arguments of Pi . When Pi fails, P j = head(B-
list(Pi )) is selected as the subgoal to which to backtrack. The tail of B-list(Pi ) is
also passed to P j and merged into B-list(P j ) so that if P j is unable to cure the
failure of Pi , backtracking may take place to other subgoals in B-list(Pi ).
This method also has significant run-time costs since the B-lists are created,
merged, and manipulated at run-time. APEX has been implemented on shared
memory multiprocessors for pure logic programs [Lin and Kumar 1988].
processes) run programs from the code area on the stack sets. All agents are
identical (there is no “master” agent). In general, the system starts allocating
only one stack set. Other stack sets are created dynamically as needed upon ap-
pearance of parallel goals. Also, agents are started and put to “sleep” as needed
in order not to overload the system when no parallel work is available. Several
scheduling and memory management strategies have been studied for the &-
Prolog system [Hermenegildo 1987; Hermenegildo and Greene 1991; Shen and
Hermenegildo 1994].
Performance Results. Experimental results for the &-Prolog system are avail-
able in the literature illustrating the performance of both the parallelizing
compiler and the run-time system. The cost and influence of global analysis
in terms of reduction in the number of run-time tests using the MA3 analyzer
was reported in Hermenegildo et al. [1992]. The number of CGEs generated,
the compiler overhead incurred due to the global analysis, and the result both
in terms of number of unconditional CGEs and of reduction of the number of
checks per CGE were studied for some benchmark programs. These results
suggested that, even for this first generation system, the overhead incurred in
performing global analysis is fairly reasonable and the figures obtained close
to what is possible manually.
Experimental results regarding the performance of the second generation
parallelizing compiler in terms of attainable program speedups were reported
in Codish et al. [1995] and Bueno et al. [1994, 1999] both without global anal-
ysis and also with sharing and sharing + freeness analysis running in the
PLAI framework [Muthukumar and Hermenegildo 1992; Muthukumar et al.
1999]. Speedups were obtained from the run-time system itself and also using
the IDRA system [Fernández et al. 1996], which collects traces from sequen-
tial executions and uses them to simulate an ideal parallel execution of the
same program.18 A much more extensive study covering numerous domains
and situations, a much larger class of programs, and the effects of the three
18 Note that simulations are better than actual executions for evaluating the amount of ideal par-
allelism generated by a given annotation, since the effects of the limited numbers of processors in
actual machines can be factored out.
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.
530 • G. Gupta et al.
4.4.2 The &ACE System. The &ACE [Pontelli et al. 1995, 1996] system
is an independent and-parallel Prolog system developed at New Mexico State
University as part of the ACE project. &ACE has been designed as a next-
generation independent and-parallel system and is an evolution of the PWAM
design (used in &-Prolog). As does &-Prolog, &ACE relies on the execution of
Prolog programs annotated with Conditional Graph Expressions.
The forward execution phase is articulated in the following steps. As soon
as a parallel conjunction is reached, a parcall frame is allocated in a separate
19 The
notion of nonstrict independence is described in Section 5.3.3.
20 Performanceof such systems ranges from about the same as SICStus to about twice the speed,
depending on the program.
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.
Parallel Execution of Prolog Programs • 531
stack, differently from &-Prolog, which allocates parcall frames on the envi-
ronment stack; this allows for easier memory management21 (e.g., facilitates
the use of last-call optimization) and for application of various determinacy-
driven optimizations [Pontelli et al. 1996] and alternative scheduling mecha-
nisms [Pontelli et al. 1996]. Slots describing the parallel subgoals are allocated
in the heap and organized in a (dynamic) linked list, thus allowing their dy-
namic manipulation at run-time. Subgoals in the goal stack (as in the PWAM
model) are replaced by a simple frame placed in the goal stack and pointing to
the parcall frame; this has been demonstrated [Pontelli et al. 1995, 1996] to be
more effective and flexible than actual goal stacking. These data structures are
described in Figure 18.
The use of markers to identify segments of the computation has been re-
moved in &ACE and replaced by a novel technique called stack linearization
which allows linking choice points lying in different stacks in the correct logical
order; this allows limiting to the minimum the changes to the backtracking al-
gorithm, thus making backtracking over and-parallel goals very efficient. The
only marker needed is the one that indicates the beginning of the continua-
tion of the parallel call. Novel uses of the trail stack (by trailing status flags in
the subgoal slots) allows the integratation of outside backtracking without any
explicit change in the backtracking procedure.
Backward execution represents another novelty in &ACE. Although it re-
lies on the same general backtracking scheme developed in PWAM (the point
backtracking scheme described in Section 4.1.3), it introduces the additional
concept of backtracking independence which allows us to take full advantage
of the semi-intelligent backtracking phase during inside backtracking. Given a
subgoal of the form
? − b, ( g 1 & g 2 ), a
21 &ACE is built on top of the SICStus WAM, that performs on-the-fly computation of the top of the
stack register. The presence of parcall frames on the same stack creates enormous complications
in the correct management of such a register.
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.
532 • G. Gupta et al.
5. DEPENDENT AND-PARALLELISM
Dependent And-Parallelism (DAP) generalizes independent and-parallelism by
allowing the concurrent execution of subgoals accessing intersecting sets of
variables. The “classical” example of DAP is represented by a goal of the form
?- p(X) & q(X)22 where the two subgoals may potentially compete (or cooper-
ate) in the creation of a binding for the unbound variable X.
Unrestricted parallel execution of the above query (in Prolog) is likely to pro-
duce nondeterministic behavior: the outcome will depend on the order in which
the two subgoals access X. Thus, the first aim of any system exploiting depen-
dent and-parallelism is to ensure that the operational behavior of dependent
and-parallel execution is consistent with the intended semantics, (sequential)
observable Prolog semantics in this case. This amounts to
—making sure that all the parallel subgoals agree on the values given to the
shared variables; and
—guaranteeing that the order in which the bindings are performed does not
lead to any violation of the observable behavior of the program (Prolog
semantics).
It is possible to show that the problem of determining the correct moment in
time when a binding can be performed without violating Prolog semantics is
in general undecidable. The different models designed to support DAP differ
in the approach taken to solve this problem; that is, they differ in how they
conservatively approximate such an undecidable property.
22 As for independent and-parallelism, we use “&” to denote parallel conjunction, while “,” is kept
to indicate sequential conjunctions.
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.
Parallel Execution of Prolog Programs • 533
5.1 Issues
Supporting DAP requires tackling a number of issues. These include:
(1) detection of parallelism: Determination of which subgoals should be consid-
ered for DAP execution;
(2) management of DAP goals: Activation and management of parallel
subgoals;
(3) management of shared variables: Validation and control of shared variables
to guarantee Prolog semantics; and
(4) backtracking: Management of nondeterminism in the presence of DAP
executions.
In the rest of this section, we deal with all these issues except for issue 2:
management of subgoals does not present any new challenge with respect to the
management of parallel subgoals in the context of independent and-parallelism.
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.
534 • G. Gupta et al.
the end of the unification in order to make the structure “public,” and this
overhead will be encountered in general for every structure built, independently
of whether this will be assigned to a dependent variable.
Another solution has been proposed in Andorra-I [Santos Costa et al. 1996];
in this system, terms that need to be matched with a compound term (i.e., using
the get structure instruction in the WAM) are locked (i.e., a mutual exclusion
mechanism is associated with it) and a special instruction (last) is added by the
compiler at the end of the term construction to release the lock (i.e., terminate
the critical section).
Another approach, adopted in the DASWAM system [Shen 1992b],
consists of modifying the unify and get instructions in such a way that they
always overwrite the successive location on the heap with a special value.
Every access to term will inspect such successive location to verify whether
the binding has been completed. No explicit locks or other mutual exclusion
mechanisms are required. On the other hand:
—while reading the binding for a dependent variable, every location accessed
needs to be checked for validity;
—an additional operation (pushing an invalid status on the successive free
location) is performed during each operation involved in the construction of
a dependent binding; and
—a check needs to be performed during each operation that constructs a term,
in order to understand whether the term has been assigned to a dependent
variable, or, alternatively, the operation of pushing the invalid status is per-
formed indiscriminately during the construction of any term, even if it will
not be assigned to a dependent variable.
Another solution [Pontelli 1997], which does not suffer from most of the
drawbacks previously described, is to have the compiler generate a different
sequence of instructions to face this kind of situation. The get structure and
get list instructions are modified, by adding a third argument:
get structure hfunctori hregisteri hjump labeli,
where the hjump labeli is simply an address in the program code. Whenever the
dereferencing of the hregisteri leads to an unbound shared variable, instead
of entering write mode (as in standard WAM behavior), the abstract machine
performs a jump to the indicated address (hjump labeli). The address contains
a sequence of instructions that performs the construction of the binding in a
bottom-up fashion, which allows for the correct atomic execution.
detected. The two most significant proposals where this strategy is adopted are
those made by Tebra [1987] and by Drakos [1989]. They both can be identified
as instances of a general scheme, named optimistic parallelism. In optimistic
parallelism, validation of bindings is performed not at binding time (i.e., the
time when the shared variable is bound to a value), but only when a conflict
occurs (i.e., when a producer attempts to bind a shared variable that had al-
ready been bound earlier by a consumer goal). In case of a conflict, the lower
priority binding (made by the consumer) has to be undone, and the consumer
goal rolled back to the point where it first accessed the shared variable. These
models have various drawbacks, ranging from their highly speculative nature
to the limitations of some of the mechanisms adopted (e.g., labeling schemes to
record binding priorities), and to the high costs of rolling back computations.
Preventive Approaches. Preventive approaches are characterized by the fact
that bindings to shared variables are prevented unless they are guaranteed to
not threaten Prolog semantics.
Performed at the goal level, preventive schemes delay the execution of the
whole subgoal until its execution will not affect Prolog semantics. Various mod-
els have embraced this solution:
(1) NonStrict Independent And-Parallelism (NSI) and Other Extended Notions
of Independence: The idea of these extensions of the notion of independence
is to greatly extend the scope of independent and-parallelism while still
ensuring correctness and efficiency/“no-slowdown” of the paralleliza-
tion [Hermenegildo and Rossi 1995; Cabeza and Hermenegildo 1994]. The
simplest concept of nonstrict independence allows execution of subgoals
that have variables in common, provided at most one subgoal can bind
each shared variable.23 This kind of independence cannot be determined
in general a priori (i.e., by inspecting the state of the computation prior
to executing the goals to be parallelized) and thus necessarily requires a
global analysis of the program. However, it is very interesting because it
appears often in programs that manipulate “open” data structures, such as
difference lists, dictionaries, and the like. An example of this is the following
flatten example, which eliminates nestings in lists ([X|Xs] represents the
list whose head is X and whose tail is Xs and [] represents the empty list):
flatten(Xs,Ys) :-
flatten(Xs,Ys,[]).
23 The
condition used in the case of impure goals is that the bindings of a goal will not affect the
computation of the remaining subgoals to its right.
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.
540 • G. Gupta et al.
(2) The basic Andorra model [Haridi 1990; Warren 1987a; Santos Costa et al.
1991a], Parallel NU-Prolog [Naish 1988], Pandora [Bahgat 1993], and
P-Prolog [Yang 1987] are all characterized by the fact that parallel exe-
cution is allowed between dependent subgoals only if there is a guarantee
that there exists at most one single matching clause. In the basic Andorra
model, subgoals can be executed ahead of their turn (“turn” in the sense
of Prolog’s depth-first search) in parallel if they are determinate, that is,
if at most one clause matches the subgoal (the determinate phase). These
determinate goals can be dependent on each other. If no determinate goal
can be found for execution, a choice point is created for the leftmost goal in
the goal list (the nondeterminate phase) and parallel execution of determi-
nate goals along each alternative of the choice point continues. Dependent
and-parallelism is obtained by having determinate goals execute in par-
allel. The different alternatives to a goal may be executed in or-parallel.
Executing determinate goals (on which other goals may be dependent) ea-
gerly also provides a coroutining effect that leads to the narrowing of the
search space of logic programs. A similar approach has been adopted in
Pandora [Bahgat 1993], which represents a combination of the Basic An-
dorra Model and the Parlog committed choice approach to execution [Clark
and Gregory 1986]; Pandora introduces nondeterminism to an otherwise
committed choice language. In Pandora, clauses are classified as either
“don’t-care” or “don’t-know”. As with the basic Andorra model, execution
alternates between the and-parallel phase and the deadlock phase. In the
and-parallel phase, all goals in a parallel conjunction are reduced concur-
rently. A goal for a “don’t-care” clause may suspend on input matching if its
arguments are insufficiently instantiated, as in normal Parlog execution. A
goal for a “don’t-know” clause is reduced if it is determinate, as in the Basic
Andorra Model. When none of the “don’t-care” goals can proceed further and
there are no determinate “don’t-know” goals, the deadlock phase is activated
(Parlog would have aborted the execution in such a case) that chooses one
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.
Parallel Execution of Prolog Programs • 541
(1) Committed Choice Languages: We only deal briefly with the notion of com-
mitted choice languages in this article, since they implement a semantics
that is radically different from Prolog. Committed choice languages [Tick
1995] disallow (to a large extent) nondeterminism by requiring the com-
putation to commit to the clause selected for resolution. Committed choice
languages support dependent and-parallel execution and handle shared
variables via a preventive scheme based on the notion of producer and
consumers. Producer and consumers are either explicitly identified at the
source level (e.g., via mode declarations) or implicitly through strict rules
on binding of variables that are external to a clause [Shapiro 1987].
(2) Binding-level nonstrict independence: The application of the gen-
eralized (consistency- and determinacy-based) notions of indepen-
dence [Hermenegildo and Rossi 1995; GarcÍa de la Banda et al. 2000; GarcÍa
de la Banda 1994] at the finest granularity level—the level of individual
bindings and even the individual steps of the constraint solver—has been
studied formally in Bueno et al. [1994, 1998]. This work arguably repre-
sents the finest grained and “most parallel” model for logic and constraint
logic programming capable of preserving correctness and theoretical
efficiency proposed to date. While this model has not been implemented
directly it serves as a theoretical basis for a number of other schemes.
(3) DDAS-based schemes: These schemes offer a direct implementation of
strong Prolog semantics through the notion of producer and consumer of
shared variables. At each point of the execution only one subgoal is allowed
to bind each shared variable (producer), and this corresponds to the leftmost
active subgoal that has access to such a variable. All remaining subgoals
are restricted to read-only accesses to the shared variable (consumers);
each attempt by a consumer to bind an unbound shared variable will lead to
the suspension of the subgoal. Each suspended consumer will be resumed
as soon as the shared variable is instantiated. Consumers may also become
producers if they become the leftmost active computations. This can
happen if the designated producer terminates without binding the shared
variable [Shen 1992b].
Detecting producer and consumer status is a complex task. Different
techniques have been described in the literature to handle this process.
Two major implementation models have been proposed to handle pro-
ducer/consumer detection, DASWAM [Shen 1992b, 1996b] and the filtered-
binding model [Pontelli and Gupta 1997a, 1997b] which are described at
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.
542 • G. Gupta et al.
5.4 Backtracking
Maintaining Prolog semantics during parallel execution also means support-
ing nondeterministic computations, that is, computations that can potentially
produce multiple solutions. In many approaches DAP has been restricted to
only those cases where p and q are deterministic [Bevemyr et al. 1993; Shapiro
1987; Santos Costa et al. 1991a]. This is largely due to the complexity of dealing
with distributed backtracking. Nevertheless, it has been shown [Shen 1992b,
1996b] that imposing this kind of restriction on DAP execution may severely
limit the amount of parallelism exploited. The goal is to exploit DAP even in
nondeterministic goals.
Backtracking in the context of DAP is more complex than in the case of in-
dependent and-parallelism. While outside backtracking remains unchanged,
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.
Parallel Execution of Prolog Programs • 543
inside backtracking (i.e., backtracking within subgoals which are part of a par-
allel call) loses its “independent” nature, which guaranteed the semi-intelligent
backtracking described earlier (Section 4.1.3). Two major issues emerge. First
of all, failure of a subgoal within a parallel conjunction does not lead to the
failure of the whole conjunction, but requires killing the subgoals on the right
and backtracking to be propagated to the subgoal immediately to the left, an
asynchronous activity, since the subgoal on the left may be still running.
In addition, backtracking within a parallel subgoal may also affect the exe-
cution of other parallel subgoals. In a parallel conjunction such as p(X) & q(X),
backtracking within p(X) which leads to a modification of the value of X will re-
quire rolling back the execution of q(X) as well, since q(X) may have consumed
the value of X that has just been untrailed.
Implementations of this scheme have been proposed in Shen [1992b,a] and
Pontelli and Gupta [1997a]; optimizations of this scheme have also been de-
scribed in Shen [1994].
to that in Aurora and is based on binding arrays [Warren 1984, 1987c]. Due to
its similarity to Aurora as far as or-parallelism is concerned, Andorra-I is able
to use the schedulers built for Aurora. The current version of Andorra-I is com-
piled [Yang et al. 1993] and is a descendant of the earlier interpreted version
[Santos Costa et al. 1991a].
As a result of exploitation of determinate dependent and-parallelism and the
accompanying coroutining, not only can Andorra-I exploit parallelism from logic
programs, but it can also reduce the number of inferences performed to compute
a solution. As mentioned earlier, this is because execution in the basic Andorra
model is divided into two phases—determinate and nondeterminate—and exe-
cution of the nondeterminate phase is begun only after all “forced choices” (i.e.,
choices for which only one alternative is left) have been made in the determinate
phase, that is, after all determinate goals in the current goal list, irrespective of
their order in this list, have been solved. Any goal that is nondeterminate (i.e.,
has more than one potentially matching clause) will be suspended in the de-
terminate phase. Solving determinate goals early constrains the search space
much more than using the standard sequential Prolog execution order (e.g., for
the 8-queen’s program the search space is reduced by 44%, for the zebra puzzle
by 70%, etc.). Note that execution of a determinate goal to the right may bind
variables which in turn may make nondeterminate goals to their left deter-
minate. The Andorra-I compiler performs an elaborate determinacy analysis
of the program and generates code so that the determinate status of a goal is
determined as early as possible at run-time [Santos Costa et al. 1996, 1991c].
The Andorra-I system supports full Prolog, in that execution can be per-
formed in such a way that sequential Prolog semantics is preserved [Santos
Costa et al. 1996, 1991c]. This is achieved by analyzing the program at compile-
time and preventing early (i.e., out of turn) execution of those determinate goals
that may contain extralogical predicates. These goals will be executed only after
all goals to the left of them have been completely solved.24
The Andorra-I system speedsup execution in two ways: by reducing the num-
ber of inferences performed at run-time, and, by exploiting dependent and-
parallelism and or-parallelism from the program. Very good speed-ups have
been obtained by Andorra-I for a variety of benchmark programs. The Andorra-I
engine [Santos Costa et al. 1991b; Yang et al. 1993] combines the implementa-
tion techniques used in implementing Parlog, namely, the JAM system [Cram-
mond 1992], and the Aurora system [Lusk et al. 1990]. The Andorra-I system
had to overcome many problems before an efficient implementation of its en-
gine could be realized. Chief among them was a backtrackable representation
of the goal list. Since goals are solved out of order, they should be inserted back
in the goal list if backtracking is to take place; recall that there is no backtrack-
ing in Parlog so this was not a problem in JAM. The Andorra-I system was
the first to employ the notion of teams of workers, where available workers are
divided into teams, and each team shares all the data structures (except the
queue of ready-to-run goals). Or-parallelism is exploited at the level of teams
24 Inspite of this, there are cases where Andorra-I and Prolog lead to different behavior; in partic-
ular, there are nonterminating Prolog programs that will terminate in Andorra-I and vice versa.
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.
Parallel Execution of Prolog Programs • 545
6.1 Issues
As one can gather, parallel systems that exploit only one form of parallelism
from logic programs have been efficiently implemented and have reached a ma-
ture stage. A number of prototypes have been implemented and successfully ap-
plied to the development and parallelization of very large real-life applications
25 We are also working under the assumption that the compiler marks goals for DAP execution
conservatively; that is, during execution if a shared variable X is bound to a structure containing
an unbound variable Y before the parallel conjunction corresponding to X is reached then both X
and Y are marked as shared. Otherwise, for correctness, the structure X is bound to will have to be
traversed to find all unbound variables occurring in it and mark them as shared.
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.
548 • G. Gupta et al.
(see also Section 10). Public domain parallel logic programming systems are
available (e.g., YapOr [Rocha et al. 1999], KLIC [Chikayama et al. 1994], Ciao
[Bueno et al. 1997], which includes &-Prolog, DASWAM [Shen 1996b]). For
some time, a number of commercial parallel Prolog systems have also appeared
on the market, such as SICStus Prolog, which includes the or-parallel MUSE
system, and ECLiPSe, which includes an or-parallel version of ElipSys. In spite
of the fact that these commercial Prolog systems have progressively dropped
their support for parallelism (this is mostly due to commercial reasons: the
high cost of maintaining the parallel execution mechanisms), these systems
demonstrate that we possess the technology for developing effective and effi-
cient Prolog systems exploiting a single form of parallelism.
Although very general models for parallel execution of logic programs (ex-
ploiting multiple forms of parallelism) have been proposed, such as the Ex-
tended Andorra Model (EAM) (described later in this section), they have not yet
been efficiently realized. A compromise approach that many researchers have
been pursuing, long before the EAM was conceived, is that of combining tech-
niques that have been effective in single-parallelism systems to obtain efficient
systems that exploit more than one source of parallelism in logic programs.26
The implementation of the basic Andorra model [Haridi 1990; Warren 1987a],
namely, Andorra-I [Santos Costa et al. 1991b] can be viewed in that way since it
combines (determinate) dependent and-parallelism, implemented using tech-
niques from JAM [Crammond 1992], with or-parallelism, implemented us-
ing the binding arrays technique [Lusk et al. 1990; Warren 1987c]. Likewise,
the PEPSys model [Westphal et al. 1987; Baron et al. 1988], the AO-WAM
[Gupta and Jayaraman 1993b], ROPM [Kalé 1985; Ramkumar and Kalé 1989,
1992], ACE [Gupta et al. 1994b, 1993], the PBA models [Gupta et al. 1994b,
1993], SBA [Correia et al. 1997], FIRE [Shen 1997], and the COWL models
[Santos Costa 1999] have attempted to combine independent and-parallelism
with or-parallelism; these models differ from one another in the environment
representation technique they use for supporting or-parallelism, and in the
flavor of and-parallelism they support. One should also note that, in fact, Con-
ery’s model described earlier is an and-or parallel model [Conery 1987b] since
solutions to goals may be found in or-parallel. Models combining independent
and-parallelism, or-parallelism, and (determinate) dependent and-parallelism
have also been proposed [Gupta et al. 1991]. The abstract execution models that
these systems employ (including those that only exploit a single source of par-
allelism) can be viewed as subsets of the EAM with some restrictions imposed,
although this is not how they were conceived. In subsequent subsections, we
review these various systems that have been proposed for combining more than
one source of parallelism.
The problems faced in implementing a combined and- and or-parallel sys-
tem are unfortunately not only the sum of problems faced in implementing
and-parallelism and or-parallelism individually. In the combined system the
problems faced in one may worsen those faced in the other, especially those
26 Simulations have shown that indeed better speedups will be achieved if more than one source of
parallelism is exploited [Shen 1992b; Shen and Hermenegildo 1991, 1996b].
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.
Parallel Execution of Prolog Programs • 549
(1) consider only pure Prolog programs for parallel execution; this was the
approach taken by many early proposals, for example, AO-WAM [Gupta
and Jayaraman 1993b] and ROPM [Kalé 1985]; or
(2) devise a new language that will allow extralogical features but in a con-
trolled way, for example, PEPSys [Ratcliffe and Syre 1987; Westphal et al.
1987; Chassin de Kergommeaux and Robert 1990].
The disadvantage of both these approaches is that existing Prolog programs
cannot be immediately parallelized. Various approaches have been proposed
that allow support for Prolog’s sequential semantics even during parallel exe-
cution [Santos Costa 1999; Correia et al. 1997; Castro et al. 1999; Ranjan et al.
2000a; Gupta et al. 1994a, 1994b; Santos Costa et al. 1991c].
Another issue that arises in systems that exploit independent and-
parallelism is whether to recompute solutions of independent goals, or to reuse
them. For example, consider the following program for finding “cousins at the
same generation” taken from Ullman [1988],
sg(X, X) :- person(X).
sg(X, Y) :- parent(X, Xp), parent(Y, Yp), sg(Xp, Yp).
In executing a query such as ?- sg(fred, john) under a (typical) purely
or-parallel, a purely independent and-parallel, or a sequential implementa-
tion, the goal parent(john, Yp) will be recomputed for every solution to
parent(fred, Xp).27 This is clearly redundant since the two parent goals are
independent of each other. Theoretically, it would be better to compute their
solutions separately, take a cross-product (join) of these solutions, and then
try the goal sg(Xp, Yp) for each of the combinations. In general, for two in-
dependent goals G 1 and G 2 with m and n solutions, respectively, the cost of
the computation can be brought down from m ∗ n to m + n by computing the
solutions separately and combining them through a cross-product, assuming
the cost of computing the cross-product is negligible.28 However, for indepen-
dent goals with very small granularity, the gain from solution sharing may be
27 Respecting Prolog semantics, a purely independent and-parallel system can avoid recomputation
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.
550 • G. Gupta et al.
6.3.1 The PEPSys Model. The PEPSys model [Westphal et al. 1987; Baron
et al. 1988; Chassin de Kergommeaux and Robert 1990] combines and- and
or-parallelism using a combination of techniques of timestamping and hashing
windows for maintaining multiple environments. In PEPSys (as already dis-
cussed in Section 3.2), each node in the execution tree has a process associated
with it. Each process has its own hash window. All the bindings of conditional
variables generated by a process are timestamped and stored in that process’
hash window. Any PEPSys process can access the stacks and hash windows of
its ancestor processes. The timestamp associated with each binding permits it
to distinguish the relevant binding from the others in the ancestor processes’
stacks and hash windows.
Independent and-parallel goals have to be explicitly annotated by the pro-
grammer. The model can handle only two and-parallel subgoals at a time. If
more than two subgoals are to be executed in and-parallel, the subgoals are
nested in a right associative fashion. If or-parallelism is nested within and-
parallelism, then and-parallel branches can generate multiple solutions. In
this case, the cross-product (join) of the left-hand and right-hand solution sets
must be formed. A process is created for each combination of solutions in the
cross-product set. Each such process can communicate with its two ancestor
processes (one corresponding to the left and-branch and other corresponding
to the right and-branch) that created the corresponding solution. Access to the
bindings of these ancestor processes is handled by join cells. A join cell con-
tains a pointer to the hash window of the left and-branch process and to the
hash window of the right and-branch process. It also contains a pointer to the
hash window that was current at the time of the and-parallel split (Figure 24).
Looking up a variable binding from a goal after the and-parallel join works as
follows: the linear chain of hash windows is followed in the usual way until
a join cell is reached. Now a branch becomes necessary. First, the right-hand
process is searched by following the join cell’s right-hand side hashed window
chain. When the least common hash window is encountered control bounces
back to the join cell and the left branch is searched.
The basic scheme for forming the cross-product, gathering the left-hand so-
lutions and the right-hand solutions in solution lists and eagerly pairing them,
relies on the fact that all solutions to each side are computed incrementally and
coexist at the same time in memory to be paired with newly arriving solutions
to the other side. However, if all solutions to the and-parallel goal on the right
have been found and backtracked over, and there are still more solutions for the
and-parallel goal to the left remaining to be discovered, then the execution of
the right goal will be restarted after discovery of more solutions of the goal to the
left (hence, PEPSys uses a combination of goal-reuse and goal-recomputation).
The PEPSys model uses timestamping and hash windows for environ-
ment representation. This doesn’t permit constant-time access to conditional
variables. Therefore, access to conditional variables is expensive. However,
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.
552 • G. Gupta et al.
6.3.2 The ROPM Model. ROPM (Reduce-Or Parallel Model) [Kalé 1991]
was devised by Kalé in his Ph.D. dissertation [Kalé 1985]. The model is based
on a modification of the and-or tree, called the Reduce-Or Tree. There are two
types of nodes in the reduce-or tree, the reduce nodes and the or nodes. The
reduce nodes are labeled with a query (i.e., a set of goals) and the or nodes
are labeled with a single literal. To prevent global checking of variable binding
conflicts every node in the tree has a partial solution set (PSS) associated with
it. The PSS consists of a set of substitutions for variables that make the subgoal
represented by the node true. Every node in the tree contains the bindings of
all variables that are either present in the node or are reachable through this
node. The reduce-or tree is defined recursively as follows [Kalé 1991].
(1) A reduce node labeled with the top level query and with an empty PSS is a
reduce-or tree.
(2) A tree obtained by extending a reduce-or tree using one of the rules below
is a reduce-or tree:
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.
Parallel Execution of Prolog Programs • 553
(a) Let Q be the set of literals in the label of a Reduce node R. Corresponding
to any literal L in Q, one may add an arc from R to a new or node O
labeled with an instance of L. The literal must be instantiated with a
consistent composition of the substitutions from the PSS of subgoals
preceding L in Q.
(b) To any or node, labeled with a goal G, one may add an arc to a new reduce
node corresponding to some clause of the program, say C, whose head
unifies with G. The body of C with appropriate substitutions resulting
from the head unification becomes the label of the new Reduce node (say)
R. If the query is empty (i.e., the clause is a “fact”) the PSS associated
with R becomes a singleton set. The substitution that unifies the goal
with the fact becomes the only member of the set.
(c) Any entry from the PSS of the reduce node can be added to the PSS of its
parent or node. A substitution can be added to the PSS of a reduce node
R representing a composite goal Q if it is a consistent composition of the
substitutions, one for each literal of Q, from the PSSs of the children (or
nodes) of R.
ROPM associates a Reduce Process with every Reduce node and an or process
with every or node. The program clauses in ROPM are represented as Data Join
Graphs (DJGs), in which each arc of the graph denotes a literal in the body of
the clause (Figure 25).
DJGs are a means of expressing and-parallelism and are similar in spirit
to Conery’s dataflow graph. A set of variable binding tuples, called a relation
(PSS), is associated with each arc and each node of the DJG. The head of a
clause is matched with a subgoal by an or process. A reduce process is spawned
to execute the body of the clause. In the reduce process, whenever a binding
tuple is available in the relation of a node k, subgoals corresponding to each
of the arcs emanating from k will be started, which leads to the creation of
new Or processes. When a solution for any subgoal arrives, it is inserted in the
corresponding arc relation. The node relation associated with a node n is a join
of the arc relations of all its incoming arcs. So when a solution tuple is inserted
in an arc relation, it is joined with all the solution tuples in the arc relations
of its parallel arcs that originated from the same tuple in the lowest common
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.
554 • G. Gupta et al.
ancestor node of the parallel arcs [Ramkumar and Kalé 1990]. A solution to the
top level query is found, when the PSS of the root node becomes nonempty.
In ROPM, multiple environments are represented by replicating them at the
time of process creation. Thus, each reduce or or process has its own copy of
variable bindings (the Partial Solution Set above) which is given to it at the
time of spawning. Thus process creation is an expensive operation. ROPM is a
process-based model rather than a stack-based one. As a result, there is no back-
tracking, and hence no memory reclamation that is normally associated with
backtracking. Computing the join is an expensive operation since the actual
bindings of variables have to be cross-produced to generate the tuple relations
of the node (as opposed to using symbolic addresses to represent solutions, as
done in PEPSys [Westphal et al. 1987] and AO-WAM [Gupta and Jayaraman
1993b]), and also since the sets being cross-produced have many redundant
elements. Much effort has been invested in eliminating unnecessary elements
from the constituent sets during join computation [Ramkumar and Kalé 1990].
However, efficiency of the computation of the join has been made more efficient
by using structure sharing. One advantage of the ROPM model is that if a pro-
cess switches from one part of the reduce-or tree to another, it doesn’t need to
update its state at all since the entire state information is stored in the tree.
The ROPM model has been implemented in the ROLOG system on a variety
of platforms. ROLOG is a complete implementation, which includes support
for side-effects [Kalé et al. 1988b]. However, although ROLOG yields very good
speedups, its absolute performance does not compare very well with other par-
allel logic programming systems, chiefly because it is a process-based model
and uses the expensive mechanism of environment closing [Ramkumar and
Kalé 1989; Conery 1987a] for multiple environment representation.
ROLOG is probably the most advanced process-based model proposed to
handle concurrent exploitation of and-parallelism and or-parallelism. Other
systems based on similar models have also been proposed in the literature, for
example, OPAL [Conery 1992], where execution is governed by a set of and and
or processes: such and processes solve the set of goals in the body of a rule, and
or processes coordinate the solution of a single goal with multiple matching
clauses. And and or processes communicate solely via messages.
6.3.3 The AO-WAM Model. This model [Gupta and Jayaraman 1993b;
Gupta 1994] combines or-parallelism and independent and-parallelism.
Independent and-parallelism is exploited in the same way as in &-Prolog and
&ACE, and solutions to independent goals are reused (and not recomputed).
To represent multiple or-parallel environments in the presence of independent
and-parallelism, the AO-WAM extends the binding arrays technique [Warren
1984, 1987c].
The model works by constructing an Extended And-Or tree. Execution con-
tinues like a standard or-parallel system until a CGE is encountered, at which
point a cross-product node that keeps track of the control information for the
and-parallel goals in the CGE is added to the or-parallel tree. New or-parallel
subtrees are started for each independent and-parallel goal in the CGE. As so-
lutions to goals are found, they are combined with solutions of other goals to
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.
Parallel Execution of Prolog Programs • 555
produce their cross-product. For every tuple in the cross-product set, the con-
tinuation goal of the CGE is executed (i.e., its tree is constructed and placed as
a descendant of the cross-product node).
As far as maintenance of multiple environments is concerned, each worker
has its own binding array. In addition, each worker has a base array. Conditional
variables are bound to a pair of numbers consisting of an offset in the base
array and a relative offset in the binding array. Given a variable bound to
the pair <i, v>, the location binding array[base array[i] + v] will contain
the binding for that variable. For each and-parallel goal in a CGE, a different
base array index is used. Thus the binding array contains a number of smaller
binding arrays, one for each and-parallel goal, that are accessible through the
base array. When a worker produces a solution for an and-parallel goal and
computes its corresponding cross-product tuples, then before it can continue
execution with the continuation goal of the CGE, it has to load all the conditional
bindings made by other goals in the CGE that are present in the selected tuple
(See Figure 26). Also, on switching nodes, a worker must update its binding
array and base array with the help of the trail, as in Aurora.
29 Notethat the ACE platform has been used to experiment with both combined and/or-parallelism
as well as dependent and-parallelism, as illustrated in Section 5.5.3.
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.
556 • G. Gupta et al.
alternative from a choice point created during execution of a goal g i inside the
CGE (true ⇒ g 1 & · · · & g n ). This team will copy all the stack segments in the
branch from the root to the CGE including the parcall frame.30 It will also have
to copy the stack segments corresponding to the goals g 1 · · · g i−1 (i.e., goals to
the left). The stack segments up to the CGE need to be copied because each
different alternative within g i might produce a different binding for a variable,
X, defined in an ancestor goal of the CGE. The stack segments corresponding to
goals g 1 through g i−1 have to be copied because execution of the goals following
30 As mentioned earlier, the parcall frame [Hermenegildo 1986b] records the control information
for the CGE and its independent and-parallel goals.
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.
Parallel Execution of Prolog Programs • 557
the CGE might bind a variable defined in one of the goals g 1 · · · g i−1 differently.
The stack segments of the goal g i from the CGE up to the choice point from
where the alternative was taken also need to be copied (note that because of
this, an alternative can be picked up for or-parallel processing from a choice
point that is in the scope of the CGE only if goals to the left, i.e., g 1 · · · g i−1 ,
have finished). The execution of the alternative in g i is begun, and when it
finishes, the goals g i+1 . . . g n are started again so that their solutions can be
recomputed. Because of recomputation of independent goals ACE can support
sequential Prolog semantics [Gupta et al. 1993, 1994a; Gupta and Santos Costa
1996].
This is also illustrated in Figure 27. The four frames represent four teams
working on the computation. The second team recomputes the goal b, while the
third and fourth teams take the second alternative of b, respectively, from the
first and second team.
space in the current one. Like AO-WAM, conditional variables are bound to a
pair of numbers where the first element of the pair indicates the page number
in the binding array, and the second element indicates the offset within this
page.
The PBA-based model also employs recomputation of independent goals, and
therefore can support Prolog semantics [Gupta et al. 1993; Gupta and Santos
Costa 1996]. Thus, when a team steals an alternative from a goal inside a CGE,
then it updates its binding array and page table so that the computation state
that exists at the corresponding choice point is reflected in the stealing team.
The team then restarts the execution of that alternative, and of all the goals
to the right of the goal in the CGE that led to that alternative. In cases where
the alternative stolen is from a choice point outside the scope of any CGE, the
operations involved are very similar to those in Aurora.
The Paged Binding Array is a very versatile data structure and can also be
used for implementing other forms of and-or parallelism [Gupat et al. 1994b].
So far we have only considered models that combine or- and independent
and-parallelism. There are models that combine independent and-parallelism
and dependent and-parallelism such as DDAS [Shen 1992a], described earlier,
as well as models that combine or-parallelism and dependent and-parallelism
such as Andorra-I [Santos Costa et al. 1991a]. Other combined independent
and- and or- parallel models have also been proposed [Biswas et al. 1988; Gupta
et al. 1991].
6.3.7 The Principle of Orthogonality. One of the overall goals that has been
largely ignored in the design of and-or parallel logic programming systems is
the principle of orthogonality [Correia et al. 1997]. In an orthogonal design,
or-parallel execution should be unaware of and-parallel execution and vice
versa. Thus, orthogonality allows the separate design of the data structures
and execution mechanisms for or-parallelism and and-parallelism. Achieving
this goal is very ambitious. Orthogonality implies that:
(1) each worker should be able to backtrack to a shared choice point and be
aware only of or-parallelism;
(2) whenever a worker enters the public part of the or-tree, the other work-
ers on the team should be able to continue unaffected their and-parallel
computations.
Most existing proposals for combined and/or-parallelism do not meet the prin-
ciple of orthogonality. Let us consider, for example, the PBA model and let us
consider the computation as shown in Figure 29.
Let us assume the following configuration.
(1) Workers W1,1 and W1,2 compose the first team, which is operating on the
parallel call on the left; worker W1,1 makes use of pages 1 and 3: page 1
is used before choice point C1 while page 3 is used after that choice point,
and worker W1,2 makes use of page 2.
(2) Worker W2,1 and W2,2 compose team number 2, which is working on the
copy of the parallel call (on the right). The computation originates from
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.
560 • G. Gupta et al.
stealing one alternative from choice point C1. In this case, worker W2,2
makes use of both pages 2 and 3.
If worker W2,1 backtracks and asks for a new alternative from the first team
(one of the alternatives of C2), then it will need to use page 3 for installing the
bindings created by team 1 after the choice point C1. But for team 2, page 3
is not available (being used by W2,2). Thus, worker W2,2 will be “affected” by
backtracking of W2,1 on a shared choice point.
Various solutions are currently under exploration to support orthogonality.
Some of the schemes proposed are
—the shared paged binding array (SPBA) [Gupta et al. 1994b] which extends
the PBA scheme by requiring the use of a global and shared paged binding
array;
—the sparse binding array [Correia et al. 1997] in which each conditional vari-
able is guaranteed to have a binding array index that is unique in the whole
computation tree and relies on operating system techniques to maintain the
large address space that each worker needs to create (each worker needs
virtual access to the address space of each worker in the system);
—the COWL methods presented in Section 6.3.5.
A comparison of these three schemes has been presented in Santos Costa et al.
[2000].
6.3.8 The Extended Andorra Model. The extended Andorra model (EAM)
[Warren 1987a; Haridi and Janson 1990; Gupta and Warren 1992] and the
Andorra Kernel Language (AKL) (later renamed Agent Kernel Language)
[Haridi and Janson 1990] combine exploitation of or-parallelism and depen-
dent and-parallelism. Intuitively, both models rely on the creation of copies of
the consumer goal for every alternative of the producer and vice-versa (akin to
computing a join) and letting the computation proceed in each such combina-
tion. Note that the EAM and the Andorra Kernel Language are very similar in
spirit to each other, the major difference being that while the EAM strives to
keep the control as implicit as possible, AKL gives the programmer complete
control over parallel execution through wait guards. In the description below,
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.
Parallel Execution of Prolog Programs • 561
we use the term Extended Andorra Model in a generic sense, to include models
such as AKL as well.
The Extended Andorra Model is an extension of the Basic Andorra Model.
The Extended Andorra Model goes a step further and removes the constraint
that goals become determinate before they can execute ahead of their turn.
However, goals that do start computing ahead of their turn must compute only
as far as the (multiple) bindings they produce for the uninstantiated variables
in their arguments are consistent with those produced by the “outside envi-
ronment.” If such goals attempt to bind a variable in the outside environment,
they suspend. Once a state is reached where execution cannot proceed, then
each suspended goal that is a producer of bindings for one (or more) of its argu-
ment variables “publishes” these bindings to the outside environment. For each
binding published, a copy of the consumer goal is made and its execution con-
tinued. (This operation of “publication” and creation of copies of the consumer
is known as a “nondeterminate promotion” step.) The producer of bindings of a
variable is typically the goal where that variable occurs first. However, if a goal
produces only a single binding (i.e., it is determinate), then it doesn’t need to
suspend; it can publish its binding immediately, thus automatically becoming
the producer for that goal irrespective of whether it contains the leftmost oc-
currence of that variable (as in the Basic Andorra Model). An alternative way
of looking at the EAM is to view it as an extension of the basic Andorra model
where nondeterminate goals are allowed to execute locally so far as they do not
influence the computation going on outside them. This amounts to including in
the Basic Andorra Model the ability to execute independent goals in parallel.
There have been different interpretations of the Extended Andorra Model,
but the essential ideas are summarized below. Consider the following very sim-
ple program,
p(X, Y) :- X=2, m(Y).
p(X, Y) :- X=3, n(Y).
q(X, Y) :- X=3, t(Y).
q(X, Y) :- X=3, s(Y).
r(Y) :- Y=5.
?- p(X, Y), q(X, Y), r(Y).
When the top level goal begins execution, all three goals will be started con-
currently. Note that variables X and Y in the top level query are considered
to be in the environment “outside” goals p, q, and r (this is depicted by ex-
istential quantification of X and Y in Figure 30). Any attempt to bind these
variables from inside these goals will lead to the suspension of these goals.
Thus, as soon as these three goals begin execution, they immediately suspend
since they try to constrain either X or Y. Of these, r is allowed to proceed and
constrain Y to value 5, because it binds Y determinately. Since p will be reck-
oned the producer goal for the binding of X, it will continue as well and publish
its binding. The goal q will, however, suspend since it is neither determinate
nor the producer of bindings of either X or Y. To resolve the suspension of q
and make it active again, the nondeterminate promotion step will have to be
performed. The nondeterminate promotion step will match all alternatives of p
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.
562 • G. Gupta et al.
with those for q, resulting in only two combinations remaining active (the rest
having failed because of nonmatching bindings of X). These steps are shown in
Figure 30.
The above is a very coarse description of the EAM; a full description of the
model is beyond the scope of this article. More details can be found elsewhere
[Warren 1987a; Haridi and Janson 1990; Gupta and Warren 1992]. The EAM is
a very general model, more powerful than the Basic Model, since it can narrow
down the search even further by local searching. It also exploits more paral-
lelism since it exploits all major forms of parallelism present in logic programs:
or-, independent and-, and dependent and-parallelism, including both determi-
nate and nondeterminate dependent and-parallelism. A point to note is that
the EAM does not distinguish between independence and dependence of con-
junctive goals: it tries to execute them in parallel whenever possible. Also note
that the EAM subsumes both the committed choice logic programming (with
nonflat as well as flat guards) and nondeterministic logic programming, that
is, general Prolog.
The generality and the power of the Extended Andorra Model make its ef-
ficient implementation quite difficult. A sequential implementation of one in-
stance of the EAM (namely, the Andorra Kernel Language or AKL) has been im-
plemented at the Swedish Institute of Computer Science [Janson and Montelius
1991]. A parallel implementation has also been undertaken by Moolenaar and
Demoen [1993]. A very efficient parallel implementation of AKL has been pro-
posed by Montelius in the Penny system [Montelius 1997; Montelius and Ali
1996]. This implementation combines techniques from or-parallelism and com-
mitted choice languages. Although AKL includes nondeterminism, it differs
from Prolog both in syntax and semantics. However, automatic translators that
transform Prolog programs into AKL programs have been constructed [Bueno
and Hermenegildo 1992]. The development of AKL has been discontinued, al-
though many of the ideas explored in the AKL project have been reused in
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.
Parallel Execution of Prolog Programs • 563
map([],[]).
map([X|Y],[X1|Y1]) :-
process(X,X1),
map(Y,Y1).
member(X, [X|T]).
member(X, [Y|T]) :- member(X, T).
?- member(Z, [1,2,..,100]), process(Z).
p( X̄ ) : − 81 , . . . , 8n , p(b̄), 9n , . . . , 91 ,
| {z } | {z } | {z }
(1) (2) (3)
where 8i and 9i are the instances of goals 8 and 9 obtained at the ith level of
recursion. This clause can be executed by first running, in parallel, the goals
81 , . . . , 8n , then executing p(b̄) (typically the base case of the recursion), and
finally running the goals 9n , . . . , 91 in parallel as well. In practice, the unfolded
clause is not actually constructed; instead the head unification for the n lev-
els of recursion is performed at the same time as the size of the recursion is
determined, and the body of the unfolded clause is compiled into parallel code.
Reform Prolog [Bevemyr et al. 1993] is an implementation of a restricted
version of the reform compilation approach. In particular only predicates per-
forming integer recursion or list recursion and for which the size of the recursion
is known at the time of the first call are considered for parallel execution.
To achieve efficient execution, Reform Prolog requires the generation of de-
terministic bindings to the external variables, thus relieving the system of the
need to perform complex backtracking on parallel calls. Compile-time analysis
tools have been proposed to guarantee the conditions necessary for the parallel
execution and to optimize execution [Lindgren 1993]. Reform Prolog has been
ported on different MIMD architectures, such as Sequent [Bevemyr et al. 1993]
and KSR-1 [Lindgren et al. 1995].
Exploitation of data and-parallelism explicitly through bounded quantifica-
tion has also been proposed [Barklund and Millroth 1992]. In this case, the
language is extended with constructs used to express bounded forms of uni-
versal quantification (e.g., ∀(X ∈ S)ϕ). Parallelism is exploited by concurrently
executing the body of the quantified formula (e.g., ϕ) for the different values in
the domain of the quantifiers (e.g., the different values in the set S).
Both traditional and-parallelism and data-parallelism offer advantages and
disadvantages. Traditional and-parallel models offer generality, being able to
exploit parallelism in a large class of programs (including the parallelism ex-
ploited by data parallelism techniques). Data and-parallelism techniques on
the other hand offer increased performance for a restricted class of programs.
As a result, various authors have worked on integrating data and-parallelism
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.
566 • G. Gupta et al.
into more traditional and-parallelism schemes [Debray and Jain 1994; Pontelli
and Gupta 1995a; Hermenegildo and Carro 1996]. The basic idea is to iden-
tify instances of data and-parallelism in generic and-parallel programs, and to
use specialized and more efficient execution mechanisms to handle these cases
within the more general and-parallel systems. These techniques have been
shown to allow obtaining the advantages of both types of parallelism within
the same system.
parallel AC3 [Samal and Henderson 1987] and AC4 [Nguyen and Deville 1998])
and apply them to the specific case of indexical constraints in CLP over finite
domains. Similar work exploring interactions between search strategies in con-
straint logic programming and parallelism has also been presented [Schulte
2000; Perron 1999].
31 Asmentioned earlier, this actually implies a better result even for Prolog programs since its
projection on the Herbrand domain is a strict generalization of previous notions of nonstrict inde-
pendence. For example, the sequence p(X), q(X) can be parallelized if p is defined, for example,
as p(a) and q is defined as q(a).
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.
568 • G. Gupta et al.
32 For instance, many person-years of effort have been spent in building some of the existing systems,
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.
Parallel Execution of Prolog Programs • 569
query by being assigned parts of the computation, and, typically, each thread
is a WAM-like processor. Examples of processor-based systems are &-Prolog,
Aurora, MUSE, Andorra-I, PEPSys, AO-WAM, DDAS, ACE, PBA, and so on.
Processor-based systems are more suited for shared memory machines, al-
though techniques such as stack copying and stack splitting show a high degree
of locality in memory reference behavior and hence are suited for nonshared
memory machines as well [Ali 1988; Ali et al. 1992; Gupta and Pontelli 1999c].
As has been shown by the ACE model, MUSE’s stack copying technique can be
applied to and-or parallel systems as well, so one can envisage implementing
a processor-based system on a nonshared memory machine using stack copy-
ing [Villaverde et al. 2001; Gupta et al. 1992]. Alternatively, one could employ
scalable virtual shared memory architectures that have been proposed [Warren
and Haridi 1988] and built (e.g., KSR, SGI Origin, IBM NUMA-Q).
Ideally, a parallel logic programming system is expected to satisfy the follow-
ing two requirements [Hermenegildo 1986b; Gupta and Pontelli 1997; Pontelli
1997].
—On a single processor, the performance of the parallel system should be com-
parable to sequential logic programming implementations (i.e., there should
be limited slowdown compared to a sequential system). To this end, the par-
allel system should be able to take advantage of the sequential compilation
technology [Warren 1983; Aı̈t-Kaci 1991; Van Roy 1990] that has advanced
rapidly in the last two decades, and thus the basic implementation should be
WAM-based.
—Parallel task creation and management should introduce a small overhead
(which implies using a limited number of processes and efficient scheduling
algorithms).
Systems such as &-Prolog, Aurora, MUSE, and ACE indeed get very close
to achieving these goals. Experience has shown that process-based systems
lose out on both the above counts. Similar accounts have been reported also in
the context of committed choice languages (where the notion of process-based
matches well with the view of each subgoal as an individual process that is
enforced by the concurrent semantics of the language); indeed the fastest par-
allel implementations of committed choice languages (e.g., Crammond [1992]
and Rokusawa et al. [1996]) rely on a processor-based implementation. In the
context of Prolog, the presence of backtracking makes the process model too
complex for nondeterministic parallel logic programming. Furthermore, the
process-based approaches typically exploit parallelism at a level that is too fine-
grained, resulting in high parallel overhead and unpromising absolute perfor-
mances (but good speedups because the large parallel overhead gets evenly dis-
tributed!). Current processor-based systems are not only highly efficient, they
can easily assimilate any future advances that will be made in the sequen-
tial compilation technology. However, it must be pointed out that increasing
the granularity of processes to achieve better absolute performance has been
attempted for process-based models with good results [Ramkumar and Kalé
1992].
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.
570 • G. Gupta et al.
(1) If a node n1 in the search tree is in a branch to the right of another branch
containing node n2 , then the data structures corresponding to node n2 would
be reclaimed before those of n1 are allocated.
(2) If a node n1 is the ancestor of another node n2 in the search tree, then the
data structures corresponding to n2 would be reclaimed before those of n1 .
As a result of these two rules, space is always reclaimed from the top of
the stacks during backtracking in logic programming systems which perform a
depth-first search of the computation tree, as Prolog does.
However, as shown in Lusk et al. [1990], Ali and Karlsson [1990b] and
Hermenegildo [1987], in parallel logic programming systems things are more
complicated. First, these rules may not hold: two branches may be simultane-
ously active due to or-parallelism (making rule 1 difficult to enforce), or two
conjunctive goals may be simultaneously active due to and-parallelism (mak-
ing rule 2 difficult to enforce). Of course, in a parallel logic system, usually
each worker has its own set of stacks (the multiple stacks are referred to as a
cactus stack since each stack corresponds to a part of the branch of the search
tree), so it is possible to enforce the two rules in each stack to ensure that space
is reclaimed only from the top of individual stacks. If this restriction is im-
posed, then while memory management becomes easier, some parallelism may
be lost since an idle worker may not be able to pick available work in a node
because doing so will violate this restriction. If this restriction is not imposed,
then it becomes necessary to deal with the “garbage slot” problem, namely, a
data structure that has been backtracked over is trapped in the stack below
a goal that is still in use, and the “trapped goal” problem, namely, an active
goal is trapped below another, and there is no space contiguous to this active
goal to expand it further, which results in the LIFO nature of stacks being
destroyed.
There are many possible solutions to these problems [Hermenegildo 1986b;
Pontelli et al. 1995; Shen and Hermenegildo 1994, 1996a]. The approach taken
by many parallel systems (e.g., the ACE and DASWAM and-parallel systems
and the Aurora or-parallel system) is to allow trapped goals and garbage slots
in the stacks. Space needed to expand a trapped goal further is allocated at the
top of the stack (resulting in “stack-frames”—such as choice points and goal
descriptors—corresponding to a given goal becoming noncontiguous). Garbage
slots created are marked as such, and are reclaimed when everything above
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.
Parallel Execution of Prolog Programs • 571
them has also turned into garbage. This technique is also employed in the
Aurora, &-Prolog, and Andorra-I systems. In Aurora the garbage slot is referred
to as a ghost node. If garbage slots are allowed, then the system will use up
more memory, but work scheduling becomes simpler and processing resources
are utilized more efficiently.
While considerable effort has been invested in the design of garbage collection
schemes for sequential Prolog implementations (e.g., Pittomvils et al. [1985],
Appleby et al. [1988], Older and Rummell [1992], and Bekkers et al. [1992]),
considerably more limited effort has been placed on adapting these mechanisms
to the case of parallel logic programming systems. Garbage collection is indeed
a serious concern, since parallel logic programming systems tend to consume
more memory than sequential ones (e.g., use of additional data structures, such
as parcall frames, to manage parallel executions). For example, results reported
for the Reform Prolog system indicate that on average 15% of the execution time
is spent in garbage collection. Some early work on parallelization of the garbage
collection process (applied mostly to basic copying garbage collection methods)
can be found in the context of parallel execution of functional languages (e.g.,
Halstead [1984]). In the context of parallel logic programming, two relevant
efforts are:
—the proposal by Ali [1995], which provides a parallel version of a copying
garbage collector, refined to guarantee avoidance of unnecessary copying
(e.g., copy the same data twice) and load-balancing between workers dur-
ing garbage collection;
—the proposal by Bevemyr [1995], which extends the work by Ali into a gener-
ational copying garbage collector (objects are divided into generations, where
newer generations contain objects more recently created; the new generation
is garbage collected more often than the old one).
Generational garbage collection algorithms have also been proposed in the con-
text of parallel implementation of committed choice languages (on PIM archi-
tectures) [Ozawa et al. 1990; Xu et al. 1989].
9.3 Optimizations
A system that builds an and-or tree to solve a problem with nondeterminism
may look trivial to implement at first, but experience shows that it is quite
a difficult task. A naive parallel implementation may lead to a slowdown or
may incur a severe overhead compared to a corresponding sequential system.
The parallelism present in these frameworks is typically very irregular and
unpredictable; for this reason, parallel implementations of nondeterministic
languages typically rely on dynamic scheduling. Thus, most of the work for
partitioning and managing parallel tasks is performed during run-time. These
duties are absent from a sequential execution and represent parallel overhead.
Excessive parallel overhead may cause a naive parallel system to run many
times slower on one processor compared to a similar sequential system.
A large number of optimizations have been proposed in the literature to
improve the performance of individual parallel logic programming systems
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.
572 • G. Gupta et al.
(e.g., Ramkumar and Kalé [1989], Shen [1994], and Pontelli et al. [1996]).
Nevertheless, limited effort has been placed in determining overall principles
that can be used to design over-the-border optimization schemes for entire
classes of systems. A proposal in this direction has been put forward by Gupta
and Pontelli [1997]. The proposal presents a number of general optimization
principles that can be employed by implementors of parallel nondeterministic
systems to keep the overhead incurred for exploiting parallelism low. These
principles have been used to design a number of optimization schemes such as
the Last Parallel Call Optimization [Pontelli et al. 1996] (used for independent
and-parallel systems and the Last Alternative Optimization [Gupta and
Pontelli 1999b] (used for or-parallel systems).
Parallel execution of a logic programming system can be viewed as the paral-
lel traversal/construction of an and-or tree. Given the and-or tree for a program,
its sequential execution amounts to traversing the and-or tree in a predeter-
mined order. Parallel execution is realized by having different workers concur-
rently traversing different parts of the and-or tree in a way consistent with
the operational semantics of the programming language. By operational se-
mantics we mean that dataflow (e.g., variable bindings) and control-flow (e.g.,
input/output operations) dependencies are respected during parallel execution
(similar to loop parallelization of FORTRAN programs, where flow dependencies
have to be preserved). Parallelism allows overlapping of exploration of different
parts of the and-or tree. Nevertheless, as mentioned earlier, this does not always
translate to an improvement in performance. This happens mainly because of
the following reasons:
—The tree structure developed during the parallel computation needs to be
explicitly maintained, in order to allow for proper management of nondeter-
minism and backtracking; this requires the use of additional data structures
not needed in sequential execution. Allocation and management of these data
structures represent overhead during parallel computation with respect to
sequential execution.
—The tree structure of the computation needs to be repeatedly traversed in or-
der to search for multiple alternatives and/or cure eventual failure of goals,
and such traversal often requires synchronization between the workers. The
tree structure may be traversed more than once because of backtracking,
and because idle workers may have to find nodes that have work after a fail-
ure takes place or a solution is reported (dynamic scheduling). This traver-
sal is much simpler in a sequential computation, where the management of
nondeterminism is reduced to a linear and fast scan of the branches in a
predetermined order.
Based on this it is possible to identify ways of reducing these overheads.
Traversal of Tree Structure: There are various ways in which the process of
traversing the complex structure of a parallel computation can be made more
efficient:
(1) simplification of the computation’s structure: by reducing the complexity of
the structure to be traversed it should be possible to achieve improvement in
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.
Parallel Execution of Prolog Programs • 573
performance. This principle has been reified in the already mentioned Last
Parallel Call Optimization and the Last Alternative Optimization, used to
flatten the computation tree by collapsing contiguous nodes lying on the
same branch if some simple conditions hold;
(2) use of the knowledge about the computation (e.g., determinacy) in order
to guide the traversal of the computation tree: information collected from
the computation may suggest the possibility of avoiding traversing certain
parts of the computation tree. This has been reified in various optimizations,
including the Determinate Processor Optimization [Pontelli et al. 1996].
Data Structure Management: Since allocating data structures is generally an
expensive operation, the aim should be to reduce the number of new data struc-
tures created. This can be achieved by:
(1) reusing existing data structures whenever possible (as long as this pre-
serves the desired execution behavior). This principle has been imple-
mented, for example, in the Backtracking Families Optimization [Pontelli
et al. 1996];
(2) avoiding allocation of unnecessary structures: most of the new data struc-
tures introduced in a parallel computation serve two purposes: to support
the management of the parallel parts of the computation, and to support
the management of nondeterminism. This principle has been implemented
in various optimizations, including the Shallow Backtracking Optimization
[Carlsson 1989] and the Shallow Parallelism Optimization [Pontelli et al.
1996].
This suggests possible conditions under which one can avoid creation of addi-
tional data structures: (i) no additional data structures are required for parts
of the computation tree that are potentially parallel but are actually explored
by the same computing agent (i.e., potentially parallel but practically sequen-
tial); (ii) no additional data structures are required for parts of the computation
that will not contribute to the nondeterministic nature of the computation (e.g.,
deterministic parts of the computation).
The scheduler is also influenced by how the system manages its memory. For
instance, if the restriction of only reclaiming space from the top of a stack is
imposed and garbage slots/trapped goals are disallowed, then the scheduler has
to take this into account and at any moment schedule only those goals meeting
these criteria.
Schedulers in systems that combine more than one form of parallelism have
to figure out how much of the resources should be committed to exploiting a
particular kind of parallelism. For example, in Andorra-I and ACE systems,
that divide available workers into teams, the scheduler has to determine the
sizes of the teams, and decide when to migrate a worker from a team that
has no work left to another that does have work, and so on [Dutra 1994,
1995].
The fact that Aurora, quite a successful or-parallel system, has about five
schedulers built for it [Calderwood and Szeredi 1989; Beaumont et al. 1991;
Sindaha 1992; Butler et al. 1988], is a testimony to the importance of work-
scheduling for parallel logic programming systems. Design of efficient and flex-
ible schedulers is still a topic of research [Dutra 1994, 1996; Ueda and Montelius
1996].
9.5 Granularity
The implementation techniques mentioned before for both or- and and-
parallelism have proven sufficient for keeping the overheads of communication,
scheduling, and memory management low and obtaining significant speedups
in a wide variety of applications on shared memory multiprocessors (starting
from the early paradigmatic examples: the Sequent Balance and Symmetry se-
ries). However, current trends point towards larger multiprocessors but with
less uniform shared memory access times. Controlling in some way the gran-
ularity (execution time and space) of the tasks to be executed in parallel can
be a useful optimization in such machines, and is in any case a necessity when
parallelizing for machines with slower interconnections. The latter include, for
example, networks of workstations or distribution of work over the Internet. It
is desirable to have a large granularity of computation, so that the scheduling
overhead is a small fraction of the total work done by a worker. The general
idea is that if the gain obtained by executing a task in parallel is less than the
overheads required to support the parallel execution, then the task is better
executed sequentially.
The idea of granularity control is to replace parallel execution with sequen-
tial execution or vice versa based on knowledge (actual data, bounds, or esti-
mations) of task size and overheads. The problem is challenging because, while
the basic communication overhead parameters of a system can be determined
experimentally, the computational cost of the tasks (e.g., procedure calls) being
parallelized, as well as the amount of data that needs to be transferred before
and after a parallel call, usually depend on dynamic characteristics of the input
data. In the following example, we consider for parallel execution q (which, as-
suming it is called with X bound to a list of numbers, adds one to each element
of the list);
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.
Parallel Execution of Prolog Programs • 575
q([],[]).
q([I|Is],[I1|Os]):- I1 is I+1, q(Is,Os).
The computational cost of a call to q (and also the communication overheads)
are obviously proportional to the number of elements in the list. The charac-
terization of input data required has made the problem difficult to solve (well)
completely at compile-time.
The Aurora and MUSE or-parallel systems keep track of granularity by
tracking the richness of nodes, that is, the amount of work—measured in terms
of number of untried alternatives in choice points—that is available in the sub-
tree rooted at a node. Workers will tend to pick work from nodes that have high
richness. The Aurora and MUSE systems also make a distinction between the
private and public parts of the tree to keep granularity high. Essentially, work
created by another worker can only be picked up from the public region. In the
private region, the worker that owns that region is responsible for all the work
generated, thereby keeping the granularity high. In the private region execu-
tion is very close to sequential execution, resulting in high efficiency. Only when
the public region runs out of work is a part of the private region of some worker
made public. In these systems, granularity control is completely performed at
run-time.
Modern systems [López-GarcÍa et al. 1996; Shen et al. 1998; Tick and Zhong
1993] implement granularity control using the two-phase process proposed in
Debray et al. [1990] and López-Gercı́a et al. [1996]:
(1) at compile-time a global analysis tool performs an activity typically called
cost estimation. Cost estimates are parametric formulae expressing lower
or upper bounds to the time complexity of the different (potentially) parallel
tasks, as a function of certain measures of input data;
(2) at run-time the cost estimates are instantiated, before execution of the task
and compared with predetermined thresholds; parallel execution of the task
is allowed only if the cost estimate is above the threshold.
Programs are then transformed at compile-time into semantically equivalent
counterparts but which automatically control granularity at run-time based on
such functions, following the scheme,
(cost estimate (n1 , . . . , nk ) > τ ⇒ goal1 & · · · & goalm )
where the m subgoals will be allowed in a parallel execution only if the result
of the cost estimate is above the threshold τ . The parameters of cost estimate
are those goal input arguments that directly determine the time-complexity of
the parallel subgoals, as identified by the global analysis phase. In the example
above, these tools derive cost functions such as, for example, 2 ∗ length(X ) + 1
for q (i.e., the unit of cost is in this case a procedure call, where the addition
is counted for simplicity as one procedure call). If we assume that we should
parallelize when the total computation cost is larger than “100,” then we can
transform the parallel call to p and q above into:
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.
576 • G. Gupta et al.
where
—n is the representation of the size of the input arguments to the clause C,
—φi (n) is the (lower bound) of the relative size of the input arguments to Bi ,
— Br is the rightmost literal in C that is guaranteed to not fail, and
—h(n) is the lower bound of the cost of head unification and tests for the clause
C.
The lower bound Cost p for a predicate p is obtained by taking the minimum of
the lower bounds for the clauses defining p.
For the more general case of estimation of the lower bound for the computa-
tion of all the solutions, it becomes necessary to estimate the lower bound to the
number of solutions that each literal in the clause will return. In Debray et al.
[1997] the problem is reduced to the computation of the chromatic polynomial
of a graph.
In King et al. [1997] bottom-up abstract interpretation techniques are used
to evaluate lower-bound inequalities (i.e., inequalities of the type d min ≤ tmin (l ),
where d min represents the threshold to allow spawning of parallel computations,
while tmin (l ) represents the lower bound to the computation time for input of
size l ) for large classes of programs.
Metrics different from task complexity have been proposed to support granu-
larity control. A related effort is the one by Shen et al. [1998], which makes use
of the amount of work performed between major sources of overheads—called
distance metric—to measure granularity.
34 VisAndOr,however, can depict Andorra-I executions: that is, or-parallelism and deterministic
dependent and-parallelism.
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.
Parallel Execution of Prolog Programs • 579
the trace format adopted by various other systems [Vaupel et al. 1997; Kusalik
and Prestwich 1996; Fonseca et al. 1998]. Must and VisAndOr have been inte-
grated in the ViMust system; a timeline moves on the VisAndOr representation
synchronized with the development of the computation tree in Must [Carro et al.
1993].
Other visualization tools have also been developed for dependent and-
parallelism in the context of committed choice languages, for example, those
for visualizing KL1 and GHC execution [Tick 1992; Aikawa et al. 1992].
Tools have also been developed for visualizing combined and/or-parallelism,
as well as to provide a better balance between dynamic and static representa-
tions, for example, VACE [Vaupel et al. 1997], based on the notion of C-trees
[Gupta et al. 1994], and VisAll [Fonseca et al. 1998]. Figure 33 shows a snapshot
of VACE.
A final note is for the VisAll system [Fonseca et al. 1998]. VisAll provides
a universal visualization tool that subsumes the features offered by most of
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.
580 • G. Gupta et al.
(1) Hermenegildo and Tick [1989; Tick 1987] proposed various studies estimat-
ing the performance of and-parallel systems on shared memory machines
taking into account different cache coherence algorithms, cache sizes, bus
widths, and so on. These early studies allowed predicting, for example, that
&-Prolog would later produce speedups over state of the art sequential sys-
tems even on quite fine-grained computations on shared-memory machines
that were not commercially available at the time.
(2) Montelius and Haridi [1997] have proposed detailed performance anal-
ysis of the Penny system, mostly using the SIMICS Sparc processor
simulator;
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.
582 • G. Gupta et al.
(3) Gupta and Pontelli [1999c] have used simulation studies (based on the use
of the SIMICS simulator) to validate the claim that stack splitting improves
the locality of an or-parallel computation based on stack copying;
(4) Santos Costa et al. [1997] have also analyzed the performance of paral-
lel logic programming systems (specifically Aurora and Andorra-I) using
processor simulators (specifically a simulator of the MIPS processor). Their
extensive work has been aimed at determining the behavior of parallel logic
programming systems on parallel architectures (with a particular focus on
highly scalable architectures, for example, distributed shared memory ma-
chines). In Santos Costa et al. [1997], the simulation framework adopted is
presented, along with the development of a methodology for understanding
cache performance. The results obtained have been used to provide concrete
improvements to the implementation of the Andorra-I system [Santos Costa
et al. 2000].
(5) The impact of cache coherence protocols on the performance of parallel
Prolog systems has been studied in more detail in Dutra et al. [2000], Silva
et al. [1999], and Calegario and Dutra [1999].
These works tend to agree on the importance of considering architectural pa-
rameters in the design of parallel logic programming systems. For example, the
results achieved by Costa et al. [1997] for the Andorra-I systems indicate that:
—or-parallel Prolog systems provide a very good locality of computation, thus
the system does not seem to require very large cache sizes;
—small cache blocks appear to provide better behavior, especially in presence
of or-parallelism: the experimental work by Dutra et al. [2000] indicates a
high-risk of false-sharing in the presence of blocks larger than 64 bytes;
—Dutra et al. [2000] compare the effect of Write Invalidate versus. Write
Update as cache coherence protocols. The study confirms the early results
of Hermenegildo and Tick [1989] and Tick [1987] and extends them under-
lining the superiority of a particular version of the Write update algorithm
(a hybrid method where each node independently decides upon receiving an
update request whether to update the local copy of data or simply invalidate
it).
Similar results have been reported in Montelius and Haridi [1997], which un-
derlines the vital importance of good cache behavior and avoidance of false
sharing for exploitation of fine-grain parallelism in Penny.
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.
584 • G. Gupta et al.
This body of experimental work indicates that the existing technology for
parallel execution of logic programs is effective when applied to large and com-
plex real-life Prolog applications. Further push for application of parallelism
comes from the realm of constraint logic programming. Preliminary work on the
Chip and ECLiPSe systems has demonstrated that the techniques described
in this article can be easily applied to parallelization of the relevant phases
of constraint handling. Considering that most constraint logic programming
applications are extremely computation-intensive, the advantages of parallel
execution are evident.
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.
Parallel Execution of Prolog Programs • 585
approach to achieving this goal, inspired by the duality [Pontelli and Gupta
1995b] and orthogonality [Correia et al. 1997] principles and by views such as
those argued in Hermenegildo and CLIP Group [1994], would be to configure
an ideal parallel logic programming system as a true “plug-and-play” system,
where a basic Prolog kernel engine can be incrementally extended with differ-
ent modules implementing different parallelization and scheduling strategies,
and the like (as well as other functionality not related to parallelism, of course)
depending on the needs of the user. We hope that with enough research effort
this ideal can be achieved.
From the point of view of compile-time technology, the result of the work
outlined in previous sections is that quite robust parallelizing compilers ex-
ist for various generalizations of independent and dependent and-parallelism,
which automatically exploit parallelism in complex applications. The accuracy,
speed, and robustness of these compilers have also been instrumental in demon-
strating that abstract interpretation provides a very adequate framework for
developing provably correct, powerful, and efficient global analyzers and, con-
sequently, parallelizers. It can be argued that, when compared with work done
in other fields, particularly strong progress has been made in the context of
logic programming in developing techniques for interprocedural analysis and
parallelization of programs with dynamic data structures and pointers, in par-
allelization using conditional dependency graphs (combining compile-time op-
timization with run-time independence tests), and in domains for the abstrac-
tion of the advanced notions of independence that are needed in the presence of
speculative computations. More recently, independence notions, analysis tech-
niques, and practical tools have also been developed for the parallelization of
constraint logic programs and logic programs with dynamic execution reorder-
ing (“delays”) [GarcÍa de la Banda et al. 2000].
The current evolutionary trend in the design of parallel computer systems
is towards building heterogeneous architectures that consist of a large num-
ber of relatively small-sized shared memory machines connected through fast
interconnection networks. Taking full advantage of the computational power
of such architectures is known to be a very difficult problem [Bader and JaJa
1997]. Parallel logic programming systems can potentially constitute a viable
solution to this problem. However, considerable research in the design and im-
plementation of parallel logic programming systems on distributed memory
multiprocessors is still needed before competitive speedups can be obtained
routinely. Distributed implementation of parallel logic programming systems
is another direction where we feel future research effort should be invested.
There are many challenges in the efficient implementation of distributed uni-
fication and maintaining program-coordinated execution state and data eco-
nomically in a noncentralized way, as well as in the development of adequate
compilation technology (e.g., for granularity control). Fortunately, this is an
area where logic programming has already produced results clearly ahead of
those in other areas. As we have overviewed, interesting techniques have been
proposed for the effective management of computations in a distributed setting,
for intelligent scheduling of different forms of parallelism, as well as for static
inference of task cost functions and their application to static and dynamic
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.
586 • G. Gupta et al.
ACKNOWLEDGMENTS
Thanks are due to Bharat Jayaraman for helping with an earlier article on
which this article is based. Thanks to Manuel Carro and Vitor Santos Costa,
who read drafts of this survey. Our deepest thanks to the anonymous referees
whose comments tremendously improved the article.
REFERENCES
AIKAWA, S., KAMIKO, M., KUBO, H., MATSUZAWA, F., AND CHIKAYAMA, T. 1992. Paragraph: A Graphical
Tuning Tool for Multiprocessor Systems. In Proceedings of the Conference on Fifth Generation
Computer Systems, ICOT Staff, Ed. IOS Press, Tokyo, Japan, 286–293.
Aı̈T-KACI, H. 1991. Warren’s Abstract Machine: A Tutorial Reconstruction. MIT Press, Cambridge,
MA. www.isg.sfu.ca/~hak/documents/wam.html.
Aı̈T-KACI, H. 1993. An Introduction to LIFE: Programming with Logic, Inheritance, Functions,
and Equations. In International Logic Programming Symposium, D. Miller, Ed. MIT Press,
Cambridge, MA, 52–68.
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.
Parallel Execution of Prolog Programs • 587
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.
588 • G. Gupta et al.
BENJUMEA, V. AND TROYA, J. 1993. An OR Parallel Prolog Model for Distributed Memory Systems.
In International Symposium on Programming Languages Implementations and Logic Program-
ming, M. Bruynooghe and J. Penjam, Eds. Springer-Verlag, Heidelberg, 291–301.
BEVEMYR, J. 1995. A Generational Parallel Copying Garbage Collection for Shared Memory
Prolog. In Workshop on Parallel Logic Programming Systems. University of Porto, Portland, OR.
BEVEMYR, J., LINDGREN, T., AND MILLROTH, H. 1993. Reform Prolog: The Language and Its Imple-
mentation. In Proceedings of the International Conference on Logic Programming, D. S. Warren,
Ed. MIT Press, Cambridge, MA, 283–298.
BISWAS, P., SU, S., AND YUN, D. 1988. A Scalable Abstract Machine Model to Support Limited-OR
Restricted AND parallelism in Logic Programs. In Proceedings of the International Conference
and Symposium on Logic Programming, R. Kowalski and K. Bowen, Eds. MIT Press, Cambridge,
MA, 1160–1179.
BORGWARDT, P. 1984. Parallel Prolog Using Stack Segments on Shared Memory Multiprocessors.
In International Symposium on Logic Programming. Atlantic City, IEEE Computer Society, Silver
Spring, MD, 2–12.
BRAND, P. 1988. Wavefront Scheduling. Tech. rep., SICS, Gigalips Project.
BRIAT, J., FAVRE, M., GEYER, C., AND CHASSIN DE KERGOMMEAUX, J. 1992. OPERA: Or-Parallel Prolog
System on Supernode. In Implementations of Distributed Prolog, P. Kacsuk and M. Wise, Eds. J.
Wiley & Sons, New York, 45–64.
BRUYNOOGHE, M. 1991. A Framework for the Abstract Interpretation of Logic Programs. Journal
of Logic Programming 10, 91–124.
BUENO, F. AND HERMENEGILDO, M. 1992. An Automatic Translation Scheme from Prolog to the
Andorra Kernel Language. In Proceedings of the International Conference on Fifth Generation
Computer Systems, ICOT Staff, Ed. IOS Press, Tokyo, Japan, 759–769.
BUENO, F., CABEZA, D., CARRO, M., HERMENEGILDO, M., LÓPEZ-GARCÍA, P., AND PUEBLA, G. 1997. The
Ciao Prolog System. Reference Manual. The Ciao System Documentation Series–TR CLIP3/97.1,
School of Computer Science, Technical University of Madrid (UPM). August. System and on-line
version of the manual available at http://clip.dia.fi.upm.es/Software/Ciao/.
BUENO, F., CABEZA, D., HERMENEGILDO, M., AND PUEBLA, G. 1996. Global Analysis of Standard Prolog
Programs. In European Symposium on Programming. Number 1058 in LNCS. Springer-Verlag,
Sweden, 108–124.
BUENO, F., DEBRAY, S., GARCÍA DE LA BANDA, M., AND HERMENEGILDO, M. 1995. Transformation-Based
Implementation and Optimization of Programs Exploiting the Basic Andorra Model. Technical
Report CLIP11/95.0, Facultad de Informática, UPM. May.
BUENO, F., GARCÍA DE LA BANDA, M., AND HERMENEGILDO, M. 1994a. A Comparative Study of Methods
for Automatic Compile-Time Parallelization of Logic Programs. In Parallel Symbolic Computa-
tion. World Scientific Publishing Company, 63–73.
BUENO, F., GARCÍA DE LA BANDA, M., AND HERMENEGILDO, M. 1999. Effectiveness of Abstract Inter-
pretation in Automatic Parallelization: A Case Study in Logic Programming. ACM Transactions
on Programming Languages and Systems 21, 2, 189–239.
BUENO, F., HERMENEGILDO, M., MONTANARI, U., AND ROSSI, F. 1994b. From Eventual to Atomic and
Locally Atomic CC Programs: A Concurrent Semantics. In Fourth International Conference on
Algebraic and Logic Programming. Number 850 in LNCS. Springer-Verlag, 114–132.
BUENO, F., HERMENEGILDO, M., MONTANARI, U., AND ROSSI, F. 1998. Partial Order and Contextual Net
Semantics for Atomic and Locally Atomic CC Programs. Science of Computer Programming 30,
51–82.
BUTLER, R., DISZ, T., LUSK, E., OLSON, R., OVERBEEK, R., AND STEVENS, R. 1988. Scheduling
Or-Parallelism: An Argonne Perspective. In Proceedings of the International Conference and
Symposium on Logic Programming, R. Kowalski and K. Bowen, Eds. MIT Press, Cambridge,
MA, 1565–1577.
BUTLER, R., LUSK, E., MCCUNE, W., AND OVERBEEK, R. 1986. Parallel Logic Programming for Numer-
ical Applications. In Proceedings of the Third International Conference on Logic Programming,
E. Shapiro, Ed. Springer-Verlag, Heidelberg, 357–388.
CABEZA, D. AND HERMENEGILDO, M. 1994. Extracting Non-Strict Independent And-Parallelism
Using Sharing and Freeness Information. In International Static Analysis Symposium, B. Le
Charlier, Ed. LNCS. Springer-Verlag, Heidelberg, 297–313.
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.
Parallel Execution of Prolog Programs • 589
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.
590 • G. Gupta et al.
CODISH, M., MULKERS, A., BRUYNOOGHE, M., GARCÍA DE LA BANDA, M., AND HERMENEGILDO, M. 1995.
Improving Abstract Interpretations by Combining Domains. ACM Transactions on Programming
Languages and Systems 17, 1, 28–44.
CODOGNET, C. AND CODOGNET, P. 1990. Non-Deterministic Stream and-Parallelism Based on In-
telligent Backtracking. In Proceedings of the International Conference on Logic Programming,
G. Levi and M. Martelli, Eds. MIT Press, Cambridge, MA, 63–79.
CODOGNET, C., CODOGNET, P., AND FILÉ, G. 1988. Yet Another Intelligent Backtracking Method. In
Proceedings of the International Conference and Symposium on Logic Programming, R. Kowalski
and K. Bowen, Eds. MIT Press, Cambridge, MA, 447–465.
CONERY, J. 1987a. Binding Environments for Parallel Logic Programs in Nonshared Memory
Multiprocessors. In International Symposium on Logic Programming. IEEE Computer Society,
Los Alamitos, CA, 457–467.
CONERY, J. 1987b. Parallel Interpretation of Logic Programs. Kluwer Academic, Norwell, MA.
CONERY, J. 1992. The OPAL Machine. In Implementations of Distributed Prolog, P. Kacsuk and
D. S. Wise, Eds. J. Wiley & Sons, New York, 159–185.
CONERY, J. AND KIBLER, D. 1981. Parallel Interpretation of Logic Programs. In Proceedings of
the ACM Conference on Functional Programming Languages and Computer Architecture (1981).
ACM Press, New York, 163–170.
CONERY, J. AND KIBLER, D. 1983. And Parallelism in Logic Programs. In Proceedings of the Inter-
national Joint Conference on AI, A. Bundy, Ed. William Kaufmann, Los Altos, CA, 539–543.
CORREIA, E., SILVA, F., AND SANTOS COSTA, V. 1997. The SBA: Exploiting Orthogonality in
And-or Parallel System. In Proceedings of the International Symposium on Logic Programming,
J. MalÃuszyński, Ed. MIT Press, Cambridge, MA, 117–131.
COUSOT, P. AND COUSOT, R. 1977. Abstract Interpretation: A Unified Lattice Model for Static
Analysis of Programs by Construction or Approximation of Fixpoints. In Conference Records
of the ACM Symposium on Principles of Programming Languages. ACM Press, New York,
238–252.
COUSOT, P. AND COUSOT, R. 1992. Abstract Interpretation and Applications to Logic Programs.
Journal of Logic Programming 13, 2–3, 103–179.
COX, P. 1984. Finding backtrack points for intelligent backtracking. In Implementations of
Prolog, J. Campbell, Ed. Ellis Horwood, Hemel Hempstead.
CRABTREE, B. 1991. A Clustering System to Network Control. Tech. rep., British Telecom.
CRAMMOND, J. 1985. A Comparative Study of Unification Algorithms for Or-Parallel Execution of
Logic Languages. IEEE Transactions on Computers 34, 10, 911–971.
CRAMMOND, J. 1992. The Abstract Machine and Implementation of Parallel Parlog. New Genera-
tion Computing 10, 4, 385–422.
DE BOSSCHERE, K. AND TARAU, P. 1996. Blackboard-Based Extensions in Prolog. Software Practice
& Experience 26, 1, 46–69.
DEBRAY, S. AND JAIN, M. 1994. A Simple Program Transformation for Parallelism. In Proceedings
of the 1994 Symposium on Logic Programming. MIT Press.
DEBRAY, S. AND LIN, N. 1993. Cost Analysis of Logic Programs. ACM Transactions on Programming
Languages and Systems 15, 5, 826–875.
DEBRAY, S. AND WARREN, D. S. 1989. Functional Computations in Logic Programs. ACM Transac-
tions on Programming Languages and Systems 11, 3, 451–481.
DEBRAY, S., LÓPEZ-GARCı́A, P., AND HERMENEGILDO, M. 1997. Non-Failure Analysis for Logic Pro-
grams. In International Conference on Logic Programming, L. Naish, Ed. MIT Press, Cambridge,
MA, 48–62.
DEBRAY, S., LÓPEZ-GARCÍA, P., HERMENEGILDO, M., AND LIN, N.-W. 1994. Estimating the Computa-
tional Cost of Logic Programs. In Static Analysis Symposium, SAS’94. Number 864 in LNCS.
Springer-Verlag, Namur, Belgium, 255–265.
DEBRAY, S., LÓPEZ-GARCÍA, P., HERMENEGILDO, M., AND LIN, N.-W. 1997. Lower Bound Cost Estima-
tion for Logic Programs. In International Logic Programming Symposium, J. Maluszyński, Ed.
MIT Press, Cambridge, MA, 291–306.
DEBRAY, S. K., LIN, N.-W., AND HERMENEGILO, M. 1990. Task Granularity Analysis in Logic
Programs. In Proceedings of the 1990 ACM Conference on Programming Language Design and
Implementation. ACM Press, New York, 174–188.
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.
Parallel Execution of Prolog Programs • 591
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.
592 • G. Gupta et al.
GARCÍA DE LA BANDA, M., HERMENEGILDO, M., AND MARRIOTT, K. 1996b. Independence in Dynamically
Scheduled Logic Languages. In Proceedings of the International Conference on Algebraic and
Logic Programming, M. Hanus and M. Rodriguez-Artalejo, Eds. Springer-Verlag, Heidelberg,
47–61.
GARCÍA DE LA BANDA, M., HERMENEGILDO, M., AND MARRIOTT, K. 2000. Independence in CLP
Languages. ACM Transactions on Programming Languages and Systems 22, 2 (March),
269–339.
GIACOBAZZI, R. AND RICCI, L. 1990. Pipeline Optimizations in And-parallelism by Abstract Inter-
pretation. In Proceedings of International Conference on Logic Programming, D. H. D. Warren
and P. Szeredi, Eds. MIT Press, Cambridge, MA, 291–305.
GIANNOTTI, F. AND HERMENEGILDO, M. 1991. A Technique for Recursive Invariance Detection and
Selective Program Specialization. In Proceedings 3rd International Symposium on Program-
ming Language Implementation and Logic Programming. Number 528 in LNCS. Springer-Verlag,
323–335.
GREGORY, S. AND YANG, R. 1992. Parallel Constraint Solving in Andorra-I. In Proceedings of the
International Conference on Fifth Generation Computer Systems, ICOT Staff, Ed. IOS Press,
Tokyo, Japan, 843–850.
GUO, H.-F. 2000. High Performance Logic Programming. Ph.D. thesis, New Mexico State
University.
GUO, H.-F. AND GUPTA, G. 2000. A Simple Scheme for Implementing Tabled LP Systems Based on
Dynamic Reordering of Alternatives. In Proceedings of the Workshop on Tabling in Parsing and
Deduction, D. S. Warren, Ed.
GUPTA, G. 1994. Multiprocessor Execution of Logic Programs. Kluwer Academic Press, Dordrecht.
GUPTA, G. AND JAYARAMAN, B. 1993a. Analysis of Or-Parallel Execution Models. ACM Transactions
on Programming Languages and Systems 15, 4, 659–680.
GUPTA, G. AND JAYARAMAN, B. 1993b. And-Or Parallelism on Shared Memory Multiprocessors.
Journal of Logic Programming 17, 1, 59–89.
GUPTA, G. AND PONTELLI, E. 1997. Optimization Schemas for Parallel Implementation of
Nondeterministic Languages and Systems. In International Parallel Processing Symposium.
IEEE Computer Society, Los Alamitos, CA.
GUPTA, G. AND PONTELLI, E. 1999a. Extended Dynamic Dependent And-Parallelism in ACE. Jour-
nal of Functional and Logic Programming 99,Special Issue 1.
GUPTA, G. AND PONTELLI, E. 1999b. Last Alternative Optimization for Or-Parallel Logic Program-
ming Systems. In Parallelism and Implementation Technology for Constraint Logic Program-
ming, I. Dutra et al., Ed. Nova Science, Commack, NY, 107–132.
GUPTA, G. AND PONTELLI, E. 1999c. Stack-Splitting: A Simple Technique for Implementing
Or-Parallelism and And-Parallelism on Distributed Machines. In International Conference on
Logic Programming, D. De Schreye, Ed. MIT Press, Cambridge, MA, 290–304.
GUPTA, G. AND SANTOS COSTA, V. 1996. Cuts and Side-Effects in And/Or Parallel Prolog. Journal
of Logic Programming 27, 1, 45–71.
GUPTA, G., HERMENEGILDO, M., PONTELLI, E., AND SANTOS COSTA, V. 1994. ACE: And/Or-Parallel
Copying-Based Execution of Logic Programs. In Proceedings of the International Conference on
Logic Programming, P. van Hentenryck, Ed. MIT Press, Cambridge, MA, 93–109.
GUPTA, G., HERMENEGILDO, M., AND SANTOS COSTA, V. 1992. Generalized Stack Copying for
And-Or Parallel Implementations. In JICSLP’92 Workshop on Parallel Implementations of Logic
Programming Systems.
GUPTA, G., HERMENEGILDO, M., AND SANTOS COSTA, V. 1993. And-Or Parallel Prolog: A Recomputa-
tion Based Approach. New Generation Computing 11, 3–4, 297–322.
GUPTA, G. AND WARREN, D. H. D. 1992. An Interpreter for the Extended Andorra Model. Internal
Report 92-CS-24, New Mexico State University, Department of Computer Science.
GUPTA, G., SANTOS COSTA, V., AND PONTELLI, E. 1994b. Shared Paged Binding Arrays: A Universal
Data-Structure for Parallel Logic Programming. Proceedings of the NSF/ICOT Workshop on
Parallel Logic Programming and its Environments, CIS-94-04, University of Oregon. Mar.
GUPTA, G., SANTOS COSTA, V., YANG, R., AND HERMENEGILDO, M. 1991. IDIOM: A Model Intergrating
Dependent-, Independent-, and Or-parallelism. In International Logic Programming Symposium,
V. Saraswat and K. Ueda, Eds. MIT Press, Cambridge, MA, 152–166.
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.
Parallel Execution of Prolog Programs • 593
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.
594 • G. Gupta et al.
HERMENEGILDO, M. AND GREENE, K. 1991. The &-Prolog System: Exploiting Independent And-
Parallelism. New Generation Computing 9, 3–4, 233–257.
HERMENEGILDO, M. AND LÓPEZ-GARCÍA, P. 1995. Efficient Term Size Computation for Granularity
Control. In Proceedings of the International Conference on Logic Programming, L. Sterling, Ed.
MIT Press, Cambridge, MA, 647–661.
HERMENEGILDO, M. AND NASR, R. I. 1986. Efficient Management of Backtracking in AND-
Parallelism. In Third International Conference on Logic Programming, E. Shapiro, Ed. Number
225 in Lecture Notes in Computer Science. Springer-Verlag, Heidelberg, 40–54.
HERMENEGILDO, M., PUEBLA, G., MARRIOTT, K., AND STUCKEY, P. 2000. Incremental Analysis of
Constraint Logic Programs. ACM Transactions on Programming Languages and Systems 22, 2
(March), 187–223.
HERMENEGILDO, M. AND ROSSI, F. 1995. Strict and Non-Strict Independent And-Parallelism
in Logic Programs: Correctness, Efficiency, and Compile-Time Conditions. Journal of Logic
Programming 22, 1, 1–45.
HERMENEGILDO, M. AND TICK, E. 1989. Memory Performance of AND-Parallel Prolog on Shared-
Memory Architectures. New Generation Computing 7, 1 (October), 37–58.
HERMENEGILDO, M. AND WARREN, R. 1987. Designing a High-Performance Parallel Logic Program-
ming System. Computer Architecture News, Special Issue on Parallel Symbolic Programming 15, 1
(March), 43–53.
HERMENEGILDO, M., WARREN, R., AND DEBRAY, S. 1992. Global Flow Analysis as a Practical Compi-
lation Tool. Journal of Logic Programming 13, 4, 349–367.
HEROLD, A. 1995. The Handbook of Parallel Constraint Logic Programming Applications. Tech.
Rep., ECRC.
HERRARTE, V. AND LUSK, E. 1991. Studying Parallel Program Behaviours with Upshot. Tech. Rep.
ANL-91/15, Argonne National Labs.
HICKEY, T. AND MUDAMBI, S. 1989. Global Compilation of Prolog. Journal of Logic Program-
ming 7, 3, 193–230.
HIRATA, K., YAMAMOTO, R., IMAI, A., KAWAI, H., HIRANO, K., TAKAGI, T., TAKI, K., NAKASE, A., AND
ROKUSAWA, K. 1992. Parallel and Distributed Implementation of Logic Programming Language
KL1. In Proceedings of the International Conference on Fifth Generation Computer Systems, ICOT
Staff, Ed. Ohmsha Ltd., Tokyo, Japan, 436–459.
IQSoft Inc. 1992. CUBIQ - Development and Application of Logic Programming Tools for Knowl-
edge Based Systems. IQSoft Inc. www.iqsoft.hu/projects/cubiq/cubiq.html.
JACOBS, D. AND LANGEN, A. 1992. Static Analysis of Logic Programs for Independent
And-Parallelism. Journal of Logic Programming 13, 1–4, 291–314.
JANAKIRAM, V., AGARWAL, D., AND MALHOTRA, R. 1988. A Randomized Parallel Backtracking
Algorithm. IEEE Transactions on Computers 37, 12, 1665–1676.
JANSON, S. AND MONTELIUS, J. 1991. A Sequential Implementation of AKL. In Proceedings of
ILPS’91 Workshop on Parallel Execution of Logic Programs.
KACSUK, P. 1990. Execution Models of Prolog for Parallel Computers. MIT Press, Cambridge, MA.
KACSUK, P. AND WISE, M. 1992. Implementation of Distributed Prolog. J. Wiley & Sons., New York.
KALÉ, L. 1985. Parallel Architectures for Problem Solving. Ph.D. thesis, SUNY Stony Brook,
Dept. Computer Science.
KALÉ, L. 1991. The REDUCE OR Process Model for Parallel Execution of Logic Programming.
Journal of Logic Programming 11, 1, 55–84.
KALÉ, L., RAMKUMAR, B., AND SHU, W. 1988a. A Memory Organization Independent Binding
Environment for AND and OR Parallel Execution of Logic Programs. In Proceedings of the Fifth
International Conference and Symposium on Logic Programs, R. Kowalski and K. Bowen, Eds.
MIT Press, Cambridge, MA, 1223–1240.
KALÉ, L. V., PADUA, D. A., AND SEHR, D. C. 1988b. Or-Parallel Execution of Prolog with Side Effects.
Journal of Supercomputing 2, 2, 209–223.
KARLSSON, R. 1992. A High Performance Or-Parallel Prolog System. Ph.D. thesis, Royal Institute
of Technology, Stockholm.
KASIF, S., KOHLI, M., AND MINKER, J. 1983. PRISM: A Parallel Inference System for Problem
Solving. In Proceedings of the 8th International Joint Conference on Artificial Intelligence (1983),
A. Bundy, Ed. Morgan Kaufman, San Francisco, CA, 544–546.
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.
Parallel Execution of Prolog Programs • 595
KING, A., SHEN, K., AND BENOY, F. 1997. Lower-Bound Time-Complexity Analysis of Logic
Programs. In Proceedings of the International Logic Programming Symposium, J. Maluszyński,
Ed. MIT Press, Cambridge, MA, 261–276.
KLUŹNIAK, F. 1990. Developing Applications for Aurora Or-Parallel System. Tech. Rep. TR-90-17,
Dept. of Computer Science, University of Bristol.
KOWALSKI, R. 1979. Logic for Problem Solving. Elsevier North-Holland, Amsterdam.
KUSALIK, A. AND PRESTWICH, S. 1996. Visualizing Parallel Logic Program Execution for Perfor-
mance Tuning. In Proceedings of Joint International Conference and Symposium on Logic Pro-
gramming, M. Maher, Ed. MIT Press, Cambridge, MA, 498–512.
LAMMA, E., MELLO, P., STEFANELLI, C., AND HENTENRYCK, P. V. 1997. Improving Distributed Unifica-
tion Through Type Analysis. In Proceedings of Euro-Par 1997. LNCS, Vol. 1300. Springer-Verlag,
1181–1190.
LE HUITOUZE, S. 1990. A new data structure for implementing extensions to Prolog. In Sympo-
sium on Programming Languages Implementation and Logic Programming, P. Deransart and
J. MalÃuszyński, Eds. Springer-Verlag, Heidelberg, 136–150.
LIN, Y. J. 1988. A Parallel Implementation of Logic Programs. Ph.D. thesis, Dept. of Computer
Science, University of Texas at Austin, Austin, TX.
LIN, Y. J. AND KUMAR, V. 1988. AND-Parallel Execution of Logic Programs on a Shared Mem-
ory Multiprocessor: A Summary of Results. In Fifth International Conference and Symposium
on Logic Programming, R. Kowalski and K. Bowen, Eds. MIT Press, Cambridge, MA, 1123–
1141.
LINDGREN, T. 1993. The Compilation and Execution of Recursion Parallel Logic Programs for
Shared Memory Multiprocessors. Ph.D. thesis, Uppsala University.
LINDGREN, T., BEVEMYR, J., AND MILLROTH, H. 1995. Compiler Optimizations in Reform Prolog:
Eperiments on the KSR-1 Multiprocessor. In Proceedings of EuroPar, S. Haridi and P. Magnusson,
Eds. Springer-Verlag, Heidelberg, 553–564.
LINDSTROM, G. 1984. Or-Parallelism on Applicative Architectures. In International Logic Pro-
gramming Conference, S. Tarnlund, Ed. Uppsala University, Uppsala, 159–170.
LLOYD, J. 1987. Foundations of Logic Programming. Springer-Verlag, Heidelberg.
LOPES, R. AND SANTOS COSTA, V. 1999. The BEAM: Towards a First EAM Implementation. In
Parallelism and Implementation Technology for Constraint Logic Programming, I. Dutra et al.,
Ed. Nova Science, Commack, NY, 87–106.
LÓPEZ-GARCÍA, P., HERMENEGILDO, M., AND DEBRAY, S. 1996. A Methodology for Granularity Based
Control of Parallelism in Logic Programs. Journal of Symbolic Computation, Special Issue on
Parallel Symbolic Computation 22, 715–734.
LUSK, E., BUTLER, R., DISZ, T., OLSON, R., STEVENS, R., WARREN, D. H. D., CALDERWOOD, A., SZEREDI,
P., BRAND, P., CARLSSON, M., CIEPIELEWSKI, A., HAUSMAN, B., AND HARIDI, S. 1990. The Aurora
Or-Parallel Prolog System. New Generation Computing 7, 2/3, 243–271.
LUSK, E., MUDAMBI, S., OVERBEEK, R., AND SZEREDI, P. 1993. Applications of the Aurora Parallel
Prolog System to Computational Molecular Biology. In Proceedings of the International Logic
Programming Symposium, D. Miller, Ed. MIT Press, Cambridge, MA, 353–369.
MASUZAWA, H., KUMON, K., ITASHIKI, A., SATOH, K., AND SOHMA, Y. 1986. KABU-WAKE: A New
Parallel Inference Method and Its Evaluation. In Proceedings of the Fall Joint Computer Confer-
ence. IEEE Computer Society, Los Alamitos, CA, 955–962.
MILLROTH, H. 1990. Reforming Compilation of Logic Programs. Ph.D. thesis, Uppsala University.
MONTELIUS, J. 1997. Exploiting Fine-Grain Parallelism in Concurrent Constraint Languages.
Ph.D. thesis, Uppsala University.
MONTELIUS, J. AND ALI, K. 1996. A Parallel Implementation of AKL. New Generation Comput-
ing 14, 1, 31–52.
MONTELIUS, J. AND HARIDI, S. 1997. An Evaluation of Penny: A System for Fine Grain Implicit
Parallelism. In International Symposium on Parallel Symbolic Computation. ACM Press, New
York, 46–57.
MOOLENAAR, R. AND DEMOEN, B. 1993. A Parallel Implementation for AKL. In Proceedings
of the Conference on Programming Languages Implementation and Logic Programming,
M. Bruynooghe and J. Penjam, Eds. Number 714 in LNCS. Springer-Verlag, Heidelberg,
246–261.
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.
596 • G. Gupta et al.
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.
Parallel Execution of Prolog Programs • 597
POLLARD, G. H. 1981. Parallel Execution of Horn Clause Programs. Ph.D. thesis, Imperial College,
London. Dept. of Computing.
PONTELLI, E. 1997. High-Performance Parallel Logic Programming. Ph.D. thesis, New Mexico
State University.
PONTELLI, E. 2000. Concurrent Web Programming in CLP(WEB). In 23rd Hawaian International
Conference of Computers and Systems Science. IEEE Computer Society, Los Alamitos, CA.
PONTELLI, E. AND EL-KATHIB, O. 2001. Construction and Optimization of a Parallel Engine for
Answer Set Programming. In Practical Aspects of Declarative Languages, I. V. Ramakrishnan,
Ed. LNCS, Vol. 1990. Springer-Verlag, Heidelberg, 288–303.
PONTELLI, E. AND GUPTA, G. 1995a. Data And-Parallel Logic Programming in &ACE. In Proceed-
ings of the Symposium on Parallel and Distributed Processing. IEEE Computer Society, Los
Alamitos, CA, 424–431.
PONTELLI, E. AND GUPTA, G. 1995b. On the Duality Between And-Parallelism and Or-Parallelism.
In Proceedings of EuroPar, S. Haridi and P. Magnusson, Eds. Springer-Verlag, Heidelberg, 43–54.
PONTELLI, E. AND GUPTA, G. 1997a. Implementation Mechanisms for Dependent And-Parallelism.
In Proceedings of the International Conference on Logic Programming, L. Naish, Ed. MIT Press,
Cambridge, MA, 123–137.
PONTELLI, E. AND GUPTA, G. 1997b. Parallel Symbolic Computation with ACE. Annals of AI and
Mathematics 21, 2–4, 359–395.
PONTELLI, E. AND GUPTA, G. 1998. Efficient Backtracking in And-Parallel Implementations of Non-
Deterministic Languages. In Proceedings of the International Conference on Parallel Processing,
T. Lai, Ed. IEEE Computer Society, Los Alamitos, CA, 338–345.
PONTELLI, E., GUPTA, G., AND HERMENEGILDO, M. 1995. &ACE: A High-Performance Parallel Prolog
System. In Proceedings of the International Parallel Processing Symposium. IEEE Computer
Society, Los Alamitos, CA, 564–571.
PONTELLI, E., GUPTA, G., PULVIRENTI, F., AND FERRO, A. 1997a. Automatic Compile-Time Paral-
lelization of Prolog Programs for Dependent And-Parallelism. In International Conference on
Logic Programming, L. Naish, Ed. MIT Press, Cambridge, MA, 108–122.
PONTELLI, E., GUPTA, G., TANG, D., CARRO, M., AND HERMENEGILDO, M. 1996. Improving the Effi-
ciency of Non-Deterministic Independent And-Parallel Systems. Computer Languages 22, 2/3,
115–142.
PONTELLI, E., GUPTA, G., WIEBE, J., AND FARWELL, D. 1998. Natural Language Multiprocessing:
A Case Study. In Proceedings of the Fifteenth National Conference on Artifical Intelligence.
AAAI/MIT Press, Cambridge, MA, 76–82.
PONTELLI, E., RANJAN, D., AND GUPTA, G. 1997b. On the Complexity of Parallel Implementation
of Logic Programs. In Proceedings of the International Conference on Foundations of Software
Technology and Theoretical Computer Science, S. Ramesh and G. Sivakumar, Eds. Springer-
Verlag, Heidelberg, 123–137.
POPOV, K. 1997. A Parallel Abstract Machine for the Thread-Based Concurrent Language Oz. In
Proceedings of the Workshop on Parallelism and Implementation Technology for Constraint Logic
Programming, E. Pontelli and V. Santos Costa, Eds. New Mexico State University.
PUEBLA, G. AND HERMENEGILDO, M. 1996. Optimized Algorithms for the Incremental Analysis of
Logic Programs. In International Static Analysis Symposium. Number 1145 in LNCS. Springer-
Verlag, 270–284.
PUEBLA, G. AND HERMENEGILDO, M. 1999. Abstract Multiple Specialization and Its Application to
Program Parallelization. J. of Logic Programming. Special Issue on Synthesis, Transformation
and Analysis of Logic Programs 41, 2&3 (November), 279–316.
RAMESH, R., RAMAKRISHNAN, I. V., AND WARREN, D. S. 1990. Automata-Driven Indexing of Prolog
Clauses. In Proceedings of the Symposium on Principles of Programming Languages. ACM Press,
New York, 281–290.
RAMKUMAR, B. AND KALÉ, L. 1989. Compiled Execution of the Reduce-OR Process Model on Mul-
tiprocessors. In Proceedings of the North American Conference on Logic Programming, E. Lusk
and R. Overbeek, Eds. MIT Press, Cambridge, MA, 313–331.
RAMKUMAR, B. AND KALÉ, L. 1990. And Parallel Solutions in And/Or Parallel Systems. In Proceed-
ings of the North American Conference on Logic Programming, S. Debray and M. Hermenegildo,
Eds. MIT Press, Cambridge, MA, 624–641.
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.
598 • G. Gupta et al.
RAMKUMAR, B. AND KALÉ, L. 1992. Machine Independent AND and OR Parallel Execution of Logic
Programs. Part i and ii. IEEE Transactions on Parallel and Distributed Systems 2, 5.
RANJAN, D., PONTELLI, E., AND GUPTA, G. 1999. On the Complexity of Or-Parallelism. New Gener-
ation Computing 17, 3, 285–308.
RANJAN, D., PONTELLI, E., AND GUPTA, G. 2000a. Data Structures for Order-Sensitive Predicates
in Parallel Nondetermistic Systems. ACTA Informatica 37, 1, 21–43.
RANJAN, D., PONTELLI, E., LONGPRE, L., AND GUPTA, G. 2000b. The Temporal Precedence Problem.
Algorithmica 28, 288–306.
RATCLIFFE, M. AND SYRE, J. C. 1987. A Parallel Logic Programming Language for PEPSys. In
Proceedings of IJCAI, J. McDermott, Ed., Morgan-Kaufmann, San Francisco, CA, 48–55.
ROCHA, R., SILVA, F., AND SANTOS COSTA, V. 1999a. Or-Parallelism Within Tabling. In Proceedings
of the Symposium on Practical Aspects of Declarative Languages, G. Gupta, Ed. Springer-Verlag,
Heidelberg, 137–151.
ROCHA, R., SILVA, F., AND SANTOS COSTA, V. 1999b. YapOr: An Or-Parallel Prolog System Based on
Environment Copying. In LNAI 1695, Proceedings of EPPIA’99: The 9th Portuguese Conference
on Artificial Intelligence. Springer-Verlag LNAI Series, 178–192.
ROKUSAWA, K., NAKASE, A., AND CHIKAYAMA, T. 1996. Distributed Memory Implementation of KLIC.
New Generation Computing 14, 3, 261–280.
RUIZ-ANDINO, A., ARAUJO, L., SÁENZ, F., AND RUZ, J. 1999. Parallel Execution Models for Con-
straint Programming over Finite Domains. In Proceedings of the Conference on Principles
and Practice of Declarative Programming, G. Nadathur, Ed. Springer-Verlag, Heidelberg, 134–
151.
SAMAL, A. AND HENDERSON, T. 1987. Parallel Consistent Labeling Algorithms. International Jour-
nal of Parallel Programming 16, 5, 341–364.
SANTOS COSTA, V. 1999. COWL: Copy-On-Write for Logic Programs. In Proceedings of
IPPS/SPDP. IEEE Computer Society, Los Alamitos, CA, 720–727.
SANTOS COSTA, V. 2000. Encyclopedia of Computer Science and Technology. Vol. 42. Marcel Dekker
Inc., New Yourk, Chapter Parallelism and Implementation Technology for Logic Programming
Languages, 197–237.
SANTOS COSTA, V., BIANCHINI, R., AND DUTRA, I. C. 1997. Parallel Logic Programming Systems on
Scalable Multiprocessors. In Proceedings of the International Symposium on Parallel Symbolic
Computation. ACM Press, Los Alamitos, CA, 58–67.
SANTOS COSTA, V., BIANCHINI, R., AND DUTRA, I. C. 2000. Parallel Logic Programming Systems on
Scalable Architectures. Journal of Parallel and Distributed Computing 60, 7, 835–852.
SANTOS COSTA, V., DAMAS, L., REIS, R., AND AZEVEDO, R. 1999. YAP User’s Manual. University of
Porto. www.ncc.up.pt/~vsc/Yap.
SANTOS COSTA, V., ROCHA, R., AND SILVA, F. 2000. Novel Models for Or-Parallel Logic Programs: A
Performance Analysis. In Proceedings of EuroPar, A. B. et al., Ed. Springer-Verlag, Heidelberg,
744–753.
SANTOS COSTA, V., WARREN, D. H. D., AND YANG, R. 1991a. Andorra-I: A Parallel Prolog System That
Transparently Exploits Both And- and Or-Parallelism. In Proceedings of the ACM Symposium
on Principles and Practice of Parallel Programming. ACM Press, New York, 83–93.
SANTOS COSTA, V., WARREN, D. H. D., AND YANG, R. 1991b. The Andorra-I Engine: A Parallel Imple-
mentation of the Basic Andorra Model. In Proceedings of the International Conference on Logic
Programming, K. Furukawa, Ed. MIT Press, Cambridge, MA, 825–839.
SANTOS COSTA, V., WARREN, D. H. D., AND YANG, R. 1991c. The Andorra-I Preprocessor: Supporting
Full Prolog on the Basic Andorra Model. In Proceedings of the International Conference on Logic
Programming, K. Furukawa, Ed. MIT Press, Cambridge, MA, 443–456.
SANTOS COSTA, V., WARREN, D. H. D., AND YANG, R. 1996. Andorra-I Compilation. New Generation
Computing 14, 1, 3–30.
SARASWAT, V. 1989. Concurrent Constraint Programming Languages. Ph.D. thesis, Carnegie
Mellon, Pittsburgh. School of Computer Science.
SCHULTE, C. 2000. Parallel Search Made Simple. In Proceedings of Techniques for Implementing
Constraint Programming Systems, Post-conference workshop of CP 2000, N. Beldiceanu et al.,
Ed. Number TRA9/00. University of Singapore, 41–57.
SHAPIRO, E. 1987. Concurrent Prolog: Collected Papers. MIT Press, Cambridge MA.
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.
Parallel Execution of Prolog Programs • 599
SHAPIRO, E. 1989. The Family of Concurrent Logic Programming Languages. ACM Computing
Suveys 21, 3, 413–510.
SHEN, K. 1992a. Exploiting Dependent And-Parallelism in Prolog: The Dynamic Dependent And-
Parallel Scheme. In Proceedings of the Joint International Conference and Symposium on Logic
Programming, K. Apt, Ed. MIT Press, Cambridge, MA, 717–731.
SHEN, K. 1992b. Studies in And/Or Parallelism in Prolog. Ph.D. thesis, University of Cambridge.
SHEN, K. 1994. Improving the Execution of the Dependent And-Parallel Prolog DDAS. In
Proceedings of Parallel Architectures and Languages Europe, C. Halatsis et al., Ed. Springer-
Verlag, Heidelberg, 438–452.
SHEN, K. 1996a. Initial Results from the Parallel Implementation of DASWAM. In Proceedings
of the Joint International Conference and Symposium on Logic Programming. MIT Press, Cam-
bridge, MA.
SHEN, K. 1996b. Overview of DASWAM: Exploitation of Dependent And-Parallelism. Journal of
Logic Programming 29, 1/3, 245–293.
SHEN, K. 1997. A New Implementation Scheme for Combining And/Or Parallelism. In Pro-
ceedings of the Workshop on Parallelism and Implementation Technology for Constraint Logic
Programming, E. Pontelli and V. Santos Costa, Eds. New Mexico State University, Dept. Com-
puter Science.
SHEN, K. AND HERMENEGILDO, M. 1991. A Simulation Study of Or- and Independent And-
Parallelism. In Proceedings of the International Logic Programming Symposium, V. Saraswat
and K. Ueda, Eds. MIT Press, Cambridge, MA, 135–151.
SHEN, K. AND HERMENEGILDO, M. 1994. Divided We Stand: Parallel Distributed Stack Memory
Management. In Implementations of Logic Programming Systems, E. Tick and G. Succi, Eds.
Kluwer Academic Press, Boston, MA.
SHEN, K. AND HERMENEGILDO, M. 1996a. Flexible Scheduling for Non-Deterministic, And-Parallel
Execution of Logic Programs. In Proceedings of EuroPar’96. Number 1124 in LNCS. Springer-
Verlag, 635–640.
SHEN, K. AND HERMENEGILDO, M. 1996b. High-Level Characteristics of Or- and Independent And-
Parallelism in Prolog. International. Journal of Parallel Programming 24, 5, 433–478.
SHEN, K., SANTOS COSTA, V., AND KING, A. 1998. Distance: A New Metric for Controlling Granularity
for Parallel Execution. In Proceedings of the Joint International Conference and Symposium on
Logic Programming, J. Jaffar, Ed. MIT Press, Cambridge, MA, 85–99.
SILVA, F. AND WATSON, P. 2000. Or-Parallel Prolog on a Distributed Memory Architecture. Journal
of Logic Programming 43, 2, 173–186.
SILVA, M., DUTRA, I. C., BIANCHINI, R., AND SANTOS COSTA, V. 1999. The Influence of Computer Archi-
tectural Parameters on Parallel Logic Programming Systems. In Proceedings of the Workshop on
Practical Aspects of Declarative Languages, G. Gupta, Ed. Springer-Verlag, Heidelberg, 122–136.
SINDAHA, R. 1992. The Dharma Scheduler — Definitive Scheduling in Aurora on Multiprocessor
Architecture. In Proceedings of the Symposium on Parallel and Distributed Processing. IEEE
Computer Society, Los Alamitos, CA, 296–303.
SINDAHA, R. 1993. Branch-Level Scheduling in Aurora: The Dharma Scheduler. In Proceedings
of International Logic Programming Symposium, D. Miller, Ed. MIT Press, Cambridge, MA,
403–419.
SINGHAL, A. AND PATT, Y. 1989. Unification Parallelism: How Much Can We Exploit? In Proceedings
of the North American Conference on Logic Programming, E. Lusk and R. Overbeek, Eds. MIT
Press, Cambridge, MA, 1135–1147.
SMITH, D. 1996. MultiLog and Data Or-Parallelism. Journal of Logic Programming 29, 1–3, 195–
244.
SMOLKA, G. 1996. Constraints in Oz. ACM Computing Surveys 28, 4 (December), 75–76.
STERLING, L. AND SHAPIRO, E. 1994. The Art of Prolog. MIT Press, Cambridge MA.
SZEREDI, P. 1989. Performance Analysis of the Aurora Or-Parallel Prolog System. In Proceedings
of the North American Conference on Logic Programming, E. Lusk and R. Overbeek, Eds. MIT
Press, Cambridge, MA, 713–732.
SZEREDI, P. 1991. Using Dynamic Predicates in an Or-Parallel Prolog System. In Proceedings of
the International Logic Programming Symposium, V. Saraswat and K. Ueda, Eds. MIT Press,
Cambridge, MA, 355–371.
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.
600 • G. Gupta et al.
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.
Parallel Execution of Prolog Programs • 601
VAN HENTENRYCK, P., SARASWAT, V., AND DEVILLE, Y. 1998. Design, Implementation and Evaluation
of the Constraint Language cc(FD). Journal of Logic Programming 37, 1–3, 139–164.
VAN ROY, P. 1990. Can Logic Programming Execute as Fast as Imperative Programming? Ph.D.
thesis, U.C. Berkeley.
VAN ROY, P. 1994. 1983-1993: The Wonder Years of Sequential Prolog Implementation. Journal
of Logic Programming 19/20, 385–441.
VAN ROY, P. AND DESPAIN, A. 1992. High-Performance Logic Programming with the Aquarius
Prolog Compiler. IEEE Computer 25, 1, 54–68.
VAUPEL, R., PONTELLI, E., AND GUPTA, G. 1997. Visualization of And/Or-Parallel Execution of Logic
Programs. In Proceedings of the International Conference on Logic Programming, L. Naish, Ed.
MIT Press, Cambridge, MA, 271–285.
VÉRON, A., SCHUERMAN, K., REEVE, M., AND LI, L.-L. 1993. Why and How in the ElipSys Or-Parallel
CLP System. In Proceedings of the Conference on Parallel Architectures and Languages Europe,
A. Bode, M. Reeve, and G. Wolf, Eds. Springer-Verlag, Heidelberg, 291–303.
VILLAVERDE, K., GUO, H.-F., PONTELLI, E., AND GUPTA, G. 2000. Incremental Stack Splitting. In
Proceedings of the Workshop on Parallelism and Implementation Technology for Constraint Logic
Programming, I. C. Dutra, Ed. Federal University of Rio de Janeiro, London.
VILLAVERDE, K., PONTELLI, E., GUPTA, G., AND GUO, H. 2001. Incremental Stack Splitting Mecha-
nisms for Efficient Parallel Implementation of Search-Based Systems. In International Confer-
ence on Parallel Processing. IEEE Computer Society, Los Alamitos, CA.
WALLACE, M., NOVELLO, S., AND SCHIMPF, J. 1997. ECLiPSe: A Platform for Constraint Logic
Programming. Tech. rep., IC-Parc, Imperial College.
WARREN, D. H. D. 1980. An Improved Prolog Implementation Which Optimises Tail Recursion.
Research Paper 156, Dept. of Artificial Intelligence, University of Edinburgh.
WARREN, D. H. D. 1983. An Abstract Prolog Instruction Set. Technical Report 309, SRI
International.
WARREN, D. H. D. 1987a. The Andorra Principle. Presented at Gigalips workshop, Unpublished.
WARREN, D. H. D. 1987b. OR-Parallel Execution Models of Prolog. In Proceedings of TAP-
SOFT, H. E. et al., Ed. Lecture Notes in Computer Science. Springer-Verlag, Heidelberg, 243–
259.
WARREN, D. H. D. 1987c. The SRI Model for OR-Parallel Execution of Prolog—Abstract Design
and Implementation. In Proceedings of the Symposium on Logic Programming. IEEE Computer
Society, Los Alamitos, CA, 92–102.
WARREN, D. H. D. AND HARIDI, S. 1988. Data Diffusion Machine—A Scalable Shared Virtual Mem-
ory Multiprocessor. In Proceedings of the International Conference on Fifth Generation Computer
Systems. ICOT, Springer-Verlag, Tokyo, Japan, 943–952.
WARREN, D. S. 1984. Efficient Prolog Memory Management for Flexible Control Strategies. In
Proceedings of the Symposium on Logic Programming. IEEE Computer Society, Los Alamitos,
CA, 198–203.
WEEMEEUW, P. AND DEMOEN, B. 1990. Memory Compaction for Shared Memory Multiproces-
sors, Design and Specification. In Proceedings of the North American Conference on Logic
Programming, S. Debray and M. Hermenegildo, Eds. MIT Press, Cambridge, MA, 306–
320.
WESTPHAL, H., ROBERT, P., CHASSIN DE KERGOMMEAUX, J., AND SYRE, J. 1987. The PEPSys Model:
Combining Backtracking, AND- and OR- Parallelism. In Proceedings of the Symposium on Logic
Programming. IEEE Computer Society, Los Alamitos, CA, 436–448.
WINSBOROUGH, W. 1987. Semantically Transparent Reset for And Parallel Interpreters Based on
the Origin of Failure. In Proceedings of the Fourth Symposium on Logic Programming. IEEE
Computer Society, Los Alamitos, CA, 134–152.
WINSBOROUGH, W. AND WAERN, A. 1988. Transparent And-Parallelism in the Presence of Shared
Free Variables. In Fifth International Conference and Symposium on Logic Programming.
749–764.
WISE, D. S. 1986. Prolog Multiprocessors. Prentice-Hall, New Jersey.
WOLFSON, O. AND SILBERSCHATZ, A. 1988. Distributed Processing of Logic Programs. In Proceedings
of the SIGMOD International Conference on Management of Data, H. Boral and P. Larson, Eds.
ACM, ACM Press, New York, 329–336.
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.
602 • G. Gupta et al.
WOO, N. AND CHOE, K. 1986. Selecting the Backtrack Literal in And/Or Process Model. In Pro-
ceedings of the Symposium on Logic Programming. IEEE, Los Alamitos, CA, 200–210.
XU, L., KOIKE, H., AND TANAKA, H. 1989. Distributed Garbage Collection for the Parallel Inference
Engine PIE64. In Proceedings of the North American Conference on Logic Programming, E. Lusk
and R. Overbeek, Eds. MIT Press, Cambridge, MA, 922–943.
YANG, R. 1987. P-Prolog: A Parallel Logic Programming Language. Ph.D. thesis, Keio University.
YANG, R., BEAUMONT, A., DUTRA, I. C., SANTOS COSTA, V., AND WARREN, D. H. D. 1993. Performance
of the Compiler-Based Andorra-I System. In Proceedings of the Tenth International Conference
on Logic Programming, D. S. Warren, Ed. MIT Press, Cambridge, MA, 150–166.
ZHONG, X., TICK, E., DUVVURU, S., HANSEN, L., SASTRY, A., AND SUNDARARAJAN, R. 1992. Towards an
Efficient Compile-Time Granularity Algorithm. In Proceedings of the International Conference
on Fifth Generation Computer Systems, ICOT Staff, Ed. Ohmsha Ltd., Tokyo, Japan, 809–816.
ZIMA, H. AND CHAPMAN, B. 1991. Supercompilers for Parallel and Vector Computers. ACM Press,
New York.
ACM Transactions on Programming Languages and Systems, Vol. 23, No. 4, July 2001.