Dependency-Based Automatic Parallelization of Java Applications
Dependency-Based Automatic Parallelization of Java Applications
Dependency-Based Automatic Parallelization of Java Applications
net/publication/278705238
CITATIONS READS
4 110
4 authors, including:
All content following this page was uploaded by Bruno Cabral on 05 October 2015.
1 Introduction
Our contribution is both a framework and a tool for performing the auto-
matic parallelization of sequential JAVA code. Our solution extracts instruction
signatures (read from memory, write to memory, control flow, etc.) from the
application’s AST and infers data dependencies between instructions. Using this
information we create a set of tasks containing the same operations as the origi-
nal version. The execution of these tasks is conducted by the Æminium Runtime
which schedules the workload to all available cores using a work-stealing al-
gorithm [7]. This approach supports a different number of processor cores by
adjusting the number of worker threads and generated tasks, as long as there is
enough latent parallelism in the program. With a simple runtime optimization,
our experiments show a 9.0 speedup on a 12-core machine for the naive recursive
Fibonacci implementation.
The remainder of this paper is organized as follows: in Section 2 we discuss the
related work. Section 3 specifies the methodology used by the Æminium compiler
throughout the entire process, from signature analysis to code generation. In
Section 4 we conduct benchmarking tests and analyze the results. Finally, in
Section 5 we present a summary of this paper’s contributions and discuss future
work.
2 Related Work
preserved. As such, compilers must ensure the correct execution order between
operations on multiple threads, taking into account their precedence in the orig-
inal program.
One of the primary targets for automatic parallelization are loops. Numeri-
cal and scientific applications often contain loops consisting mostly of arithmetic
operations. These loops provide a good source of parallelism due to the lack of
complex control structures and can be parallelized with techniques such as doall,
doacross and dopipe [10]. When dependencies between iterations are found the
compiler may attempt to remove them by applying transformations such as vari-
able privatization, loop distribution, skewing and reversal. These modifications
are extensively described in [4].
Many algorithms however, are best implemented using a recursive definition
as this is often the nature of the problem itself. The parallel resolution of each
of the sub-problems has also been analyzed. In [11] this method is applied to
the functional language LISP by the introduction of the letpar construct. This
model can be used with success because the semantics of functional programming
imply that there is no interference between sub-expressions. For non-functional
languages, a technique known as thread-level speculation executes the operations
optimistically assuming no interference. If such speculation is wrong, specialized
hardware is used to rollback the faulty threads into a previous checkpoint [12].
In [13] recursion-based parallelism is applied to the JAVA language. In order to
avoid interference between sub-expressions, a static analysis of read and write
signatures is performed and the resulting data stored. At runtime, this informa-
tion is used to check which methods can be executed in parallel by replacing the
parameters with the actual variables in the stored method signatures. However,
this runtime verification inadvertently introduces overhead. Our approach, on
the other hand, does not resort to runtime support for dealing with this prob-
lem. By adding two new signatures, merge and control, we are able to solve
this problem without a runtime penalty.
3 Methodology
tasks
Æminium
.java runtime
Java Compiler
JVM
execution flow of other operations inside the scope marked by the datagroup
τ . The last predicate callm (α, τo , τr , τp0 , · · · , τpn ) is used as a placeholder for
method calls; τo is the datagroup of the object that owns the method, τr is the
datagroup where the return value is saved and τpx is the datagroup for each
of the invocation arguments. In program 1 the reader can observe an example
of signatures extracted by the compiler. Also note that a merge(αret1 , τn , τret )
signature is detected as well. However, since n and ret are both integers this
signature can be omitted.
int f ( int n ) {
if ( n < 2) { // read(αcond , τn )
return n ; // write(αret1 , τret ), control(αret1 , τf )
}
return f ( n - 1) + f ( n - 2) ; // callf (αinv1 , ∅, τinv1 , τp0 )
}
n n
2 2 n
2 2
• > • • > •
n
2 2 return •
f(•) f(•)
• > •
• + •
if • return •
f(•)
Fig. 2: Tasks generated for program 1 without optimization. Dotted arrows iden-
tify child scheduling. Solid arrows are used to represent strong dependencies
while dashed arrows indicate weak dependencies. Filled tasks is the function
root task
1. A task that may read from a datagroup must wait for the termination of the
last task that writes to it;
2. A task that may write to a datagroup must wait for the conclusion of all
tasks that read from it since the last write;
If two tasks may write to the same datagroup and there is no intermediary
task that reads from it, then the latter task must wait for the former to
complete; 1
α ≺t β ≺t γ, ∀α, β, γ ∈ A
write(α, τa ), merge(β, τa , τb ), read(γ, τb )
∴ α ≺p γ
1
This rule applies when operations require both read and write access (such as the
increment operator), or when tasks span more than a single operation.
7
α ≺t β ≺t γ, ∀α, β, γ ∈ A
read(α, τa ), merge(β, τa , τb ), write(γ, τb )
∴ α ≺p γ
α ≺t β ≺t γ, ∀α, β, γ ∈ A
write(α, τa ), merge(β, τa , τb ), write(γ, τb )
∴ α ≺p γ
4. Control signatures enforce dependencies from all the tasks of the scope whose
execution path can be altered.
3.3 Optimization
Task Aggregation The first pass is to create one task per each node of the AST.
However, the execution of a parallel program with one task for each AST node is
several times slower than the sequential program, which makes task aggregation
mandatory. By coarsening the tasks, we are able to lower the scheduling overhead
and the memory usage. This optimization step attempts to reduce the number
of generated tasks by merging the code of several tasks together in one task.
The aggregate(α, β) operation has the following semantics: given two tasks
α, β ∈ A, such that α is a strong dependency of β, we merge α into β by
transferring all the dependencies of α into β, and placing the instructions of α
before the instructions of β or a place of equal execution semantics (such as the
right-hand side of an assignment expression).
Given that the code inside each task executes sequentially, by over-aggregating
tasks the parallelism of the program is reduced. As such, we identify two types of
task aggregation. Soft aggregation reduces tasks without hindering parallelism:
if task β depends on α, and there is no other task γ that also depends on α,
then α can be merged into β without loss of parallelism.
sof t , α ≺p β ∧ @ α ≺p γ ⇒ aggregate(α, β) α, β, γ ∈ A
Hard aggregation on the other hand attempts to merge tasks even in other scenar-
ios, such as lightweight arithmetic operations. Currently the optimizer aggregates
all expressions with the exception of method invocations (including constructor
calls). Also, statements where execution must be sequential (e.g.: the then block
of an if statement) and their aggregation does not violate dependency con-
straints are also aggregated. Optionally full sequentialization of cycles can also
take place. Using this feature disables parallelization of loops, but generates a
lower runtime memory footprint.
9
The Æminium runtime executes and handles dependencies between Task’s. These
objects contain information about their state, and their dependencies. The ac-
tual code executed by each task exists in a execute() method of a class that
implements the Body interface. This factorization allows for reuse of the same
body object for multiple tasks. Bodies are constructed with a single parameter:
a reference to the parent body if it exists and null otherwise. This allows ac-
cess to fields of upper tasks where local variables and method parameters will
be stored. Inside the constructor of the body, its task is created by calling the
Aeminium.createTask() function which receives the body as it first parameter.
The second parameter defines a hints object used by the runtime to optimize
scheduling. This functionality is not used by the compiler and the default value
of NO HINTS is used. Strong dependencies of the task are instantiated in the con-
structor of the task body. This operation must take place after the creation of
the task object (since it must be available as the parent task when scheduling
those dependencies), and before the schedule of the task itself (since those tasks
will be used inside the task dependency list).
Loops Loop statements such as while, for, and do...while allow for multiple
iterations to execute the same lines of code. However, the actual instructions
may vary from iteration to iteration. Furthermore, the instructions on the first
iteration must wait for instructions prior to the loop (e.g. a variable declaration)
while subsequent instructions only need to wait for one on the previous iteration
(last modification). To allow this duality of dependencies two trees of tasks are
10
created for each loop. The former contains dependencies belonging to the first
iteration while the latter includes dependencies associated with the following
iterations. The parent task of this second tree contains a previous field that
points to the preceding instance, and inside the execute() method creates an-
other instance of itself. Sub-tasks make use of this field to reference tasks of the
previous iteration for their dependency list.
4 Evaluation
To validate our approach we compiled three sample applications using the Æminium
compiler and executed the resulting tasks in a machine with the following spec-
ification: 2 Intel
Xeon
R
Processor
R X5660 (6 cores each, with hyper-threading,
forming a total of 24 threads) and 24 GB of RAM. The applications include
the recursive implementation of the Fibonnaci program already mentioned in
Section 3, an application to numerically approximate the integral of a func-
tion given an interval, and finally a simple implementation of the Fast Fourier
Transform (FFT) on an array of 222 random complex numbers. The FFT ap-
plication requires the generation of an array of Complex objects. This step is
not considered for the benchmark time as it requires sequential invocations to
Random.nextDouble(). Also, in order to minimize runtime overhead of cycle
scheduling the option to sequentialize loops (as described in 3.3) was used. Each
experiment was repeated 30 times. The results are depicted in Table 1 and Figure
3.
80 12
70 Original Fibonacci
Parallelized 10 Integrate
Execution Time (seconds)
60 FFT
8
50
Speedup
40 6
30 4
20
2
10
0 0 1 2 3 4 5 6 7 8 9 10 11 12
Fibonacci Integrate FFT ]
The first benchmark computes the 50th Fibonacci number. The sequential
execution of this problem took on average 55.56 seconds to complete, while the
parallel version only took 6.17 seconds. Although it consists of a 9.00x increase
in performance (p = 0.973), it is well bellow the possible 12x (linear) speedup.
The scalability test shown in Figure 4 indicates the cores/speedup relation. The
dashed line is the desired linear speedup. The dotted lines identify the the least-
squares method fitted to the Amdahl’s law [14] with the exception of the third
benchmark where an adjustment for linear ovearhead h was added.
The second benchmark computes the integral of the function f (x) = x3 + x
in the interval [−2101.0, 200.0] up to 10−14 precision. The behaviour of this
test is similar to the previous, but with a slightly higher p = 0.978. The FFT
benchmark shows the lowest speedup among the three executed benchmarks
(pamdahl = 0.311 or p = 0.972, h = 0.746,). It is also the one with highest mem-
ory usage. This suggests that memory bandwidth is the primary bottleneck of
this particular implementation. In fact, this is the case for naı̈ve FFT imple-
mentations as indicated in [15]. As a consequence, for larger arrays the speedup
decreases as cache hits become less and less frequent due to false sharing.
Acknowledgments
This work would not have been possible without the contributions to the Aem-
inium language and runtime from Sven Stork, Paulo Marques and Jonathan
Aldrich. This work was partially supported by the Portuguese Research Agency
FCT, through CISUC (R&D Unit 326/97), the CMU|Portugal program (R&D
Project Aeminium CMU-PT/SE/0038/2008), the iCIS project (CENTRO-07-
ST24-FEDER-002003), co-financed by QREN, in the scope of the Mais Centro
12
Program and European Unions FEDER and by the COST framework, under
Actions IC0804 and IC0906. The third author was also supported by the Por-
tuguese National Foundation for Science and Technology (FCT) through a Doc-
toral Grant (SFRH/BD/84448/2012).
References
1. K Arnold, J Gosling, D.H.: The Java programming language. Addison Wesley
Professional (2005)
2. Biema, M.v.: A survey of parallel programming constructs. In: Columbia University
Computer Science Technical Reports. Department of Computer Science, Columbia
University (1999)
3. Banerjee, U., Eigenmann, R., Nicolau, A., Padua, D.: Automatic program paral-
lelization. Proceedings of the IEEE 81(2) (feb 1993) 211 –243
4. Banerjee, U.: Loop Transformations for Restructuring Compilers: The Founda-
tions. Springer (1993)
5. Feautrier, P.: Automatic parallelization in the polytope model. In Perrin, G.R.,
Darte, A., eds.: The Data Parallel Programming Model. Volume 1132 of Lecture
Notes in Computer Science. Springer Berlin / Heidelberg (1996) 79–103 10.1007/3-
540-61736-1 44.
6. Bik, A.J., Gannon, D.B.: Automatically exploiting implicit parallelism in java.
Concurrency - Practice and Experience 9(6) (1997) 579–619
7. Blumofe, R.D., Leiserson, C.E.: Scheduling multithreaded computations by work
stealing. J. ACM 46(5) (September 1999) 720–748
8. Randall, K.: Cilk: Efficient multithreaded computing. Technical report, Cambridge,
MA, USA (1998)
9. Dagum, L., Menon, R.: Openmp: an industry standard api for shared-memory
programming. Computational Science Engineering, IEEE 5(1) (jan-mar 1998) 46
–55
10. Ottoni, G., Rangan, R., Stoler, A., August, D.: Automatic thread extraction with
decoupled software pipelining. In: Microarchitecture, 2005. MICRO-38. Proceed-
ings. 38th Annual IEEE/ACM International Symposium on. (nov. 2005) 12 pp.
11. Hogen, G., Kindler, A., Loogen, R.: Automatic parallelization of lazy functional
programs. In: Proc. of 4th European Symposium on Programming, ESOP’92,
LNCS 582:254-268, Springer-Verlag (1992) 254–268
12. Bhowmik, A., Franklin, M.: A general compiler framework for speculative multi-
threading. In: Proceedings of the fourteenth annual ACM symposium on Parallel
algorithms and architectures. SPAA ’02, New York, NY, USA, ACM (2002) 99–108
13. Chan, B., Abdelrahman, T.S.: Run-time support for the automatic parallelization
of java programs. J. Supercomput. 28(1) (April 2004) 91–117
14. Amdahl, G.M.: Validity of the single processor approach to achieving large scale
computing capabilities. In: Proceedings of the April 18-20, 1967, spring joint com-
puter conference. AFIPS ’67 (Spring), New York, NY, USA, ACM (1967) 483–485
15. da Silva, C.P., Cupertino, L.F., Chevitarese, D., Pacheco, M.A.C., Bentes, C.: Ex-
ploring data streaming to improve 3d fft implementation on multiple gpus. In: Com-
puter Architecture and High Performance Computing Workshops (SBAC-PADW),
2010 22nd International Symposium on, IEEE (2010) 13–18