DwarfCode A Performance Prediction Tool For Parallel Applications

The document describes DwarfCode, a performance prediction tool for parallel applications. DwarfCode collects execution traces from MPI applications that record computing and communication events. It then merges the traces from different processes into a single trace. DwarfCode identifies and compresses repeating patterns in the final trace to generate a smaller "dwarf code" benchmark. This dwarf code mimics the original program's behavior and can be replayed on a target platform to predict the application's performance. The key challenges addressed are reducing the time complexity of the trace merging and repeat compression algorithms. Experimental results show DwarfCode can accurately predict running time for compute and communication intensive applications within 10% error.

Uploaded by

arif

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

49 views

DwarfCode A Performance Prediction Tool For Parallel Applications

Uploaded by

arif

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

IEEE TRANSACTIONS ON COMPUTERS, VOL. 65, NO.

2, FEBRUARY 2016 495

DwarfCode: A Performance Prediction

Tool for Parallel Applications
Weizhe Zhang, Member, IEEE, Albert M. K. Cheng, Senior Member, IEEE, and
Jaspal Subhlok, Member, IEEE

Abstract—We present DwarfCode, a performance prediction tool for MPI applications on diverse computing platforms. The goal is to
accurately predict the running time of applications for task scheduling and job migration. First, DwarfCode collects the execution traces
to record the computing and communication events. Then, it merges the traces from different processes into a single trace. After that,
DwarfCode identifies and compresses the repeating patterns in the final trace to shrink the size of the events. Finally, a dwarf code is
generated to mimic the original program behavior. This smaller running benchmark is replayed in the target platform to predict the
performance of the original application. In order to generate such a benchmark, two major challenges are to reduce the time complexity
of trace merging and repeat compression algorithms. We propose an O(mpn) trace merging algorithm to combine the traces generated
by separate MPI processes, where m denotes the upper bound of tracing distance, p denotes the number of processes, and n denotes
the maximum of event numbers of all the traces. More importantly, we put forward a novel repeat compression algorithm, whose time
complexity is O(nlogn). Experimental results show that DwarfCode can accurately predict the running time of MPI applications. The
error rate is below 10 percent for compute and communication intensive applications. This toolkit has been released for free download
as a GNU General Public License v3 software.

Index Terms—Performance prediction, MPI application, DwarfCode, trace merging, trace compressing

1 INTRODUCTION

P ERFORMANCE prediction of applications is a prerequisite

for resource management in various computing plat-
forms, such as task scheduling and job migration [1], [2].
of platforms (such as network traffic) is still an open prob-
lem [3], [4].
The most promising solution is benchmarking. If we can
Traditional performance prediction approaches fall into find a “representative” benchmark that exercises the app-
two branches: trace-driven and model-driven. The trace-driven lication’s characteristics, we can use the benchmarking results
prediction generates the execution logs of applications. to quantify the application performance on different plat-
When an identical application is rescheduled on these platforms. If we simply rely on the limited benchmark library, it
forms, its running time can be inferred directly. However, is almost impossible to find a “representative” one since real
these historical records are the static snapshots with previ- applications are diverse.
ous setups, which is difficult to fit into a complicated net- In this paper, we present DwarfCode, a tool that can
work computing environment. accurately predict multicore systems include process-level
The model-driven prediction does not rely on any specific coarse (MPI) application performance. We focus on MPI
result on a real platform. Instead, it builds the performance applications because they are the most popular types of
models for the computing platforms and applications. HPC applications and their behaviors are well under-
Then, the running time is predicted by calculating and ana- stood. This method can be easily extended to the HPC or
lyzing these models. This method needs to understand the Cloud computing environments. The key idea is to use
implementations of applications and features of the hard- computation and communication traces as a platform-inde-
ware in detail. If the model is not accurate, the real execu- pendent abstraction of real application behaviors—Dwarf-
tion time may have large conflicts with the predicted one. Code captures the computation and communication traces of
Besides, a heterogeneous and shared network complicates an executing application using a lightweight tracing
the real prediction. Even worse, predicting some features engine, analyzes these traces, and then generates a “dwarf
code” automatically. The dwarf code is a shorter running
benchmark of the real application which mimics the
W. Zhang is with the School of Computer Science and Technology, Harbin behavior the application. Its running time is expected
Institute of Technology, Harbin 150001, China. to be proportional to that of the original program on plat-
E-mail: wzzhang@hit.edu.cn. forms with a similar architecture. Finally, the dwarf code
A.M.K. Cheng and J. Subhlok are with the Department of Computer Sci-
ence, University of Houston, Houston, TX 77204.
can be replayed on a target platform to predict the app-
E-mail: cheng@cs.uh.edu, jsubhlok@central.uh.edu. lication’s performance.
Manuscript received 19 Mar. 2014; revised 1 Mar. 2015; accepted 17 Mar. The modules of DwarfCode include trace recording,
2015. Date of publication 29 Apr. 2015; date of current version 15 Jan. 2016. trace merging, repeat compression and dwarf code generation.
Recommended for acceptance by J. Xue. Although several related studies have been conducted
For information on obtaining reprints of this article, please send e-mail to: and well-grounded in trace recording and code genera-
reprints@ieee.org, and reference the Digital Object Identifier below.
Digital Object Identifier no. 10.1109/TC.2015.2417526 tion [5], [6], [7], [8], [9], [10], [11], [12], challenges remain
0018-9340 ß 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution
requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
496 IEEE TRANSACTIONS ON COMPUTERS, VOL. 65, NO. 2, FEBRUARY 2016

2.1 Recording the Computation

To record the time for computation operations, DwarfCode
measures the time spent between the end of an MPI call and
the start of the next MPI call.

2.2 Recording the Communication

To record the behaviors of MPI calls, DwarfCode links the
MPI application by employing a standard MPI profiling layer
(PMPI) [14]. The PMPI interface is widely used by perfor-
mance analysis tools and libraries, such as VampirTrace [15].
DwarfCode uses PMPI to intercept MPI calls during applica-
tion execution and direct to wrapper MPI calls. When an MPI
call is executed, the profiling library intercepts it, and records
its parameters and timestamps. Once these are logged, the
original MPI routine is invoked. DwarfCode uses a well-
known MPI profiler mpiP [16] to record the traces. Fig. 1
shows a fragment of the runtime trace log, which is from NPB
BT application (Class C; ProcRank ¼ 0).
The trace consists of many records with the same data
structures. The format of each record is shown as a cell in
Fig. 1. Each record contains several domains including
Rank, Function, Parameters, Paravalues, Starttime, End-
time and Durtime. Rank records the process running the
code. Function is the function name of the intercepted MPI
Fig. 1. Fragment of running traces (BT, Class ¼ C, ProcRank ¼ 0). call. Parameters and Paravalues record the parameter
names and values of the intercepted MPI call, respectively.
in trace merging and repeat compression. In order to Starttime and Endtime are the starting time and the ending
accomplish this, first, we need to tackle function conflicts time of the call, respectively. Durtime is calculated on the
in trace merging with lower time complexity. Second, we difference of Starttime and Endtime.
need to develop a lower time complexity algorithm for The values of Starttime and Endtime are logical time. The
repeat compression attacking the existing O(n3) algo- logical time is the difference of the timestamp of current call
rithm. Finally, DwarfCode is fulfilled and evaluated on relative to the timestamp when the MPI process starts, i.e.,
small-scale and large-scale platforms to confirm the pre- when the MPI_Init function is invoked. When an MPI appli-
diction accuracy and scalability. cation starts, all the processes will invoke MPI_Init(). At this
We have implemented DwarfCode, developed an O time, DwarfCode records their timestamps as Global_Init_-
(mpn) trace merging algorithm as well as an O(nlogn) Time. Note that the values of Global_Init_Time might be
repeat compression algorithm. DwarfCode is deployed on different in different processes. When these processes inter-
two small-scale clusters (named Dawning1000 and IA32) cept the start of one new MPI call, they record the difference
and one large-scale supercomputer (named Kongfu). We between current timestamp and the corresponding Glob-
find that DwarfCode can predict the response times of al_Init_Time as Starttime. When these processes intercept
NAS Parallel Benchmarks (NPB) [13] and Parallel NBody the end of one new MPI call, they record the difference
Simulation application with low error rates (<3 percent between current timestamp and the corresponding Glob-
on Dawning1000, <10 percent on IA32 and on Kongfu). al_Init_Time as Endtime. Even so, there might be some
This toolkit has been released for free download on errors incurred by the unbalanced or overlapped computa-
Github [40] under GNU general public license (GPL) v3. tion or communication. Only if the deviation of their Start-
The rest of this paper is organized as follows. Section time/Endtime is below a certain percentage of the error
2 presents the trace recording procedure. We describe rate, they can be determined as the same logical time. In our
the trace merging and repeat compression algorithms in experiments, the threshold error rate is set to 8 percent
Sections 3 and 4, respectively. Section 5 explains the based on experimental analysis.
dwarf code generation. DwarfCode is evaluated in Section The procedure to generate the logical traces is called trace
6. Section 7 presents the related works. Architecture and logicalization. A communication matrix that identifies process
deadlock issues are discussed in Section 8. Section 9 con- pairs with traffic during execution is also generated by sum-
cludes and discusses our work in future. marizing the number of messages exchanged between pro-
cess pairs. Our former studies [7, 8] leverage the graph
isomorphism checking algorithm to identify the application
2 TRACE RECORDING level communication topology. Finally, all message sends and
To generate an execution trace, DwarfCode needs to decide receives are to/from a logical neighbor in terms of a logical
the event types to trace. The CPU, storage, network and communication topology (e.g., a torus or a grid) instead of a
lock events are the important events to profile general pro- physical process rank. The logical communication trace keeps
grams. For MPI applications, the communication and com- similar behaviors to the physical communication trace, which
putation events are critical to be recorded. can make the generated code scalable.
ZHANG ET AL.: DWARFCODE: A PERFORMANCE PREDICTION TOOL FOR PARALLEL APPLICATIONS 497

Fig. 3. The case with multiple conflicted calls.

The key to solve function conflicts is to sequence the con-

flicted calls, align the similar ones and merge them to the
extent possible. We propose a maximal m-step downward
tracing heuristic to compare the tracing distances of con-
flicted calls.
Fig. 2. Communication that leads to function conflicts.
First, when the conflicts happen, each conflicted call indi-
vidually searches similar calls in all of the remaining traces
and records their distances when the first similar call is
3 TRACE MERGING found. The tracing distance denotes the maximal distance
DwarfCode records the traces for separate MPI processes, among the recorded ones. For example, Fig. 3 shows that
which may build a family of process-level dwarf codes inde- MPI_B of Process 0 conflicts with MPI_A of other processes
pendently. However, we aim to construct a single SPMD in the first line. The tracing distance of MPI_B is 2 because in
dwarf program to mimic the original program behavior. two steps MPI_B searches down and finds the farthest
Thus, the traces from these processes should be merged into a sin- MPI_B of Process 3 in the third line; Similarly, the tracing
gle trace before the dwarf code generation. distance of MPI_A is 3 because MPI_A of Process 2 in the
In this section, we propose a trace merging algorithm to last line is the farthest match. Note that an upper bound m
combine these similar events. Its main challenges are three- should be set for the tracing distance. Otherwise, some unre-
fold: 1) How to identify similar events from distinct traces lated calls may be merged incorrectly. According to the
(Section 3.1); 2) How to tackle function conflicts (Section experiments, m is set up as 10. The unrelated calls are
3.2); 3) How to devise lower time complexity algorithm for merged correctly, only if the tracing distance is less than m ¼
trace merging (Section 3.3). 10; otherwise, the merge of the unrelated calls is incorrect.
Then, we sequence the conflict calls in descending order
3.1 Identifying Similar Events of their tracing distances. For example in Fig. 4a, MPI_B of
Only similar events originated from the same MPI call can process 0 conflicts with MPI_A of the other processes in the
be merged in the final single trace. Thus, several rules are first line. Then, we compute the tracing distances of MPI_B
proposed to delimit similar events. and MPI_A individually. MPI_A can be merged in the next
First, similar events should have the same logical time line, thus its tracing distance is 1. However, MPI_B cannot be
(recorded in Starttime and Endtime domains in Fig. 1) and merged until the end of the trace, thus its tracing distance is
function name (Function domain in Fig. 1) in different m (the upper bound). Because m is greater than 1, MPI_B
traces. Clearly, collective and point-to-point communication should be in front of MPI_A in the merged trace. The same
procedure is also suitable for MPI_D and MPI_C. Fig. 4b
events should not be merged together because of function
illustrates the result of the merged trace.
name violation; neither should blocking and non-blocking
events. Additionally, even for point-to-point communica-
tion events with the same logical time and function name, 3.3 Trace-Merging Algorithm
one need to further compare their communication parame- The number of events recorded in the traces can be very large.
ters (Parameters and Paravalues domains in Fig. 1), e.g., For example, the events of the LU application from NPB 3.3
count, source and dest in the MPI_Send and MPI_Recv calls. are up to 106. Thus, more scalable algorithms with lower
Only if the communication (e.g., count) deviation is below a complexity are preferred. DwarfCode introduces a new
threshold, these communication events are considered simi- trace-merging algorithm with the overall time complexity of
lar and merged. In our experiment, the deviation threshold
is set up as 5 percent based on experimental analysis.
Finally, as far as computation events are concerned,
DwarfCode treats the computation events between the
same pair of communication events as similar events. The
computational time is determined by the longest trace.

3.2 Tackling Function Conflicts

Function conflicts can be described as one or more of the
processes, but not all, invoking certain MPI functions. As
shown in Fig. 2a, Process 0 invokes MPI_A after MPI_B
while other processes only invoke MPI_A. As shown in
Fig. 2b, this causes that the logical time of MPI_A recorded Fig. 4. Maximal m-step downward tracing heuristic to tackle the function
in Process 0 conflicts with those from the other processes. conflicts.
498 IEEE TRANSACTIONS ON COMPUTERS, VOL. 65, NO. 2, FEBRUARY 2016

O(mpn), where m denotes the upper bound of tracing distance, recording. These loops should be discovered and folded
p denotes the number of processes, and n denotes the maxi- again; 2) to downsize the size of the original program to
mum of event numbers of all the traces. The complete algo- generate the dwarf code; the similar and successive events
rithm is shown in Algorithm 1. We initialize the pointers to are also recognized and compressed even if these events are
the top of each trace (lines 1-3). Then, we tackle the function not in the same loop.
conflicts and merge similar events (lines 4-21). If the corre- In this section, first, the repeat compression problem is for-
sponding events can be merged without function conflicts, mulated into an optimal string compression problem. Then, the
they are simply joined into the final trace. Otherwise, the trac- key is two-fold: 1) find all the primitive and inextensible
ing distance is calculated, and the maximal tracing distance is tandem (PIT arrays) arrays; 2) find the optimal combination
merged with time complexity O(mpn). Thus, the overall time of tandem arrays to acquire the optimal compression of the
complexity of Algorithm 1 is O(mpn). original string. After solving the two sub-problems, the
repeat compression algorithm is provided and its complex-
Algorithm 1. Trace-Merging Algorithm ity is analyzed in detail.
Input: T—a set of all the traces
Output: F—the merged trace. 4.1 Formulation
Variables: p—the number of traces in T The merged trace can be converted into a string of symbols.
n—the maximum of event numbers of all the traces Each event with the similar function is represented with the
ti —the ith trace (a set of events from process i), where same symbol. The similarity of the events is determined as
0 i p, ti 2 T stated in Section 3.1. Then, the merged trace is symbolized
li —the location of current event required to be merged in the ti, as a string S with a finite alphabet of a fixed size, such as:
where 0 li n S ¼ xyaxyabcdabcdabcdae. The aim of repeat compression is
m—the upper bound of tracing distance to discover and shrink the loop nest structures or similar
dki —the tracing distance of the kth event of ti successive events. Thus, the repeat compression problem
1. foreach 0 i p do can be converted into finding and reducing the repeating
2. li ¼ 0 substrings in S.
3. endfor Repeating substrings are divided into three categories:
4. repeat 1) a tandem repeat means two successive identical substrings
5. If the events pointed by all the li can be merged, then
immediately follow each other (such as abcabc, abc is the
6. merge the events and insert the merged events to the
repeat substring); 2) an overlap repeat means two identical
bottom of F
substrings overlap (such as abaabaab, abaab is the repeat sub-
7. foreach 0 i p do
string); 3) a split repeat means two identical substrings are
8. li ¼ li þ 1
9. endfor separated by some nonempty substring (such as abcdeabc,
10. continue abc is the repeat substring). Note that a loop structure in the
11. endif trace is the same as a tandem repeat in the string. Thus, we
12. foreach 0 i p do // tackle the function conflicts are only interested in finding the tandem repeats. The overlap
13. calculating the tracing distance dki of the event in the and split ones are omitted.
position k ¼ li of the trace ti Specifically, if a tandem repeat does not contain shorter
14. if dki > m, then dki ¼ m repeats, it is called a primitive tandem repeat. A tandem
15. endif repeat is inextensible if there is no identical substring imme-
16. endfor diately before or after the repeats. For instance, if the string
17. sort the dki in the descending order is represented as ðababÞ2 , the repeat ðababÞ2 is not primitive
18. merge the events with the highest dki for containing the shorter repeat ab. If the string is repre-
19. insert the merged event to the bottom of F sented as ðabÞ3 ab, the repeat ðabÞ3 is not inextensible for an
20. li ¼ li þ 1 identical substring after it.
21. until (each li reaches the end of the trace ti ) We define the trace length after the compression as the
metric to evaluate the optimal compression. The shorter
The time complexity of trace merging algorithms can be
the length, the better the compression. For example, con-
further reduced to O(mnlogp) if a parallel trace merging
sider the string xyaxyabcdabcdabcdae. There exist two primi-
architecture is introduced. This advanced architecture is
tive and inextensible compressions ðxyaÞ2 ðbcdaÞ3 e and
used in Scalasca [49] and VampirServer [50] with Open Trace
Format 2 (OTF2). First, our system loads the trace files into xyaxyðabcdÞ3 ae. The former’s length is 8, and the latter’s one
memory. Then, trace files are bisected and recursively is 11. Moreover, more than one compression may be optimal.
merged based on Algorithm 1. Finally, all trace files are For example, consider the string abcabcabca. Both ðabcÞ3 a
meged into a single trace in parallel. and a ðbcaÞ3 are optimal.
Before formulating the repeat compression problem more
4 REPEAT COMPRESSION precisely, we introduce the terminology on the string.
After gaining the merged trace, the core of DwarfCode is to Definition 1. A string S denotes an ordered list of symbols with a
identify and compress the repeating patterns in the final trace to finite alphabet of a fixed size. The length of S is Sj ¼ n. For
shrink the size of the events. The purpose of repeat compres- 1 i j n, S[i..j] denotes the substring of S beginning
sion is two-fold: 1) the computation and communication with the ith and ending with the jth symbol of S. The suffix(i)
events inside a loop are spread out during the trace denotes S[i..n].
ZHANG ET AL.: DWARFCODE: A PERFORMANCE PREDICTION TOOL FOR PARALLEL APPLICATIONS 499

Definition 2. A tandem array denotes a substring S½i::iþ jaj

p 1 in S where S½i::i þ jaj 1 1 ¼ S½i þ jaj::i þ jaj
2 1 ¼ ¼ S½i þ jaj ðp 1Þ::iþjajp 1, i ð1 i n)
is the starting symbol, að1 jaj nÞ is the repeated sub-
string, and pð2 p nÞ is the repeated periods of a. For sim-
plicity, it is denoted by a triple (i, a, p). If p ¼ 2, (i, a, 2)
denotes a tandem repeat. A tandem array is primitive if and
only if a is not periodic. A tandem array is inextensible if and
only P if there is no a right before or after the tandem
array. ¼ fði; a; pÞj1 i, jaj n; 2 p n, (i, a, p) is
primitive and inextensible} denotes the set of the primitive
and inextensible tandem arrays. Fig. 5. Identification of a PIT array in a string.
Definition 3. A compression of a string S denotes a tandem array
LCSðjÞ þ LCP ðj þ 1Þ j
set Ci 2 2Sð1 i j2SjÞ, which compresses the original
string S into Si . Ci is an optimal compression if and only if LCS(i) ð1 i nÞ ¼ maxfjjx½m j þ 1::m ¼ S½m þ
8j 1 j j2Sj; jSi j jSj . i j þ 1::m þ ig is the longest common suffix
LCP(i) ð2 i n þ 1Þ ¼ maxfjjy½1::j ¼ y½i::i þ j 1g
Problem 1. The Optimal String Compression Problem is the longest common prefix (LCP).
Query. In a string S, find a combination set Cj of the primitive Proof. Refer to [22]. u
t
and inextensible tandem arrays, which P provides an optimal
compression solution Si : Cj 2 2S; ¼ fði; a; pÞ j 1 i; The complete algorithm for finding all the PIT arrays is
j a j n; 2 p n; ði; a; pÞ is primitive and inextensible}. shown below:

The methodology to solve the OSC problem is two-fold: Algorithm 2. All the PIT Arrays Finding Algorithm
1) find all the primitive and inextensible
P tandem arrays (i, a, p) Input: S—the
in the string and form a set of the tandem arrays; 2)P find P string denoting the symbolized trace
Output: —a set of all the PIT Arrays
the optimal combination Ci of the tandem arrays from to Variables: l—the length of each loop body or the repeated
acquire the optimal compression of the original string. substring
1. n ¼ jSj
4.2 Finding the Primitive and Inextensible Tandem 2. foreach 1 l n=2 do
Arrays 3. foreach j 2 f0; l; 2 l; . . . ; ½n=l lg do
The computation of all the primitive and inextensible tan- 4. if ðS½j 6¼ S½j þ lÞ then continue
dem arrays is a classical string matching problem with vari- 5. else Lp ¼ Longest_Common_Prefix(jþ1, l)
ous application areas, most notably molecular biology [17]. 6. Ls ¼ Longest_Common_Suffix(j, l)
There are several different O(nlogn) algorithms finding all 7. endif
the PIT arrays. In 1981, the problem is first studied by Cro- 8. P p þ Ls Þ lÞ
if (ðL
chemore [18] and an optimal O(nlogn) algorithm is given. In 9. S[(j Lp)..(j þ l þ Ls)]
recent years, most of the algorithms are based on the suffix 10. endif
tree. Apostolico and Preparata [19] present an O(nlogn) algo- 11. endfor
rithm for finding the leftmost PIT arrays. Main and Lorentz 12. endfor
[20] propose another algorithm which actually finds all PIT
In Algorithm 2, the length of the complete string is j S j ¼
arrays in O(nlogn) time. The algorithms based on the suffix
n (line 1). For each tandem array (i, a, p) in S, the length
tree are efficient, but building and processing the suffix tree
l ¼ jaj must not exceed n/2 because the repeated periods of
with several auxiliary data structures consume much mem-
a are greater than 1. Then, we enumerate the 1 l n=2 to
ory [21]. In this section, we design an O(nlogn) algorithm
find all the tandem arrays (i, a, p) where j a j ¼ l (lines 2-12).
based on the suffix array. Its advantage over the suffix tree is
For a fixed l, first, we find a certain position j where
that, in practice, they use three to five times less memory.
S½j ¼ S½j þ i (line 3). Then, we calculate the longest com-
The idea of our algorithm is based on the Theorem 1
mon suffix (LCS) and prefix from the S[j] and S[j þ l] (lines
derived by us and the Theorem 2 mentioned in [22].
5-7). Finally, if the sum of the prefix and suffix is greater
Theorem 1. If there is a tandem repeat (i, a, 2) in the string S, than l–1, then we find a tandem array where jaj ¼ l and
then 9j 2 f0; jaj; 2jaj . . . ½jSj=jaj jajg; S½j ¼ S½j þ jaj. record it (lines 8-10).
Fig. 5 illustrates the procedure to find one PIT array from
Proof. The length of a tandem repeat (i, a, 2) is 2 j a j , thus
lines 3 to line 11. The string is S ¼ BABCABCABCDACD,
there exist j and j þ jaj 2 f0; jaj; 2jaj . . . ½jSj=jaj jajg that
and l ¼ 3. Because in the first loop S[0] 6¼ S[3] (line 4), the
are covered by the (i, a, 2). Since S½i::i þ jaj 1 1 ¼
loop is continued (line 3). Then, because in the next loop S
S½i þ jaj::i þ jaj 2 1 according to Definition 2, then
[3] ¼ S[6] ¼ C (line 4), their longest common prefix and lon-
S½j ¼ S½j þ jaj. u
t
gest common suffixare calculated as Lp ¼ 2 and Ls ¼ 3
Theorem 2. There exists a tandem array with repeat a and (lines 5 & 6). Because Lp þ Ls ¼ 5 > l ¼ 3 (line 8), one PIT
j ¼ jaj in the string S ¼ xy, that contains the frontier between array M ¼ ABCABCABC is identified and added into the set
x and y and has a root in y if and only if (line 9).
500 IEEE TRANSACTIONS ON COMPUTERS, VOL. 65, NO. 2, FEBRUARY 2016

To reduce the time complexity of Algorithm 2, the key is the final combination of PIT arrays and set the values from
to reduce the time of the longest common prefix and suffix mark[j] to mark ½j þ jbj q 1 as True.
(lines 5-6). The LCS and LCP problem can be converted into The time complexity of the non-overlapping detection
the Range Minimum Query (RMQ) problem. We use a fast of PIT arrays (lines 5-11) is O(1). First, choosing the longest
(O(n), O(1)) time algorithm for the RMQ problem, which PIT array (line 5) is O(1) because the PIT arrays have been
can also be applied to the LCS and LCP problem [23], [24]. sorted according to their lengths in line 3. Then, compar-
The O(n) is the preprocessing time to construct the suffix ing the two endpoints (lines 6-8) is O(1). Finally, the mark-
array, and the O(1) is the query time to find the longest com- ing procedure (lines 9-11) is O(1). Although the marking
mon prefix and suffix based on the preprocessed suffix array. procedure seems to scan and mark the Boolean array O
The complete (O(n), O(1)) algorithm for the LCP and LCS (nlg(n)) times, we should note that not every PIT array
program is detailed in [23]. The suffix array is the basic data needs to scan and mark the Boolean array. Because the
structure for the LCP and LCS problem. length of the Boolean array is n, in the worst case it should
The suffix array is the basic data structure for the LCP and be marked O(n) times. While the total iteration number is
LCS problem. We define the suffix array of a string S as a pair O(nlog(n)), according to the amortized analysis, the amor-
of arrays (SA, Rank). The sort array SA is the lexicographi- tized time complexity for each iteration in lines 9-11
cally ordered list of all suffixes of S. That is, SA[i] ¼ j if suffix should be O(1).
(j) is lexicographically the ith suffix among all suffixes suffix
(1), suffix(2),. . ., suffix(n) of S. The number of i will be called Algorithm 3. The Greedy Selection Algorithm
the rank of suffix(j), denoted by Rank(j) ¼ i, which is an P
Input: —a set of all the PIT Arrays
inverse with the SA. That is, SA[Rank[j]] ¼ j. We adopt the
Input: S—the string denotes the symbolized trace
DC3 algorithm [25] to construct the suffix array. Input: mark[n]—an array of Boolean type (False/True), whose
length is jSj ¼ n
4.3 Finding the Optimal Combination of Tandem Output: C—a combination of PIT arrays, C 2 2S
Arrays 1. initialization: C
2. foreach 0 < i n do
After finding P all the PIT arrays of the string S ( j S j ¼ n), we
mark[i] ¼ False P
acquire a set ¼ fði; a; pÞj1 i, jaj n, 2 p n, (i, a, p)
3. sort the PIT arrays in according to their lengths in
isprimitive and inextensible}. The next step is to find a subset
descending order
Ci 2 2S so that 8 j 1 j j2Sj, jSi j jSj j if the string S is
4. repeat P
compressed with Ci . 5. choose the longest PIT array (i, a, p) from . If two or
The maximal number of all the PIT arrays is O(nlog(n)) more PIT arrays satisfy the conditions, choose the one
[18], [19] for a string jSj ¼ n. The time complexity to enu- with the smallest starting point.
merate all the solutions is Oð2n logðnÞ Þ, which is an NP hard 6. if (mark[i] ¼ ¼ True j P j mark [i þ j a j p 1] ¼ ¼ True)
problem. Thus, we aim to provide a heuristic algorithm for 7. {delete (i, a, p) from
near-optimal solution. 8. continue;}
An Oðn3 Þ dynamic programming algorithm can be easily 9. else
designed to find the near-optimal combination of tandem 10. {set the value from mark[i] to mark[i þ j a j p 1] as
arrays. However, the complexity needs to be reduced fur- True P
ther. Xu and Subhlok [8] propose a greedy heuristic called 11. P (i, a, p) from
delete and C C [ fði; a; pÞg
“Bottom-Up” to iteratively choose the longest PIT array with 12. until ( is empty)
the smallest starting point. The time complexity is Oðn2 Þ
while there exists the risk of missing the optimal solution. 4.4 Repeat Compression Algorithm and Complexity
For example, the PIT arrays are {(1, xya, 2), (6, abcd, 3), Analysis
(7, bcda, 3)} for a string S ¼ xyaxyabcdabcdabcdaef. According The complete repeat compression algorithm is shown in
to the “loop filtering” algorithm, we can get the Ci ¼ fð6; Algorithm 4. First, the string S ( j S j ¼ n) is converted into
abcd; 3Þg to acquire the compressed string S’ ¼ xyaxy(abc- the suffix array with the DC3 algorithm (line 2). Then, we
d)3aef and j S0 j ¼ 12. Obviously, the optimal solution is S} ¼ use the Range Minimum Query algorithm to preprocess the
ðxyaÞ2 ðbcdaÞ3 ef and jS}j ¼ 9 with Ci ¼ fð1; xya; 2Þ, (7,bcda, suffix array (line 3). Third, Algorithm 2 is introduced to find
3)}. Thus, we propose a new algorithm shown in Algorithm all the PIT arrays (line 4). Finally, Algorithm 3 is used to
3. Note that the algorithm only acquires the intermediate greedily select and order the sequences of the PIT arrays
results of the final combination. The complete solution relies (line 5). The procedure mentioned above repeats until the
on Algorithm 4. set of PIT array is empty (line 8).
The advantage of Algorithm 3 lies in the non-overlap- The time complexity of the repeat compression algorithm
ping detection of PIT arrays (lines 5-11). First, when the first is O(nlog(n)), which is proved by the following lemmas and
longest PIT array (i, a, p) is chosen, the values from mark[i] theorems.
to mark[iþ j a j p–1] are set as True. Then, when the next
Lemma 1. The time complexity of Algorithm DC3 is O(n).
PIT array (j, b, q) is chosen, we should judge whether the
value of mark[j] or mark ½j þ jbj q 1 has been set as Proof. Refer to [25]. u
t
True. If either one is True, the PIT array (j, b, q) is over-
Lemma 2. The time complexity of Algorithm RMQ is (O(n),O(1)).
lapped with former selected PIT arrays and this PIT array
should be discarded. Otherwise, we add the PIT array to Proof. Refer to [24]. u
t
ZHANG ET AL.: DWARFCODE: A PERFORMANCE PREDICTION TOOL FOR PARALLEL APPLICATIONS 501

Lemma 3. The time complexity of Algorithm 2 (All the PIT alternative, we can use a delay function (e.g., sleep ()) found
Arrays Finding Algorithm) is O(nlogn). in most operating systems. This puts a process to sleep for a
specified time, during which it will waste no CPU time. For
Proof. The outmost loop number l of Algorithm 2 is from 1
example, a computation symbol can be replaced by the fol-
to n/2. The inner loop number is n/l. Thus, the total iter-
lowing loop statement:
ative number is P ¼ n=1 þ n=2þ; . . . ; þn=ðn=2Þ ¼ nð1þ
sleep (Original Runtime CR);
1=2þ; . . . ; þ2=nÞ < nð1 þ 1=2þ; . . . ; þ1=nÞ ¼ nHn , where
Hn ¼ 1 þ 1=2þ; . . . ; þ1=n. Note that the sum of the recip-
rocals of the first n natural numbers Hn is the nth har- Algorithm 4. The Repeat Compression Algorithm
monic number. The sum Hn is approximated by the Input: S—the string denotes the symbolized trace
Rn
integral 1 x1 dx, whose value is log(n). Thus, P < n Hn Output: S’—the compressed string
nlogn. Also, the loop body is LCP and LCS. Note that in Variables: SA—the sort array of suffix array
Lemma 2, the query time of LCP and LCS is O(1). Thus, Rank—the rank array of suffix array
the time complexity is determined by the iterative num- P combination of PIT arrays C 2 2S
C—a
—a set of all the PIT Arrays
ber, whose upper bound is O(nlogn). u
t
1. repeat
Lemma 4. The time complexity of Algorithm 3 (Greedy Selection 2. (SA, Rank) DC3(S)
Algorithm) is O(nlogn). 3. PRank) RMQ(SA, Rank)
(SA,
4. Finding the primitive and inextensible tandem
Proof. Note that the maximal number of all the PIT arrays arrays (SA, Rank) // AlgorithmP 2
found by Algorithm 2 is nlogn [18], [19]. In Algorithm 3, 5. C The Greedy Selection ( ) // Algorithm 3
each PIT array is traversed once and the amortized com- 6. S PCompress S with C
plexity for each traverse is O(1). Thus, the time complex- 7. until ( is empty)
ity of Algorithm 3 is O(nlogn). u
t 8. S0 S
Theorem 3. The time complexity of Algorithm 4 (The Repeat
Compression Algorithm) is O(nlogn).
5.2 Tackling the Communication
Proof. The time complexity between repeat (line 1) and In the trace recording and merging phases, the algorithm
Until (line 7) is O(n) þ O(n) þ O(nlogn) þ O(nlogn) ¼ O preserves the src and dest information of point-to-point
(nlogn) based on the Lemmas 1-4. The original string S is communication. The dwarf code needs to restore these
compressed to be a new string S’ in line 6. Note that S is parameters. In order to unify the dwarf code, first we obtain
shortened at least by half each iteration. Thus, the com- the rank of the current process immediately after executing
plexity of Algorithm 4 is O(nlogn) þ O(n/2log(n/2)) þ ,. . ., the MPI_Init call, and save it in a global variable id_proc.
þ O(1) ¼ O(nlogn). u
t Before each point-to-point communication call performs,
we create a src/dest matrix according to the merging infor-
5 GENERATING DWARF CODE mation recorded, and then invoke the communication calls.
The final step to build the dwarf code is to convert the The parameters in the corresponding positions can be
merged and compressed trace into an executable program, obtained according to the value of id_proc in the assign-
which mimics the behavior represented in the trace. The ment matrix. Fig. 6 illustrates the procedure of one-to-one
compressed trace in this phase contains the primitive tan- communication in BT’s dwarf code. The parameter count can
dem arrays with the loop numbers. These arrays consist of a be adjusted to scale down the running time of the dwarf
series of symbols, which denotes an MPI communication code proportionally.
call or a computation in a certain period. The trace is con-
5.3 Tackling the PIT Arrays
verted into the executable C dwarf code by resuming these
symbols with the communication or computation calls. The PIT arrays (i, a, p) are converted into the loop state-
ments. The repeat number p is converted into the loop
5.1 Tackling the Computation number. The primitive substring a is mapped to the corre-
sponding computation and communication calls. The
We can replace the symbols representing computation by
repeat number (or loop number) p can be adjusted to scale
synthetic computation codes with equal duration time, such
down the running time of the dwarf code in proportion.
as the busy waiting or spinning, to generate the dwarf code.
The iteration number (abbreviated as IN, IN > 1) of the loop
6 EVALUATION
can be adjusted with the predefined compression ratio (abbre-
viated as CR, 0 < CR < 1). For example, a computation 6.1 Experiment Setup
symbol can be replaced by the following loop statement, To evaluate the correctness and efficiency of the Dwarf-
where the iteration number is multiplied by compression ratio Code system, we perform extensive experiments with
to decrease the running time in proportion: NAS parallel benchmarks on two small clusters named
“Dawning1000” and “IA32”, and one large cluster named
for ði ¼ 1; i <¼ IN CR; i þ þÞfg; “Kongfu” with different sizes, architectures, interconnection
types, and operating systems, as described below:
However, busy waiting or spinning is an inefficient pro-
gramming pattern and should be avoided. Also, it might be Dawning1000 is a 16-node tightly coupled cluster. Its
removed by modern compilers as dead codes. As an peak performance achieves 2.5 GFLOPS, and its
502 IEEE TRANSACTIONS ON COMPUTERS, VOL. 65, NO. 2, FEBRUARY 2016

Fig. 7. Architecture of the IA32 bewolf cluster.

communication calls. Note that the number of events

recorded in different traces, even from the same application,
might not be the same. This is due to the unstructured
point-to-point communication that can cause unbalanced
communication. Thus, we record and compare the largest
Fig. 6. Process of parameters of one-to-one communication in BT
and smallest number of communication events from the
(CLASS ¼ C, NPROCS ¼ 16)’s dwarf code. traces of NPB applications.
Table 1 shows the number of communication events
practical computing speed is 1.58 GFLOPS. It has up recorded running eight NPB applications in Dawning1000.
to 32 GB memory. Each node has quad-core Intel By comparing the recorded communication events with the
Xeon(TM) 2.4GHz processors and 2 GB physical MPI statements in their source codes, we verify that all the
memory. The communication network is 1 Gigabit events are completely recorded. We run NPB applications
Ethernet. The operating system is Turbo Linux 8.0. from Class A to F. We find the number of communication
IA32 is a 16-node loosely coupled bewolf cluster. events are approximately equal to each other while the run-
Four nodes have dual-core Intel Pentium III (Copper- ning times change dramatically. Thus, we choose Class C
mine) 866 MHz processors and 2 GB physical mem- (the middle size) to continue.
ory. The other nodes have dual-core Intel Pentium III
(Coppermine) 1 GHz processors and 2 GB physical 6.3 Validation of Trace Merging
memory. The 16 hosts are distributed in four subnets. In this section, the aim is to check whether the merged
The inner bandwidth is 1 Gb/s while the bandwidth trace maintains the original computation and communica-
between the subnets is 100 Mb/s. Fig. 7 shows the tion behaviors. Also, we evaluate the running time of the
architecture of the IA32 bewolf cluster. trace merging algorithm. Table 2 shows the number of
KongFu cluster contains 250 nodes and 1,000-core merged traces and the merging time when running eight
in total. Each node contains a four-core Westmere- NPB applications in Dawning1000. The trace merging algo-
based Intel Xeon E5620 with a 64-bit instruction set rithm can successfully regularize and merge the traces,
and its clock speed is 2.4 GHz. The L1 caches are and the number of merged traces in Table 2 is identical
128 KB for codes and 128 KB for data. The L2 and to the maximal number of the communication events in
L3 caches are 1 and 12 MB, respectively. Each Table 1. The merging time is much faster than the running
node contains 8 GB memory and 2 TB internal time of NPB applications. The longest merging time LU is
storage. KongFu cluster also has 60 TB external 375.84 seconds, which is far less than the running time of
storage. The nodes are interconnected via an Infini- LU application itself.
Band 40 Gbits/s network.
On all of the clusters, the MPI library is MPICH version
1.2.7 and MPICH 2. The C compiler is gcc 3.4.3, while For- TABLE 1
The Number of Communication Events Recorded
tran compiler is F77. The random number generator is in NPB Applications (Class C)
randdp. All experimental results are based on the MPI imple-
mentation of the NPB and a real application—Parallel NPB
BT CG EP FT IS LU MG SP
NBody Simulation. The NPB includes eight benchmarks Application
mimic the computation and data movement in CFD applica- The largest
tions. Five are kernels: IS, EP, CG, MG, and FT. Three are number of 17,111 41,954 5 47 38 324,355 10,043 26,891
pseudo applications: BT, SP, and LU. We construct the dwarf communication
events
codes for each Class C benchmark.
The smallest
6.2 Validation of Trace Recording number of 17,111 41,954 5 47 36 162,189 9,329 26,891
communication
The aim is to check whether all the computation and events
communication events can be recorded, especially the
ZHANG ET AL.: DWARFCODE: A PERFORMANCE PREDICTION TOOL FOR PARALLEL APPLICATIONS 503

TABLE 2
Results of Trace Merging

NPB Application BT CG EP FT IS LU MG SP
The merged trace number 17,111 41,954 5 47 38 324,355 10,043 26,891
The merging time(s) 20.22 49.68 0.02 0.12 0.08 375.84 12.57 35.47

6.4 Validation of Repeat Compression Table 5 shows that the running time of our algorithm is
In this section, the aim is to evaluate the compression shorter than that of the “Top-Down” and “Bottom-Up”
ratio and the compression time of the repeat compression algorithms for all NPB applications. The “Top-Down”
algorithm. algorithm is the most time-consuming. The running time of
The compression is affected by the loop number. The our algorithm is 50 percent shorter than that of the “Bottom-
NPB applications are scientific computing applications and Up” algorithm for most applications. Tables 4 and 5 do not list
the loop structure dominates the main body in most of the result of the “Top-Down” algorithm for the trace of LU.
them. However, the traces of the EP, FT and IS applications The reason is that its time of compressing LU with the “Top-
are so small that we omit their results. Because enumerating Down” algorithm is too long and far beyond 105 seconds.
all the inner and outer loops is complicated, we focus on the The time complexity of our algorithm is O(nlogn) while
outer loops. Table 3 shows the main loop structure, original the time complexity of the “Top-Down” and “Bottom-Up”
length, compressed length and compression ratio of BT, algorithms is Oðn2 Þ. Therefore, our algorithm shows much
CG, LU, MG and SP applications. The main loop structures better asymptotic time complexity than two other algorithms.
are represented by the product of the loop number and the
statement number of each loop, i.e., (15) 200 denotes that 6.5 Validation of Prediction Accuracy
the statement number inside the loop is 15 and the loop The aim is to evaluate the prediction accuracy of the dwarf
number is 200 for the BT application. The compress ratios code. We generate the dwarf codes of BT, CG, LU, MG and
are very high even the lowest compress ratios of MG is 96.9 SP application with Class C on Dawning1000. The dwarf
percent and the highest one of LU is 99.998 percent. codes that are generated are 10 times smaller than the origi-
In order to further evaluate our repeat compression algo- nal programs. Then, the original programs and dwarf codes
rithm, we compare it with a “Top-Down” algorithm in [8] run separately on Dawning1000 and IA32 clusters with all
and a “Bottom-Up” algorithm in [9] for compressed length of 16 nodes.
and running time. Table 6 shows the actual running time of the original pro-
Table 4 shows that when the initial trace lengths are gram, the running time of the dwarf code, the predicted time
small such as the BT, CG, MG, and SP applications, their of the dwarf code and the error rates on Dawning1000. Table 7
compressed lengths with our algorithm are the same as shows the results on IA32.
those with the “Top-Down” and “Bottom-Up” algorithms. As for Dawning1000, the prediction error rates are less
However, when the initial trace length is large such as the than 3 percent for 5 NPB applications; As for IA32, the error
LU application, the compressed length with our algorithm rates do not exceed 10 percent. The prediction difference
is shorter than that with other algorithms. between Dawning1000 and IA32 cluster is that IA32 is a

TABLE 3 TABLE 5
Results of Repeat Compression Running Time Comparison of Three Algorithms

NPB Main loop Initial Compressed Compression NPB Initial Time (s)
Application structure length length ratio (%) application length Our algorithm Top-Down Bottom-Up
BT (15) 200 17,111 44 99.743 BT 17,111 4.05 276.46 6.96
CG (12) 75 41,954 26 99.938 CG 41,954 6.81 1,498.27 7.59
LU (7) 249 324,355 38 99.998 LU 324,355 26.96 >105 41.87
MG (134) 20 10,043 302 96.993 MG 10,043 2.76 104.84 9.05
SP (15) 400 26,891 43 98.840 SP 26,891 5.61 649.12 10.21

TABLE 6
TABLE 4
Results of Time Prediction on Dawning1000 Cluster
Compressed Length of Three Algorithms
NPB Original Dwarf Dwarf code Error
NPB Intial Compressed length
Application program code running prediction rate (%)
application length Our algorithm Top-Down Bottom-Up running time(s) time(s) time(s)
BT 17,111 44 44 44 BT 1,094.65 113.04 1,130.4 3.27
CG 41,954 26 26 26 CG 384.81 37.93 379.3 1.43
LU 324,355 38 – 47 LU 877.27 85.16 851.6 2.92
MG 10,043 302 302 302 MG 786.34 77.57 775.7 1.35
SP 26,891 43 43 43 SP 1,219.16 118.56 1,185.6 2.76
504 IEEE TRANSACTIONS ON COMPUTERS, VOL. 65, NO. 2, FEBRUARY 2016

TABLE 7 TABLE 9
Results of Time Prediction on IA32 Cluster Collective Functions Involved in NPB & Nbody Applications

Original Dwarf Dwarf Collective Func BT CG EP FT IS LU MG SP Nbody

NPB program code code Error p p p p p
MPI_Bcast
Application running running prediction rate (%) p
MPI_Gather
time(s) time(s) time(s) MPI_Gatherv
p
BT 3,558.72 337.65 3,376.5 5.12 MPI_Scatter
CG 1,042.09 101.72 1,017.2 2.39 MPI_Scatterv
p
LU 3,191.50 297.54 2,975.4 6.77 MPI_Allgather
MG 2,040.17 183.94 1,839.4 9.84 MPI_Allgatherv
p p p
SP 4,295.63 401.81 4,018.1 6.46 MPI_Alltoall
p
MPI_Alltoallv
p p p p p p p p
MPI_Barrier
p p p p p p p
MPI_Reduce
p p p p p
loosely coupled cluster across two subnets. It consists of dif- MPI_Allreduce
p
MPI_Reduce_scatter
ferent speed processors (866 MHz and 1 GHz) and low
MPI_Scan
bandwidth network (100 Mb/s for interconnections and MPI_Op_create
1 Gb/s for intraconnections). MPI_Op_free

6.6 Validation of Scalability

To verify the scalability of DwarfCode, real applications were However, when the number of cores is greater than 32, the
deployed on modern hardware with fast processors and memory is sufficient while the computing and communica-
low latency networking. The scalability is measured on the tion parts dominate the running time. Thus, the error rates
KongFu high performance cluster, which is installed at the are under 10 percent and the scalability can be verified.
High Performance Computing Center in Harbin Institute of The procedure to handle collective communication is
Technology. shown as follows: First, DwarfCode uses PMPI to intercept
Parallel NBody Simulation is introduced as a real appli- the collective calls when they are executed. All the processes
cation. The NBody application aims to predict the motion of involved in the group communication will record the collec-
a great number of celestial objects that interact with the tive calls in their traces. Their parameters are recorded includ-
gravitation and repulsion force. The code of the application ing the buffer (sendbuf & recvbuf), list length (count), data type
is acquired from the Petascale Education Program sup- (datatype), MPI operation (op), MPI communicator (comm).
ported by the NCSA Blue Waters project [26]. The parallel Meantime, the function names and the timestamps are
version contains approximately 3,000 lines of codes, which recorded. Then, the collective communication does not need
is written in C plus MPI. The input parameter body number to reorder event sequences unlike the point-to-point commu-
is important to determine the computing time and will be nication because the whole communication group is involved
equally divided into different nodes. In the following in the communication. Third, the procedure of repeat com-
experiments, the number of bodies is set to 256 K, which is pression is identical with the point-to-point communication.
an upper bound size to fit the memory of each node. The Finally, when we generate the dwarf code, we make sure to cor-
dwarf code of Parallel NBody Simulation is generated from rectly recover the parameters comm, count, datatype and op.
the traces on 16 cores. To verify the DwarfCode’s scalaibity for the collective
Table 8 shows results when the application runs on the communication, we summarize the collective calls invoked
four, eight and 16 cores. The error rates are approximately by the NPB and Nbody applicatons. Table 9 shows that
20 percent. However, when it runs on the 32, 64, 128 and most of common collective operations are covered and
256 cores, the error rates are no more than 10 percent. By the generated dwarf codes work well, including MPI_Bcast,
analyzing the utilization of memory and swap spaces, we MPI_Gather, MPI_Scatter, etc.
find that when the application runs on the four, eight and
16 cores, it uses more RAM than is physically available and 7 RELATED WORK
the paging procedure consumes more time than expected.
Relevant previous projects mainly focus on finding the
benchmark most similar to an application to predict perfor-
TABLE 8 mance. In 1989, Cole originally puts forward the Algorith-
Validation of Scalability on Kongfu Cluster for Parallel
NBody Application with 256 K Bodies mic Skeletons [27] by designing the program template
for several frequently used parallel programs. In 1993,
Original program Dwarf code Error Dikaiakos et al. [28] extend the simulation scale of parallel
#Cores
running time(s) prediction time(s) rate (%) programs through the method of functional algorithm sim-
4 6,797.22 5,496.71 19.13 ulation. They build a FAST prototype which collects exter-
8 4,236.74 3,435.89 18.90 nal information to forecast the performance of massively
16 2,564.52 2,090.18 18.49 parallel programs. In 1999, Dinda and Hallaron find that
32 1,563.77 1,418.03 9.32 the running time of applications is closely relevant to
64 968.53 905.96 6.46
the workload [29]. Hoste et al. [30] describe architecture
128 578.42 542.21 6.26
256 348.89 326.11 6.53 independent characteristics to find the most similar bench-
marks to predict the performance of CPU-intensive
ZHANG ET AL.: DWARFCODE: A PERFORMANCE PREDICTION TOOL FOR PARALLEL APPLICATIONS 505

applications. Lu and Reed [31] propose a method using The upper complexity bound of the graph spectrum
curve fitting to compress parallel programs for reducing and isomorphism algorithms is Oðn3 Þ for n events in
the program running time significantly. Sherwood et al. each trace. Moreover, it neglects function conflicts.
[32] study automatic analysis of the periodicity of parallel Mueller’s studies in [11], [12] maintain a dependence
programs. Also, some works focus on predicting web graph during the entire merge algorithm. The upper
application’s performance through modeling [33], [34]. complexity bound of the overall merge operation is
In contrast to DwarfCode, these approaches rely on the Oðn2 Þ for n events in each trace. DwarfCode not only
library of existing benchmarks, which neglect the diversity considers the sequence differences and function conflicts,
and complexity of applications and platforms. but also reduces the time complexity of the trace merging
HPC simulators, such as SST [41], BigSim [42], ROSS algorithm to O(mpn).
[43], and PSINS [44], can allow simulation of diverse 2. Repeat compression. Related approaches suffer high
aspects of hardware and software. But the prediction time complexity for the repeat compression algo-
accuracy of the application running time is reduced for rithm. Sodhi et al. [6], [7] recognize and compress
loss of details in modeling process. Our system automati- repeated execution behaviors as loops to generate
cally generates the dwarf code, a customized benchmark, the final execution skeleton. The complexity of their
which can be replayed in real-time without modeling compression algorithm is Oðn3 Þ. Xu et al. [8], [9] take
either the application or the platform. a variant approach to identify the loop structures in
Some recent attempts aim to generate a shorter running a trace based on the Crochemore’s algorithm [18].
benchmark of the real application and replay it on the target The complexity of this compression algorithm is
platform. CloudProphet [35] is an end-to-end performance Oðn2 Þ. Wong et al. [10] introduce a pattern identifica-
prediction tool for web applications in the cloud. It replays tion algorithm to find the most relevant phases of the
the trace log by capturing the resource usage and extracting parallel applications, whose complexity is Oðn2 Þ.
the dependency. CloudProphet only focuses on web appli- Mueller et al. [12] propose intra-node and inter-node
cations while DwarfCode pays close attention to MPI compression techniques of MPI events that are capa-
applications. ble of extracting an application’s communication
Several studies address performance prediction for MPI structure. Its complexity is Oðn2 Þ. DwarfCode introdu-
applications. Dimemas [5] is a performance prediction tool ces a novel repeat compression algorithm based on suffix
for MPI applications in the Grid environment. It captures arrays whose time complexity is O(nlogn).
the CPU bursts and the communication patterns. It models
the target architecture with a configuration file. Meanwhile, 8 DISCUSSIONS
Sodhi et al. [6], [7] propose a framework for automatic gen-
DwarfCode is mainly designed for performance predic-
eration of performance skeletons. Xu et al. [8], [9] present
tion of MPI applications on cluster systems but its princi-
generation of coordinated performance skeletons, similar to
ple can aid performance prediction for hybrid MPI þ
dwarf code with logicalization and compression procedures.
OpenMP applications on multicore systems and hybrid
Parallel application signatures for performance prediction
MPI þ GPU applications on hybrid-core systems with
(PAS2P) is a tool studied by Wong et al. [10]. Based on the
hardware accelerators.
application’s message-passing activity, representative
phases can be identified and extracted, with which a paral- 1)Hybrid MPIþOpenMP applications on MPI and
lel application signature can be created to predict the loop-level grain (OpenMP) parallelism. Its running
application’s performance. Mueller et al. [11], [12], [36], [37] time is the sum of intra-node OpenMP and inter-
introduce intra-node and inter-node compression techni- node MPI call costs by considering overlapping fac-
ques of MPI events that are capable of extracting an tor [45]. Our method can help build parameterized
application’s communication structure and presenting an communication model for inter-node MPI calls.
automatic generation mechanism for replaying the traces. Intra-node OpenMP performance can be acquired by
Chen et al. [38] implement a performance prediction frame- analysizing the memory bandwidth contention.
work, called PHANTOM, which integrates the computa- 2) Hybrid MPI þ GPU applications on hybrid-core sys-
tion-time acquisition approach with a trace-driven network tems conform to a classic MPI þ GPU or GPU-inte-
simulator. Also, part of our preliminary work to build the grated MPI models (MPI-ACC [46] and MVAPICH-
representative benchmarks is shown in [39]. GPU [47]). For the classic MPI þ GPU model, similar
These approaches are the closest to the DwarfCode work to hybrid MPI þ OpenMP, its running time is the
presented in this paper. However, there are some key sum of MPI calls between hosts and data copies,
differences: which are performed between main memory and the
local GPU’s device. We can leverage our method for
1. Trace merging. Several studies, for example, [6], [7], inter-node MPI calls and calculate the costs of cuda-
[8], [11], [12], have been conducted on trace merging Memcpy or clEnqueueWriteBuffer for data copies. For
algorithms but they all have some pitfalls. Sodhi the GPU-integrated MPI model, the programmer can
et al. [6], [7] put forward the methods to identify and use the GPU buffer directly as the communication
cluster similar events without considering sequence parameter in MPI routines. This is difficult to create
differences and function conflicts. Xu et al. [8], [9] the dwarf code, which needs further investigation.
match communication patterns with application Due to event reordering and potential information loss
communication graphs represented by the matrices. caused by inter-process trace merging, the code generated
506 IEEE TRANSACTIONS ON COMPUTERS, VOL. 65, NO. 2, FEBRUARY 2016

from traces may have deadlocks. The key issue in ensuring less than 10 percent for computing and communication
deadlock freedom is to identify and label the non-matching intensive applications.
calls. Current research mainly focuses on modeling the com-
putation and communication, which are typical events of
1) A procedure is outlined to mark the non-matching scientific applications, such as NPB applications. However,
calls in our former work [7], [8]. It is based on the more complicated and irregular codes should be consid-
basic deadlock free patterns which are a) a non- ered. Future work includes addressing memory and I/O
blocking Send/Recv with a matching Recv/Send intensive codes and validation with complete multi-phase
before a corresponding Wait; b) One or more block- applications. We are porting the MpiBlast, and SPH applica-
ing Send/Recv calls followed by matching Recv/ tions to our platform.
Send calls. Such calls are labeled with our algorithm
in [7], [8] and ignored for code generation.
2) We also solve this with the help of MPI runtime error
ACKNOWLEDGMENTS
detection tool, named marmot umpire scalable tool The authors would like to thank Prof. Marc Snir and Dr.
(MUST) [48]. MUST can cover various process-level Babak Behzad at the University of Illinois at Urbana-Cham-
correctness checks. It is especially skilled in deadlock paign for insightful discussion about the paper revision.
detection. We introduce the following steps to They also thank the anonymous reviewers’ comments,
ensure the correctness of the final dwarf code: a) run which are all valuable and very helpful for revising and
the dwarf code and intercepts all MPI calls of all pro- improving our paper. This work was supported in part by
cesses at runtime; b) generate a message dependence the National Basic Research Program of China under Grant
graph (MDG) or a wait-for graph (WFG); c) perform No.G2011CB302605. This work was partially supported by
type matching, collective verification, and deadlock the National Natural Science Foundation of China (NSFC)
detection with a centralized deadlock detector with under grant No. 61173145 and also the Doctoral Program of
MUST. MUST’s AND
OR model can achieve sub- Higher Education of China under grant No. 20132302110037.
linear analysis time. Prof. Albert Cheng was supported by the US National Science
Trace recording needs further improvement to reduce Foundation under Awards No. 0720856 and No. 1219082.
the trace size. Our approach collects raw communication
traces for each process of a parallel application. Size of REFERENCES
the uncompressed process-level trace usually increases [1] I. Foster, “The Grid: A new infrastructure for 21st century
with the number of communication calls. When the num- science,” in Grid Computing: Making the Global Infrastructure a
ber of calls is too large, the size of the raw trace may Reality. Hoboken, NJ, USA: Wiley, 2003, pp. 51–63.
[2] W. Zhang, B. Fang, M. Hu, X. Liu, H. Zhang, and L. Gao,
exceed the storage capacity of a single node. However, “Multisite co-allocation scheduling algorithms for parallel jobs in
there are three ways to alleviate this problem: 1) the trace computing grid environments,” Sci. China Ser. F: Inf. Sci., vol. 49,
can be stored in HPC storage system instead of the node no. 6, pp. 906–926, 2006.
[3] X. Gao, A. Snavely, and L. Carter, “Path grammar guided trace
generating the trace; 2) the records in the trace can be rep- compression and trace approximation,” in Proc. 15th IEEE Int.
resented with the binary code but not current ASCII code; Symp. High Perform. Distrib. Comput., 2006, pp. 57–68.
3) online trace merging can be introduced when the trace [4] N. Cardwell, S. Savage, and T. Anderson, “Modeling TCP
latency,” in Proc. INFOCOM, 2000, pp. 1742–1751.
length is more than the trace distance, thus not waiting for
[5] R. M. Badia, F. Escale, E. Gabriel, J. Gimenez, R. Keller, J. Labarta,
all the traces generated. and M. S. M€ uller, “Performance prediction in a grid environ-
ment,” in Grid Computing. Berlin, Germany: Springer, 2004,
pp. 257–264.
9 CONCLUSIONS [6] S. Sodhi and J. Subhlok, “Skeleton based performance prediction
Model-driven and trace-driven performance prediction on shared networks,” in Proc. IEEE Int. Symp. Cluster Comput.
Grid, 2004, pp. 723–730.
techniques are of limited use in practice. We present Dwarf- [7] S. Sodhi, J. Subhlok, and Q. Xu, “Performance prediction with
Code, a performance prediction tool for MPI applications. It skeletons,” Cluster Comput., vol. 11, no. 2, pp. 151–165, 2008.
includes procedures for trace recording, trace merging, [8] Q. Xu and J. Subhlok, “Construction and evaluation of coordi-
nated performance skeletons,” in Proc. 15th Int. Joint Conf. High
repeat compression, and dwarf code generation. Researchers Perform. Comput., 2008, pp. 73–86.
can download our toolkit for free, which is under a GNU [9] Q. Xu, J. Subhlok, and N. Hammen, “Efficient discovery of loop
GPL v3 license. Our main contribution is three-fold: 1) An O nests in execution traces,” in Proc. IEEE Int. Symp. Model., Anal.
Simul. Comput. Telecommun. Syst., 2010, pp. 193–202.
(mpn) trace merging algorithm is proposed, which can also [10] A. Wong, D. Rexachs, and E. Luque, “Extraction of parallel appli-
tackle the sequence differences and function conflicts. 2) A cation signatures for performance prediction,” in Proc. 12th IEEE
novel repeat compression algorithm based on suffix trees is Int. Conf. High Perform. Comput. Commun., 2010, pp. 223–230.
[11] M. Noeth, P. Ratn, F. Mueller, S. Martin, and B. R. de Supinski,
designed, whose time complexity is O(nlogn). It converts “ScalaTrace: Scalable compression and replay of communication
the original problem into an optimal string compression traces for high-performance computing.” J. Parallel Distrib. Com-
problem. First, we find all the primitive and inextensible put., vol. 69, no. 8, pp. 696–710, 2009.
tandem arrays. Then, we acquire the optimal combination [12] P. Ratn, F. Mueller, B. R. de Supinski, and M. Schulz, “Preserving
time in large-scale communication traces,” in Proc. 22nd Annu. Int.
of the tandem arrays to form the solution. 3) The dwarf code Conf. Supercomput., 2008, pp. 46–55.
can be built on fewer cores and predict running time of the [13] D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, L.
application on clusters with a similar architecture but more Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S.
Schreiber, H. D. Simon, V. Venkatakrishnan, and S. K. Weeratunga,
cores. The results show that DwarfCode can accurately pre- “The NAS parallel benchmarks summary and preliminary results,”
dict the running time of MPI applications. The error rate is in Proc. 5th Annu. Int. Conf. Supercomput., 1991, pp. 158–165.
ZHANG ET AL.: DWARFCODE: A PERFORMANCE PREDICTION TOOL FOR PARALLEL APPLICATIONS 507

[14] Message Passing Interface Forum [Online]. Available: http:// [40] DwarfCode [Online]. Available: https://github.com/wzzhang-
www.mpi-forum.org/, 2012. HIT/DwarfCode, 2014.
[15] R. Sch€one, R. Tsch€ uter, T. Ilsche, and D. Hackenberg, “The vam- [41] Sandia National Laboratories. SST: The structural simulation tool-
pirtrace plugin counter interface: Introduction and examples,” in kit [Online]. Available: http://sst.sandia.gov/, 2011.
Proc. Euro-Par Parallel Process. Workshops, 2011, pp. 501–511. [42] G. Zheng, K. Gunavardhan, and V. K. Laxmikant. (2004). Bigsim:
[16] J. S. Vetter and O. M. Michael, “Statistical scalability analysis A parallel simulator for performance prediction of extremely large
of communication operations in distributed applications,” ACM parallel machines. Proc. 18th Int. Parallel Distributed Process. Symp.,
SIGPLAN Notices, vol. 36, no. 7, pp. 123–132, 2001. p. 78 [Online]. Available: http://charm.cs.uiuc.edu/research/
[17] D. Gusfield, Algorithms on Strings, Trees and Sequences: Computer bigsim
Science and Computational Biology. Cambridge, U.K.: Cambridge [43] ROSS: Rensselaer’s Optimistic simulation system [Online]. Avail-
Univ. Press, 1997. able: https://github.com/carothersc/ROSS/wiki, 2013.
[18] M. Crochemore, “An optimal algorithm for computing the repeti- [44] M. Tikir, M. Laurenzano, L. Carrington, and A. Snavely, “PSINS:
tions in a word,” Inf. Process. Lett., vol. 12, no. 5, pp. 244–250, 1981. An open source event tracer and execution simulator for MPI
[19] A. Apostolico and F. P. Preparata, “Optimal off-line detection of applications,” in Proc. Euro-Par Parallel Process., 2009, pp. 135–148.
repetitions in a string,” Theor. Comput. Sci. , vol. 22, no. 3, pp. 297– [45] X. Wu and V. Taylor, “Performance modeling of hybrid mpi/
315, 1983. openmp scientific applications on large-scale multicore cluster
[20] M. G. Main and R. J. Lorentz, “An O(nlogn) algorithm for finding systems,” in Proc. IEEE 14th Int. Conf. Comput. Sci. Eng., 2011,
all repetitions in a string,” J. Algorithms, vol. 5, no. 3, 422–432, 1984. pp. 181–190.
[21] U. Manber and G. Myers, “Suffix arrays: A new method for on- [46] A. Aji, J. Dinan, D. Buntinas, P. Balaji, W. Feng, K. Bisset, and R.
line string searches,” SIAM J. Comput., vol. 22, no. 5, pp. 935–948, Thakur, “MPI-ACC: An integrated and extensible approach to
1993. data movement in accelerator-based systems,” in Proc. IEEE 14th
[22] M. Lothaire, Applied Combinatorics on Words, vol. 105. Cambridge, Int. Conf. High Perform. Comput. Commun. IEEE 9th Int. Conf.
U.K.: Cambridge Univ. Press, 2005. Embedded Softw. Syst., 2012, pp. 647–654.
[23] M. A. Bender and M. Farach-Colton, “The LCA problem revis- [47] A. Singh, S. Potluri, H. Wang, K. Kandalla, S. Sur, and D. Panda,
ited,” in Proc. 4th LATIN Amer. Symp.: Theoretical Informat., 2000, “MPI alltoall personalized exchange on GPGPU clusters: Design
pp. 88–94. alternatives and benefit,” in Proc. IEEE Int. Conf.Cluster Comput.,
[24] J. Fischer and V. Heun, “A new succinct representation of RMQ- 2011, pp. 420–427.
information and improvements in the enhanced suffix array,” in [48] T. Hilbrich, J. Protze, M. Schulz, B. de Supinski, and M. M€ uller,
Combinatorics, Algorithms, Probabilistic and Experimental Methodolo- “MPI runtime error detection with MUST: Advances in deadlock
gies. Berlin, Germany: Springer, 2007, pp. 459–470. detection,” Sci. Programm., vol. 21, no. 3, pp. 109–121, 2013.
[25] J. K€arkk€
ainen, P. Sanders, and S. Burkhardt, “Linear work suffix
[49] M. Geimer, F. Wolf, B. J. N. Wylie, E. Abrah am, D. Becker, and B.
array construction,” J. ACM, vol. 53, no. 6, pp. 918–936, 2006. Mohr, “The Scalasca performance toolset architecture,” Concur-
[26] NCSA Blue Waters project, Undergraduate Petascale Education rency Comput.: Practice Exp., vol. 22, no. 6, pp. 702–719, Apr. 2010.
Program [Online]. Available: http://www.shodor.org/petascale/, [50] VMPIR-Performance Optimization [Online]. Available: https://
2015. www.vampir.eu/, 2015.
[27] M. I. Cole, Algorithmic Skeletons: Structured Management of Parallel
Computation. London, U.K.: Pitman, 1989. Weizhe Zhang is a professor in the School of
[28] M. D. Dikaiakos, A. Rogers, and K. Steiglitz, “Fast: A functional Computer Science and Technology, Harbin Insti-
algorithm simulation testbed,” in Proc. 2nd Int. Workshop Model., tute of Technology, China. He has been a visiting
Anal., Simul. Comput. Telecommun. Syst., 1994, pp. 142–146. scholar in the Department of Computer Science,
[29] P. A. Dinda and D. R. O’Hallaron, “An evaluation of linear mod- University of Illinois at Urbana-Champaign and
els for host load prediction,” in Proc. 8th Int. Symp. High Perform. the University of Houston. His research interests
Distrib. Comput., 1999, pp. 87–96. are primarily in parallel computing, distributed
[30] K. Hoste, A. Phansalkar, L Eeckhout, A. Georges, L. K. John, and computing, cloud computing. He has published
K. De Bosschere, “Performance prediction based on inherent pro- more than 100 academic papers in journals,
gram similarity,” in Proc. 15th Int. Conf. Parallel Archit. Compilation books, and conference proceedings. He is a
Techn., Sep. 2006, pp. 114–122. member of the IEEE.
[31] C. D. Lu and D. A. Reed, “Compact application signatures for par-
allel and distributed scientific codes,” in Proc. ACM/IEEE Conf.
Supercomput., Nov. 2002, pp. 1–10. Albert M.K. Cheng received the BA degree with
[32] T. Sherwood, E. Perelman, G.Hamerly, and B. Calder, highest honors in computer science, graduating
“Automatically characterizing large scale program behavior,” Phi Beta Kappa, the MS degree in computer sci-
ACM SIGARCH Comput. Archit. News., vol. 30, no. 5, pp. 45–57, 2002. ence with a minor in electrical engineering, and
[33] C. Stewart and S. Kai, “Performance modeling and system man- the PhD degree in computer science, all from
agement for multi-component online services,” in Proc. 2nd Conf. The University of Texas at Austin, where he held
Symp. Netw. Syst. Des. Implementation, 2005, vol. 2, pp. 71–84. a GTE Foundation Doctoral Fellowship. He is a
[34] B. Urgaonkar, G. Pacifici, P. Shenoy, M. Spreitzer, and A. Tantawi, professor and a former interim associate chair of
“An analytical model for multi-tier internet services and its the Computer Science Department, University of
applications,” ACM SIGMETRICS Perform. Eval. Rev., vol. 33, Houston. He received numerous awards. He is
no. 1, pp. 291–302, 2005. the author of the popular textbook entitled Real-
[35] A. Li, X. Zong, S. Kandula, X. Yang, and M. Zhang, Time Systems: Scheduling, Analysis, and Verification (Wiley) and more
“CloudProphet: Towards application performance prediction in than 200 refereed publications on real-time, embedded, and cyber-phys-
cloud,” ACM SIGCOMM Comput. Commun. Rev., vol. 41, no. 4, ical systems. He is a senior member of the IEEE and a fellow of the Insti-
pp. 426–427, 2011. tute of Physics.
[36] X. Wu, K. Vijayakumar, F. Mueller, X. Ma, and P. C. Roth,
“Probabilistic communication and I/O tracing with deterministic
replay at scale,” in Proc. Int. Conf. Parallel Process., 2011, pp. 196–205. Jaspal Subhlok received the PhD degree in
[37] X. Wu, F. Mueller, and S. Pakin, “Automatic generation of execut- computer science from Rice University. His
able communication specifications from parallel applications,” in research interest involves high performance
Proc. Int. Conf. Supercomput., 2011, pp. 12–21. computing. He is a professor and chair of the
[38] J. Zhai, W. Chen, and W. Zheng, “Phantom: Predicting perfor- Computer Science Department, University of
mance of parallel applications on large-scale parallel machines Houston. He has published more than 100 aca-
using a single node,” ACM Sigplan Notices, vol. 45, no. 5, pp. 305– demic papers in journals, books, and conference
314, 2010. proceedings. He is a member of the IEEE.
[39] W. Zhang, T. Han, Y. Zhang, and A. M. Cheng, “Performance pre-
diction for MPI parallel jobs,” in Proc. IEEE Int. Conf. Cluster Com-
put. Workshops CLUSTER WORKSHOPS, 2012, pp. 136–142.