DwarfCode A Performance Prediction Tool For Parallel Applications
DwarfCode A Performance Prediction Tool For Parallel Applications
Abstract—We present DwarfCode, a performance prediction tool for MPI applications on diverse computing platforms. The goal is to
accurately predict the running time of applications for task scheduling and job migration. First, DwarfCode collects the execution traces
to record the computing and communication events. Then, it merges the traces from different processes into a single trace. After that,
DwarfCode identifies and compresses the repeating patterns in the final trace to shrink the size of the events. Finally, a dwarf code is
generated to mimic the original program behavior. This smaller running benchmark is replayed in the target platform to predict the
performance of the original application. In order to generate such a benchmark, two major challenges are to reduce the time complexity
of trace merging and repeat compression algorithms. We propose an O(mpn) trace merging algorithm to combine the traces generated
by separate MPI processes, where m denotes the upper bound of tracing distance, p denotes the number of processes, and n denotes
the maximum of event numbers of all the traces. More importantly, we put forward a novel repeat compression algorithm, whose time
complexity is O(nlogn). Experimental results show that DwarfCode can accurately predict the running time of MPI applications. The
error rate is below 10 percent for compute and communication intensive applications. This toolkit has been released for free download
as a GNU General Public License v3 software.
Index Terms—Performance prediction, MPI application, DwarfCode, trace merging, trace compressing
1 INTRODUCTION
O(mpn), where m denotes the upper bound of tracing distance, recording. These loops should be discovered and folded
p denotes the number of processes, and n denotes the maxi- again; 2) to downsize the size of the original program to
mum of event numbers of all the traces. The complete algo- generate the dwarf code; the similar and successive events
rithm is shown in Algorithm 1. We initialize the pointers to are also recognized and compressed even if these events are
the top of each trace (lines 1-3). Then, we tackle the function not in the same loop.
conflicts and merge similar events (lines 4-21). If the corre- In this section, first, the repeat compression problem is for-
sponding events can be merged without function conflicts, mulated into an optimal string compression problem. Then, the
they are simply joined into the final trace. Otherwise, the trac- key is two-fold: 1) find all the primitive and inextensible
ing distance is calculated, and the maximal tracing distance is tandem (PIT arrays) arrays; 2) find the optimal combination
merged with time complexity O(mpn). Thus, the overall time of tandem arrays to acquire the optimal compression of the
complexity of Algorithm 1 is O(mpn). original string. After solving the two sub-problems, the
repeat compression algorithm is provided and its complex-
Algorithm 1. Trace-Merging Algorithm ity is analyzed in detail.
Input: T—a set of all the traces
Output: F—the merged trace. 4.1 Formulation
Variables: p—the number of traces in T The merged trace can be converted into a string of symbols.
n—the maximum of event numbers of all the traces Each event with the similar function is represented with the
ti —the ith trace (a set of events from process i), where same symbol. The similarity of the events is determined as
0 i p, ti 2 T stated in Section 3.1. Then, the merged trace is symbolized
li —the location of current event required to be merged in the ti, as a string S with a finite alphabet of a fixed size, such as:
where 0 li n S ¼ xyaxyabcdabcdabcdae. The aim of repeat compression is
m—the upper bound of tracing distance to discover and shrink the loop nest structures or similar
dki —the tracing distance of the kth event of ti successive events. Thus, the repeat compression problem
1. foreach 0 i p do can be converted into finding and reducing the repeating
2. li ¼ 0 substrings in S.
3. endfor Repeating substrings are divided into three categories:
4. repeat 1) a tandem repeat means two successive identical substrings
5. If the events pointed by all the li can be merged, then
immediately follow each other (such as abcabc, abc is the
6. merge the events and insert the merged events to the
repeat substring); 2) an overlap repeat means two identical
bottom of F
substrings overlap (such as abaabaab, abaab is the repeat sub-
7. foreach 0 i p do
string); 3) a split repeat means two identical substrings are
8. li ¼ li þ 1
9. endfor separated by some nonempty substring (such as abcdeabc,
10. continue abc is the repeat substring). Note that a loop structure in the
11. endif trace is the same as a tandem repeat in the string. Thus, we
12. foreach 0 i p do // tackle the function conflicts are only interested in finding the tandem repeats. The overlap
13. calculating the tracing distance dki of the event in the and split ones are omitted.
position k ¼ li of the trace ti Specifically, if a tandem repeat does not contain shorter
14. if dki > m, then dki ¼ m repeats, it is called a primitive tandem repeat. A tandem
15. endif repeat is inextensible if there is no identical substring imme-
16. endfor diately before or after the repeats. For instance, if the string
17. sort the dki in the descending order is represented as ðababÞ2 , the repeat ðababÞ2 is not primitive
18. merge the events with the highest dki for containing the shorter repeat ab. If the string is repre-
19. insert the merged event to the bottom of F sented as ðabÞ3 ab, the repeat ðabÞ3 is not inextensible for an
20. li ¼ li þ 1 identical substring after it.
21. until (each li reaches the end of the trace ti ) We define the trace length after the compression as the
metric to evaluate the optimal compression. The shorter
The time complexity of trace merging algorithms can be
the length, the better the compression. For example, con-
further reduced to O(mnlogp) if a parallel trace merging
sider the string xyaxyabcdabcdabcdae. There exist two primi-
architecture is introduced. This advanced architecture is
tive and inextensible compressions ðxyaÞ2 ðbcdaÞ3 e and
used in Scalasca [49] and VampirServer [50] with Open Trace
Format 2 (OTF2). First, our system loads the trace files into xyaxyðabcdÞ3 ae. The former’s length is 8, and the latter’s one
memory. Then, trace files are bisected and recursively is 11. Moreover, more than one compression may be optimal.
merged based on Algorithm 1. Finally, all trace files are For example, consider the string abcabcabca. Both ðabcÞ3 a
meged into a single trace in parallel. and a ðbcaÞ3 are optimal.
Before formulating the repeat compression problem more
4 REPEAT COMPRESSION precisely, we introduce the terminology on the string.
After gaining the merged trace, the core of DwarfCode is to Definition 1. A string S denotes an ordered list of symbols with a
identify and compress the repeating patterns in the final trace to finite alphabet of a fixed size. The length of S is Sj ¼ n. For
shrink the size of the events. The purpose of repeat compres- 1 i j n, S[i..j] denotes the substring of S beginning
sion is two-fold: 1) the computation and communication with the ith and ending with the jth symbol of S. The suffix(i)
events inside a loop are spread out during the trace denotes S[i..n].
ZHANG ET AL.: DWARFCODE: A PERFORMANCE PREDICTION TOOL FOR PARALLEL APPLICATIONS 499
The methodology to solve the OSC problem is two-fold: Algorithm 2. All the PIT Arrays Finding Algorithm
1) find all the primitive and inextensible
P tandem arrays (i, a, p) Input: S—the
in the string and form a set of the tandem arrays; 2)P find P string denoting the symbolized trace
Output: —a set of all the PIT Arrays
the optimal combination Ci of the tandem arrays from to Variables: l—the length of each loop body or the repeated
acquire the optimal compression of the original string. substring
1. n ¼ jSj
4.2 Finding the Primitive and Inextensible Tandem 2. foreach 1 l n=2 do
Arrays 3. foreach j 2 f0; l; 2 l; . . . ; ½n=l lg do
The computation of all the primitive and inextensible tan- 4. if ðS½j 6¼ S½j þ lÞ then continue
dem arrays is a classical string matching problem with vari- 5. else Lp ¼ Longest_Common_Prefix(jþ1, l)
ous application areas, most notably molecular biology [17]. 6. Ls ¼ Longest_Common_Suffix(j, l)
There are several different O(nlogn) algorithms finding all 7. endif
the PIT arrays. In 1981, the problem is first studied by Cro- 8. P p þ Ls Þ lÞ
if (ðL
chemore [18] and an optimal O(nlogn) algorithm is given. In 9. S[(j Lp)..(j þ l þ Ls)]
recent years, most of the algorithms are based on the suffix 10. endif
tree. Apostolico and Preparata [19] present an O(nlogn) algo- 11. endfor
rithm for finding the leftmost PIT arrays. Main and Lorentz 12. endfor
[20] propose another algorithm which actually finds all PIT
In Algorithm 2, the length of the complete string is j S j ¼
arrays in O(nlogn) time. The algorithms based on the suffix
n (line 1). For each tandem array (i, a, p) in S, the length
tree are efficient, but building and processing the suffix tree
l ¼ jaj must not exceed n/2 because the repeated periods of
with several auxiliary data structures consume much mem-
a are greater than 1. Then, we enumerate the 1 l n=2 to
ory [21]. In this section, we design an O(nlogn) algorithm
find all the tandem arrays (i, a, p) where j a j ¼ l (lines 2-12).
based on the suffix array. Its advantage over the suffix tree is
For a fixed l, first, we find a certain position j where
that, in practice, they use three to five times less memory.
S½j ¼ S½j þ i (line 3). Then, we calculate the longest com-
The idea of our algorithm is based on the Theorem 1
mon suffix (LCS) and prefix from the S[j] and S[j þ l] (lines
derived by us and the Theorem 2 mentioned in [22].
5-7). Finally, if the sum of the prefix and suffix is greater
Theorem 1. If there is a tandem repeat (i, a, 2) in the string S, than l–1, then we find a tandem array where jaj ¼ l and
then 9j 2 f0; jaj; 2jaj . . . ½jSj=jaj jajg; S½j ¼ S½j þ jaj. record it (lines 8-10).
Fig. 5 illustrates the procedure to find one PIT array from
Proof. The length of a tandem repeat (i, a, 2) is 2 j a j , thus
lines 3 to line 11. The string is S ¼ BABCABCABCDACD,
there exist j and j þ jaj 2 f0; jaj; 2jaj . . . ½jSj=jaj jajg that
and l ¼ 3. Because in the first loop S[0] 6¼ S[3] (line 4), the
are covered by the (i, a, 2). Since S½i::i þ jaj 1 1 ¼
loop is continued (line 3). Then, because in the next loop S
S½i þ jaj::i þ jaj 2 1 according to Definition 2, then
[3] ¼ S[6] ¼ C (line 4), their longest common prefix and lon-
S½j ¼ S½j þ jaj. u
t
gest common suffixare calculated as Lp ¼ 2 and Ls ¼ 3
Theorem 2. There exists a tandem array with repeat a and (lines 5 & 6). Because Lp þ Ls ¼ 5 > l ¼ 3 (line 8), one PIT
j ¼ jaj in the string S ¼ xy, that contains the frontier between array M ¼ ABCABCABC is identified and added into the set
x and y and has a root in y if and only if (line 9).
500 IEEE TRANSACTIONS ON COMPUTERS, VOL. 65, NO. 2, FEBRUARY 2016
To reduce the time complexity of Algorithm 2, the key is the final combination of PIT arrays and set the values from
to reduce the time of the longest common prefix and suffix mark[j] to mark ½j þ jbj q 1 as True.
(lines 5-6). The LCS and LCP problem can be converted into The time complexity of the non-overlapping detection
the Range Minimum Query (RMQ) problem. We use a fast of PIT arrays (lines 5-11) is O(1). First, choosing the longest
(O(n), O(1)) time algorithm for the RMQ problem, which PIT array (line 5) is O(1) because the PIT arrays have been
can also be applied to the LCS and LCP problem [23], [24]. sorted according to their lengths in line 3. Then, compar-
The O(n) is the preprocessing time to construct the suffix ing the two endpoints (lines 6-8) is O(1). Finally, the mark-
array, and the O(1) is the query time to find the longest com- ing procedure (lines 9-11) is O(1). Although the marking
mon prefix and suffix based on the preprocessed suffix array. procedure seems to scan and mark the Boolean array O
The complete (O(n), O(1)) algorithm for the LCP and LCS (nlg(n)) times, we should note that not every PIT array
program is detailed in [23]. The suffix array is the basic data needs to scan and mark the Boolean array. Because the
structure for the LCP and LCS problem. length of the Boolean array is n, in the worst case it should
The suffix array is the basic data structure for the LCP and be marked O(n) times. While the total iteration number is
LCS problem. We define the suffix array of a string S as a pair O(nlog(n)), according to the amortized analysis, the amor-
of arrays (SA, Rank). The sort array SA is the lexicographi- tized time complexity for each iteration in lines 9-11
cally ordered list of all suffixes of S. That is, SA[i] ¼ j if suffix should be O(1).
(j) is lexicographically the ith suffix among all suffixes suffix
(1), suffix(2),. . ., suffix(n) of S. The number of i will be called Algorithm 3. The Greedy Selection Algorithm
the rank of suffix(j), denoted by Rank(j) ¼ i, which is an P
Input: —a set of all the PIT Arrays
inverse with the SA. That is, SA[Rank[j]] ¼ j. We adopt the
Input: S—the string denotes the symbolized trace
DC3 algorithm [25] to construct the suffix array. Input: mark[n]—an array of Boolean type (False/True), whose
length is jSj ¼ n
4.3 Finding the Optimal Combination of Tandem Output: C—a combination of PIT arrays, C 2 2S
Arrays 1. initialization: C
2. foreach 0 < i n do
After finding P all the PIT arrays of the string S ( j S j ¼ n), we
mark[i] ¼ False P
acquire a set ¼ fði; a; pÞj1 i, jaj n, 2 p n, (i, a, p)
3. sort the PIT arrays in according to their lengths in
isprimitive and inextensible}. The next step is to find a subset
descending order
Ci 2 2S so that 8 j 1 j j2Sj, jSi j jSj j if the string S is
4. repeat P
compressed with Ci . 5. choose the longest PIT array (i, a, p) from . If two or
The maximal number of all the PIT arrays is O(nlog(n)) more PIT arrays satisfy the conditions, choose the one
[18], [19] for a string jSj ¼ n. The time complexity to enu- with the smallest starting point.
merate all the solutions is Oð2n logðnÞ Þ, which is an NP hard 6. if (mark[i] ¼ ¼ True j P j mark [i þ j a j p 1] ¼ ¼ True)
problem. Thus, we aim to provide a heuristic algorithm for 7. {delete (i, a, p) from
near-optimal solution. 8. continue;}
An Oðn3 Þ dynamic programming algorithm can be easily 9. else
designed to find the near-optimal combination of tandem 10. {set the value from mark[i] to mark[i þ j a j p 1] as
arrays. However, the complexity needs to be reduced fur- True P
ther. Xu and Subhlok [8] propose a greedy heuristic called 11. P (i, a, p) from
delete and C C [ fði; a; pÞg
“Bottom-Up” to iteratively choose the longest PIT array with 12. until ( is empty)
the smallest starting point. The time complexity is Oðn2 Þ
while there exists the risk of missing the optimal solution. 4.4 Repeat Compression Algorithm and Complexity
For example, the PIT arrays are {(1, xya, 2), (6, abcd, 3), Analysis
(7, bcda, 3)} for a string S ¼ xyaxyabcdabcdabcdaef. According The complete repeat compression algorithm is shown in
to the “loop filtering” algorithm, we can get the Ci ¼ fð6; Algorithm 4. First, the string S ( j S j ¼ n) is converted into
abcd; 3Þg to acquire the compressed string S’ ¼ xyaxy(abc- the suffix array with the DC3 algorithm (line 2). Then, we
d)3aef and j S0 j ¼ 12. Obviously, the optimal solution is S} ¼ use the Range Minimum Query algorithm to preprocess the
ðxyaÞ2 ðbcdaÞ3 ef and jS}j ¼ 9 with Ci ¼ fð1; xya; 2Þ, (7,bcda, suffix array (line 3). Third, Algorithm 2 is introduced to find
3)}. Thus, we propose a new algorithm shown in Algorithm all the PIT arrays (line 4). Finally, Algorithm 3 is used to
3. Note that the algorithm only acquires the intermediate greedily select and order the sequences of the PIT arrays
results of the final combination. The complete solution relies (line 5). The procedure mentioned above repeats until the
on Algorithm 4. set of PIT array is empty (line 8).
The advantage of Algorithm 3 lies in the non-overlap- The time complexity of the repeat compression algorithm
ping detection of PIT arrays (lines 5-11). First, when the first is O(nlog(n)), which is proved by the following lemmas and
longest PIT array (i, a, p) is chosen, the values from mark[i] theorems.
to mark[iþ j a j p–1] are set as True. Then, when the next
Lemma 1. The time complexity of Algorithm DC3 is O(n).
PIT array (j, b, q) is chosen, we should judge whether the
value of mark[j] or mark ½j þ jbj q 1 has been set as Proof. Refer to [25]. u
t
True. If either one is True, the PIT array (j, b, q) is over-
Lemma 2. The time complexity of Algorithm RMQ is (O(n),O(1)).
lapped with former selected PIT arrays and this PIT array
should be discarded. Otherwise, we add the PIT array to Proof. Refer to [24]. u
t
ZHANG ET AL.: DWARFCODE: A PERFORMANCE PREDICTION TOOL FOR PARALLEL APPLICATIONS 501
Lemma 3. The time complexity of Algorithm 2 (All the PIT alternative, we can use a delay function (e.g., sleep ()) found
Arrays Finding Algorithm) is O(nlogn). in most operating systems. This puts a process to sleep for a
specified time, during which it will waste no CPU time. For
Proof. The outmost loop number l of Algorithm 2 is from 1
example, a computation symbol can be replaced by the fol-
to n/2. The inner loop number is n/l. Thus, the total iter-
lowing loop statement:
ative number is P ¼ n=1 þ n=2þ; . . . ; þn=ðn=2Þ ¼ nð1þ
sleep (Original Runtime CR);
1=2þ; . . . ; þ2=nÞ < nð1 þ 1=2þ; . . . ; þ1=nÞ ¼ nHn , where
Hn ¼ 1 þ 1=2þ; . . . ; þ1=n. Note that the sum of the recip-
rocals of the first n natural numbers Hn is the nth har- Algorithm 4. The Repeat Compression Algorithm
monic number. The sum Hn is approximated by the Input: S—the string denotes the symbolized trace
Rn
integral 1 x1 dx, whose value is log(n). Thus, P < n Hn Output: S’—the compressed string
nlogn. Also, the loop body is LCP and LCS. Note that in Variables: SA—the sort array of suffix array
Lemma 2, the query time of LCP and LCS is O(1). Thus, Rank—the rank array of suffix array
the time complexity is determined by the iterative num- P combination of PIT arrays C 2 2S
C—a
—a set of all the PIT Arrays
ber, whose upper bound is O(nlogn). u
t
1. repeat
Lemma 4. The time complexity of Algorithm 3 (Greedy Selection 2. (SA, Rank) DC3(S)
Algorithm) is O(nlogn). 3. PRank) RMQ(SA, Rank)
(SA,
4. Finding the primitive and inextensible tandem
Proof. Note that the maximal number of all the PIT arrays arrays (SA, Rank) // AlgorithmP 2
found by Algorithm 2 is nlogn [18], [19]. In Algorithm 3, 5. C The Greedy Selection ( ) // Algorithm 3
each PIT array is traversed once and the amortized com- 6. S PCompress S with C
plexity for each traverse is O(1). Thus, the time complex- 7. until ( is empty)
ity of Algorithm 3 is O(nlogn). u
t 8. S0 S
Theorem 3. The time complexity of Algorithm 4 (The Repeat
Compression Algorithm) is O(nlogn).
5.2 Tackling the Communication
Proof. The time complexity between repeat (line 1) and In the trace recording and merging phases, the algorithm
Until (line 7) is O(n) þ O(n) þ O(nlogn) þ O(nlogn) ¼ O preserves the src and dest information of point-to-point
(nlogn) based on the Lemmas 1-4. The original string S is communication. The dwarf code needs to restore these
compressed to be a new string S’ in line 6. Note that S is parameters. In order to unify the dwarf code, first we obtain
shortened at least by half each iteration. Thus, the com- the rank of the current process immediately after executing
plexity of Algorithm 4 is O(nlogn) þ O(n/2log(n/2)) þ ,. . ., the MPI_Init call, and save it in a global variable id_proc.
þ O(1) ¼ O(nlogn). u
t Before each point-to-point communication call performs,
we create a src/dest matrix according to the merging infor-
5 GENERATING DWARF CODE mation recorded, and then invoke the communication calls.
The final step to build the dwarf code is to convert the The parameters in the corresponding positions can be
merged and compressed trace into an executable program, obtained according to the value of id_proc in the assign-
which mimics the behavior represented in the trace. The ment matrix. Fig. 6 illustrates the procedure of one-to-one
compressed trace in this phase contains the primitive tan- communication in BT’s dwarf code. The parameter count can
dem arrays with the loop numbers. These arrays consist of a be adjusted to scale down the running time of the dwarf
series of symbols, which denotes an MPI communication code proportionally.
call or a computation in a certain period. The trace is con-
5.3 Tackling the PIT Arrays
verted into the executable C dwarf code by resuming these
symbols with the communication or computation calls. The PIT arrays (i, a, p) are converted into the loop state-
ments. The repeat number p is converted into the loop
5.1 Tackling the Computation number. The primitive substring a is mapped to the corre-
sponding computation and communication calls. The
We can replace the symbols representing computation by
repeat number (or loop number) p can be adjusted to scale
synthetic computation codes with equal duration time, such
down the running time of the dwarf code in proportion.
as the busy waiting or spinning, to generate the dwarf code.
The iteration number (abbreviated as IN, IN > 1) of the loop
6 EVALUATION
can be adjusted with the predefined compression ratio (abbre-
viated as CR, 0 < CR < 1). For example, a computation 6.1 Experiment Setup
symbol can be replaced by the following loop statement, To evaluate the correctness and efficiency of the Dwarf-
where the iteration number is multiplied by compression ratio Code system, we perform extensive experiments with
to decrease the running time in proportion: NAS parallel benchmarks on two small clusters named
“Dawning1000” and “IA32”, and one large cluster named
for ði ¼ 1; i <¼ IN CR; i þ þÞfg; “Kongfu” with different sizes, architectures, interconnection
types, and operating systems, as described below:
However, busy waiting or spinning is an inefficient pro-
gramming pattern and should be avoided. Also, it might be Dawning1000 is a 16-node tightly coupled cluster. Its
removed by modern compilers as dead codes. As an peak performance achieves 2.5 GFLOPS, and its
502 IEEE TRANSACTIONS ON COMPUTERS, VOL. 65, NO. 2, FEBRUARY 2016
TABLE 2
Results of Trace Merging
NPB Application BT CG EP FT IS LU MG SP
The merged trace number 17,111 41,954 5 47 38 324,355 10,043 26,891
The merging time(s) 20.22 49.68 0.02 0.12 0.08 375.84 12.57 35.47
6.4 Validation of Repeat Compression Table 5 shows that the running time of our algorithm is
In this section, the aim is to evaluate the compression shorter than that of the “Top-Down” and “Bottom-Up”
ratio and the compression time of the repeat compression algorithms for all NPB applications. The “Top-Down”
algorithm. algorithm is the most time-consuming. The running time of
The compression is affected by the loop number. The our algorithm is 50 percent shorter than that of the “Bottom-
NPB applications are scientific computing applications and Up” algorithm for most applications. Tables 4 and 5 do not list
the loop structure dominates the main body in most of the result of the “Top-Down” algorithm for the trace of LU.
them. However, the traces of the EP, FT and IS applications The reason is that its time of compressing LU with the “Top-
are so small that we omit their results. Because enumerating Down” algorithm is too long and far beyond 105 seconds.
all the inner and outer loops is complicated, we focus on the The time complexity of our algorithm is O(nlogn) while
outer loops. Table 3 shows the main loop structure, original the time complexity of the “Top-Down” and “Bottom-Up”
length, compressed length and compression ratio of BT, algorithms is Oðn2 Þ. Therefore, our algorithm shows much
CG, LU, MG and SP applications. The main loop structures better asymptotic time complexity than two other algorithms.
are represented by the product of the loop number and the
statement number of each loop, i.e., (15) 200 denotes that 6.5 Validation of Prediction Accuracy
the statement number inside the loop is 15 and the loop The aim is to evaluate the prediction accuracy of the dwarf
number is 200 for the BT application. The compress ratios code. We generate the dwarf codes of BT, CG, LU, MG and
are very high even the lowest compress ratios of MG is 96.9 SP application with Class C on Dawning1000. The dwarf
percent and the highest one of LU is 99.998 percent. codes that are generated are 10 times smaller than the origi-
In order to further evaluate our repeat compression algo- nal programs. Then, the original programs and dwarf codes
rithm, we compare it with a “Top-Down” algorithm in [8] run separately on Dawning1000 and IA32 clusters with all
and a “Bottom-Up” algorithm in [9] for compressed length of 16 nodes.
and running time. Table 6 shows the actual running time of the original pro-
Table 4 shows that when the initial trace lengths are gram, the running time of the dwarf code, the predicted time
small such as the BT, CG, MG, and SP applications, their of the dwarf code and the error rates on Dawning1000. Table 7
compressed lengths with our algorithm are the same as shows the results on IA32.
those with the “Top-Down” and “Bottom-Up” algorithms. As for Dawning1000, the prediction error rates are less
However, when the initial trace length is large such as the than 3 percent for 5 NPB applications; As for IA32, the error
LU application, the compressed length with our algorithm rates do not exceed 10 percent. The prediction difference
is shorter than that with other algorithms. between Dawning1000 and IA32 cluster is that IA32 is a
TABLE 3 TABLE 5
Results of Repeat Compression Running Time Comparison of Three Algorithms
NPB Main loop Initial Compressed Compression NPB Initial Time (s)
Application structure length length ratio (%) application length Our algorithm Top-Down Bottom-Up
BT (15) 200 17,111 44 99.743 BT 17,111 4.05 276.46 6.96
CG (12) 75 41,954 26 99.938 CG 41,954 6.81 1,498.27 7.59
LU (7) 249 324,355 38 99.998 LU 324,355 26.96 >105 41.87
MG (134) 20 10,043 302 96.993 MG 10,043 2.76 104.84 9.05
SP (15) 400 26,891 43 98.840 SP 26,891 5.61 649.12 10.21
TABLE 6
TABLE 4
Results of Time Prediction on Dawning1000 Cluster
Compressed Length of Three Algorithms
NPB Original Dwarf Dwarf code Error
NPB Intial Compressed length
Application program code running prediction rate (%)
application length Our algorithm Top-Down Bottom-Up running time(s) time(s) time(s)
BT 17,111 44 44 44 BT 1,094.65 113.04 1,130.4 3.27
CG 41,954 26 26 26 CG 384.81 37.93 379.3 1.43
LU 324,355 38 – 47 LU 877.27 85.16 851.6 2.92
MG 10,043 302 302 302 MG 786.34 77.57 775.7 1.35
SP 26,891 43 43 43 SP 1,219.16 118.56 1,185.6 2.76
504 IEEE TRANSACTIONS ON COMPUTERS, VOL. 65, NO. 2, FEBRUARY 2016
TABLE 7 TABLE 9
Results of Time Prediction on IA32 Cluster Collective Functions Involved in NPB & Nbody Applications
applications. Lu and Reed [31] propose a method using The upper complexity bound of the graph spectrum
curve fitting to compress parallel programs for reducing and isomorphism algorithms is Oðn3 Þ for n events in
the program running time significantly. Sherwood et al. each trace. Moreover, it neglects function conflicts.
[32] study automatic analysis of the periodicity of parallel Mueller’s studies in [11], [12] maintain a dependence
programs. Also, some works focus on predicting web graph during the entire merge algorithm. The upper
application’s performance through modeling [33], [34]. complexity bound of the overall merge operation is
In contrast to DwarfCode, these approaches rely on the Oðn2 Þ for n events in each trace. DwarfCode not only
library of existing benchmarks, which neglect the diversity considers the sequence differences and function conflicts,
and complexity of applications and platforms. but also reduces the time complexity of the trace merging
HPC simulators, such as SST [41], BigSim [42], ROSS algorithm to O(mpn).
[43], and PSINS [44], can allow simulation of diverse 2. Repeat compression. Related approaches suffer high
aspects of hardware and software. But the prediction time complexity for the repeat compression algo-
accuracy of the application running time is reduced for rithm. Sodhi et al. [6], [7] recognize and compress
loss of details in modeling process. Our system automati- repeated execution behaviors as loops to generate
cally generates the dwarf code, a customized benchmark, the final execution skeleton. The complexity of their
which can be replayed in real-time without modeling compression algorithm is Oðn3 Þ. Xu et al. [8], [9] take
either the application or the platform. a variant approach to identify the loop structures in
Some recent attempts aim to generate a shorter running a trace based on the Crochemore’s algorithm [18].
benchmark of the real application and replay it on the target The complexity of this compression algorithm is
platform. CloudProphet [35] is an end-to-end performance Oðn2 Þ. Wong et al. [10] introduce a pattern identifica-
prediction tool for web applications in the cloud. It replays tion algorithm to find the most relevant phases of the
the trace log by capturing the resource usage and extracting parallel applications, whose complexity is Oðn2 Þ.
the dependency. CloudProphet only focuses on web appli- Mueller et al. [12] propose intra-node and inter-node
cations while DwarfCode pays close attention to MPI compression techniques of MPI events that are capa-
applications. ble of extracting an application’s communication
Several studies address performance prediction for MPI structure. Its complexity is Oðn2 Þ. DwarfCode introdu-
applications. Dimemas [5] is a performance prediction tool ces a novel repeat compression algorithm based on suffix
for MPI applications in the Grid environment. It captures arrays whose time complexity is O(nlogn).
the CPU bursts and the communication patterns. It models
the target architecture with a configuration file. Meanwhile, 8 DISCUSSIONS
Sodhi et al. [6], [7] propose a framework for automatic gen-
DwarfCode is mainly designed for performance predic-
eration of performance skeletons. Xu et al. [8], [9] present
tion of MPI applications on cluster systems but its princi-
generation of coordinated performance skeletons, similar to
ple can aid performance prediction for hybrid MPI þ
dwarf code with logicalization and compression procedures.
OpenMP applications on multicore systems and hybrid
Parallel application signatures for performance prediction
MPI þ GPU applications on hybrid-core systems with
(PAS2P) is a tool studied by Wong et al. [10]. Based on the
hardware accelerators.
application’s message-passing activity, representative
phases can be identified and extracted, with which a paral- 1)Hybrid MPIþOpenMP applications on MPI and
lel application signature can be created to predict the loop-level grain (OpenMP) parallelism. Its running
application’s performance. Mueller et al. [11], [12], [36], [37] time is the sum of intra-node OpenMP and inter-
introduce intra-node and inter-node compression techni- node MPI call costs by considering overlapping fac-
ques of MPI events that are capable of extracting an tor [45]. Our method can help build parameterized
application’s communication structure and presenting an communication model for inter-node MPI calls.
automatic generation mechanism for replaying the traces. Intra-node OpenMP performance can be acquired by
Chen et al. [38] implement a performance prediction frame- analysizing the memory bandwidth contention.
work, called PHANTOM, which integrates the computa- 2) Hybrid MPI þ GPU applications on hybrid-core sys-
tion-time acquisition approach with a trace-driven network tems conform to a classic MPI þ GPU or GPU-inte-
simulator. Also, part of our preliminary work to build the grated MPI models (MPI-ACC [46] and MVAPICH-
representative benchmarks is shown in [39]. GPU [47]). For the classic MPI þ GPU model, similar
These approaches are the closest to the DwarfCode work to hybrid MPI þ OpenMP, its running time is the
presented in this paper. However, there are some key sum of MPI calls between hosts and data copies,
differences: which are performed between main memory and the
local GPU’s device. We can leverage our method for
1. Trace merging. Several studies, for example, [6], [7], inter-node MPI calls and calculate the costs of cuda-
[8], [11], [12], have been conducted on trace merging Memcpy or clEnqueueWriteBuffer for data copies. For
algorithms but they all have some pitfalls. Sodhi the GPU-integrated MPI model, the programmer can
et al. [6], [7] put forward the methods to identify and use the GPU buffer directly as the communication
cluster similar events without considering sequence parameter in MPI routines. This is difficult to create
differences and function conflicts. Xu et al. [8], [9] the dwarf code, which needs further investigation.
match communication patterns with application Due to event reordering and potential information loss
communication graphs represented by the matrices. caused by inter-process trace merging, the code generated
506 IEEE TRANSACTIONS ON COMPUTERS, VOL. 65, NO. 2, FEBRUARY 2016
from traces may have deadlocks. The key issue in ensuring less than 10 percent for computing and communication
deadlock freedom is to identify and label the non-matching intensive applications.
calls. Current research mainly focuses on modeling the com-
putation and communication, which are typical events of
1) A procedure is outlined to mark the non-matching scientific applications, such as NPB applications. However,
calls in our former work [7], [8]. It is based on the more complicated and irregular codes should be consid-
basic deadlock free patterns which are a) a non- ered. Future work includes addressing memory and I/O
blocking Send/Recv with a matching Recv/Send intensive codes and validation with complete multi-phase
before a corresponding Wait; b) One or more block- applications. We are porting the MpiBlast, and SPH applica-
ing Send/Recv calls followed by matching Recv/ tions to our platform.
Send calls. Such calls are labeled with our algorithm
in [7], [8] and ignored for code generation.
2) We also solve this with the help of MPI runtime error
ACKNOWLEDGMENTS
detection tool, named marmot umpire scalable tool The authors would like to thank Prof. Marc Snir and Dr.
(MUST) [48]. MUST can cover various process-level Babak Behzad at the University of Illinois at Urbana-Cham-
correctness checks. It is especially skilled in deadlock paign for insightful discussion about the paper revision.
detection. We introduce the following steps to They also thank the anonymous reviewers’ comments,
ensure the correctness of the final dwarf code: a) run which are all valuable and very helpful for revising and
the dwarf code and intercepts all MPI calls of all pro- improving our paper. This work was supported in part by
cesses at runtime; b) generate a message dependence the National Basic Research Program of China under Grant
graph (MDG) or a wait-for graph (WFG); c) perform No.G2011CB302605. This work was partially supported by
type matching, collective verification, and deadlock the National Natural Science Foundation of China (NSFC)
detection with a centralized deadlock detector with under grant No. 61173145 and also the Doctoral Program of
MUST. MUST’s AND
OR model can achieve sub- Higher Education of China under grant No. 20132302110037.
linear analysis time. Prof. Albert Cheng was supported by the US National Science
Trace recording needs further improvement to reduce Foundation under Awards No. 0720856 and No. 1219082.
the trace size. Our approach collects raw communication
traces for each process of a parallel application. Size of REFERENCES
the uncompressed process-level trace usually increases [1] I. Foster, “The Grid: A new infrastructure for 21st century
with the number of communication calls. When the num- science,” in Grid Computing: Making the Global Infrastructure a
ber of calls is too large, the size of the raw trace may Reality. Hoboken, NJ, USA: Wiley, 2003, pp. 51–63.
[2] W. Zhang, B. Fang, M. Hu, X. Liu, H. Zhang, and L. Gao,
exceed the storage capacity of a single node. However, “Multisite co-allocation scheduling algorithms for parallel jobs in
there are three ways to alleviate this problem: 1) the trace computing grid environments,” Sci. China Ser. F: Inf. Sci., vol. 49,
can be stored in HPC storage system instead of the node no. 6, pp. 906–926, 2006.
[3] X. Gao, A. Snavely, and L. Carter, “Path grammar guided trace
generating the trace; 2) the records in the trace can be rep- compression and trace approximation,” in Proc. 15th IEEE Int.
resented with the binary code but not current ASCII code; Symp. High Perform. Distrib. Comput., 2006, pp. 57–68.
3) online trace merging can be introduced when the trace [4] N. Cardwell, S. Savage, and T. Anderson, “Modeling TCP
latency,” in Proc. INFOCOM, 2000, pp. 1742–1751.
length is more than the trace distance, thus not waiting for
[5] R. M. Badia, F. Escale, E. Gabriel, J. Gimenez, R. Keller, J. Labarta,
all the traces generated. and M. S. M€ uller, “Performance prediction in a grid environ-
ment,” in Grid Computing. Berlin, Germany: Springer, 2004,
pp. 257–264.
9 CONCLUSIONS [6] S. Sodhi and J. Subhlok, “Skeleton based performance prediction
Model-driven and trace-driven performance prediction on shared networks,” in Proc. IEEE Int. Symp. Cluster Comput.
Grid, 2004, pp. 723–730.
techniques are of limited use in practice. We present Dwarf- [7] S. Sodhi, J. Subhlok, and Q. Xu, “Performance prediction with
Code, a performance prediction tool for MPI applications. It skeletons,” Cluster Comput., vol. 11, no. 2, pp. 151–165, 2008.
includes procedures for trace recording, trace merging, [8] Q. Xu and J. Subhlok, “Construction and evaluation of coordi-
nated performance skeletons,” in Proc. 15th Int. Joint Conf. High
repeat compression, and dwarf code generation. Researchers Perform. Comput., 2008, pp. 73–86.
can download our toolkit for free, which is under a GNU [9] Q. Xu, J. Subhlok, and N. Hammen, “Efficient discovery of loop
GPL v3 license. Our main contribution is three-fold: 1) An O nests in execution traces,” in Proc. IEEE Int. Symp. Model., Anal.
Simul. Comput. Telecommun. Syst., 2010, pp. 193–202.
(mpn) trace merging algorithm is proposed, which can also [10] A. Wong, D. Rexachs, and E. Luque, “Extraction of parallel appli-
tackle the sequence differences and function conflicts. 2) A cation signatures for performance prediction,” in Proc. 12th IEEE
novel repeat compression algorithm based on suffix trees is Int. Conf. High Perform. Comput. Commun., 2010, pp. 223–230.
[11] M. Noeth, P. Ratn, F. Mueller, S. Martin, and B. R. de Supinski,
designed, whose time complexity is O(nlogn). It converts “ScalaTrace: Scalable compression and replay of communication
the original problem into an optimal string compression traces for high-performance computing.” J. Parallel Distrib. Com-
problem. First, we find all the primitive and inextensible put., vol. 69, no. 8, pp. 696–710, 2009.
tandem arrays. Then, we acquire the optimal combination [12] P. Ratn, F. Mueller, B. R. de Supinski, and M. Schulz, “Preserving
time in large-scale communication traces,” in Proc. 22nd Annu. Int.
of the tandem arrays to form the solution. 3) The dwarf code Conf. Supercomput., 2008, pp. 46–55.
can be built on fewer cores and predict running time of the [13] D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, L.
application on clusters with a similar architecture but more Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S.
Schreiber, H. D. Simon, V. Venkatakrishnan, and S. K. Weeratunga,
cores. The results show that DwarfCode can accurately pre- “The NAS parallel benchmarks summary and preliminary results,”
dict the running time of MPI applications. The error rate is in Proc. 5th Annu. Int. Conf. Supercomput., 1991, pp. 158–165.
ZHANG ET AL.: DWARFCODE: A PERFORMANCE PREDICTION TOOL FOR PARALLEL APPLICATIONS 507
[14] Message Passing Interface Forum [Online]. Available: http:// [40] DwarfCode [Online]. Available: https://github.com/wzzhang-
www.mpi-forum.org/, 2012. HIT/DwarfCode, 2014.
[15] R. Sch€one, R. Tsch€ uter, T. Ilsche, and D. Hackenberg, “The vam- [41] Sandia National Laboratories. SST: The structural simulation tool-
pirtrace plugin counter interface: Introduction and examples,” in kit [Online]. Available: http://sst.sandia.gov/, 2011.
Proc. Euro-Par Parallel Process. Workshops, 2011, pp. 501–511. [42] G. Zheng, K. Gunavardhan, and V. K. Laxmikant. (2004). Bigsim:
[16] J. S. Vetter and O. M. Michael, “Statistical scalability analysis A parallel simulator for performance prediction of extremely large
of communication operations in distributed applications,” ACM parallel machines. Proc. 18th Int. Parallel Distributed Process. Symp.,
SIGPLAN Notices, vol. 36, no. 7, pp. 123–132, 2001. p. 78 [Online]. Available: http://charm.cs.uiuc.edu/research/
[17] D. Gusfield, Algorithms on Strings, Trees and Sequences: Computer bigsim
Science and Computational Biology. Cambridge, U.K.: Cambridge [43] ROSS: Rensselaer’s Optimistic simulation system [Online]. Avail-
Univ. Press, 1997. able: https://github.com/carothersc/ROSS/wiki, 2013.
[18] M. Crochemore, “An optimal algorithm for computing the repeti- [44] M. Tikir, M. Laurenzano, L. Carrington, and A. Snavely, “PSINS:
tions in a word,” Inf. Process. Lett., vol. 12, no. 5, pp. 244–250, 1981. An open source event tracer and execution simulator for MPI
[19] A. Apostolico and F. P. Preparata, “Optimal off-line detection of applications,” in Proc. Euro-Par Parallel Process., 2009, pp. 135–148.
repetitions in a string,” Theor. Comput. Sci. , vol. 22, no. 3, pp. 297– [45] X. Wu and V. Taylor, “Performance modeling of hybrid mpi/
315, 1983. openmp scientific applications on large-scale multicore cluster
[20] M. G. Main and R. J. Lorentz, “An O(nlogn) algorithm for finding systems,” in Proc. IEEE 14th Int. Conf. Comput. Sci. Eng., 2011,
all repetitions in a string,” J. Algorithms, vol. 5, no. 3, 422–432, 1984. pp. 181–190.
[21] U. Manber and G. Myers, “Suffix arrays: A new method for on- [46] A. Aji, J. Dinan, D. Buntinas, P. Balaji, W. Feng, K. Bisset, and R.
line string searches,” SIAM J. Comput., vol. 22, no. 5, pp. 935–948, Thakur, “MPI-ACC: An integrated and extensible approach to
1993. data movement in accelerator-based systems,” in Proc. IEEE 14th
[22] M. Lothaire, Applied Combinatorics on Words, vol. 105. Cambridge, Int. Conf. High Perform. Comput. Commun. IEEE 9th Int. Conf.
U.K.: Cambridge Univ. Press, 2005. Embedded Softw. Syst., 2012, pp. 647–654.
[23] M. A. Bender and M. Farach-Colton, “The LCA problem revis- [47] A. Singh, S. Potluri, H. Wang, K. Kandalla, S. Sur, and D. Panda,
ited,” in Proc. 4th LATIN Amer. Symp.: Theoretical Informat., 2000, “MPI alltoall personalized exchange on GPGPU clusters: Design
pp. 88–94. alternatives and benefit,” in Proc. IEEE Int. Conf.Cluster Comput.,
[24] J. Fischer and V. Heun, “A new succinct representation of RMQ- 2011, pp. 420–427.
information and improvements in the enhanced suffix array,” in [48] T. Hilbrich, J. Protze, M. Schulz, B. de Supinski, and M. M€ uller,
Combinatorics, Algorithms, Probabilistic and Experimental Methodolo- “MPI runtime error detection with MUST: Advances in deadlock
gies. Berlin, Germany: Springer, 2007, pp. 459–470. detection,” Sci. Programm., vol. 21, no. 3, pp. 109–121, 2013.
[25] J. K€arkk€
ainen, P. Sanders, and S. Burkhardt, “Linear work suffix
[49] M. Geimer, F. Wolf, B. J. N. Wylie, E. Abrah am, D. Becker, and B.
array construction,” J. ACM, vol. 53, no. 6, pp. 918–936, 2006. Mohr, “The Scalasca performance toolset architecture,” Concur-
[26] NCSA Blue Waters project, Undergraduate Petascale Education rency Comput.: Practice Exp., vol. 22, no. 6, pp. 702–719, Apr. 2010.
Program [Online]. Available: http://www.shodor.org/petascale/, [50] VMPIR-Performance Optimization [Online]. Available: https://
2015. www.vampir.eu/, 2015.
[27] M. I. Cole, Algorithmic Skeletons: Structured Management of Parallel
Computation. London, U.K.: Pitman, 1989. Weizhe Zhang is a professor in the School of
[28] M. D. Dikaiakos, A. Rogers, and K. Steiglitz, “Fast: A functional Computer Science and Technology, Harbin Insti-
algorithm simulation testbed,” in Proc. 2nd Int. Workshop Model., tute of Technology, China. He has been a visiting
Anal., Simul. Comput. Telecommun. Syst., 1994, pp. 142–146. scholar in the Department of Computer Science,
[29] P. A. Dinda and D. R. O’Hallaron, “An evaluation of linear mod- University of Illinois at Urbana-Champaign and
els for host load prediction,” in Proc. 8th Int. Symp. High Perform. the University of Houston. His research interests
Distrib. Comput., 1999, pp. 87–96. are primarily in parallel computing, distributed
[30] K. Hoste, A. Phansalkar, L Eeckhout, A. Georges, L. K. John, and computing, cloud computing. He has published
K. De Bosschere, “Performance prediction based on inherent pro- more than 100 academic papers in journals,
gram similarity,” in Proc. 15th Int. Conf. Parallel Archit. Compilation books, and conference proceedings. He is a
Techn., Sep. 2006, pp. 114–122. member of the IEEE.
[31] C. D. Lu and D. A. Reed, “Compact application signatures for par-
allel and distributed scientific codes,” in Proc. ACM/IEEE Conf.
Supercomput., Nov. 2002, pp. 1–10. Albert M.K. Cheng received the BA degree with
[32] T. Sherwood, E. Perelman, G.Hamerly, and B. Calder, highest honors in computer science, graduating
“Automatically characterizing large scale program behavior,” Phi Beta Kappa, the MS degree in computer sci-
ACM SIGARCH Comput. Archit. News., vol. 30, no. 5, pp. 45–57, 2002. ence with a minor in electrical engineering, and
[33] C. Stewart and S. Kai, “Performance modeling and system man- the PhD degree in computer science, all from
agement for multi-component online services,” in Proc. 2nd Conf. The University of Texas at Austin, where he held
Symp. Netw. Syst. Des. Implementation, 2005, vol. 2, pp. 71–84. a GTE Foundation Doctoral Fellowship. He is a
[34] B. Urgaonkar, G. Pacifici, P. Shenoy, M. Spreitzer, and A. Tantawi, professor and a former interim associate chair of
“An analytical model for multi-tier internet services and its the Computer Science Department, University of
applications,” ACM SIGMETRICS Perform. Eval. Rev., vol. 33, Houston. He received numerous awards. He is
no. 1, pp. 291–302, 2005. the author of the popular textbook entitled Real-
[35] A. Li, X. Zong, S. Kandula, X. Yang, and M. Zhang, Time Systems: Scheduling, Analysis, and Verification (Wiley) and more
“CloudProphet: Towards application performance prediction in than 200 refereed publications on real-time, embedded, and cyber-phys-
cloud,” ACM SIGCOMM Comput. Commun. Rev., vol. 41, no. 4, ical systems. He is a senior member of the IEEE and a fellow of the Insti-
pp. 426–427, 2011. tute of Physics.
[36] X. Wu, K. Vijayakumar, F. Mueller, X. Ma, and P. C. Roth,
“Probabilistic communication and I/O tracing with deterministic
replay at scale,” in Proc. Int. Conf. Parallel Process., 2011, pp. 196–205. Jaspal Subhlok received the PhD degree in
[37] X. Wu, F. Mueller, and S. Pakin, “Automatic generation of execut- computer science from Rice University. His
able communication specifications from parallel applications,” in research interest involves high performance
Proc. Int. Conf. Supercomput., 2011, pp. 12–21. computing. He is a professor and chair of the
[38] J. Zhai, W. Chen, and W. Zheng, “Phantom: Predicting perfor- Computer Science Department, University of
mance of parallel applications on large-scale parallel machines Houston. He has published more than 100 aca-
using a single node,” ACM Sigplan Notices, vol. 45, no. 5, pp. 305– demic papers in journals, books, and conference
314, 2010. proceedings. He is a member of the IEEE.
[39] W. Zhang, T. Han, Y. Zhang, and A. M. Cheng, “Performance pre-
diction for MPI parallel jobs,” in Proc. IEEE Int. Conf. Cluster Com-
put. Workshops CLUSTER WORKSHOPS, 2012, pp. 136–142.