Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Automatic Tracing in Task-Based Runtime Systems

Rohan Yadav Stanford UniversityUSA Michael Bauer NVIDIAUSA David Broman KTH Royal Institute of TechnologySweden Michael Garland NVIDIAUSA Alex Aiken Stanford UniversityUSA  and  Fredrik Kjolstad Stanford UniversityUSA
Abstract.

Implicitly parallel task-based runtime systems often perform dynamic analysis to discover dependencies in and extract parallelism from sequential programs. Dependence analysis becomes expensive as task granularity drops below a threshold. Tracing techniques have been developed where programmers annotate repeated program fragments (traces) issued by the application, and the runtime system memoizes the dependence analysis for those fragments, greatly reducing overhead when the fragments are executed again. However, manual trace annotation can be brittle and not easily applicable to complex programs built through the composition of independent components. We introduce Apophenia, a system that automatically traces the dependence analysis of task-based runtime systems, removing the burden of manual annotations from programmers and enabling new and complex programs to be traced. Apophenia identifies traces dynamically through a series of dynamic string analyses, which find repeated program fragments in the stream of tasks issued to the runtime system. We show that Apophenia is able to come between 0.92x–1.03x the performance of manually traced programs, and is able to effectively trace previously untraced programs to yield speedups of between 0.91x–2.82x on the Perlmutter and Eos supercomputers.

1. Introduction

Implicitly parallel programming systems (Bauer et al., 2012; Moritz et al., 2017; Augonnet et al., 2011; Bosilca et al., 2011; Zaharia et al., 2012) automatically extract parallelism from a sequential source program through different forms of dynamic dependence analysis. Automatic parallelization and communication inference has enabled composable high-level libraries (Bauer and Garland, 2019; Yadav et al., 2023) to be built on top of implicitly parallel task-based runtime systems. However, the cost of the dependence analysis affects the performance of implicitly parallel systems at scale and places a floor on the minimum problem size that can be executed efficiently (Slaughter et al., 2020). Applications with tasks that are too small to amortize the cost of dependence analysis is dominated by it and run at low efficiency.

To improve the performance of implicitly parallel task-based runtime systems, researchers have proposed techniques (Lee et al., 2018; Mashayekhi et al., 2017) to memoize, or trace, the dependence analysis. Tracing records the results of the dependence analysis for an issued program fragment, and then replays the results of the analysis the next time an identical program fragment is issued. Tracing has been shown to yield significant speedups by eliminating the cost of the dependence analysis on iterative programs. For example, tracing can reduce the per-task overhead in the Legion (Bauer et al., 2012) runtime system from similar-to\sim1ms to similar-to\sim100μ𝜇\muitalic_μ(Bauer et al., 2021), widening the scope of applications for which task-based runtime systems can be effective.

A significant limitation of existing tracing techniques is that they require the programmer to annotate repeatedly issued program fragments with stop/start markers for the runtime system. Programmer inserted annotations derail an important feature of implicitly parallel programming systems—their correctness under program composition. As users develop modular programs that pass data from one component to another, the runtime system ensures that computations launched by different modules maintain sequential semantics by implicitly inserting the necessary data movement and synchronization. However, programmer introduced trace annotations do not obey these composition principles, and the correct placement of trace annotations when composing complex software becomes unclear. Functions defined in a third-party library may contain operations that cannot by traced by a practical tracing implementation, or may issue a different sequence of operations on each invocation. Each of these cases result in runtime errors, due to the incorrect trace annotations constructing an ill-formed sequence of operations. Furthermore, even simple programs using high-level implicitly parallel libraries can have traces that do not correspond to syntactic loop structures in the source program, making it difficult to correctly place tracing annotations. We elaborate on such an example program in Section 2.

In order to improve programmer productivity and to enable the tracing of modular high-level programs, implicitly parallel task-based systems should automatically identify repeated sequences of operations, memoize their analysis results and cheaply replay the analysis as needed. We call this the problem of automatic trace identification, which is similar to Just-In-Time (JIT) compilation in the context of dynamic language implementations (Hölzle and Ungar, 1996; Gal et al., 2009; Paleczny et al., 2001). JIT compilers for dynamic languages interpret bytecode during program startup, and compile bytecode to native instructions as repeatedly invoked program fragments become hot. Following this architecture, implicitly parallel task-based runtimes should interpret issued operations with a dynamic dependence analysis, and switch to an analysis-free compiled execution once repeated sequences of operations are encountered.

We introduce our system Apophenia111Apophenia is the tendency to notice patterns between unrelated things., that acts as a JIT compiler for the dependence analysis of an implicitly parallel task-based runtime system. The key challenge that Apophenia faces is the identification of repeated sequences of operations produced by the target program. Unlike JIT compilers, the input to a task-based runtime system is a stream of tasks that lacks information about control flow such as basic block labels or function definitions. As such, Apophenia cannot rely on these code landmarks or predictable execution flow to identify repeated sequences of operations. Instead, Apophenia analyzes the input stream of operations to find repetitions by solving a series of online string analysis problems.

To demonstrate Apophenia, we develop an implementation within the Legion (Bauer et al., 2012) runtime system as a front-end component that sits between the application and Legion’s dependence analysis engine. As operations are issued to Legion, Apophenia performs a series of dynamic analyses to identify repeatedly issued sequences of operations, and correspondingly invokes Legion’s tracing engine (Lee et al., 2018) to memoize and replay dependence analysis on these sequences. While our prototype targets Legion, we believe that the ideas in Apophenia can be directly applied to other task-based runtime systems that perform a dynamic dependence analysis.

The specific contributions of this work are:

  1. (1)

    A formulation of the desirable properties of traces to identify (Section 3).

  2. (2)

    Algorithms to dynamically identify traces in an application’s stream of operations (Section 4).

  3. (3)

    An implementation of Apophenia that targets the Legion (Bauer et al., 2012) runtime system.

To evaluate Apophenia, we apply it to the largest and most complex Legion applications written to this date, including production-grade scientific simulations and machine learning applications. We show that on up to 64 GPUs of the Perlmutter and Eos supercomputers, Apophenia is able to achieve between 0.92x–1.03x the performance of manually traced code, and is able to effectively trace previously untraced code built from the composition of high-level components to yield end-to-end speedups of between 0.91x–2.82x. As such, Apophenia is able to insulate programmers against the overheads of task-based runtime systems on varying applications and problem sizes, transparently and without programmer intervention.

2. Motivating Example

We now show an example of high-level implicitly parallel code where it is difficult for a programmer place tracing annotations. As part of developing the example, we provide necessary background on the Legion (Bauer et al., 2012) runtime system.

Figure 1(a) performs Jacobi iteration using cuNumeric (Bauer and Garland, 2019), a distributed drop-in replacement for NumPy. cuNumeric distributes NumPy through a dynamic translation to Legion. cuNumeric implements NumPy operations by issuing one or more Legion tasks, which are designated functions registered with the runtime system. Each NumPy array is mapped to a Legion region, which is a multi-dimensional array tracked by Legion. Each task takes a list of regions as arguments. The stream of tasks launched by the main loop of the cuNumeric program is in Figure 1(b). For each task, the first two arguments denote the inputs, while the third argument is the output. Legion extracts parallelism from the issued stream of tasks by analyzing the data dependencies between tasks and the usage of their region arguments (Bauer et al., 2023).

1import cunumeric as np
2# Generate random system.
3A = np.random.rand(N,N)
4b = np.random.rand(N)
5# Initialize solution and
6# extract diagonal.
7x = np.zeros(A.shape[1])
8d = np.diag(A)
9R = A - np.diag(d)
10# Jacobi iteration.
11for i in range(iters):
12 x = (b - np.dot(R, x)) / d
(a) Python source code.
1DOT(R, x1, t1)
2SUB(b, t1, t2)
3DIV(t2, d, x2) # Iteration 1
4DOT(R, x2, t1)
5SUB(b, t1, t2)
6DIV(t2, d, x1) # Iteration 2
7DOT(R, x1, t1)
8SUB(b, t1, t2)
9DIV(t2, d, x2) # Iteration 3
10DOT(R, x2, t1)
11SUB(b, t1, t2)
12DIV(t2, d, x1) # Iteration 4
(b) Main loop task stream.
Figure 1. A cuNumeric (Bauer and Garland, 2019) program and the stream of tasks it issues at runtime. An intuitive trace around the main loop does not correspond to a repeated program fragment.

To trace a program fragment, the programmer issues a tbegin(id) call (standing for “trace begin”) before and a tend(id) call after the fragment. The first time Legion executes a trace with a particular id, it records the results of the dependence analysis, and then replays the results when executing the same trace id again (Lee et al., 2018). For a trace to be valid, the sequence of tasks and their region arguments encapsulated by tbegin(id) and tend(id) calls must be exactly the same for a given id. The same region arguments must be used across trace invocations as the dependence analysis is affected by the usages of the regions and how they are partitioned. While we consider regions for a Legion implementation of Apophenia, this restriction generalizes to any form of argument that affects the dependence analysis.

A natural attempt to trace the program in Figure 1(a) would place the tbegin and tend around the body of the main for loop. However, this annotation results in an invalid trace (raising a runtime error), for a subtle reason that requires knowledge of the internals of cuNumeric. The problem with this natural annotation is the loop-carried use of the Python variable x, which is bound to different cuNumeric arrays (regions) at different points of execution. Upon entering loop iteration i𝑖iitalic_i, x is bound to a region arbitrarily named x1, which is used as an argument for the first dot operation. As execution proceeds, cuNumeric allocates a new region x2 for the result of the division with d, and binds the Python variable x to the region x2. Therefore, the next iteration i+1𝑖1i+1italic_i + 1 issues a dot on x2, causing iteration i+1𝑖1i+1italic_i + 1 to issue a different sequence of tasks than iteration i𝑖iitalic_i! This program illustrates a real-world case where abstraction and composition make it difficult to apply the low-level tracing technique.

To correctly trace the program in Figure 1(a), a programmer must either add trace annotations around every two iterations of the main loop, or use two different trace ID’s for each different iteration’s repetition pattern. This steady state of groups of two iterations is achieved because when x is assigned, the region it refers to can be collected and immediately reused by cuNumeric. Relying on this steady state is brittle, as the addition of more operations in the main loop or a change in cuNumeric’s region allocation policy could perturb the way in which the necessary steady state for tracing is achieved. Instead, Apophenia dynamically analyzes the stream of tasks and automatically discovers what fragments of the application should be traced, removing this concern from the programmer.

3. What Are Good Traces?

The overarching goal of Apophenia is to reduce the amount of time the runtime spends performing dynamic dependence analysis by selecting traces to replay. A simple model of a tasking runtime system’s dependence analysis is that the runtime spends time α𝛼\alphaitalic_α analyzing each task. The first time a trace is issued, the dependence analysis results are memoized, so the runtime spends time αmsubscript𝛼𝑚\alpha_{m}italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT (memoization time) on each task in the trace, where αmsubscript𝛼𝑚\alpha_{m}italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is slightly larger than α𝛼\alphaitalic_α. Then, on subsequent executions of the trace, there is some constant c𝑐citalic_c amount of overhead to replaying the trace, but every task in the trace only incurs an analysis cost of αrsubscript𝛼𝑟\alpha_{r}italic_α start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT (replaying time), where αrαmuch-less-thansubscript𝛼𝑟𝛼\alpha_{r}\ll\alphaitalic_α start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ≪ italic_α.

Using this model of the runtime system, we derive several properties of traces that Apophenia should find. First, the selected traces should maximize the number of traced operations to minimize the number of tasks that contribute an α𝛼\alphaitalic_α to the overall analysis cost. Next, the selected traces should be relatively long so that the constant replay cost c𝑐citalic_c does not accumulate. Finally, the set of selected traces should be small, so that Apophenia does not continually memoize new traces and pay αmsubscript𝛼𝑚\alpha_{m}italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT per task in each new trace. Intuitively, the ideal set of traces corresponds to the loops in the target program.

We now concretize the good traces that Apophenia should find as the solutions of a concrete optimization problem. Consider the sequence of tasks S𝑆Sitalic_S constructed from a complete execution of the target program. A system for automatic trace identification must construct from S𝑆Sitalic_S

  • A set of traces T𝑇Titalic_T, containing sub-strings of S𝑆Sitalic_S,

  • A function f:Tinterval set:𝑓𝑇interval setf:T\rightarrow\textsf{interval set}italic_f : italic_T → interval set, mapping each tT𝑡𝑇t\in Titalic_t ∈ italic_T to a set of intervals in S𝑆Sitalic_S that are matched by t𝑡titalic_t,

that maximizes the coverage of f𝑓fitalic_f, defined by coverage(T,f)=tTif(t)|i|coverage𝑇𝑓subscript𝑡𝑇subscript𝑖𝑓𝑡𝑖\textsf{coverage}(T,f)=\sum_{t\in T}\sum_{i\in f(t)}|i|coverage ( italic_T , italic_f ) = ∑ start_POSTSUBSCRIPT italic_t ∈ italic_T end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i ∈ italic_f ( italic_t ) end_POSTSUBSCRIPT | italic_i |, subject to the constraints

  1. (1)

    tTfor-all𝑡𝑇\forall t\in T∀ italic_t ∈ italic_T, t𝑡titalic_t is longer than a minimum length,

  2. (2)

    tTf(t)subscript𝑡𝑇𝑓𝑡\bigcup_{t\in T}f(t)⋃ start_POSTSUBSCRIPT italic_t ∈ italic_T end_POSTSUBSCRIPT italic_f ( italic_t ) is a disjoint set of intervals.

Multiple solutions exist for this problem, so we prefer solutions that first maximize the number of matched intervals (tT|f(t)|subscript𝑡𝑇𝑓𝑡\sum_{t\in T}|f(t)|∑ start_POSTSUBSCRIPT italic_t ∈ italic_T end_POSTSUBSCRIPT | italic_f ( italic_t ) |), and then minimize the total number of selected traces (|T|𝑇|T|| italic_T |). Maximizing coverage(T,f)coverage𝑇𝑓\textsf{coverage}(T,f)coverage ( italic_T , italic_f ) directly minimizes the number of untraced tasks, and selecting a small set of traces that repeats many times minimizes the memoization cost of αmsubscript𝛼𝑚\alpha_{m}italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT per task. Finally, a minimum length is placed on traces to ensure that the constant replay cost c𝑐citalic_c can be effectively amortized. We present a concrete problem instance and example solutions in Figure 2.

Refer to caption
Figure 2. Example of a task stream and fixed trace set T𝑇Titalic_T with an invalid matching function f𝑓fitalic_f, and two matching functions with different coverage(T,f)coverage𝑇𝑓\textsf{coverage}(T,f)coverage ( italic_T , italic_f ).

The presented optimization problem precisely defines the properties of traces that a system like Apophenia should attempt to find, but it does not directly yield an algorithm to discover good solutions. Additionally, the optimization problem is structured in a post-hoc formulation, where an optimal solution is constructed from the results of the entire program execution. In practice, a system like Apophenia must construct the solution (T,f)𝑇𝑓(T,f)( italic_T , italic_f ) in an online manner, using the currently visible prefix of the sequence of tasks launched by the application. In the next section, we discuss algorithms for dynamically finding good solutions to this optimization problem through a set of string processing algorithms.

4. Trace Identification

Dynamically finding good traces requires processing information about the tasks seen so far, and then using that information to record and replay traces in the future. An overview of Apophenia’s dynamic analysis procedure is sketched in Algorithm 1. Apophenia has two components that correspond to the to the targets of the optimization problem in Section 3. The trace finder constructs the candidate set of traces T𝑇Titalic_T by accumulating the tasks issued by the application into a buffer, and asynchronously mining the buffer to find candidate traces. The trace replayer then constructs the matching function f𝑓fitalic_f by ingesting the candidate traces into a trie, and identifying candidate traces in the application stream by maintaining pointers into the trie that represent potential matches. A concrete example of how Apophenia identifies a trace in an application is shown in Figure 3. We now describe each of these components in detail.

1
2
3
/* Initialize token history buffer B𝐵Bitalic_B and pending async analyses J𝐽Jitalic_J. */
4 B,J[],[]formulae-sequence𝐵𝐽B,J\leftarrow[],[]italic_B , italic_J ← [ ] , [ ]
5
/* Initialize trie of candidates C𝐶Citalic_C, potential current traces A𝐴Aitalic_A, and pending tasks P𝑃Pitalic_P. */
6 C,A,PTrie(),[],[]formulae-sequence𝐶𝐴𝑃Trie()C,A,P\leftarrow\textnormal{{Trie(}}\textnormal{\emph{}}\textnormal{{)}},[],[]italic_C , italic_A , italic_P ← typewriter_Trie( typewriter_) , [ ] , [ ]
7
/* Discussed in Section 4.2. */
8 TraceFinder (H)𝐻(H)( italic_H )
9       BB+[H]𝐵𝐵delimited-[]𝐻B\leftarrow B+[H]italic_B ← italic_B + [ italic_H ]
10       if ShouldAnalyzeHistory(B𝐵Bitalic_B) then
             /* What subset of the history to analyze is discussed in Section 4.4. */
11             Bsuperscript𝐵absentB^{\prime}\leftarrowitalic_B start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ← GetAnalysisSubset(B𝐵Bitalic_B)
             /* Find repeated sub-strings. */
12             j𝑗absentj\leftarrowitalic_j ← async FindRepeats(Bsuperscript𝐵B^{\prime}italic_B start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT)
13             JJ+[j]𝐽𝐽delimited-[]𝑗J\leftarrow J+[j]italic_J ← italic_J + [ italic_j ]
14             B𝐵absentB\leftarrowitalic_B ← MaybeClearHistory(B𝐵Bitalic_B)
/* Discussed in Section 4.3. */
15 TraceReplayer (T,H)𝑇𝐻(T,H)( italic_T , italic_H )
16       if jJ,𝑗𝐽\exists~{}j\in J,∃ italic_j ∈ italic_J , j𝑗jitalic_j is complete then
17             IngestCandidates(j,C𝑗𝐶j,Citalic_j , italic_C)
18            
19      PP+[T]𝑃𝑃delimited-[]𝑇P\leftarrow P+[T]italic_P ← italic_P + [ italic_T ]
       /* Advance all potential traces by H𝐻Hitalic_H in the trie if possible. Remove impossible traces, and extract fully matched candidates. */
20       A𝐴absentA\leftarrowitalic_A ← AdvanceActiveCandidates(C,A,H𝐶𝐴𝐻C,A,Hitalic_C , italic_A , italic_H)
21       A𝐴absentA\leftarrowitalic_A ← FilterInvalidCandidates(C,A𝐶𝐴C,Aitalic_C , italic_A)
22       D,A𝐷𝐴absentD,A\leftarrowitalic_D , italic_A ← FilterCompletedCandidates(C,A𝐶𝐴C,Aitalic_C , italic_A)
23       if |D|>0𝐷0|D|>0| italic_D | > 0 then
             /* Select one of the pending candidates to replay. Execute any tasks before it, and issue a trace replay for the candidate. */
24             R𝑅absentR\leftarrowitalic_R ← SelectReplayTrace(D,P,A𝐷𝑃𝐴D,P,Aitalic_D , italic_P , italic_A)
25             P,A𝑃𝐴absentP,A\leftarrowitalic_P , italic_A ← ExecuteAndReplay(R,P,A𝑅𝑃𝐴R,P,Aitalic_R , italic_P , italic_A)
26            
/* Applications issue tasks through Apophenia’s ExecuteTask function. */
27 ExecuteTask (T)𝑇(T)( italic_T )
28       H𝐻absentH\leftarrowitalic_H ← Hash(T𝑇Titalic_T)
29       TraceFinder(H𝐻Hitalic_H)
30       TraceReplayer(T𝑇Titalic_T, H𝐻Hitalic_H)
31      
Algorithm 1 Apophenia’s Dynamic Analysis.
Refer to caption
Figure 3. Visualization of Apophenia’s dynamic analysis.

4.1. A Stream of Tokens

An insight of our work is that automatic trace identification is inherently an online string analysis problem of finding repeated sub-sequences in the application’s task stream. As seen in Figure 1(b), the task stream is not just a list of identifiers—tasks have arguments that must also be the same across iterations to be used in a trace. To capture all aspects of a task that can affect the dependence analysis, Apophenia constructs a hash from each task and its region arguments. Converting the input stream of tasks into a stream of hash tokens enables more direct application of string processing techniques, and straightforward handling of traceable operations that are not tasks.

4.2. Finding Traces With High Coverage

Apophenia’s trace finder records tasks as they are issued by the application into a buffer (we describe a refinement to this scheme in Section 4.4). Once the buffer fills up, Apophenia launches an asynchronous analysis of the buffer to find a set of traces within the buffer that maximize the coverage of the buffer. We discuss previous ideas that are related to this goal, and then describe the solution used in Apophenia.222We discuss more related works in Section 7.

Existing Techniques

The Lempel-Ziv family of algorithms use repeated sub-strings for compression. Algorithms like LZ77 (Ziv and Lempel, 1977, 1978; Storer and Szymanski, 1982) maintain a sliding window of previous tokens to search for repeats in when encoding upcoming tokens. The LZW (Welch, 1984) algorithm avoids the use of a sliding window by only increasing the length of any candidate repeat by a single token at a time. While not directly finding a set of repeats with high coverage, similar algorithms that use a sliding window would need to maintain and search in a window the size of the analyzed buffer, resulting in quadratic time complexity. In order to recognize a trace of length n𝑛nitalic_n, an LZW-style algorithm would also need to encounter the trace n1𝑛1n-1italic_n - 1 times. We wanted an algorithm that is sub-quadratic in order to scale to large buffer sizes. Real-world applications we discuss in Section 6 have traces that contain more than 2000 tasks, requiring token buffers of at least twice that size to detect a single repeat.

Within the programming languages community, recent work by Sisco et al. (Sisco et al., 2023) used a technique called tandem repeat analysis (Stoye and Gusfield, 2002) to find loops in the netlists that result from compiling hardware description languages. A tandem repeat is a sub-string α𝛼\alphaitalic_α that repeats contiguously within a larger string S𝑆Sitalic_S, such that αksuperscript𝛼𝑘\alpha^{k}italic_α start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT is a sub-string of S𝑆Sitalic_S, for some k𝑘kitalic_k. Despite the success that Sisco et al. had using tandem repeat analysis, we found that even simple real world cuNumeric programs did not contain enough tandem repeats for the analysis to reliably identify a trace set with high coverage. The reason is that while these real-world programs tended to have repetitive main loops, there would often be irregularly appearing computations such as convergence checks or statistics calculations that occur infrequently between loop iterations. As such, the strings that represented these programs would not contain tandem repeats, but instead repeated sub-strings separated by other tokens.

A relaxation of tandem repeat analysis is to search for non-overlapping repeated sub-strings, which removes the contiguity requirement on the repeats. Concretely, given the string ababab𝑎𝑏𝑎𝑏𝑎𝑏abababitalic_a italic_b italic_a italic_b italic_a italic_b, abab𝑎𝑏𝑎𝑏ababitalic_a italic_b italic_a italic_b is an overlapping repeat, while ab𝑎𝑏abitalic_a italic_b is non-overlapping. We could use non-overlapping repeated sub-strings to assemble a set of traces T𝑇Titalic_T and a disjoint mapping f𝑓fitalic_f that achieves high coverage. While there exist standard suffix-tree algorithms to find repeated sub-strings, we found that the natural extensions of these algorithms to detect non-overlapping repeated sub-strings also resulted in quadratic runtime complexity.

Our Algorithm

In this work, we design a repeat finding algorithm that is directly aware of the optimization problem in Section 3 and runs in O(nlog(n))𝑂𝑛𝑛O(n\log(n))italic_O ( italic_n roman_log ( italic_n ) ), where n𝑛nitalic_n is the size of the token history buffer. At a high level, our algorithm makes a pass through a suffix array constructed from the input buffer to collect a set of candidate repeats. It then greedily selects the largest repeated sub-strings that do not overlap with any previously chosen sub-strings. Psuedocode for our algorithm is in Algorithm 2, which takes a string S𝑆Sitalic_S and returns a set of sub-strings that achieve high coverage of S𝑆Sitalic_S. We assume that the reader is knowledgeable about suffix arrays and their structural properties. However, understanding the algorithm in Algorithm 2 is not required to understand its usage in Apophenia, as discussed in Section 4.3 and Section 4.4.

1
2
3FindRepeats (S)𝑆(S)( italic_S )
4       SA,LCP𝑆𝐴𝐿𝐶𝑃absentSA,LCP\leftarrowitalic_S italic_A , italic_L italic_C italic_P ← SuffixArray(S𝑆Sitalic_S)
       /* Candidates are tuples of string length, the repeated sub-string, and starting position. */
5       C[]𝐶C\leftarrow[]italic_C ← [ ]
6       foreach i[0,|SA|1)𝑖0𝑆𝐴1i\in[0,|SA|-1)italic_i ∈ [ 0 , | italic_S italic_A | - 1 ) do
             /* Extract adjacent suffix array entries and their overlap length. */
7             s1,s2,pSA[i],SA[i+1],LCP[i]formulae-sequence𝑠1𝑠2𝑝𝑆𝐴delimited-[]𝑖𝑆𝐴delimited-[]𝑖1𝐿𝐶𝑃delimited-[]𝑖s1,s2,p\leftarrow SA[i],SA[i+1],LCP[i]italic_s 1 , italic_s 2 , italic_p ← italic_S italic_A [ italic_i ] , italic_S italic_A [ italic_i + 1 ] , italic_L italic_C italic_P [ italic_i ]
8             if [s1:s1+p)[s2:s2+p)=[s1:s1+p)\cap[s2:s2+p)=\emptyset[ italic_s 1 : italic_s 1 + italic_p ) ∩ [ italic_s 2 : italic_s 2 + italic_p ) = ∅ then
                   /* S[s1:s1+p]S[s1:s1+p]italic_S [ italic_s 1 : italic_s 1 + italic_p ] and S[s2:s2+p]S[s2:s2+p]italic_S [ italic_s 2 : italic_s 2 + italic_p ] are repeated strings that do not overlap in S𝑆Sitalic_S, so they are candidates. */
9                   rS[s1:s1+p]r\leftarrow S[s1:s1+p]italic_r ← italic_S [ italic_s 1 : italic_s 1 + italic_p ]
10                   CC+[(p,r,s1),(p,r,s2)]𝐶𝐶𝑝𝑟𝑠1𝑝𝑟𝑠2C\leftarrow C+[(p,r,s1),(p,r,s2)]italic_C ← italic_C + [ ( italic_p , italic_r , italic_s 1 ) , ( italic_p , italic_r , italic_s 2 ) ]
11                  
12            else
                   /* S[s1:s1+p]S[s1:s1+p]italic_S [ italic_s 1 : italic_s 1 + italic_p ] and S[s2:s2+p]S[s2:s2+p]italic_S [ italic_s 2 : italic_s 2 + italic_p ] overlap in S𝑆Sitalic_S. Assume s2>s1𝑠2𝑠1s2>s1italic_s 2 > italic_s 1, the other case is symmetric. In this case, the overlap is a collection of repeats of S[s1:s1+d]S[s1:s1+d]italic_S [ italic_s 1 : italic_s 1 + italic_d ], by the structure of the suffix array. */
13                   ds2s1𝑑𝑠2𝑠1d\leftarrow s2-s1italic_d ← italic_s 2 - italic_s 1
                   /* Break prefix into two chunks of repeated pieces of S[s1:s1+d]S[s1:s1+d]italic_S [ italic_s 1 : italic_s 1 + italic_d ]. */
14                   l(p+d)/2𝑙𝑝𝑑2l\leftarrow(p+d)/2italic_l ← ( italic_p + italic_d ) / 2
                   /* Remove trailing tokens. */
15                   ll(l%d)𝑙𝑙percent𝑙𝑑l\leftarrow l-(l\%d)italic_l ← italic_l - ( italic_l % italic_d )
16                   rS[s1:s1+l]r\leftarrow S[s1:s1+l]italic_r ← italic_S [ italic_s 1 : italic_s 1 + italic_l ]
17                   CC+[(l,r,s1),(l,r,s1+l)]𝐶𝐶𝑙𝑟𝑠1𝑙𝑟𝑠1𝑙C\leftarrow C+[(l,r,s1),(l,r,s1+l)]italic_C ← italic_C + [ ( italic_l , italic_r , italic_s 1 ) , ( italic_l , italic_r , italic_s 1 + italic_l ) ]
18                  
      /* Sort the candidates by decreasing length and by increasing sub-string and start position. */
19       Sort(C𝐶Citalic_C)
       /* Greedily collect sub-strings that do not overlap with previously chosen sub-strings. */
20       I,R[],[]formulae-sequence𝐼𝑅I,R\leftarrow[],[]italic_I , italic_R ← [ ] , [ ]
21       foreach (l,_,s)C𝑙_𝑠𝐶(l,\_,s)\in C( italic_l , _ , italic_s ) ∈ italic_C do
22             if [s,s+l)𝑠𝑠𝑙[s,s+l)[ italic_s , italic_s + italic_l ) does not intersect I𝐼Iitalic_I then
23                   II+[[s,s+l)]𝐼𝐼delimited-[]𝑠𝑠𝑙I\leftarrow I+[[s,s+l)]italic_I ← italic_I + [ [ italic_s , italic_s + italic_l ) ]
24                   RR+[S[s:s+l]]R\leftarrow R+[S[s:s+l]]italic_R ← italic_R + [ italic_S [ italic_s : italic_s + italic_l ] ]
25                  
26      return R𝑅Ritalic_R
Algorithm 2 Non-overlapping repeated sub-strings.
Refer to caption
Figure 4. Execution of Algorithm 2 on “aabcbcbaa”. The candidates for each suffix pair is shown between the pair.

As a first step, we construct a suffix array and longest common prefix array from the input buffer of tokens. We then iterate through adjacent pairs of suffixes to construct a set of candidate repeats, which are tuples of sub-strings defined by their length, the repeated sub-string, and its starting position in S𝑆Sitalic_S. These candidates are constructed based on whether or not the shared prefix between adjacent suffix array entries overlap. Once all of the candidates have been constructed, we sort the candidates to greedily select candidates in order of length, and select as many occurrences of a particular sub-string as possible. We only select candidates that do not overlap with any previously selected candidates, and then deduplicate the chosen set of candidates as the result. A sample execution of Algorithm 2 is shown in Figure 4.

Our algorithm can be implemented with time complexity O(nlog(n))𝑂𝑛𝑛O(n\log(n))italic_O ( italic_n roman_log ( italic_n ) ). Linear time algorithms exist for suffix array and LCP array construction (Kasai et al., 2001). Two candidates are generated for each entry in the suffix array, so sorting the candidates takes O(nlog(n))𝑂𝑛𝑛O(n\log(n))italic_O ( italic_n roman_log ( italic_n ) ) time. The interval intersection step can be reduced to constant time by leveraging the candidate iteration order, so the entire loop executes in O(n)𝑂𝑛O(n)italic_O ( italic_n ) time. In particular, an array of length |S|𝑆|S|| italic_S | can be maintained, and as each candidate is selected, all positions covered by the candidate are marked. Then, as candidates are iterated over in decreasing length and increasing start position order, interval intersection can be checked by checking if the start or end of an interval is marked. Finally, the deduplication can be done by generating a unique ID for each candidate sub-string in the candidate generation phase, and adjusting the candidate representation to be a tuple of length, ID and starting position; using this sort order allows deduplication to be done at each iteration of the candidate selection loop.

Our algorithm aims to find good solutions to the optimization problem in Section 3 by identifying long repeated sub-strings and selecting as many as possible that do not overlap with each other. We trade off between an optimal solution to the optimization problem to instead find good solutions and maintain a lower asymptotic runtime. There are two such heuristics in our algorithm. First, when adjacent suffix array entries have a repetition, we consider only the maximal length repetition instead of all sub-strings of the repetition. Second, when we select which candidates to keep, we greedily choose the largest candidates instead of performing a bin-packing computation. While we do not provide theoretical guarantees on the optimality of our algorithm, we show in Section 6 that Apophenia using our algorithm is able to identify good traces in complex, real-world applications.

4.3. Recognizing and Replaying Candidate Traces

Apophenia’s trace replayer uses Algorithm 2 to find candidate traces from the application’s history of tasks. In this section, we discuss how Apophenia’s trace replayer identifies and selects these candidate traces from the task stream to record and replay. Our design of the trace replayer has two major goals. First, the per-task overhead must be low, as it is imperative for performance for the application to issue as many tasks into the runtime as possible so that the runtime can either replay traces or perform dependence analysis ahead of execution. Slowing down the task launch rate would result in exposed latency from various sources in the runtime. Second, Apophenia must balance exploration and exploitation when selecting traces. As more information about the application is gained, Apophenia should switch to better traces as it finds them. However, Apophenia should not leave a steady state until it is confident that performance can be improved, as recording new traces has a cost.

As discussed previously, Apophenia accumulates a history of tasks launched by the application and asynchronously uses Algorithm 2 to select candidate traces. Asynchronous analysis of task histories is important to avoid stalling the application by waiting for the analysis to finish before accepting the next task from the application.

When an asynchronous analysis completes, Apophenia ingests the results into a trie that maintains the current set of candidate traces. Along with this trie, Apophenia maintains a set of pointers into the trie that represent potential matched traces. As tasks are issued, Apophenia updates the set of pointers by creating new pointers for each new task, stepping any existing pointers down the trie if possible, and removing any pointers that are made invalid. Once a pointer reaches a leaf of the trie and has matched a trace, Apophenia has the option to forward the trace to the tasking runtime, wrapped by tbegin and tend calls.

Apophenia uses a scoring function to select which matched trace to replay when faced with multiple valid choices. The scoring function is based on the length of the candidate trace multiplied by a count of the number of times the trace has appeared. In calculation of the score, we impose a maximum value of the count that can be used, and exponentially decay the value of the count by how many tasks have been encountered since the trace last appeared. Finally, we increase the score slightly if a trace has already been replayed.

Our scoring function encodes heuristics about trace selection and aims to balance exploration and exploitation. We naturally prefer long traces over shorter ones, as longer traces have the potential to eliminate more runtime overhead. The capping of the appearance count allows for Apophenia to eventually switch from a trace that appeared early during program execution to a better trace that appears later in the execution. Next, decaying the appearance count ensures that a seemingly promising trace that occurs infrequently, does not eventually hit a threshold, and disrupts a steady state. Finally, since recording new traces has a cost, when faced with traces of a similar score, we bias Apophenia towards a trace it has already replayed.

4.4. Achieving Responsiveness and Quality

Apophenia’s trace finder accumulates tasks into a buffer and mines the buffer for traces using Algorithm 2. An important question is what should the size of that buffer be? The size of this buffer trades off between responsiveness of the Apophenia’s trace identification and the quality of traces Apophenia is able to find. With a small buffer, Apophenia can identify traces early but will not be able to identify traces in programs with large loops. Meanwhile, a large buffer allows Apophenia to identify long traces in complex applications but introduces significant startup delay in smaller applications.

Refer to caption
Figure 5. Visualization of Apophenia’s buffer sampling strategy on a buffer of size 8. After processing the i𝑖iitalic_i’th task, Apophenia mines the buffer slice labeled i𝑖iitalic_i.

We did not want end users to be required to continually adjust the buffer size parameter as their application changes. As such, some strategy to adapt the buffer size along this tradeoff space is necessary. We found that a strategy that attempts to dynamically resize the buffer based on what traces to find is unsatisfactory, as the system is unable to differentiate between an application currently not repeating operations versus an application repeating a sequence of operations larger than the buffer size. Instead, we propose a strategy that selects a large fixed buffer size upfront, and then samples smaller pieces of the buffer in a principled manner to be responsive to the occurrence of short traces.

Apophenia samples from the buffer guided by the ruler function sequence (Wikipedia, 2024), which provides a practically useful sampling strategy with provable guarantees. The ruler function counts the number of times a number can be evenly divided by two. Applying it to the sequence 1,2,3,4,12341,2,3,4,\ldots1 , 2 , 3 , 4 , … yields the sequence 0,1,0,2,01020,1,0,2,\ldots0 , 1 , 0 , 2 , …. Raising the sequence to the power two yields 1,2,1,4,12141,2,1,4,\ldots1 , 2 , 1 , 4 , …, which we can interpret as subsets of the buffer to analyze. For example, with a buffer size of four, as tasks arrive Apophenia would first analyze the first task, then the first two tasks, then the third task, and finally all four tasks. A visualization of this sampling policy is in Figure 5. This sampling policy lets Apophenia quickly react to changes in the application by analyzing recent pieces of the buffer while allowing larger traces to be found by infrequently analyzing longer components of the buffer. For example, sampling the full buffer in Figure 5 is required to find a trace that repeats in positions H2-H4 and H5-H7. In practice, we use the exponentiated ruler function as the multiples of a larger constant (such as 250) to sample the buffer with. Finally, given that our algorithm in Section 4 runs in O(nlog(n))𝑂𝑛𝑛O(n\log(n))italic_O ( italic_n roman_log ( italic_n ) ), we show that our sampling strategy increases the total runtime complexity of processing the buffer by only an extra log\logroman_log factor, yielding a total of O(nlog2(n))𝑂𝑛superscript2𝑛O(n\log^{2}(n))italic_O ( italic_n roman_log start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_n ) ). This technique enables all of the experiments in Section 6 to be run with the same buffer size configuration parameter.

5. Implementation Discussion

We now discuss important aspects of a realistic implementation of Apophenia. In particular, we discuss the specifics of implementing Apophenia in a distributed context and a decision not to perform speculation when replaying traces.

5.1. Distributing the Analysis

Apophenia’s analysis as presented in Section 4 is sequential, processing tasks as they are issued by the application. In a distributed setting, Apophenia leverages Legion’s dynamic control replication (Bauer et al., 2021) to act as a sequential analysis, except for one component, which we discuss next. With control replication, the application executes on each node and Legion shards the dependence analysis and execution across nodes. The main restriction of control replication is that the application must issue the same sequence of tasks on every node. We implement Apophenia as a layer between the application and Legion, so Apophenia inherits the control replication requirements of the application. In particular, each node must agree on which traces to replay and when during program execution to record and replay the traces.

The only source of non-determinism in Apophenia that may result in divergent decisions between nodes is the asynchronous processing of token buffers described in Section 4.2. The asynchronous analysis may complete earlier on one node than another, resulting in that node replaying a trace before another node has identified that trace as a candidate. However, making the analysis synchronous would result in stalling the application until analyses complete. We resolve this tension by having each node agree on a count of processed operations to issue before ingesting the results of an asynchronous analysis. If any node had to wait on an asynchronous analysis to complete, all nodes increase their count of operations to wait on for the next analysis. This strategy reaches a steady state where analysis results are ingested in a deterministic manner without stalling the application.

5.2. (The Lack of) Speculation

Speculation is a common technique in computer architecture to efficiently execute programs with data-dependent control flow. As Apophenia has similarities to speculative components in architecture like trace caches (Section 7), a natural design decision was if Apophenia should speculate on whether traces would be issued by the application. Our implementation of Apophenia does not speculate and waits for the entirety of a trace to arrive before issuing the trace to Legion. The relative costs of different operations within the Legion runtime system made the potential upside of speculation not worth the implementation complexity.

Legion employs a pipelined architecture where a task flows through three stages: 1) the application phase, where the task is launched (into Apophenia), 2) the analysis phase, where the task is analyzed or replayed as part of a trace, and 3) the execution phase, where the task is executed. Depending on the cost ratio of the application and analysis phases, speculation may be beneficial as Apophenia waits for an entire trace to pass through the application phase. Legion’s analysis phase is an order of magnitude more expensive than the application phase, letting the application phase run far ahead of the analysis phase. Thus, waiting for an entire trace to be issued by the application rarely stalls the pipeline and gets exposed in the overall runtime. Thus, we found that designing a trace prediction algorithm and implementing a backup-rollback-recover scheme on speculation failures was not worth the complexity.

6. Evaluation

Refer to caption
(a) S3D (Perlmutter)
Refer to caption
(b) HTR-Solver (Perlmutter)
Figure 6. Weak scaling on previously traced Legion applications, where Apophenia (“auto” in the figure) performs competitively.
Refer to caption
(a) CFD (Eos)
Refer to caption
(b) TorchSWE (Eos)
Figure 7. Weak scaling on cuNumeric applications, where Apophenia (“auto” in the figure) outperforms the untraced version.

Overview

We evaluate Apophenia on the largest and most complex Legion applications written to date, including production scientific simulations and a distributed deep learning framework. Our results show that Apophenia is able to effectively find traces in complex programs with lower overhead, enabling programmers to experience the benefits of tracing without manual effort and allowing a more general set of applications to be traced.

Experimental Setup

We evaluated Apophenia on the Eos and Perlmutter supercomputers. Each node of Eos is an NVIDIA DGX H100, containing 8 H100 GPUs with 80 GB of memory and a 112 core Intel Xeon Platinum. Each node of Perlmutter contains 4 NVIDIA A100 GPUs with 40 GB of memory and a 64 core AMD EPYC 7763. Nodes of Eos are connected with an Infiniband interconnect, while Perlmutter uses a Slingshot interconnect. We compile Legion on Eos with the UCX networking module, and use the GASNet-EX (Bonachea and Hargrove, 2018) networking module on Perlmutter. We do not execution each application on both Perlmutter and Eos due to differences between the local environments on each machine. In our experiments, we evaluate the relative performance differences between traced and untraced programs, and comparisons between machines are not significant.

6.1. Weak Scaling

In this section, we discuss weak scaling results of applications using Apophenia, as shown in Figure 6 and Figure 7. In a weak scaling study, we increase the problem size as the size of the target machine grows to keep the problem size per processor constant. For each application, we perform a sweep over different sizes of the problem to vary the task granularity, thus affecting the impact of runtime overhead. These different problem sizes are denoted in the graph by the “-s”, “-m” and “-l” label suffixes which stand for small, medium and large. At smaller problem sizes, more runtime overhead can be exposed, while larger problem sizes make it easier to hide runtime overhead. In each weak-scaling plot, we report the steady-state throughput of each configuration and problem size after a number of warmup iterations (discussed in Section 6.3). We report throughput in iterations per second achieved by each configuration, so within a particular problem size, higher is better; across problem sizes, the smaller problem sizes will achieve a higher iterations per second than the larger problem sizes.

S3D

S3D (Treichler et al., 2017) is a production combustion chemistry simulation code that has been developed over the course of many years by different scientists and engineers. The Legion port of S3D implements the right-hand-side function of the Runge-Kutta scheme, and interoperates with the legacy Fortran+MPI driver of the simulation. The integration between Legion and the legacy Fortran+MPI code leads to various constraints that the manual trace annotations interact with. For example, during the first 10 iterations, a hand-off between Legion and Fortran+MPI must occur every iteration, while after the first 10 iterations a hand-off is required only every 10 iterations. While not unmanageable, these interactions have led to relatively complicated logic to manually trace the main loop. We scale S3D on Perlmutter, and compare the performance of Apophenia to manually traced and untraced versions of S3D. The results are shown in Figure 6(a). Even on a single node, tracing has a noticeable performance impact on the smaller problem sizes and affects the scalability of S3D. Apophenia achieves within 0.92x–1.03x of the performance of the manually traced version, and between 0.98x–1.82x speedups over the untraced version.

HTR

HTR (Di Renzo et al., 2020) is a production hypersonic aerothermodynamics application. HTR performs multi-physics simulations of hypersonic flows at high enthalpies and Mach numbers, such as for simulations of the reentry of spacecraft into the atmosphere. Like S3D, we evaluate Apophenia’s performance on HTR on Perlmutter, and compare it against a manually traced version and an untraced version. While HTR without tracing performs competitively to the traced version at small GPU counts, Figure 6(b) shows that tracing is necessary for performance at scale. Apophenia achieves within 0.99x–1.01x of the performance of the manually traced version, and between 0.96x–1.21x speedups over the untraced version.

CFD

CFD is a cuNumeric application that solves the Navier-Stokes equations for 2D channel flow (Barba and Forsyth, 2019). Unlike S3D and HTR, there is not a manually traced version of CFD, due to the difficulties around composition discussed in Section 2. Developing a manually traced implementation of CFD would require either rewriting the application to remove any dynamic region allocation, or manual examination of allocator logs to find the number of iterations in the steady state. As a result, we compare CFD with Apophenia to the standard untraced version on different problem sizes, which is the performance that cuNumeric users are able to achieve today.

Figure 7(a) shows weak scaling results for CFD on Eos. These results are similar to HTR, where leveraging tracing is necessary for performance at scale. On the smallest problem size, even though the tracing removes a large amount of runtime overhead, the tasks are too small to hide the communication latency at larger scales, leading to the observed fall off in performance. On larger problems, CFD with Apophenia is able to maintain high performance while the untraced version falls off, yielding between 0.92x–2.64x speedups.

TorchSWE

TorchSWE is a cuNumeric port of the MPI-based TorchSWE (Chuang, 2021) shallow-water equation solver, and is the largest cuNumeric application developed so far. Similarly to CFD, there is no manually traced version to compare to. However, unlike CFD, performing a rewrite of TorchSWE to enable manual tracing would be difficult, as TorchSWE contains an order of magnitude more lines of code. Weak scaling results for TorchSWE on Eos are shown in Figure 7(b).

These results demonstrate that there does not exist a problem size for TorchSWE on Eos that can hide Legion’s runtime overhead without tracing. Even the large problem size, which nearly reaches the GPU’s memory capacity, exposes Legion runtime overhead at 8 GPUs. The reason for this is that TorchSWE maintains a large number of fields for each simulated point, and issues different array operations on each field. The amount of data needed for each element in the simulation does not allow the task granularity to be easily increased, as each new element added increases the memory footprint more than it increases the average task granularity. For such applications, leveraging tracing is a requirement, and Apophenia enables complex applications like TorchSWE to do so automatically. TorchSWE itself contains enough task parallelism to hide communication latencies, but needs tracing to first lower runtime overhead. With Apophenia, we are able to achieve between 0.91x–2.82x speedup on TorchSWE, achieving nearly perfect scalability on 64 GPUs.

6.2. Strong-Scaling

Refer to caption
Figure 8. Strong scaling of FlexFlow on Eos.

We now move from scientific simulation codes to distributed deep neural network training with FlexFlow (Jia et al., 2018; Unger et al., 2022). FlexFlow is a deep neural network framework that searches for hybrid parallelization strategies for different layers of the network. We perform a strong-scaling experiment with FlexFlow on Eos to train the largest (pilot1) network from the CANDLE (can, [n. d.]) initiative333Due to engineering limitations in FlexFlow at the time of writing, the network was parallelized only with data parallelism.. A strong-scaling study fixes the problem size on a single processor, and increases the number of processors while keeping total problem size constant. To strong scale the training, we fix the batch size for single GPU, and then increases the number of GPUs available.

We compare the performance of FlexFlow with manual trace annotations, two configurations of Apophenia (discussed next), and no tracing. As seen in Figure 8, as FlexFlow scales up, the tasks become smaller and begin to expose Legion runtime overhead without tracing, leading to slowdowns when scaling up. The two configurations of Apophenia differ in the maximum trace length to be replayed (Apophenia’s history buffer is the same, but recorded traces are broken into pieces of a given maximum size). The first (auto-5000) is the standard configuration with no maximum, as used in all other experiments, and the second (auto-200) has a maximum length of 200 tasks, which is similar to the length of the manually annotated trace. As FlexFlow strong scales, the cost of Legion issuing the trace replay starts to become exposed as the execution time of the trace decreases, leading to shorter traces exposing less latency, and thus performing better444The Legion team is aware of this shortcoming and plans to address it in the future.. On 32 GPUs, the configuration of Apophenia with a maximum trace length of 200 achieves between 0.97x the performance of the manually traced FlexFlow, and achieves a 1.5x speedup over the untraced FlexFlow.

6.3. Overheads of Apophenia

We now discuss the overheads that Apophenia imposes over standard execution with Legion. While we inherit the overheads of Legion’s existing tracing infrastructure (Lee et al., 2018) (the cost of memoizing traces), Apophenia imposes two new sources of overhead to measure: 1) the overhead on task launches and 2) the time taken until a steady state is reached.

As discussed in Section 4, Apophenia intercepts the application’s task launches and performs some analysis work before forwarding the task launches to Legion. This analysis work includes launching asynchronous token buffer processing jobs and manipulating traversals of the trie data structures used for online trace identification. To quantify this overhead, we ran a two node experiment on Perlmutter and measured the time it took to launch (not analyze or execute) Legion tasks with and without Apophenia enabled. We ran a two node experiment to ensure that the coordination logic discussed in Section 5.1 was included in timing. We found that task launching took on average 7μ𝜇\muitalic_μs without Apophenia, and on average 12μ𝜇\muitalic_μs with Apophenia. While Apophenia increases the task launch overhead, this overhead is still significantly lower than the amount of time it takes to replay a task as part of a trace, which is  100μ𝜇\muitalic_μs. As such, the task launching cost of Apophenia can still be effectively hidden by the asynchronous runtime architecture. The asynchronous analysis jobs that Apophenia launches to process task histories do not affect the critical path, and utilize Legion’s background worker threads. While in theory these jobs could compete for the resources necessary for Legion’s dependence analysis, we have not yet encountered an application where they caused a detriment in performance.

To measure the time taken until Apophenia reaches a steady state of replaying traces on our iterative applications, we report the number of iterations until a steady state is reached. Figure 9 contains the iteration counts needed for each application in Section 6.1 and Section 6.2, which range from 30 to 300. These simulation and machine learning workloads would be run in production for a significantly larger number of iterations. We note that the cuNumeric applications have a larger number of required warmup iterations due to the dynamic behavior discussed in Section 2, where a single application-level iteration of the program does not necessarily correspond to a repeated sequence of tasks.

Application Iterations Until Steady State
S3D 50
HTR 50
CFD 300
TorchSWE 300
FlexFlow 30
Figure 9. Warmup iterations before Apophenia reaches a replaying steady state.
Refer to caption
Figure 10. Visualization of Apophenia finding traces in S3D.

6.4. Trace Search

To give intuition about the search process that Apophenia performs, we constructed a visualization of the amount of runtime overhead that Apophenia is removing over time. Figure 10 is a visualization of S3D over time (for 70 iterations), where each for task launched by S3D, we display how many of the previous 5000 tasks were traced. For iterative computations, this procedure yields the expected result, where Apophenia spends time during program startup discovering new traces, and then settles into a steady state. The amount of traced operations increases slightly by the end of the execution, as Apophenia finds a better set of traces that lowers the number of untraced operations.

7. Related Work

Just-In-Time Compilers

Just-In-Time (JIT) compilers (Hölzle and Ungar, 1996; Gal et al., 2009; Paleczny et al., 2001) for dynamic languages have a tiered execution system, where the target language is first translated to bytecode, which is executed by an interpreter. Frequently executed program fragments are then compiled into native instructions for significantly faster execution. Apophenia employs a similar architecture where a task-based runtime system’s dynamic analysis acts as the slow but general interpreter, and uses a tracing engine as the fast but specialized compiler. JIT compilers rely on code landmarks like function definitions and basic block addresses to maintain counters of frequently executed program fragments. Since Apophenia views an unrolled stream of tasks, it must employ novel techniques for identification of traceable program fragments.

Trace Caches

Trace caches (Rotenberg et al., 1996) have been used in processors to improve instruction fetching bandwidth. At a high level, trace caches record the common jump paths taken through basic blocks, and pre-fetch those paths when revisiting the same basic blocks. Apophenia shares a similar architecture to trace caches, which also use patterns in running programs to improve the performance of a slower dynamic component (in this case, the control-dependent instruction fetching). Similarly to JIT compilers, trace caches also use landmarks in executing programs to guide their decisions, which Apophenia is not able to exploit. Also, by virtue of being implemented in hardware, the mechanisms that trace caches must be simpler than the kinds of analyses Apophenia can use, which are implemented in software.

String Analysis

Section 4.2 contains a partial discussion of related string analysis works—we continue the discussion here. The most relevant string processing problem in the bio-informatics community is motif finding (Das and Dai, 2007), which is the problem of finding short (5–20 token long), fixed-length repeated strings in a larger corpus. The focus on a short and fixed sub-string length and a tendency to use genomic information to guide the search makes these techniques not applicable to our problem. Algorithms for document fingerprinting such as Moss (Schleimer et al., 2003) have been developed that accurately identify copies between documents. In particular, these techniques are guaranteed to detect if repetitions of at least a minimum size exist across documents. Fingerprinting techniques are useful to detect whether there exist repeated sub-strings, but do not directly aid in finding the sub-strings themselves that have high coverage.

Inspector-Executor Frameworks

Apophenia is similar in spirit to Inspector-Executor (I/E) frameworks that dynamically analyze program behavior and then perform optimizations (Ravishankar et al., 2015, 2012). I/E frameworks generally focus on recording information related to array accesses and use knowledge of these accesses to perform compiler optimizations that parallelize or distributed loops. In contrast, Apophenia observes a dynamic sequence of tasks and searches for repeated sub-sequences of tasks to record as traces.

Task-Based Runtime Systems

Several task-based runtime systems have been developed for high performance computing (Bauer et al., 2012; Augonnet et al., 2011; Bosilca et al., 2011), data science (Dask Development Team, 2016; Zaharia et al., 2012), and machine learning (Moritz et al., 2017; Barham et al., 2022). One axis of runtime overhead that these different systems impose on applications is the cost of dependence analysis. The cost of dependence analysis is directly related to the expressivity and flexibility of the runtime system’s programming model. Legion has an expressive data model that supports content-based coherence (Bauer et al., 2023), leading to a relatively expensive dependence analysis. As a result, tracing (Lee et al., 2018) was developed to reduce the costs of the dependence analysis. A tracing-like technique called Execution Templates (Mashayekhi et al., 2017) was also developed to cache control plane decisions in runtime systems for cloud-based environments.

8. Conclusion

In this work, we introduce Apophenia, a system and framework for task-based runtime systems to automatically trace the dependence analyses for repeated program. By automatically detecting traces, Apophenia is able to improve programmer productivity by insulating programmers against changing task granularity, and enable new applications to take advantage of tracing. We develop an implementation of Apophenia that targets the Legion runtime system and show that on the most complex Legion applications written to this date, Apophenia is able to match the performance of manually traced code, and effectively optimize currently untraceable programs to improve the performance at scale by up to 2.82x.

References

  • (1)
  • can ([n. d.]) [n. d.]. CANDLE — Exascale Deep Learning and Simulation Enabled Precision Medicine for Cancer — wordpress.cels.anl.gov. https://wordpress.cels.anl.gov/candle/. [Accessed 06-05-2024].
  • Augonnet et al. (2011) Cédric Augonnet, Samuel Thibault, Raymond Namyst, and Pierre-André Wacrenier. 2011. StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurrency and Computation: Practice and Experience 23, 2 (2011), 187–198. https://doi.org/10.1002/cpe.1631
  • Barba and Forsyth (2019) Lorena Barba and Gilbert Forsyth. 2019. CFD Python: the 12 steps to Navier-Stokes equations. Journal of Open Source Education 2, 16 (2019), 21. https://doi.org/10.21105/jose.00021
  • Barham et al. (2022) Paul Barham, Aakanksha Chowdhery, Jeff Dean, Sanjay Ghemawat, Steven Hand, Dan Hurt, Michael Isard, Hyeontaek Lim, Ruoming Pang, Sudip Roy, Brennan Saeta, Parker Schuh, Ryan Sepassi, Laurent El Shafey, Chandramohan A. Thekkath, and Yonghui Wu. 2022. Pathways: Asynchronous Distributed Dataflow for ML. arXiv:2203.12533 [cs.DC]
  • Bauer and Garland (2019) Michael Bauer and Michael Garland. 2019. Legate NumPy: accelerated and distributed array computing. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (Denver, Colorado) (SC ’19). Association for Computing Machinery, New York, NY, USA, Article 23, 23 pages. https://doi.org/10.1145/3295500.3356175
  • Bauer et al. (2021) Michael Bauer, Wonchan Lee, Elliott Slaughter, Zhihao Jia, Mario Di Renzo, Manolis Papadakis, Galen Shipman, Patrick McCormick, Michael Garland, and Alex Aiken. 2021. Scaling implicit parallelism via dynamic control replication. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (Virtual Event, Republic of Korea) (PPoPP ’21). Association for Computing Machinery, New York, NY, USA, 105–118. https://doi.org/10.1145/3437801.3441587
  • Bauer et al. (2023) Michael Bauer, Elliott Slaughter, Sean Treichler, Wonchan Lee, Michael Garland, and Alex Aiken. 2023. Visibility Algorithms for Dynamic Dependence Analysis and Distributed Coherence. In Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (Montreal, QC, Canada) (PPoPP ’23). Association for Computing Machinery, New York, NY, USA, 218–231. https://doi.org/10.1145/3572848.3577515
  • Bauer et al. (2012) Michael Bauer, Sean Treichler, Elliott Slaughter, and Alex Aiken. 2012. Legion: expressing locality and independence with logical regions. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (Salt Lake City, Utah) (SC ’12). IEEE Computer Society Press, Washington, DC, USA, Article 66, 11 pages.
  • Bonachea and Hargrove (2018) Dan Bonachea and Paul H. Hargrove. 2018. GASNet-EX: A High-Performance, Portable Communication Library for Exascale. In Proceedings of Languages and Compilers for Parallel Computing (LCPC’18) (Lecture Notes in Computer Science, Vol. 11882). Springer International Publishing. https://doi.org/10.25344/S4QP4W https://doi.org/10.25344/S4QP4W.
  • Bosilca et al. (2011) George Bosilca, Aurelien Bouteiller, Anthony Danalis, Thomas Herault, Pierre Lemarinier, and Jack Dongarra. 2011. DAGuE: A Generic Distributed DAG Engine for High Performance Computing. In 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum. 1151–1158. https://doi.org/10.1109/IPDPS.2011.281
  • Chuang (2021) Pi-Yueh Chuang. 2021. TorchSWE: GPU shallow-water equation solver.
  • Das and Dai (2007) Modan K. Das and Ho-Kwok Dai. 2007. A survey of DNA motif finding algorithms. BMC Bioinformatics 8, 7 (01 Nov 2007), S21. https://doi.org/10.1186/1471-2105-8-S7-S21
  • Dask Development Team (2016) Dask Development Team. 2016. Dask: Library for dynamic task scheduling. http://dask.pydata.org
  • Di Renzo et al. (2020) Mario Di Renzo, Lin Fu, and Javier Urzay. 2020. HTR solver: An open-source exascale-oriented task-based multi-GPU high-order code for hypersonic aerothermodynamics. Computer Physics Communications 255 (2020), 107262. https://doi.org/10.1016/j.cpc.2020.107262
  • Gal et al. (2009) Andreas Gal, Brendan Eich, Mike Shaver, David Anderson, David Mandelin, Mohammad R. Haghighat, Blake Kaplan, Graydon Hoare, Boris Zbarsky, Jason Orendorff, Jesse Ruderman, Edwin W. Smith, Rick Reitmaier, Michael Bebenita, Mason Chang, and Michael Franz. 2009. Trace-based just-in-time type specialization for dynamic languages. In Proceedings of the 30th ACM SIGPLAN Conference on Programming Language Design and Implementation (Dublin, Ireland) (PLDI ’09). Association for Computing Machinery, New York, NY, USA, 465–478. https://doi.org/10.1145/1542476.1542528
  • Hölzle and Ungar (1996) Urs Hölzle and David Ungar. 1996. Reconciling responsiveness with performance in pure object-oriented languages. ACM Trans. Program. Lang. Syst. 18, 4 (jul 1996), 355–400. https://doi.org/10.1145/233561.233562
  • Jia et al. (2018) Zhihao Jia, Matei Zaharia, and Alex Aiken. 2018. Beyond Data and Model Parallelism for Deep Neural Networks. CoRR abs/1807.05358 (2018). arXiv:1807.05358 http://arxiv.org/abs/1807.05358
  • Kasai et al. (2001) Toru Kasai, Gunho Lee, Hiroki Arimura, Setsuo Arikawa, and Kunsoo Park. 2001. Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications. In Combinatorial Pattern Matching, Amihood Amir (Ed.). Springer Berlin Heidelberg, Berlin, Heidelberg, 181–192.
  • Lee et al. (2018) Wonchan Lee, Elliott Slaughter, Michael Bauer, Sean Treichler, Todd Warszawski, Michael Garland, and Alex Aiken. 2018. Dynamic tracing: memoization of task graphs for dynamic task-based runtimes. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (Dallas, Texas) (SC ’18). IEEE Press, Article 34, 13 pages.
  • Mashayekhi et al. (2017) Omid Mashayekhi, Hang Qu, Chinmayee Shah, and Philip Levis. 2017. Execution templates: caching control plane decisions for strong scaling of data analytics. In Proceedings of the 2017 USENIX Conference on Usenix Annual Technical Conference (Santa Clara, CA, USA) (USENIX ATC ’17). USENIX Association, USA, 513–526.
  • Moritz et al. (2017) Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, William Paul, Michael I. Jordan, and Ion Stoica. 2017. Ray: A Distributed Framework for Emerging AI Applications. CoRR abs/1712.05889 (2017). arXiv:1712.05889 http://arxiv.org/abs/1712.05889
  • Paleczny et al. (2001) Michael Paleczny, Christopher Vick, and Cliff Click. 2001. The java hotspotTM server compiler. In Proceedings of the 2001 Symposium on JavaTM Virtual Machine Research and Technology Symposium - Volume 1 (Monterey, California) (JVM’01). USENIX Association, USA, 1.
  • Ravishankar et al. (2015) Mahesh Ravishankar, Roshan Dathathri, Venmugil Elango, Louis-Noël Pouchet, J. Ramanujam, Atanas Rountev, and P. Sadayappan. 2015. Distributed memory code generation for mixed Irregular/Regular computations. SIGPLAN Not. 50, 8 (jan 2015), 65–75. https://doi.org/10.1145/2858788.2688515
  • Ravishankar et al. (2012) Mahesh Ravishankar, John Eisenlohr, Louis-Noel Pouchet, J. Ramanujam, Atanas Rountev, and P. Sadayappan. 2012. Code generation for parallel execution of a class of irregular loops on distributed memory systems. In SC ’12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. 1–11. https://doi.org/10.1109/SC.2012.30
  • Rotenberg et al. (1996) E. Rotenberg, S. Bennett, and J.E. Smith. 1996. Trace cache: a low latency approach to high bandwidth instruction fetching. In Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29. 24–34. https://doi.org/10.1109/MICRO.1996.566447
  • Schleimer et al. (2003) Saul Schleimer, Daniel S. Wilkerson, and Alex Aiken. 2003. Winnowing: local algorithms for document fingerprinting. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data (San Diego, California) (SIGMOD ’03). Association for Computing Machinery, New York, NY, USA, 76–85. https://doi.org/10.1145/872757.872770
  • Sisco et al. (2023) Zachary D. Sisco, Jonathan Balkind, Timothy Sherwood, and Ben Hardekopf. 2023. Loop Rerolling for Hardware Decompilation. Proc. ACM Program. Lang. 7, PLDI, Article 123 (jun 2023), 23 pages. https://doi.org/10.1145/3591237
  • Slaughter et al. (2020) Elliott Slaughter, Wei Wu, Yuankun Fu, Legend Brandenburg, Nicolai Garcia, Wilhem Kautz, Emily Marx, Kaleb S. Morris, Qinglei Cao, George Bosilca, Seema Mirchandaney, Wonchan Leek, Sean Treichlerk, Patrick McCormick, and Alex Aiken. 2020. Task Bench: A Parameterized Benchmark for Evaluating Parallel Runtime Performance. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. 1–15. https://doi.org/10.1109/SC41405.2020.00066
  • Storer and Szymanski (1982) James A. Storer and Thomas G. Szymanski. 1982. Data compression via textual substitution. J. ACM 29, 4 (oct 1982), 928–951. https://doi.org/10.1145/322344.322346
  • Stoye and Gusfield (2002) Jens Stoye and Dan Gusfield. 2002. Simple and flexible detection of contiguous repeats using a suffix tree. Theoretical Computer Science 270, 1 (2002), 843–856. https://doi.org/10.1016/S0304-3975(01)00121-9
  • Treichler et al. (2017) Sean Treichler, Michael Bauer, Ankit Bhagatwala, Giulio Borghesi, Ramanan Sankaran, Hemanth Kolla, Patrick Mccormick, Elliott Slaughter, Wonchan Lee, Alex Aiken, and Jacqueline H. Chen. 2017. S3D-Legion: An Exascale Software for Direct Numerical Simulation of Turbulent Combustion with Complex Multicomponent Chemistry. (11 2017). https://doi.org/10.1201/b21930-12
  • Unger et al. (2022) Colin Unger, Zhihao Jia, Wei Wu, Sina Lin, Mandeep Baines, Carlos Efrain Quintero Narvaez, Vinay Ramakrishnaiah, Nirmal Prajapati, Pat McCormick, Jamaludin Mohd-Yusof, Xi Luo, Dheevatsa Mudigere, Jongsoo Park, Misha Smelyanskiy, and Alex Aiken. 2022. Unity: Accelerating DNN Training Through Joint Optimization of Algebraic Transformations and Parallelization. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). USENIX Association, Carlsbad, CA, 267–284. https://www.usenix.org/conference/osdi22/presentation/unger
  • Welch (1984) Welch. 1984. A Technique for High-Performance Data Compression. Computer 17, 6 (1984), 8–19. https://doi.org/10.1109/MC.1984.1659158
  • Wikipedia (2024) Wikipedia. 2024. Ruler function — Wikipedia, The Free Encyclopedia. http://en.wikipedia.org/w/index.php?title=Ruler%20function&oldid=1193825609. [Online; accessed 02-May-2024].
  • Yadav et al. (2023) Rohan Yadav, Wonchan Lee, Melih Elibol, Manolis Papadakis, Taylor Lee-Patti, Michael Garland, Alex Aiken, Fredrik Kjolstad, and Michael Bauer. 2023. Legate Sparse: Distributed Sparse Computing in Python. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (Denver, CO, USA) (SC ’23). Association for Computing Machinery, New York, NY, USA, Article 13, 13 pages. https://doi.org/10.1145/3581784.3607033
  • Zaharia et al. (2012) Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (San Jose, CA) (NSDI’12). USENIX Association, USA, 2.
  • Ziv and Lempel (1977) J. Ziv and A. Lempel. 1977. A universal algorithm for sequential data compression. IEEE Transactions on Information Theory 23, 3 (1977), 337–343. https://doi.org/10.1109/TIT.1977.1055714
  • Ziv and Lempel (1978) J. Ziv and A. Lempel. 1978. Compression of individual sequences via variable-rate coding. IEEE Transactions on Information Theory 24, 5 (1978), 530–536. https://doi.org/10.1109/TIT.1978.1055934