ACM Trans. Program. Lang. Syst., Vol. 42, No. 2, Article 8, Publication date: May 2020.
DOI: https://doi.org/10.1145/3382092
Computing precise (fully flow- and context-sensitive) and exhaustive (as against demand-driven) points-to information is known to be expensive. Top-down approaches require repeated analysis of a procedure for separate contexts. Bottom-up approaches need to model unknown pointees accessed indirectly through pointers that may be defined in the callers and hence do not scale while preserving precision. Therefore, most approaches to precise points-to analysis begin with a scalable but imprecise method and then seek to increase its precision. We take the opposite approach in that we begin with a precise method and increase its scalability. In a nutshell, we create naive but possibly non-scalable procedure summaries and then use novel optimizations to compact them while retaining their soundness and precision.
For this purpose, we propose a novel abstraction called the generalized points-to graph (GPG), which views points-to relations as memory updates and generalizes them using the counts of indirection levels leaving the unknown pointees implicit. This allows us to construct GPGs as compact representations of bottom-up procedure summaries in terms of memory updates and control flow between them. Their compactness is ensured by strength reduction (which reduces the indirection levels), control flow minimization (which removes control flow edges while preserving soundness and precision), and call inlining (which enhances the opportunities of these optimizations).
The effectiveness of GPGs lies in the fact that they discard as much control flow as possible without losing precision. This is the reason GPGs are very small even for main procedures that contain the effect of the entire program. This allows our implementation to scale to 158 kLoC for C programs.
At a more general level, GPGs provide a convenient abstraction to represent and transform memory in the presence of pointers. Future investigations can try to combine it with other abstractions for static analyses that can benefit from points-to information.
ACM Reference format:
Pritam M. Gharat, Uday P. Khedker, and Alan Mycroft. 2020. Generalized Points-to Graphs: A Precise and Scalable Abstraction for Points-to Analysis. ACM Trans. Program. Lang. Syst. 42, 2, Article 8 (May 2020), 78 pages. https://doi.org/10.1145/3382092
Points-to analysis discovers information about indirect accesses in a program. Its precision influences the precision and scalability of client program analyses significantly. Computationally intensive analyses such as model checking are noted as being ineffective on programs containing pointers, partly because of imprecision of points-to analysis [2].
We focus on exhaustive as against demand-driven [7, 13, 36, 37] points-to analysis. A demand-driven points-to analysis computes points-to information that is relevant to a query raised by a client analysis; for a different query, the points-to analysis needs to be repeated. An exhaustive analysis, however, computes all points-to information that can be queried later by a client analysis; multiple queries do not require points-to analysis to be repeated. For precision of points-to information, we are interested in full flow- and context-sensitive points-to analysis. A flow-sensitive analysis respects the control flow and computes separate dataflow information at each program point. This matters because a pointer could have different pointees at different program points because of redefinitions. Hence, a flow-sensitive analysis provides more precise results than a flow-insensitive analysis but can become inefficient at the interprocedural level. A context-sensitive analysis distinguishes between different calling contexts of procedures and restricts the analysis to interprocedurally valid control flow paths (i.e., control flow paths from program entry to program exit in which every return from a procedure is matched with a call to the procedure such that all call-return matchings are properly nested). A fully context-sensitive analysis does not lose precision even in the presence of recursion. Both flow- and context-sensitivity enhance precision, and we aim to achieve this without compromising efficiency.
A top-down approach to interprocedural context-sensitive analysis propagates information from callers to callees [47] effectively traversing the call graph top-down. In the process, it analyzes a procedure each time a new dataflow value reaches it from some call. Several popular approaches fall in this category: the call-strings method [34], its value-based variants [20, 29], and the tabulation-based functional method [30, 34]. By contrast, bottom-up approaches [5, 9, 12, 16, 26, 32, 39, 42, 43, 44, 45, 46, 47] avoid analyzing a procedure multiple times by constructing its procedure summary, which is used to incorporate the effect of calls to the procedure. Effectively, this approach traverses the call graph bottom-up.1 A flow- and context-sensitive interprocedural analysis using procedure summaries is performed in two phases: the first phase constructs the procedure summaries, and the second phase uses them to represent the effect of the calls at the call sites.
For points-to analysis, an additional dimension of context sensitivity arises because heap locations are typically abstracted using allocation sites—all locations allocated by the same statement are treated alike. These allocation sites could be created context insensitively or could be cloned based on the contexts. We summarize various methods of points-to analysis using the metric described in Figure 20 and use it to position our work in Section 11.
Most approaches to precise points-to analysis begin with a scalable but imprecise method and then seek to increase its precision. We take the opposite approach in that we begin with a precise method and increase its scalability. We create naive, possibly non-scalable, procedure summaries and then use novel optimizations to compact them while retaining their soundness and precision. More specifically, we advocate a new form of bottom-up procedure summaries, called the generalized points-to graphs (GPGs) for flow- and context-sensitive points-to analysis. GPGs represent memory transformers (summarizing the effect of a procedure) and contain generalized points-to updates (GPUs) representing individual memory updates along with the control flow between them. GPGs are compact—their compactness is achieved by a careful choice of a suitable representation and a series of optimizations as described next:
These optimizations are based on the following novel operations and analyses:
At a practical level, our main contribution is a method of flow-sensitive, field-sensitive, and context-sensitive exhaustive points-to analysis of C programs that scales to large real-life programs.
The core ideas of GPGs have been presented before [11]. This article provides a complete treatment and enhances the core ideas significantly. We describe our formulations for a C-like language.
Section 2 describes the limitations of past approaches. Section 3 introduces the concept of GPUs that form the basis of GPGs and provides an overview of GPG construction through a motivating example. Section 4 describes the strength reduction optimization performed on GPGs. Section 5 explains dead GPU elimination. Section 6 describes control flow minimization optimizations performed on GPGs. Section 7 explains the interprocedural use of GPGs by defining call inlining and shows how recursion is handled. Section 8 shows how GPGs are used for performing points-to analysis. Section 9 proves soundness and precision of our method by showing its equivalence with a top-down flow- and context-sensitive classical points-to analysis. Section 10 presents empirical evaluation on SPEC benchmarks, and Section 11 describes related work. Section 12 concludes the article.
Some details (handling fields of structures and union, heap memory, function pointers, etc.) are available in an appendix available electronically.2 We have included cross references to the material in the appendix where relevant.
This section reviews some basic concepts and describes the challenges in constructing procedure summaries for efficient points-to analysis. It concludes by describing the limitations of the past approaches and outlining our key ideas. For further details of related work, see Section 11.
In this section, we describe the nature of memory, memory updates, and memory transformers.
2.1.1 Abstract and Concrete Memory. There are two views of memory and operations on it. First, we have the concrete memory view corresponding to runtime operations where a memory location representing a pointer always points to exactly one memory location or
2.1.2 Memory Transformer. A procedure summary for points-to analysis should represent memory updates in terms of copying locations, loading from locations, or storing to locations. We call it a memory transformer because it computes the memory after a call to a procedure based on the memory before the call. Given a memory $M$ and a memory transformer $\text{$\Delta $}$, the updated memory $\text{$M$}^{\prime }$ is computed by $\text{$M$}^{\prime }=\text{$\Delta $} (\text{$M$})$ as illustrated in Example 2 (Section 2.3).
2.1.3 Strong and Weak Updates. In concrete memory, every assignment overwrites the contents of the (single) memory location corresponding to the LHS of the assignment. However, in abstract memory, we may be uncertain as to which of several locations a variable (say $p$) points to. Hence, an indirect assignment such as $*p=\&x$ does not overwrite any of these locations but merely adds $x$ to their possible pointees. This is a weak update. Sometimes, however, there is only one possible abstract location described by the LHS of an assignment, and in this case we may, in general, replace the contents of this location. This is a strong update. There is just one subtlety that we return to later: prior to the preceding assignment, we may only have one assignment to p (say $p=\&a$). If this latter assignment dominates the former, then a strong update is appropriate. But if the latter assignment only appears on some control flow paths to the former, then we say that the read of $p$ in $*p=\&x$ is upwards exposed (i.e., live on entry to the current procedure) and therefore may have additional pointees unknown to the current procedure. Thus, the criterion for a strong update in an assignment is that its LHS references a single location and the location referenced is not upwards exposed (for more details, see Section 4.4). A direct assignment to a variable (e.g., $p = \&x$) is special case of a strong update.
When a value is stored in a location, we say that the location is defined without specifying whether the update is strong or weak and make the distinction only where required.
In the absence of indirect assignments involving pointers, data dependence between memory updates within a procedure can be inferred by using variable names without requiring any information from the callers. In such a situation, procedure summaries for some analyses, including various bit-vector dataflow analyses (e.g., live variables analysis), can be precisely represented by constant gen and kill sets [1, 22] or graph paths discovered using reachability [30].
Procedure summaries for points-to analysis, however, cannot be represented in terms of constant gen and kill sets because the association between pointer variables and their pointee locations could change in the procedure and may depend on the aliases between pointer variables established in the callers of the procedure. Often, and particularly for points-to analysis, we have a situation where a procedure summary must either lose information or retain internal details that can only be resolved when its caller is known.
The preceding example illustrates the following challenges in constructing flow-sensitive memory transformers: (a) representing indirectly accessed unknown pointees, (b) identifying blocking assignments and postponing some optimizations, and (c) recording control flow between memory updates so that potential data dependence between them is neither violated nor overapproximated.
Thus, a flow-sensitive memory transformer for points-to analysis requires a compact representation for memory updates that captures the minimal control flow between them succinctly.
A common solution for modeling indirect accesses of unknown pointees in a memory transformer is to use placeholders (also known as external variables [26, 39, 42] and extended parameters [43]). They are pattern-matched against the input memory to compute the output memory. Here we describe two broad approaches that use placeholders.
The first approach, which we call a multiple transfer functions (MTFs) approach, proposed a precise representation of a procedure summary for points-to analysis as a collection of (conditional) partial transfer functions (PTFs) [5, 16, 43, 46]. Each PTF corresponds to a combination of aliases that might occur in the callers of a procedure. Our work is inspired by the second approach, which we call a single transfer function (STF) approach [4, 6, 23, 26, 27, 39, 42]. This approach does not customize procedure summaries for combinations of aliases. However, the existing STF approach fails to be precise. We illustrate this approach and its limitations to motivate our key ideas using Figure 1. It shows a procedure and two memory transformers ($\text{$\Delta $}^{\prime }$ and $\text{$\Delta $}^{\prime \prime }$) for it and the associated input and output memories. The effect of $\text{$\Delta $}^{\prime }$ is explained in Example 2 and that of $\text{$\Delta $}^{\prime \prime }$ in Example 3.
Transformer $\text{$\Delta $}^{\prime }$ in Figure 1 is constructed by the STF approach. It is an abstract points-to graph containing placeholders $\phi _i$ for modeling unknown pointees. For example, $\phi _1$ represents the pointees of $y,$ and $\phi _2$ represents the pointees of pointees of $y$. Note that a memory is a snapshot of points-to edges, whereas a memory transformer needs to distinguish the points-to edges that are generated by it (shown by thick edges) from those that are carried forward from the input memory (shown by thin edges).
The two accesses of $y$ in statements 1 and 3 may or may not refer to the same location because of a possible side effect of the intervening assignment in statement 2. If $x$ and $y$ are aliased in the input memory (e.g., in $\text{$M$} _2$), statement 2 redefines the pointee of $y$ and hence $p$ and $q$ will not be aliased in the output memory. However, $\text{$\Delta $}^{\prime }$ uses the same placeholder for all accesses of a pointee. Further, $\text{$\Delta $}^{\prime }$ also suppresses strong updates because the control flow between memory updates is not recorded. Hence, points-to edge $s\rightarrow c$ in $\text{$M$} _1^{\prime }$ is not deleted. Similarly, points-to edge $r\rightarrow a$ in $\text{$M$} _2^{\prime }$ is not deleted, and $q$ spuriously points to $a$. Additionally, $p$ spuriously points-to $b$. Hence, $p$ and $q$ appear to be aliased in the output memory $\text{$M$} _2^{\prime }$.
The use of control flow ordering between the points-to edges that are generated by a memory transformer can improve its precision as shown by the following example.
In Figure 1, memory transformer $\text{$\Delta $}^{\prime \prime }$ differs from $\text{$\Delta $}^{\prime }$ in two ways. First, it uses a separate placeholder for every access of a pointee to avoid an overapproximation of memory (e.g., placeholders $\phi _1$ and $\phi _2$ to represent $*y$ in statement 1, and $\phi _5$ and $\phi _6$ to represent $*y$ in statement 3). This, along with control flow, allows strong updates, thereby killing the points-to edge $r\rightarrow a$ and hence $q$ does not point to $a$ (as shown in $\text{$M$} _2^{\prime \prime }$). Second, the points-to edges generated by the memory transformer are ordered based on the control flow of a procedure, thereby adding some form of flow sensitivity that $\text{$\Delta $}^{\prime }$ lacks. To see the role of control flow, observe that if the points-to edge corresponding to statement 2 is considered first, then $p$ and $q$ will always be aliased because the possible side effect of statement 2 will be ignored.
The output memories $\text{$M$} _1^{\prime \prime }$ and $\text{$M$} _2^{\prime \prime }$ computed using $\text{$\Delta $}^{\prime \prime }$ are more precise than the corresponding output memories $\text{$M$} _1^{\prime }$ and $\text{$M$} _2^{\prime }$ computed using $\text{$\Delta $}^{\prime }$.
Observe that although $\text{$\Delta $}^{\prime \prime }$ is more precise than $\text{$\Delta $}^{\prime }$, it uses a larger number of placeholders and also requires control flow information. This affects the scalability of points-to analysis.
A fundamental problem with placeholders is that they use a low-level representation of memory expressed in terms of classical points-to edges. Hence, a placeholder-based approach is forced to explicate unknown pointees by naming them, resulting in either a large number of placeholders (in the STF approach) or multiple PTFs (in the MTF approach). The need of control flow ordering further increases the number of placeholders in the former approach.
We propose a GPG as a representation for a memory transformer of a procedure; special cases of GPGs also represent memory as a points-to relation. A GPG is characterized by the following key ideas that overcome the two limitations described in Section 2.3:
Section 3 illustrates them using a motivating example and gives a big-picture view.
In this section, we define a GPG that serves as our memory transformer. It is a graph with generalized points-to blocks (GPBs) as nodes that contain GPUs. We provide an overview of our the ideas and algorithms in a limited setting of our motivating example of Figure 2. Toward the end of this section, Figure 6 summarizes them as a collection of abstractions, operations, dataflow analyses, and optimizations.
We model the effect of a pointer assignment on an abstract memory by defining the concept of GPU in Definition 1, which gives the abstract semantics of a GPU. The concrete semantics of a GPU $x \mathop {\longrightarrow }\limits _{s}^{i|j} y$ can be viewed as the following C-style pointer assignment with $i-1$ dereferences of $x$ (or $i$ dereferences of $\&x$) and $j$ dereferences of $\&y$.
This conceptual understanding of a GPU is central to the development of this work. However, most compiler intermediate languages are at a lower level of abstraction and instead represent this GPU using (placeholder) temporaries $l_k \, (0\le k \lt i)$ and $r_k \, (0\le k \le j)$ as a sequence of C-style assignments (illustrated inFigure 3):4
Statement labels, $s$, in GPUs are unique across procedures to distinguish between the statements of different procedures after call inlining. They facilitate distinguishing between strong and weak updates by identifying may-defined pointers (Section 3.1.1). Further, since GPUs are simplified in the calling contexts, statement labels allow back-annotation of points-to information within the procedure to which they belong. For simplicity, we omit the statement labels from GPUs when they are not required.
A GPU $\text{$\gamma $}:x \mathop {\longrightarrow }\limits _{s}^{i|j} y$ generalizes a points-to edge5 from $x$ to $y$ with the following properties:
We refer to a GPU with $i=1$ and $j=0$ as a classical points-to edge, as it encodes the same information as edges in classical points-to graphs.
The pointer assignment in statement 01 in Figure 2 is represented by a GPU $r \mathop {\longrightarrow }\limits _{01}^{1|0} a$ where the indirection levels “$1|0$” appear above the arrow and the statement number “01” appears below the arrow. The indirection level 1 in “$1|0$” indicates that $r$ is defined by the assignment, and the indirection level 0 in “$1|0$” indicates that the address of $a$ is read. Similarly, statement 02 is represented by a GPU $q\mathop {\longrightarrow }\limits _{02}^{2|0} m$. The indirection level 2 for $q$ indicates that some pointee of $q$ is being defined, and theindirection level 0 indicates that the address of $m$ is read.
Figure 3 presents the GPUs for basic pointer assignments in C and for the general GPU $x \mathop {\longrightarrow }\limits _{s}^{i|j} y$. (To deal with C structs and unions, GPUs are extended to encode lists of field names—for details see Figure B.1 in Appendix B).
GPUs are useful rubrics of our abstractions because they can be composed to construct new simplified GPUs (i.e., GPUs with smaller indirection levels) whenever possible, thereby converting them progressively to classical points-to edges. The composition between GPUs eliminates RaW dependence between them and thereby the need for control flow ordering between them.
A GPU can be seen as a primitive memory transformer that is used as a building block for the GPG as a memory transformer for a procedure (Definition 2). The optimized GPG for a procedure differs from its control flow graph (CFG) in the following way:
3.1.1 Abstract Semantics of GPBs. Abstract semantics of GPBs is a generalization of the semantics of pointer assignment in two ways. The first generalization is from a pointer assignment to a GPU, and the second generalization is from a single statement to multiple statements.
The semantics of a GPU (Definition 1) forms the basis of the semantics of a GPB. However, since a GPB has no control flow ordering on its GPUs and may contain multiple (simplified forms of) GPUs for a single source-language statement, or GPUs for multiple statements, we need to specify the combined effect of these multiple GPUs . In particular, differing concrete runs may execute a only a subset of the GPUs in some order. Let $\delta$ be a GPB and $\mu$ be its associated may-definition set, and let $S$ be the set of source-language labels $\text{$s$}$ occurring as labels of GPUs in $\text{$\delta $}$. Now write $\delta \vert _{\rm s}$ for $\lbrace x \mathop {\longrightarrow }\limits _{s}^{i|j} y \in \text{$\delta $} \rbrace$. The abstract execution of $\delta$ is characterized by the following two features:
Consider a GPB $\text{$\delta $} =\; \lbrace \text{$\gamma $} _1\!:\! x\mathop {\longrightarrow }\limits _ {11}^{1|0}a, \text{$\gamma $} _2\!:\!x\mathop {\longrightarrow }\limits _{11}^{1|0}b, \text{$\gamma $} _3\!:\!y\mathop {\longrightarrow }\limits _{12}^{1|0}c, \text{$\gamma $} _4\!:\!z\mathop {\longrightarrow }\limits _{13}^{1|0}d, \text{$\gamma $} _5\!:\!t\mathop {\longrightarrow }\limits _{13}^{1|0}d, \rbrace$ and its associated may-definition set $\text{$\mu $} = \lbrace (z,1),(t,1)\rbrace$ because $\text{$\gamma $} _4$ and $\text{$\gamma $} _5$ correspond to a single statement (statement 13) but define multiple sources. Note that $\text{$\gamma $} _1$ and $\text{$\gamma $} _2$ also correspond to a single statement (statement 11), but they define a single source $(x,1)$. Then, after executing $\delta$ abstractly, we know that the points-to set of $x$ is overwritten to become $\lbrace a,b\rbrace$ (i.e., $x$ definitely points to one of $a$ and $b$). Similarly, the points-to set of $y$ is overwritten to become $\lbrace c\rbrace$ because $\text{$\gamma $} _3$ defines a single location $c$ in statement 12. However, $\delta$ causes the points-to sets of $z$ and $t$ to include $\lbrace d\rbrace$ (without removing the existing pointees) because their sources are members of $\mu$. Thus, $x$ and $y$ are strongly updated (their previous pointees are removed), but $z$ and $t$ are weakly updated (their previous pointees are augmented).
3.1.2 Data Dependence Between GPUs. We use the usual notion of data dependence based on Bernstein's conditions [3]: two statements have a data dependence between them if they access the same memory location and at least one of them writes into the location [19, 28]. However, we restrict ourselves to locations that are pointers and use more intuitive names such as read-after-write, write-after-write, and write-after-read for flow, output, and anti dependence, respectively.
Formally, suppose $\text{$\gamma $} _1:x \mathop {\longrightarrow }\limits _{s}^{i|j} y$ is followed by $\text{$\gamma $} _2:w\mathop {\longrightarrow }\limits _{\text{$t$}}^{k|l}z$ on some control flow path. Then, $\text{$\gamma $} _2$ has the following dependence on $\text{$\gamma $} _1$ in the following cases (note that $i,k\gt 0$ and $j,l \ge 0$ in all cases):
Note that putting $i=j=k=l=1$ reduces to the classical definitions of these dependences.
We call these dependences definite dependences. They correspond to case A.i in Section 1.2. In addition, if $\text{$\gamma $} _2$ postdominates $\text{$\gamma $} _1$ (i.e., follows $\text{$\gamma $} _1$ on every control flow path), we call the dependence strict. As illustrated in Example 1, $\text{$\gamma $} _1$ and $\text{$\gamma $} _2$ can have a dependence even when they do not have a common variable. Such a dependence is called a potential dependence (case B in Section 1.2).
Two GPUs on a control flow path cannot be placed within a single GPB if there is a definite or potential RaW or WaW dependence between them. However, it is safe to include them in the case of WaR dependence because of the “all reads precede all writes” semantics of GPBs.
3.1.3 Finiteness of the Sets of GPUs. For two variables $x$ and $y$, the number of GPUs x $\mathop {\longrightarrow }\limits _{s}^{i|j} y$ depends on the number of possible
In this section, we intuitively describe GPU composition and GPU reduction.
3.2.1 GPU Composition. In a compiler, the sequence $p=\&a; *p=x$ is usually simplified to $p=\&a; a=x$ to facilitate further optimizations. Similarly, the sequence $p=\&a; q=p$ is usually simplified to $p=\&a; q=\&a$. GPU composition facilitates similar simplifications: Suppose a GPU $\text{$\gamma $} _1$ precedes $\text{$\gamma $} _2$ on some control flow path. If $\text{$\gamma $} _2$ has a RaW dependence on $\text{$\gamma $} _1,$ then $\text{$\gamma $} _2$ is a consumer of the pointer information represented by the producer $\text{$\gamma $} _1$. In such a situation, a GPU composition $\text{$\gamma $} _3 = \text{$\gamma $} _2\, \text{$\ \circ \ $} \text{$\gamma $} _1$ computes a new GPU $\text{$\gamma $} _3$ such that the
For statement sequence $p=\&a; *p=x$, the consumer GPU $\text{$\gamma $} _2\!:\!{p}\mathop {\longrightarrow }\limits _{2}^{2|1}{x}$ (statement 2) is simplified to $\text{$\gamma $} _3\!:\! {a}\mathop {\longrightarrow }\limits _{2}^{1|1}{x}$ by replacing the source $p$ of $\text{$\gamma $} _2$ using the producer GPU $\text{$\gamma $} _1\!:\!{p}\mathop {\longrightarrow }\limits _{1}^{1|0}{a}$ (statement 1). GPU $\text{$\gamma $} _3$ can be further simplified to one or more points-to edges (i.e., GPUs with
The preceding example illustrates that multiple GPU compositions may be required to reduce the
3.2.2 GPU Reduction. We generalize the operation of composition as follows. If, instead of a single producer GPU as defined in Section 3.2.1, we have a set $\mathcal {R}$ of GPUs (representing generalized-points-to knowledge from all control flow paths to node $n$ and obtained from the reaching GPUs analyses of Sections 4.5 and 4.6) and a single GPU $\text{$\gamma $} \in \text{$\delta $} _{n}$ corresponding to statement $\text{$s$}$, then GPU reduction $\text{$\gamma $} \text{$\ \circ \ $} \text{$\mathcal {R}$}$ constructs a set of one or more GPUs, all of which correspond to statement $\text{$s$}$. Taking the union of all such sets, as $\text{$\gamma $}$ varies over $\text{$\delta $} _{n}$, is considered as the information generated for node $n$ and is semantically equivalent to $\text{$\delta $}$ in the context of $\mathcal {R}$ and, as suggested earlier, may beneficially replace $\text{$\delta $}$.
GPU reduction $\text{$\gamma $} \text{$\ \circ \ $} \text{$\mathcal {R}$}$ eliminates the RaW data dependence of $\text{$\gamma $}$ on the GPUs in $\mathcal {R}$, wherever possible, thereby eliminating the need for control flow between $\text{$\gamma $}$ and the GPUs in $\mathcal {R}$.
The GPG of procedure $R$ (denoted $\text{$\Delta $} _R$) is constructed by traversing a spanning tree of the call graph starting with its leaf nodes. It involves the following steps:
We illustrate these steps intuitively using the motivating example in Figure 2.
3.3.1 Creating a GPG and Call Inlining. To construct a GPG from a CFG, we first map the CFG naively into a GPG by the following transformations:
Examples 9 through 12 explain the analyses and optimizations over $\text{$\Delta $} _Q$ and $\text{$\Delta $} _R$ (GPGs for procedures $Q$ and $R$) at an intuitive level.
3.3.2 Strength Reduction Optimization. This step simplifies GPB $\text{$\delta $} _{\text{$n$}}$ for each node $n$ by
Effectively, strength reduction simplifies each GPB as much as possible without needing the knowledge of aliasing in the caller. In the process, data dependences are eliminated to the extent possible, facilitating dead GPU elimination and control flow minimization. Note that strength reduction does not create new GPBs; it only creates new (equivalent) GPUs within the same GPB. The statement labels in GPUs remain unchanged because the simplified GPUs of a statement continue to represent the same statement.
To reduce the
The following two issues in reaching GPUs analysis are not illustrated in this section:
We intuitively explain the reaching GPUs analysis for procedure $Q$ over its initial GPG (Figure 4). The final result is shown later in Figure 8. GPU ${r}\mathop {\longrightarrow }\limits _{01}^{1|0}{a}$ representing statement 01 reaches $\text{$\delta $} _{02}$ in the first iteration. However, it does not simplify GPU ${q}\mathop {\longrightarrow }\limits _{02}^{2|0}{m}$ in $\text{$\delta $} _{02}$. The GPUs $\lbrace {r}\mathop {\longrightarrow }\limits _{01}^{1|0}{a}, {q}\mathop {\longrightarrow }\limits _{02}^{2|0}{m}\rbrace$ reach the GPB $\text{$\delta $} _{03}$. GPU ${q}\mathop {\longrightarrow }\limits _{03}^{1|0}{b}$ cannot be simplified any further. In the second iteration, GPUs $\lbrace {r}\mathop {\longrightarrow }\limits _{01}^{1|0}{a}, {q}\mathop {\longrightarrow }\limits _{02}^{2|0}{m}, {q}\mathop {\longrightarrow }\limits _{03}^{1|0}{b}\rbrace$ reach $\text{$\delta $} _{02}$ and $\text{$\delta $} _{03}$. Composing ${q}\mathop {\longrightarrow }\limits _{02}^{2|0}{m}$ with ${q}\mathop {\longrightarrow }\limits _{03}^{1|0}{b}$ results in ${b}\mathop {\longrightarrow }\limits _{02}^{1|0}{m}$. In addition, the pointee information of $q$ is available only along one path (identified with the help of boundary definitions not shown here). Hence, the assignment causes a weak update and GPU ${q}\mathop {\longrightarrow }\limits _{02}^{2|0}{m}$ is also retained. Thus, GPB $\text{$\delta $} _{02}$ contains two GPUs, ${b}\mathop {\longrightarrow }\limits _{02}^{1|0}{m}$ and ${q}\mathop {\longrightarrow }\limits _{02}^{2|0}{m,}$ after simplification, and sources $(b,1)$ and $(q,2)$ are both included in $\text{$\mu $} _{02}$. This process continues until the least fixed point is reached. Strength reduction optimization based on these results gives the GPG shown in the third column of Figure 4.
3.3.3 Dead GPU Elimination. The following example illustrates dead GPU elimination in our motivating example. This optimization removes the WaW dependences where possible.
In procedure $Q$ of Figure 4, pointer $q$ is defined in $\text{$\delta $} _{03}$ but is redefined in $\text{$\delta $} _{05}$ and hence GPU ${q}\mathop {\longrightarrow }\limits _{03}^{1|0}{b}$ is eliminated. Therefore, GPB $\text{$\delta $} _{03}$ becomes empty and is removed from $\text{$\Delta $} _Q$. Since GPU ${q}\mathop {\longrightarrow }\limits _{02}^{2|0}{m}$ does not define $q$ but its pointee, it is not killed by $\text{$\delta $} _{05}$ and is not eliminated from $\text{$\Delta $} _Q$.
For procedure $R$ in Figure 5, GPU ${q}\mathop {\longrightarrow }\limits _{07}^{1|0}{d}$ in $\text{$\delta $} _{07}$ is killed by GPU ${q}\mathop {\longrightarrow }\limits _{05}^{1|0}{e}$ in $\text{$\delta $} _{14}$. Hence, GPU ${q}\mathop {\longrightarrow }\limits _{07}^{1|0}{d}$ is eliminated from GPB $\text{$\delta $} _{07}$. Similarly, GPU ${e}\mathop {\longrightarrow }\limits _{04}^{1|1}{c}$ in GPB $\text{$\delta $} _{14}$ is removed because $e$ is redefined by GPU ${e}\mathop {\longrightarrow }\limits _{10}^{1|0}{o}$ in GPB $\text{$\delta $} _{10}$ (after strength reduction in $\text{$\Delta $} _R$). However, GPU ${d}\mathop {\longrightarrow }\limits _{08}^{1|0}{n}$ in GPB $\text{$\delta $} _{08}$ is not removed even though $\text{$\delta $} _{13}$ contains a definition of $d$ expressed GPU ${d}\mathop {\longrightarrow }\limits _{02}^{1|0}{m}$. This is because $\text{$\delta $} _{13}$ also contains GPU ${b}\mathop {\longrightarrow }\limits _{02}^{1|0}{m,}$ which defines $b$. Since statement 02 defines two sources, both of them are may-defined in $\text{$\delta $} _{13}$ (i.e., are included in $\text{$\mu $} _{13}$). Thus, the previous definition of $d$ cannot be killed—giving a weak update.
3.3.4 Control Flow Minimization. This step improves the compactness of a GPG by eliminating empty GPBs from a GPG and then minimizing control flow by coalescing adjacent GPBs into a single GPB wherever there is no RaW or WaW dependences between them.
After eliminating GPU ${q}\mathop {\longrightarrow }\limits _{07}^{1|0}{d}$ from the GPG of procedure $R$ in Figure 5 (because it is dead), GPB $\text{$\delta $} _{07}$ becomes empty and is removed from the optimized GPG.
We eliminate control flow in the GPG by performing coalescing analysis (Section 6). It partitions the nodes of a GPG (into parts) such that all GPBs in a part are coalesced (i.e., the GPB of the coalesced node contains the union of the GPUs of all GPBs in the part) and control flow is retained only across the new GPBs representing the parts. Given a GPB $\text{$\delta $} _{\text{$n$}}$ in a part, a control flow successor $\text{$\delta $} _{\text{$m$}}$ can appear in the same part only if the control flow between them is redundant. This requires that the GPUs in $\text{$\delta $} _{\text{$m$}}$ do not have RaW or WaW dependence on the other GPUs in the part.
A GPB obtained after coalescing may contain GPUs belonging to multiple statements, and not all of them may be executed in a concrete run of the GPB. This requires determining the associated may-definition set for the coalesced node that identifies the sources that are may-defined to maintain the abstract semantics of a GPB (Section 3.1.1).
For procedure $Q$ in Figure 4, the GPBs $\text{$\delta $} _1$ and $\text{$\delta $} _2$ can be coalesced: there is no data dependence between their GPUs because GPU ${r}\mathop {\longrightarrow }\limits _{01}^{1|0}{a}$ in $\text{$\delta $} _1$ defines $r$ whose type is “$\tt int\, *\!*$,” whereas the GPUs in $\text{$\delta $} _2$ read the address of $m$, pointer $b$, and pointee of $q$. The type of latter two is “$\tt int\, *$.” Thus, a potential dependence between the GPUs in $\text{$\delta $} _1$ and $\text{$\delta $} _2$ is ruled out using types. However, GPUs ${q}\mathop {\longrightarrow }\limits _{02}^{2|0}{m}$ in $\text{$\delta $} _2$ and ${e}\mathop {\longrightarrow }\limits _{04}^{1|2}{p}$ in $\text{$\delta $} _4$ have a potential RaW dependence ($p$ and $q$ could be aliased in the caller) that is not ruled out by type information. Thus, we do not coalesce GPBs $\text{$\delta $} _2$ and $\text{$\delta $} _4$. Since there is no RaW dependence between the GPUs in the GPBs $\text{$\delta $} _4$ and $\text{$\delta $} _5,$ we coalesce them (potential WaR dependence does not matter because all reads precede any write).
The GPB resulting from coalescing GPBs $\text{$\delta $} _1$ and $\text{$\delta $} _2$ is labeled $\text{$\delta $} _{11}$. Similarly, $\text{$\delta $} _{12}$ is the result of coalescing GPBs $\text{$\delta $} _4$ and $\text{$\delta $} _5$. The loop formed by the back edge $\text{$\delta $} _2 \rightarrow \text{$\delta $} _1$ in the GPG before coalescing now becomes a self-loop over $\text{$\delta $} _{11}$. Since, by definition, the GPUs in a GPB can never have a dependence between each other, the self-loop $\text{$\delta $} _{11} \rightarrow \text{$\delta $} _{11}$ is redundant and is hence removed.
For procedure $R$ in Figure 5, after performing dead GPU elimination, the remaining GPBs in the GPG of procedure $R$ are all coalesced into a single GPB $\text{$\delta $} _{15}$ because there is no data dependence between the GPUs of its GPBs.
As shown in Example 10, the GPUs ${b}\mathop {\longrightarrow }\limits _{02}^{1|0}{m}$ and ${q}\mathop {\longrightarrow }\limits _{02}^{2|0}{m}$ in procedure $Q$ cause inclusion of the sources $(b,1)$ and $(q,2)$ in $\text{$\mu $} _{02}$,leading further to their inclusion in $\text{$\mu $} _{11}$ for the coalesced GPB $\text{$\delta $} _{11}$. Similarly, for procedure $R$, $(b,1)$ is may-defined in GPB $\text{$\delta $} _{15}$ but not $(d,1)$ because the latter is defined along all paths through procedure $R$ but not the former as shown Figure 5.
Figure 6 provides the big picture of GPG construction by listing specific abstractions, operations, dataflow analyses, and optimizations and shows dependences between them, along with the section that define them. The optimizations use the results of dataflow analyses. The reaching GPUs analysis uses the GPU operations that are defined in terms of key abstractions. The abstractions of allocation sites, indirection lists (
This section begins with a motivation in Section 4.1. Section 4.2 defines GPU composition as a family of partial operations. Section 4.3 defines GPU reduction. Sections 4.5 presents the reaching GPUs analysis without blocking, and Section 4.6 extends it to include blocking.
Strength reduction optimization uses the knowledge of a producer GPU , to simplify a consumer GPU
(on a control flow path from
) through an operation called GPU composition denoted
(Section 4.2). A consumer GPU may require multiple GPU compositions to reduce it to an equivalent GPU with
that involves a series of GPU compositions with appropriate producer GPUs in $\mathcal {R}$ to simplify the consumer GPU
maximally. The set $\mathcal {R}$ of GPUs used for simplification provides a context for
and represents generalized points-to knowledge from previous GPBs. It is obtained by performing a dataflow analysis called the reaching GPUs analysis (Sections 4.5, and 4.6), which computes the sets ${{\text RGIn}$_{\text{$n$}}$}$, ${{\text RGOut}$_{\text{$n$}}$}$, ${{\text RGGen}$_{\text{$n$}}$}$, and ${{\text RGKill}$_{\text{$n$}}$}$ for every GPB $\text{$\delta $} _{\text{$n$}}$. These dataflow variables represent the GPUs reaching the entry of GPB $\text{$\delta $} _{\text{$n$}}$, its exit, the GPUs obtained through GPU reduction, and the GPUs whose propagation is killed by $\text{$\delta $} _{\text{$n$}}$, respectively. The set ${{\text RGGen}$_{\text{$n$}}$}$ is semantically equivalent to $\text{$\delta $} _{\text{$n$}}$ in the context of
In some cases, the location read by could be different from the location defined by
due to the presence of a GPU
(called a barrier) corresponding to an intervening assignment. This could happen because of a potential dependence between
and
. (Section 2.2). In such a situation (characterized formally in Section 4.6.1), replacing $\text{$\delta $} _{\text{n}}$ by ${{\text RGGen}$_{\text{$n$}}$}$ during strength reduction may be unsound. Hence we postpone the composition
explicitly by eliminating those GPUs from $\mathcal {R}$ that are blocked by a barrier. After inlining, the knowledge of the calling context may allow a barrier GPU to be reduced so that it no longer blocks a postponed reduction.
We first present the intuition behind GPU composition before defining it formally.
4.2.1 The Intuition Behind GPU Composition. The composition of a consumer GPU and a producer GPU
is possible when
has a RaW dependence on
through a common variable called the pivot of composition. It is the source of
but may be the source or the target of
.
The type ${\tau }$ of composition indicates the name of the composition, which is
and second letter indicates its role in
. For a
(
(
and
. Note that
and
because the same variable cannot occur both in the RHS and LHS of an assignment in the case of pointers to scalars.7
Figure 7 illustrates these compositions. For $\!:\!$ ${z}\mathop {\longrightarrow }\limits _{\text{$t$}}^{i|j}{x}$ and
$\!:\!$ ${x}\mathop {\longrightarrow }\limits _{\text{$s$}}^{k|l}{y}$ with pivot $x,$ which is the target of
and the source of
. The goal of the composition is to join the source $z$ of
and the target $y$ of
by using the pivot $x$ as a bridge. This requires the
to view the base GPU
in its derived form as ${x}\stackrel{\longrightarrow }{j|(l+j-k)}{y}$. This balances the
.
4.2.2 Defining GPU Composition. Before we define the GPU composition formally, we need to establish the properties of
to be greater than the
. For the generic
Observe that has WaW dependence on
instead of RaW dependence (Section 3.1.2) because
overwrites the location written by
.
The does not exceed the corresponding
. This requires the
and the consumer GPU
to satisfy the following constraints. In each constraint, the first term in the conjunct compares the
and
, whereasthe second term compares those of the targets (see Figure 7):
Consider the statement sequence $x=*y; z = x$. A and
is
. Intuitively, this GPU is not useful for computing a points-to edge because the
is “$1|2,$” which is greater than the
,which is “$1|1.$” Formally, this composition is flagged
We take a conjunction of the constraints of
Definition 3 defines GPU composition formally. It computes a simplified GPU by balancing the
GPU reduction uses the GPUs in $\mathcal {R}$ (a set of data-dependence-free GPUs) to compute a set of GPUs whose
. During reduction, the
is reduced progressively using the GPUs from $\mathcal {R}$ through a sequence of
Formally, ${\sf Red}$ is the fixed point of the equation ${\sf Red} = {\sf \text{GPU} \!\!\_\text{reduction}} ({\sf Red}, \text{$\mathcal {R}$})$ with the initialization . Function ${\sf \text{GPU} \!\!\_\text{reduction}}$ (Definition 4) simplifies the GPUs in ${\sf Red}$ by composing them with those in $\mathcal {R}$. The resulting GPUs are accumulated in ${\sf Red}^{\prime },$ which is initially $\emptyset$. If a GPU $\text{$\gamma $} _1 \in {\sf Red}$ is simplified, its simplified GPU $r$ is included in temp, which is then added to ${\sf Red}^{\prime }$. However, if $\text{$\gamma $} _1$ cannot compose with any GPU in $\mathcal {R}$, then $\text{$\gamma $} _1$ is then added to ${\sf Red}^{\prime }$. The GPUs in ${\sf Red}^{\prime }$ are then simplified in the next iteration of the fixed-point computation. The fixed point is achieved when no GPU in ${\sf Red}^{\prime }$ can be simplified any further.
Consider with $\text{$\mathcal {R}$} = \lbrace {y}\mathop {\longrightarrow }\limits _{21}^{1|0}{a}, {a}\mathop {\longrightarrow }\limits _{22}^{1|0}{b}\rbrace$. The reduction
involves two consecutive
computes ${\sf Red}^{\prime } = \lbrace {x}\mathop {\longrightarrow }\limits _{23}^{1|1}{a}\rbrace$. Then, the reduced GPU ${x}\mathop {\longrightarrow }\limits _{23}^{1|1}{a}$ becomes the consumer GPU and is composed with ${a}\mathop {\longrightarrow }\limits _{22}^{1|0}{b}$from $\mathcal {R}$, which results in ${\sf Red}^{\prime } = \lbrace {x}\mathop {\longrightarrow }\limits _{23}^{1|0}{b} \rbrace$. It cannot be reduced further, as it is already in the classical points-to form and the computation has reached the fixed point.
GPU reduction requires the set $\mathcal {R}$ to satisfy following properties:
These properties also hold for
The convergence of reduction on a unique solution is guaranteed by the following:
Recall that for an indirect assignment ($*p=\&x$ say) as a consumer, GPU reduction typically returns a set of GPUs that define multiple abstract locations, leading to a weak update. Sometimes, however, we may discover that $p$ has a single pointee within the procedure and the assignment defines only one abstract location. In this case, we may, in general, perform a strong update. However, this condition, although necessary, is not sufficient for a strong update because the source of $p$ may not be defined along all paths—there may be a path along which the source of $p$ is not defined within the procedure (i.e., is live on entry to the procedure) and is defined in a caller. In the presence of such a definition-free path in a procedure, even if we find a single pointee of $p$ in the procedure, we cannot guarantee that a single abstract location is being defined. This makes it difficult to distinguish between strong and weak updates.
A control flow path $\text{$n$} _1,\text{$n$} _2,\ldots \text{$n$} _{k}$ in $\Delta$ is a definition-free path for source $(x,i)$ if
We identify the definition-free paths by introducing boundary definitions (explained in the following)
This ensures the property of completeness of reaching GPUs (Section 4.3) that guarantees that some definition of every source $(x,i)$ reaches every node, thereby enabling strength reduction and distinguishing between strong and weak updates.
The boundary definitions are of the form ${x}\mathop {\longrightarrow }\limits _{0}^{i|i}{x^{\prime },}$ where $x^{\prime }$ is a symbolic representation of the initial value of $x$ at the start of the procedure and $i$ ranges from 1 to the maximum depth of the indirection level that depends on the type of $x$ (e.g., for type (int $**$), $i$ ranges from 1 to 2). Variable $x^{\prime }$ is called the upwards-exposed [22] version of $x$. This is similar to Hoare-logic style specifications in which postconditions use (immutable) auxiliary variables $x^{\prime }$ to denote the original value of variable $x$ (which may have since changed). Our upwards-exposed versions serve a similar purpose; logically on entry to each procedure, the statement $x = x^{\prime }$ provides a definition of $x$. The rationale behind the label 0 in the boundary definitions is explained after the following example.
Consider a GPB $\text{$\delta $} _{\text{n}} = \lbrace {p}\mathop {\longrightarrow }\limits _{\text{$s$}}^{2|0}{a} \rbrace$ for statement $*p=\&a$. After the introduction of a boundary definition ${p}\mathop {\longrightarrow }\limits _{0}^{1|1}{p^{\prime }}$, if there is a definition-free path from
The boundary definitions are symbolic in that they are never contained in any GPB but are only contained in the set of producer GPUs that reach the GPBs. This allows us to use a synthetic label 0 in them because only the labels of consumer GPUs matter (because they identify a source-language statement); the labels of producer GPUs are irrelevant because they only provide information that is used for simplifying consumer GPUs labeled $s$ into one or more GPUs all labeled $s$. The boundary definitions participate in GPU reduction algorithm (without requiring any change in GPU composition) like any other producer GPU. After GPU reduction, upwards-exposed versions of variables can appear in simplified GPUs.
In this section, we define reaching GPUs analysis ignoring the effect of barriers.
The reaching GPUs analysis is an intraprocedural forward dataflow analysis in the spirit of the classical reaching definitions analysis. Its dataflow equations are presented in Definition 5. They compute set
If any of these conditions is violated, then $\text{$\gamma $}^{\prime }$ is excluded from
Figure 8 gives the final result of reaching GPUs analysis for procedure $Q$ of our motivating example. We have shown the boundary GPU ${q}\mathop {\longrightarrow }\limits _{00}^{1|1}{q^{\prime }}$ for $q$. Other boundary GPUs are not required for strong updates in this example and have been omitted. This result has been used to construct GPG $\text{$\Delta $} _Q$ shown in Figure 4. For procedure $R$, we do not show the complete result of the analysis but make some observations. The GPU ${q}\mathop {\longrightarrow }\limits _{10}^{2|0}{o}$ is composed with the GPU ${q}\mathop {\longrightarrow }\limits _{05}^{1|0}{e}$ to create a reduced GPU ${e}\mathop {\longrightarrow }\limits _{10}^{1|0}{o}$. Since only a single pointer $e$ is being defined by the assignment and source $(e,1)$ is not may-defined (i.e. not in $\text{$\mu $} _{10}$), this is a strong update and hence kills ${e}\mathop {\longrightarrow }\limits _{04}^{1|1}{c}$. The GPU to be killed is identified by ${\sf Match} ({e}\mathop {\longrightarrow }\limits _{10}^{1|0}{o}, {{\text RGIn}$_{10}$}),$ which matches the source and the
This section extends the reaching GPUs analysis to incorporate the effect of blocking by defining a dataflow analysis that computes $\overline{{\sf RGIn}}_{}$ and $\overline{{\sf RGOut}}_{}$ for the purpose.
Consider the possibility of the composition of a consumer GPU that appears to have a RaW dependence on a producer GPU
because they have a pivot but there is a barrier GPU
(Sections 2.2 and 4.1) between the two such that
has a potential WaW dependence on
. This possible if the
or
is greater than 1. We call such a GPU an indirect GPU. The execution of
may alter the apparent dependence between
and
,and hence the composition of
with
may be unsound.
Since this potential dependence between and
cannot be resolved without the alias information in the calling context, we block such producer GPUs so that such GPU compositions leading to potentially unsound strength reduction optimization are postponed. Wherever possible, we use the type information to rule out some GPUs as barriers. After inlining the GPG in a caller, more information may become available. Thus, it may resolve potential data dependence of a barrier with a producer. Then, if a consumer still has a RaW dependence on a producer, the composition that was earlier postponed may now safely be performed.
Consider the procedure in Figure 9(a). The composition of the GPUs for statements 02 and 04 is
GPG construction and optimization, we postpone this composition to eliminate the possibility of unsoundness. Reaching GPUs analysis with blocking blocks the GPU ${y}\mathop {\longrightarrow }\limits _{02}^{1|0}{a}$ by a barrier ${x}\mathop {\longrightarrow }\limits _{03}^{2|0}{b}$. This corresponds to the first case described earlier.
For the second case, consider statement 02 of the procedure in Figure 9(b), which may indirectly define $y$ (if $x$ points to $y$). Statement 03 directly defines $y$. Thus, $q$ in statement 04 would point to $b$ if $x$ points to $y$; otherwise, it would point to $a$. We postpone the composition with
by blocking the GPU
(here, the GPU ${y}\mathop {\longrightarrow }\limits _{03}^{1|0}{b}$ acts as a barrier).
Consider a GPU originally blocked by a barrier
. After inlining the GPG in its callers and performing reductions in the calling contexts, the following situations could arise:
The preceding Case 1(a) could arise if $x$ points to $p$ in the calling context of the procedure in Figure 9(a). As a result, GPU ${y}\mathop {\longrightarrow }\limits _{02}^{1|0}{a}$ is killed by the barrier GPU ${y}\mathop {\longrightarrow }\limits _{03}^{1|0}{b}$ (which is the simplified version of the barrier GPU ${x}\mathop {\longrightarrow }\limits _{03}^{2|0}{b}$), and hence the composition is prohibited and $q$ points to $b$ for statement 04. Case 1(b) could arise if $x$ points to any location other than $y$ in the calling context. In this case, the composition between ${q}\mathop {\longrightarrow }\limits _{04}^{1|1}{y}$ and ${y}\mathop {\longrightarrow }\limits _{02}^{1|0}{a}$ is sound, and $q$ points to $a$ for statement 04. Case 2 could arise if the pointee of $x$ is not available even in the calling context. In this case, the barrier GPU ${x}\mathop {\longrightarrow }\limits _{03}^{2|0}{b}$ continues to block ${y}\mathop {\longrightarrow }\limits _{02}^{1|0}{a}$.
Our measurements (Section 10) show that situation 1(a) rarely arises in practice because it amounts to defining the same pointer multiple times through different aliases in the same context.
To see how reaching GPUs analysis with blocking helps, consider the example in Figure 9(b). The set of GPUs reaching the statement 04 is ${{\text RGIn}$_{04}$} = \lbrace {x}\mathop {\longrightarrow }\limits _{02}^{2|0}{a}, {y}\mathop {\longrightarrow }\limits _{03}^{1|0}{b}\rbrace$. The GPU ${x}\mathop {\longrightarrow }\limits _{02}^{2|0}{a}$ is blocked by the barrier GPU ${y}\mathop {\longrightarrow }\limits _{03}^{1|0}{b,}$ and hence $\text{$\overline{{\sf RGIn}}_{04}$} = \lbrace {y}\mathop {\longrightarrow }\limits _{03}^{1|0}{b}\rbrace$. Thus, GPU reduction for $\text{$\gamma $} _1 \!:\! {q}\mathop {\longrightarrow }\limits _{04}^{1|2}{x}$ (in the context of $\text{$\overline{{\sf RGIn}}_{04}$}$) computes until
has been simplified.
The following GPUs should be blocked as barriers:
Additionally, we use the type information to minimize blocking. We define a predicate $\text{$\overline{{\sf DDep}}$} (B, I)$ to check the presence of data dependence between the sets of GPUs $B$ and $I$ (Definition 6). When the types of and
match, we assume the possibility of data dependence and hence
blocks
. ${\sf TDef} (B)$ is the set of types of locations being written by a barrier, whereas $\left({\sf TDef} (I) \cup {\sf TRef} (I)\right)$ represents the set of types of locations defined or read by the GPUs in $I,$ thereby checking for a potential WaW and WaR dependence of the GPUs in $B$ on those of $I$.
The dataflow equations in Definition 7 differ from those in Definition 5 as follows:
For the procedure in Figure 9(b), $\text{$\overline{{\sf RGIn}}_{02}$} = \emptyset$ and $\text{$\overline{{\sf RGGen}}_{02}$}$ is $\lbrace {x}\mathop {\longrightarrow }\limits _{02}^{2|0}{a}\rbrace$. Although $\overline{{\sf RGGen}}_{02}$ contains an indirect GPU, since no GPUs reach 02 (because it is the first statement), $\overline{{\sf RGOut}}_{02}$ is $\lbrace {x}\mathop {\longrightarrow }\limits _{02}^{2|0}{a}\rbrace ,$ indicating that no GPUs are blocked.
For statement 03, $\text{$\overline{{\sf RGIn}}_{03}$} = \lbrace {x}\mathop {\longrightarrow }\limits _{02}^{2|0}{a}\rbrace$ and $\text{$\overline{{\sf RGGen}}_{03}$} = \lbrace {y}\mathop {\longrightarrow }\limits _{03}^{1|0}{b}\rbrace$. $\overline{{\sf RGGen}}_{03}$ is non-empty and does not contain an indirect GPU, and thus $\text{$\overline{{\sf RGOut}}_{03}$} = \lbrace {y}\mathop {\longrightarrow }\limits _{03}^{1|0}{b}\rbrace$ according to the third case in the
For statement 04, $\text{$\overline{{\sf RGIn}}_{04}$} = \lbrace {y}\mathop {\longrightarrow }\limits _{03}^{1|0}{b}\rbrace$ and $\overline{{\sf RGGen}}_{04}$ is $\lbrace {q}\mathop {\longrightarrow }\limits _{04}^{1|2}{x}\rbrace$. For this statement, the composition $({q}\mathop {\longrightarrow }\limits _{04}^{1|2}{x} \text{$\ \circ \ $} ^{\textrm {ts}} {x}\mathop {\longrightarrow }\limits _{02}^{2|0}{a})$ is postponed because the GPU ${x}\mathop {\longrightarrow }\limits _{02}^{2|0}{a}$ is blocked. In this case, $\overline{{\sf RGGen}}_{04}$ does not contain an indirect GPU and $\text{$\overline{{\sf RGOut}}_{04}$} = \lbrace {y}\mathop {\longrightarrow }\limits _{03}^{1|0}{b}, {q}\mathop {\longrightarrow }\limits _{04}^{1|2}{x}\rbrace$.
In Figure 9(a), the GPU ${y}\mathop {\longrightarrow }\limits _{02}^{1|0}{a}$ is blocked by the barrier GPU ${x}\mathop {\longrightarrow }\limits _{03}^{2|0}{b}$ because ${\sf typeof} (y, 1)$ matches with ${\sf typeof} (x, 2)$. Hence, the composition $({q}\mathop {\longrightarrow }\limits _{04}^{1|1}{y} \text{$\ \circ \ $} ^{\textrm {ts}} {y}\mathop {\longrightarrow }\limits _{02}^{1|0}{a})$ is postponed.
In the GPG of procedure $Q$ (of our motivating example) shown in Figure 4, the GPUs ${r}\mathop {\longrightarrow }\limits _{01}^{1|0}{a}$ and ${q}\mathop {\longrightarrow }\limits _{03}^{1|0}{b}$ are not blocked by the GPU ${q}\mathop {\longrightarrow }\limits _{02}^{2|0}{m}$ because they have different types. However, the GPU ${e}\mathop {\longrightarrow }\limits _{04}^{1|2}{p}$ blocks the indirect GPU ${q}\mathop {\longrightarrow }\limits _{02}^{2|0}{m}$ because there is a possible WaW data dependence ($e$ and $q$ could be aliased in the callers of $Q$).
Example 21 shows the role of boundary definitions in ensuring completeness of reaching GPUs.
Let the GPUs of the statements 1, 2, and 3 on the right be denoted by $\text{$\gamma $} _1$, $\text{$\gamma $} _2$, and $\text{$\gamma $} _3$, respectively. Then, $\text{$\gamma $} _1$ is blocked by $\text{$\gamma $} _2$ because of potential RaW dependence (if $p$ points-to $x$ in the caller). Thus, $\text{$\overline{{\sf RGIn}}_{3}$}\! =\! \lbrace \text{$\gamma $} _1, \text{$\gamma $} _2 \rbrace$. Then, the $\text{$\overline{{\sf RGGen}}_{3}$}\! =\! \lbrace {y}\mathop {\longrightarrow }\limits _{3}^{1|0}{a}\rbrace$. Replacing $\text{$\delta $} _3$ by $\overline{{\sf RGGen}}_{3}$ is unsound because if $p$ points to $x$ in the caller, then $y$ should also point to $b$. The problem arises because $\overline{{\sf RGIn}}_{3}$ does not have the source $(x,1)$ defined along the path 1-2-3 because of blocking in node 2. This violates the completeness of $\overline{{\sf RGIn}}_{3}$. Explicitly adding the boundary definition ${x}\mathop {\longrightarrow }\limits _{0}^{1|1}{x^{\prime }}$ in 2 ensures that $(x,1)$ is defined along both the paths, leading to $\text{$\overline{{\sf RGGen}}_{3}$} = \lbrace {y}\mathop {\longrightarrow }\limits _{3}^{1|0}{a}, {y}\mathop {\longrightarrow }\limits _{3}^{1|1}{x^{\prime }}\rbrace$. When the resulting GPG is inlined in the caller, $x^{\prime }$ is replaced by $x$ and the original consumer GPU representing $y=x$ is recovered. Thus, if $p$ points to $x,$ then after node 3, $y$ points to both $a$ and $b$. However, if $p$ does not point to $x,$ then after node 3, $y$ points to $a$ as expected.
For each node $n$, dead GPU elimination removes redundant GPUs—that is, those $\text{$\gamma $} \in \text{$\delta $} _{\text{$n$}}$ that are killed along every control flow path from $n$ to the
For the first requirement, we check that a GPU considered for dead GPU elimination does not belong to
In procedure $Q$ of Figure 4, pointer $q$ is defined in statement 03 but is redefined in statement 05, and hence the GPU ${q}\mathop {\longrightarrow }\limits _{03}^{1|0}{b}$ is killed and does not reach the
Similarly, the GPUs ${q}\mathop {\longrightarrow }\limits _{07}^{1|0}{d}$ (in $\text{$\delta $} _{07}$) and ${e}\mathop {\longrightarrow }\limits _{04}^{1|1}{c}$ (in $\text{$\delta $} _{14}$) in the GPG of procedure $R$ (Figure 5) are eliminated from their corresponding GPBs.
For the procedure in Figure 9(a), the GPU ${y}\mathop {\longrightarrow }\limits _{02}^{1|0}{a}$ is blocked by the barrier ${x}\mathop {\longrightarrow }\limits _{03}^{2|0}{b}$; hence, it is present in ${{\text RGOut}$_{05}$}$ but not in $\overline{{\sf RGOut}}_{05}$ (05 is the
when the barrier ${x}\mathop {\longrightarrow }\limits _{03}^{2|0}{b}$ is reduced after call inlining (and ceases to block ${y}\mathop {\longrightarrow }\limits _{02}^{1|0}{a})$. Thus, it is not removed by dead GPU elimination.
To see the need of $\overline{{\sf RGOut}}_{{\sf End$_{}$}}$, observe that ${q}\mathop {\longrightarrow }\limits _{04}^{1|1}{y}$ is reduced to ${q}\mathop {\longrightarrow }\limits _{04}^{1|0}{a}$ in
We minimize control flow by empty GPB elimination and coalescing of GPBs. They improve the compactness of a GPG and reduce the repeated re-analysis of GPBs after inlining. Empty GPBs are eliminated by connecting their predecessors to their successors.
In the GPG of procedure $Q$ of Figure 4, the GPB $\text{$\delta $} _{03}$ becomes empty after dead GPU elimination. Hence, $\text{$\delta $} _{03}$ can be removed by connecting its predecessors to successors. This transforms the back edge $\text{$\delta $} _{03} \rightarrow \text{$\delta $} _{01}$ to $\text{$\delta $} _{02} \rightarrow \text{$\delta $} _{01}$. Similarly, the GPB $\text{$\delta $} _{07}$ is deleted from the GPG of procedure $R$ in Figure 5.
In the rest of this section, we explain coalescing of GPBs.
After strength reduction and dead GPU elimination, we coalesce multiple GPBs into a single GPB whenever possible to reduce the size of GPGs (in terms of control flow information). It relies on the elimination of data dependence by strength reduction and dead GPU elimination. This turns out to be the core idea for making GPGs a scalable technique for points-to analysis.
Strength reduction exploits and removes all definite RaW dependences, whereas dead GPU elimination removes all definite WaW dependences that are strict (Section 3.1.2). Only the potential dependences, definite WaRdependences, and definite non-strict WaW dependences remain. Recall that WaR dependences are preserved by GPBs; as we shall see in this section, definite non-strict WaW dependences are also preserved by coalesced GPBs. This make much of the control flow redundant.
For a control flow edge $\text{$\delta $} _{\text{$n$} _1}\rightarrow \text{$\delta $} _{\text{$n$} _2}$, the decision to coalesce GPBs $\text{$\delta $} _{\text{$n$} _1}$ and $\text{$\delta $} _{\text{$n$} _2}$ is influenced not only by the dependence between the GPUs of $\text{$\delta $} _{\text{$n$} _1}$ and $\text{$\delta $} _{\text{$n$} _2}$ but also by the dependence of the GPUs of $\text{$\delta $} _{\text{$n$} _1}$ and $\text{$\delta $} _{\text{$n$} _2}$ with the GPUs in some other GPB as illustrated in the following example.
Let the GPUs of the statements 1, 2, and 3 on the right be denoted by $\text{$\gamma $} _1$, $\text{$\gamma $} _2$, and $\text{$\gamma $} _3$, respectively. Then, $\text{$\gamma $} _1$ cannot be coalesced with $\text{$\gamma $} _2$ because of potential WaW dependence (if $p$ points-to $x$ in the caller). Similarly, $\text{$\gamma $} _2$ cannot be coalesced with $\text{$\gamma $} _3$ because of potential WaW dependence (if $p$ points to $y$ in the caller). There is no data dependence between $\text{$\gamma $} _1$ and $\text{$\gamma $} _3$. However, they cannot be coalesced together because doing so will create GPBs $\text{$\delta $} =\lbrace \text{$\gamma $} _1, \text{$\gamma $} _3\rbrace$ and $\text{$\delta $}^{\prime }=\lbrace \text{$\gamma $} _2\rbrace$ with control flow edges $\text{$\delta $} \rightarrow \text{$\delta $}^{\prime }$ and $\text{$\delta $}^{\prime }\rightarrow \text{$\delta $},$ leading to spurious potential data dependences.
The next example illustrate that a non-strict WaW dependence does not constrain coalescing.
Let the GPUs of the statements 1, 2, and 3 on the right be denoted by $\text{$\gamma $} _1$, $\text{$\gamma $} _2$, and $\text{$\gamma $} _3$, respectively. The WaW dependence between $\text{$\gamma $} _1$ and $\text{$\gamma $} _2$ is definite but not strict and is not removed by dead GPU elimination because $\text{$\gamma $} _1$ is not killed along the path 1,3. Thus, both $\text{$\gamma $} _1$ and $\text{$\gamma $} _2$ reach statement 3. Hence, although there is a WaW dependence between $\text{$\gamma $} _1$ and $\text{$\gamma $} _2$, they can be coalesced because the semantics of GPB allows both of them to the executed in parallel without any data dependence between them. This enables both of them to reach statement 3.
There is no “best” coalescing operation: given three sequenced GPUs $\text{$\gamma $} _1$, $\text{$\gamma $} _2$ $\text{$\gamma $} _3$, then $\text{$\gamma $} _1$ may coalesce with $\text{$\gamma $} _2$ and separately $\text{$\gamma $} _2$ may coalesce with $\text{$\gamma $} _3$,but the GPUs $\text{$\gamma $} _1, \text{$\gamma $} _2, \text{$\gamma $} _3$ do not all coalesce.
Let the GPUs of the statements 1, 2, and 3 on the right be denoted by $\text{$\gamma $} _1$, $\text{$\gamma $} _2$, and $\text{$\gamma $} _3$, respectively. Let the type of pointers $x$ and $z$ be “int $*$ ” and that of $y$ be “float $*$ .” Then there is no data dependence between $\text{$\gamma $} _1$ and $\text{$\gamma $} _2$ because $x$ and $y$ are guaranteed to point to different locations based on types. Similarly, there is no data dependence between $\text{$\gamma $} _2$ and $\text{$\gamma $} _3$. However, there is a potential datadependence between $\text{$\gamma $} _1$ and $\text{$\gamma $} _3$. Thus, $\text{$\gamma $} _1$ and $\text{$\gamma $} _2$ can be coalesced and so can $\text{$\gamma $} _2$ and $\text{$\gamma $} _3$; however, all three of them cannot be coalesced.
Therefore, we formulate the coalescing operation on a GPG as a partition $\text{$\Pi $}$ on its nodes (Section 6.2), set out the correctness conditions the partition must satisfy (Section 6.3), and describe how we select one of the maximally coalescing partitions satisfying the conditions (Section 6.4).
Recall that a partition $\text{$\Pi $}$ of a set $S$ is a collection of the non-empty subsets of $S$ such that every element of $S$ is a member of exactly one element of $\text{$\Pi $}$. We call the elements of $\text{$\Pi $}$ parts and write $\text{$\Pi $} (x)$ for the part containing $x$. A partition induces an equivalence relation on $S$; thus, for example, $x \in \text{$\Pi $} (y)$ holds if and only if $y \in \text{$\Pi $} (x)$.
Following a practice common for CFGs, we have previously conflated the idea of a node $n$ of a GPG with that of its GPB $\text{$\delta $} _{\text{n}}$ which is a set of GPUs. It is helpful to keep these separated when defining a partition, noting that, under coalescing, GPBs remain sets of GPUs while the definition of a node is changed.
Given a GPG $\text{$\Delta $}$ and a partition $\text{$\Pi $}$ on its nodes, we obtain a coalesced GPG, written $\text{$\Delta $}/\text{$\Pi $}$, in the following steps:
This is the natural definition of quotient of a labeled graph, save that self-edges are removed as they serve no purpose. Due to strength reduction, a self-loop cannot represent a control flow edge with an unresolved data dependence between the GPUs across it. There are two possibilities for a self-loop: it exists in the original program or could result from empty GPB elimination and coalescing. In the former case, strength reduction, based on the fixed point of reaching GPUs analysis, ensures that the data dependence along the self-loop is eliminated (there is no blocking as the GPUs reached along the self-loop belong to an immediate successor). In the latter case, the reduction of a loop to a self-loop indicates that there are no indirect GPUs in the loop and hence no blocking. Thus, the data dependences in the loop are eliminated through strength reduction.
Observe that for every path in $\Delta$, there is a corresponding path in $\text{$\Delta $}/\text{$\Pi $}$. In the degenerate case, this path could well be a single node ${\hat{n}}$ if all nodes along a path are coalesced into the same part.
After finding a suitable partition, we revert to our previous abuse of notation and once again conflated nodes with their GPBs representing the sets of GPUs.
A partition $\Pi$ is valid for coalescing to construct $\text{$\Delta $}/\text{$\Pi $}$ if it preserves the semantic understandings of $\text{$\Delta $}$. Validity is characterized by a set of conditions that ensures the following:
Assuming that dead GPU elimination and empty GPB elimination have been performed before coalescing, the validity of a coalescing partition is formalized as the following sufficient conditions:
Condition (S1) ensures that no RaW dependence is missed in $\text{$\Delta $}/\text{$\Pi $}$; condition (S2) ensures that no strict WaW dependence is spuriously included in $\text{$\Delta $}/\text{$\Pi $}$. Together, they ensure that every GPU reaching the
Conditions (P1) and (P2) ensure that killing is not underapproximated in $\text{$\Delta $}/\text{$\Pi $}$ by converting a strict WaW dependence into a non-strict dependence. Although definite strict WaW dependences with GPUs have been removed, we could still have a potential strict WaW dependence between GPUs or a definite strict WaW dependence with a boundary definition. Condition (P3) ensures that no spurious RaW dependence is included in $\text{$\Delta $}/\text{$\Pi $}$. Together, they ensure that no GPU that does not reach the
Note that coalescing only forbids nodes that have a potential RaW or WaW dependence from being coalesced if there is a control flow path between them; coalescing in the absence of data dependence (or in the presence of definite non-strict WaW dependence) is generally allowed.
Consider the GPG in Figure 10 for coalescing and proposed partitions assuming that the may-definition sets are $\emptyset$. GPU $\text{$\gamma $} _2$ has a potential RaW dependence on $\text{$\gamma $} _1$. Option A violates both soundness and precision, whereas options B and D violate soundness, and only option C satisfies all conditions:
Two important characteristics of these conditions are the following:
Consider the statement sequence $x = \&a; {if}\, (c) *\!y = \&b$; in which there is a potential RaW dependence between the two pointers assignments (because $x$ could point to $y$ in a caller). This violates condition (S1), and yet coalescing these statements does not violate soundness or precision because no pointer-pointee association is missed, nor is a spurious association created by coalescing.
This section describes how we ensure that the conditions of validity of partitioning are satisfied.
6.4.1 Ensuring Soundness. We honor the conditions for soundness in the following manner:
This example illustrates why the preceding step (2) only considers the dependence between $\text{$n$} _1$ and $\text{$n$} _2 \in (\text{$\Pi $} (\text{$n$}) \cap \text{$\mathit {pred}^+$} (\text{$n$} _1))$ rather than between $\text{$n$} _1$ and $\text{$n$} _2 \in \text{$\Pi $} (\text{$n$})$. In Figure 11(a), nodes $\text{$n$} _1$, $\text{$n$} _2$, and $\text{$n$} _4$ can be included in the same part. Consider node $\text{$n$} _3$ for inclusion in this part: the GPU in $\text{$n$} _3$ appears to have RaW dependence with the GPU in node $\text{$n$} _4$ because variable $z^{\prime }$ will be replaced by $z$ after inlining $z^{\prime }$. However, there is no control flow from $\text{$n$} _4$ to $\text{$n$} _3$. Hence, the data dependence of the GPU in $\text{$n$} _3$ need only be checked with those in $\text{$n$} _1$ and $\text{$n$} _2$ and not with those in $\text{$n$} _4$. Thus, $\text{$n$} _3$ can also be included in the same part. Similarly, although it appears that there is a WaW dependence between $\text{$n$} _2$ and $\text{$n$} _5$, the latter can also be included in the same part.
6.4.2 Ensuring Precision. Define the external predecessors and successors of $\mathit {entry}$ and $\mathit {exit}$ nodes of a part ${\hat{n}}$ as follows:
To see the role of coherence in precision, consider the GPG in Figure 11(b). Nodes $\text{$n$} _1$, $\text{$n$} _2$, and $\text{$n$} _4$ can be considered for inclusion in the same part. Nodes $\text{$n$} _3$ and $\text{$n$} _5$ have potential dependences with any other GPU. Assuming that the types rule out the possibility of potential dependences, the part $\lbrace \text{$n$} _1, \text{$n$} _2, \text{$n$} _4\rbrace$ violates coherence because it has two exits ($\text{$n$} _2$ and $\text{$n$} _4$) that have different external successors. If we form $\text{$\Delta $}/\text{$\Pi $},$ we will have control flow from the GPU of $\text{$n$} _4$ to the GPU of $\text{$n$} _3,$ creating a spurious RaW dependence between them because of variable $z$ (the upwards-exposed version $z^{\prime }$ will be replaced by $z$ after inlining). Some examples of coherent partitions are $\text{$\Pi $} _1 = \lbrace \lbrace \text{$n$} _1\rbrace ,\lbrace \text{$n$} _2,\text{$n$} _3\rbrace ,\lbrace \text{$n$} _4,\text{$n$} _5\rbrace ,\lbrace \text{$n$} _6\rbrace \rbrace$, $\text{$\Pi $} _2 = \lbrace \lbrace \text{$n$} _1,\text{$n$} _2,\text{$n$} _3\rbrace ,\lbrace \text{$n$} _4,\text{$n$} _5,\text{$n$} _6\rbrace \rbrace$, and $\text{$\Pi $} _3 = \lbrace \lbrace \text{$n$} _1\rbrace ,\lbrace \text{$n$} _2,\text{$n$} _3,\text{$n$} _4,\text{$n$} _5\rbrace ,\lbrace \text{$n$} _6\rbrace \rbrace$.
6.4.3 A Greedy Algorithm for Coalescing. Instead of exploring all possible partitions, we use the following greedy algorithm that implements the preceding heuristics in three steps:
We define two interdependent dataflow analyses that
Unlike the usual dataflow variables that typically compute a set of facts,
The dataflow equations to compute
Unlike the usual dataflow equations, the dataflow variables
The incremental expansion of a part in a forward direction influences the flow of GPUs accumulated in a part leading to a forward dataflow analysis for computing the GPUs reaching node $n$ in $\text{$\Pi $} (\text{$n$})$ using dataflow variables
Figure 13 gives the dataflow information for the example of Figure 12. GPBs $\text{$\delta $} _1$ and $\text{$\delta $} _2$ can be coalesced because
GPU ${z}\mathop {\longrightarrow }\limits _{32}^{2|0}{o}$ has a definition-free path in $\text{$\delta $} _8$ because boundary definition ${z}\mathop {\longrightarrow }\limits _{0}^{2|2}{z}$ reaches the exit of part $\text{$\delta $} _8$ along the path $\text{$\delta $} _1 \rightarrow \text{$\delta $} _2 \rightarrow \text{$\delta $} _4 \rightarrow \text{$\delta $} _5$. No other GPU has a definition-free path.
Observe that some GPUs appear in multiple GPBs of a GPG (before coalescing). This is because we could have multiple calls to the same procedure. Thus, even though the GPBs are renumbered, the statement labels in the GPUs remain unchanged resulting in repetitive occurrence of a GPU. This is a design choice because it helps us accumulate the points-to information of a particular statement in all contexts.
In Figure 4, GPBs $\text{$\delta $} _{01}$ and $\text{$\delta $} _{02}$ can be coalesced because ${\sf DDep} (\text{$\delta $} _{01}, \text{$\delta $} _{02})$ returns
In Figure 4, $\text{$\mu $} _i = \emptyset$ for all nodes $i$ in the initial GPG. Strength reduction reduces the GPUs in $\text{$\delta $} _{02}$ and correspondingly updates $\text{$\mu $} _{02}$ to $\lbrace (b,1),(q,2)\rbrace$. After coalescing, the may-definition sets are computed to obtain $\text{$\mu $} _{11} = \lbrace (b,1),(q,2)\rbrace$ (because these sources have a definition-free path from the entry of $\text{$\delta $} _{01} \in \text{$\mathit {entry}$} (\text{$\delta $} _{11})$ to exit of $\text{$\delta $} _{02} \in \text{$\mathit {exit}$} (\text{$\delta $} _{11})$) and $\text{$\mu $} _{12} = \emptyset$.
For procedure $R$ (Figure 5), the boundary definition ${b}\mathop {\longrightarrow }\limits _{00}^{1|1}{b^{\prime }}$ reaches the exit of $\text{$\Delta $} _R,$ indicating that $b$ is
We explain call inlining by classifying calls into three categories: (a) callee is known and the call is non-recursive, (b) callee is known and the call is recursive, and (c) callee is not known.
In this case, the GPG of the callee can be constructed completely before the GPG of its callers if we traverse the call graph bottom up.
We inline the optimized GPGs of the callees at the call sites in the caller procedures by renumbering the GPB nodes and each inlining of a callee gives fresh numbering to the nodes. This process does not change the statement labels within the GPUs. In addition, the upwards-exposed variable $x^{\prime }$ occurring in a callee's GPU inlined in the caller is substituted by the original variable $x$.
When inlining a callee's (optimized) GPG, we add two new GPBs, a predecessor to its
Consider Figure 14, in which procedure $P$ calls procedure $Q$ and $Q$ calls $P$. The GPG of $Q$ depends on that of $P$ and vice versa, leading to incomplete GPGs: the GPGs of the callees of some calls either have not been constructed or are incomplete. We handle this mutual dependency by successive refinement of incomplete GPGs of $P$ and $Q,$ which involves inlining GPGs of the callee procedures, followed by GPG optimizations, repeatedly until a fixed point is reached. The rest of the section explains how refinement is performed and how a fixed point is defined and detected.
A set of recursive procedures is represented by a strongly connected component in a call graph. We construct GPGs for a set of recursive procedures by visiting the procedures in a post order obtained through a topological sort of the call graph. Because of recursion, the GPGs of some callees of the leaf are not available in the beginning. We handle such situations by using a special GPG $\text{$\Delta $} _{\top }$ that represents the effect of a call when the callee's GPG is not available. The GPG $\text{$\Delta $} _{\top }$ is the $\top$ element of the lattice of all possible procedure summaries. It kills all GPUs and generates none (thereby, when applied, it computes the $\top$ value—$\emptyset$—of the lattice for
We perform the reaching GPUs analyses over incomplete GPGs containing recursive calls by repeated inlining of callees starting with $\text{$\Delta $} _{\top }$ as their initial GPGs, until no further inlining is required. Let $\text{$\Delta $} ^1_P$ denote the GPG of procedure $P$ in which all of the calls to the procedures that are not part of the strongly connected component are inlined by their respective optimized GPGs. Note that the GPGs of these procedures have already been constructed because of the bottom-up traversal over the call graph. The calls to procedures that are part of the strongly connected component are retained in $\text{$\Delta $} ^1_P$. In each step of refinement, the recursive calls in $\text{$\Delta $} ^1_P$ are inlined either by
Thus, we compute a series of GPGs $\text{$\Delta $} ^i_P$, $i\gt 1$ for every procedure $P$ in a strongly connected component in the call graph until the termination of fixed-point computation. Once $\text{$\Delta $} ^i_P$ is constructed, we decide to construct $\text{$\Delta $} ^j_Q$ for a caller $Q$ of $P$ if the dataflow values of the
In the example of Figure 14, the sole strongly connected component contains procedures $P$ and $Q$. The GPG of procedure $Q$ is constructed first and $\text{$\Delta $} ^1_Q$ contains a single call to procedure $P$ whose GPG is not constructed yet, and hence the construction of $\text{$\Delta $} ^2_Q$ requires inlining of $\text{$\Delta $} _{\top }$. Since $\text{$\Delta $} _{\top }$ represents a procedure call that never returns, the GPB
We give an informal argument for termination of GPG construction in the presence of recursion. A formal and complete proof can be found in Gharat [10]. We first describe a property that holds for intraprocedural dataflow analysis over CFGs and then extend it to GPGs.
Consider a CFG $C_Q$ representing procedure $Q$ such that the flow functions associated with the nodes in $C_Q$ are monotonic and compute values in a finite lattice $L$. Let the dataflow value associated with the entry of
The preceding situation models call inlining in GPGs. From Section 3.1.3, the set of GPUs is finite, and from Gharat [10], they form a lattice with $\subseteq$ as the partial order. The flow function for a call GPB is initially assumed to be $f_\top ,$ and then the GPB is replaced by the GPG of the callee. The control flow surrounding this call remains same. Let the effect of the callee GPG be described by a flow function $f$. Clearly, $f \sqsubseteq f_\top$ because $f_\top$ computes $\top$ value. The process of successive refinements for handling recursion replaces call GPBs by the GPGs of the callees repeatedly. Consider a sequence of refinement, $ \text{$\Delta $} ^1_Q, \text{$\Delta $} ^2_Q, \ldots \text{$\Delta $} ^i_Q$. It can be proved by induction on the length of the sequence that the GPUs reaching the
We model a call through function pointer (say fp) at call site $s$ as a use statement with a GPU (Section 8). Interleaving of strength reduction and call inlining reduces the GPU
and provides the pointees of fp. This is identical to computing points-to information (Section 8). Until the pointees become available, the GPU
acts as a barrier. Once the pointees become available, the indirect call converts to a set of direct calls (see Appendix C for an illustrative example). A naive approach to function pointer resolution would inline an indirect callee first into its immediate callers. This may require as many rounds of GPG construction as the maximum number of indirect calls in any call chain. Instead, we allow inlining directly in a transitive callee when a pointee of the function pointer of an indirect call becomes available. Hence, we can resolve all indirect calls in a call chain in a single round beginning with the indirect call closest to main. This is explained in Appendix C.
The second phase of a bottom-up approach, which uses procedure summaries created in the first phase, is redundant in our method. This is because our first phase computes the points-to information as a side effect of the construction of GPGs. Since statement labels in GPUs are unique across all procedures and are not renamed on inlining, the points-to edges computed across different contexts for a given statement can be back-annotated to the statements giving the flow- and context-sensitive points-to information for the statement.
Since we also need points-to information for statements that read pointers but do not define them, we model them as use statements. Consider a use of a pointer variable in a non-pointer assignment or an expression. We represent such a use with a GPU whose source is a fictitious node with
,whereas an integer assignment “$\tt *x = 5;$” is modeled as a GPB
.
Consider the assignment sequence $01\!: x=\&a; \; 02\!: *x=5;$. A client analysis would like to know the pointees of $x$ for statement 02. We model this use of pointee of $x$ as a GPU . This GPU can be composed with ${x}\mathop {\longrightarrow }\limits _{01}^{1|0}{a}$ to get a reduced GPU
indicating that pointee of $x$ in statement 2 is $a$.
When a use involves multiple pointers such as “$\tt if \; (x == *y)$,” the corresponding GPB contains multiple GPUs. If the exact pointer-pointee relationship is required, rather than just the reduced form of the use (devoid of pointers), we need additional minor bookkeeping to record GPUs and the corresponding pointers that have been replaced by their pointees in the simplified GPUs.
With the provision of a GPU for a use statement, the process of computing points-to information can be seen simply as a process of simplifying consumer GPUs (including those with a use node . The interleaving of strength reduction and call inlining gradually converts a GPU $x \mathop {\longrightarrow }\limits _{s}^{i|j} y$ to a set of points-to edges $\lbrace {a}\mathop {\longrightarrow }\limits _{\text{$s$}}^{1|0}{b} \mid a$ is $i^{th}$ pointee of $x$, $b$ is $j^{th}$ pointee of $y$}. This is achieved by propagating the use of a pointer (in a pointer assignment or a use statement) and its definitions to a common context. This may require propagating
eg.phase1.pta The four variants of hoisting and
to a common procedure in the first phase of a bottom-up method are illustrated in the following with the help of Figure 16; effectively, they make the second phase redundant:
Thus, $y$ points-to $a$ along the call from procedure $P,$ and it points-to $b$ along the call from procedure $S$. Thus, the points-to information $\lbrace {y}\stackrel{\rightarrow }{1|0}{a}, {y}\stackrel{\rightarrow }{1|0}{b}\rbrace$ represents flow- and context-sensitive information for statement 4.
In this section, we prove the soundness and precision of GPG-based points-to analysis by comparing it with a classical top-down flow- and context-sensitive points-to analysis. We first describe our assumptions, review the classical points-to analysis, and then provide the main proof obligation. This is followed by a series of lemmas proving the soundness of our analyses and operations.
We do a whole-program analysis and assume that the entire source is available for analysis. Practically, there are very few library functions that influence the points-to relations of pointers to scalars in a C program. Library functions manipulating pointers into the heap can be manually represented by a GPB representing a sound overapproximation of their summaries.
For simplicity of reasoning, our proof does not talk about heap pointers. Our analysis computes a sound overapproximation of classical points-to analysis for heap pointers because (a) we use a simple allocation-site-based abstractions in which heap locations are not cloned context sensitively, and (b) we use $k$-limiting for heap pointers that are live on entry.
In the proof, we often talk about reaching GPUs analysis without making a distinction between reaching GPUs analysis with and without blocking. Blocking is discussed only in the proof of Lemma 9.11 because it is required to ensure soundness of GPU reduction.
Finally, our proofs use a simplistic model of programs where all variables are global and there is no parameter or return value mapping when making a call or when returning from a call. Including local variables, function parameters, and these mapping functions in the reasoning is a matter of detail and is not required for the spirit of the arguments made in the proof.
This section describes the top-down interprocedural flow- and context-sensitive points-to analysis. In keeping with the requirements of our proof of soundness (assumptions in Section 9.1), our formulation is restricted to global (non-structure) variables and direct procedure calls. Our formulation can be easily extended to support local variables, parameter mappings, return-value mappings, and structures; calls through function pointers can be handled using a standard approach of augmenting the call graph on the fly.
Our formulation is based on the classical Sharir-Pnueli tabulation method [34]. This method maintains pairs $(X,Y)$ of input-output dataflow values (hence the name tabulation method) for every procedure $Q$ where $X$ reaches
The value-contexts-based method is subtly different the tabulation method in the following way. For each procedure $Q,$ this method creates a mapping represented as a set of pairs $(X,Y),$ with $X$ being a possible points-to graph reaching
For other statements, the generated points-to information is the cross product of the pointers being defined by the statement (${{\sf Updated\_Ptrs}_{n}}$) and the locations whose addresses are read by the pointers on the RHS (${{\sf RHS\_Pointees}_{n}}$). The points-to information is killed by a statement when a strong update is possible (Section 2.1.3), which is the case for every direct assignment because the pointer in the LHS is overwritten (${{\sf Overwritten\_Ptrs}_{n}}$). For an indirect assignment, when the pointer appearing on the LHS has exactly one pointee and the pointer is not live on entry to main, the pointer is overwritten and the earlier pointees are removed. This is possible only when there is no definition-free path for the pointer from
We need to name different versions of a GPG as it undergoes optimizations, analyses on these different versions, and GPBs of a callee inlined in a caller's GPG.
9.3.1 Naming the GPG Versions and Analyses. Recall that GPG construction creates a series of GPGs by progressively transforming them. For the purpose of proof, it is convenient to use notation that distinguishes between them. We use the notation in Figure 17 for different versions of a GPG. These different versions can have different possibilities of analyses that we show are equivalent using the following notation:
Note that our implementation does not perform
9.3.2 Naming the GPBs After Call Inlining. Let procedure $P$ call procedure $Q$. Then, as illustrated in Figure 18, $\text{$\Delta $} _{P}^{\text{Call}}$ contains the $\text{$\Delta $} _{Q}^{\text{ Opt}}$ as a subgraph that is obtained by expanding GPB $\text{$\delta $} _{\text{m}}$ containing the call to $Q$, by connecting the predecessors of $\text{$\delta $} _{{m}}$ to the
Consider node $n$ in procedure $Q$. After $Q$ is inlined in its caller, say $P$, the label of the inlined instance of the node is a sequence $\text{$m$} \cdot \text{$n$},$ where $\text{$m$}$ is the label of the node in $P$ that contains the call to $Q$. When $P$ is inlined in a caller $R$, the label of the further inlined instance of the node becomes $\text{$l$} \cdot \text{$m$} \cdot \text{$n$},$ where node $\text{$l$}$ in $R$ calls $P$. Thus, the nodes labels are sequences of the labels of the call nodes with the last element in the sequence identifying the actual node in the inlined callers. Letters $S$ and $E$ are used for distinguishing the inlined
We handle recursion by repeated inlining of recursive procedures. This process constructs for series of GPGs $\text{$\Delta $} ^i_P$, $i\gt 1$ for every procedure $P$ in a cycle of recursion. GPG $\text{$\Delta $} ^{i+1}_P$ is constructed by inlining GPGs $\text{$\Delta $} ^1_Q$ in $\text{$\Delta $} ^{i}_P$, for all callees $Q$ of $P$. As explained in Section 7.2, this sequence is bounded by some $k$ when $\text{$\Delta $} ^{k}_P$ is equivalent to $\text{$\Delta $} ^{k+1}_P$ in terms of the GPUs reaching the
For the purpose of reasoning in the proofs, we assume without any loss of generality that indirect recursion has been converted into self-recursion [18]. The resulting inlining has been illustrated in Figure 18. Note that the successors and predecessors of the call node after $k+1$ inlinings are disconnected (e.g., there is no control flow from $\text{$\delta $} _{\text{$m$} ^{k+1}\cdot \text{$l$}}$ to $\text{$\delta $} _{\text{$m$} ^{k+1}\cdot \text{$n$}}$ in Figure 18). For self-recursive procedures, we use the notation $\text{$\delta $} _{\text{$m$} ^k\cdot \text{$n$}}$ to denote the sequence $\text{$m$} \cdot \text{$m$} \ldots \text{$m$} \cdot S$ of $k$ occurrences of $m$ followed by $\text{$n$},$ where $n$ could be letter $S$ or $E$ apart from the usual node labels.
9.3.3 Naming the Dataflow Variables in Different Contexts of a Recursive Procedure. The top-down context-sensitive reaching GPUs analysis over $\text{$\Delta $} _{P}^{\text{Init}}$ computes the values of dataflow variables
We use the classical points-to analysis defined in Section 9.2 as the gold standard and show that the GPG-based points-to analysis computes identical information (except when $k$-limiting is used for bounding
Given the complete source of a C program, the GPG-based points-to analysis of the program computes the same points-to information that would be computed by a top-down fully flow- and context-sensitive classical points-to analysis of the program.
Figure 19 illustrates the proof outline listing the lemmas that prove some key results. Since $\text{$\Delta $} _{P}^{\text{Init}}$ is constructed by simple transliteration (Section 3.3.1), we assume that it is a sound representation of the CFG of procedure $P$ with respect to classical flow- and context-sensitive points-to analysis. Then, using the points-to relations created by static initializations as memory $M$ (or set of GPUs) for boundary informationfor main,
Recall that a top-down analysis uses the dataflow information reaching the call sites to compute the context-sensitive dataflow information within the callees. Thus, the information reaching the call sites is boundary information for an analysis of the callee procedures.
Consider points-to information represented by memory $M$ that defines all pointers that are live on entry in procedure $P$. Then, with $M$ as boundary information, .
Since all pointers are defined before their use, GPU reduction computes GPUs of the form ${x}\mathop {\longrightarrow }\limits _{\text{$s$}}^{1|0}{y}$. Thus, there are no potential dependences and hence no blocking. In such a situation, the dataflow equations in Definition 5 reduce to those of the classical flow-sensitive points-to analysis. Assuming that the two analyses maintain context sensitivity using the same mechanism, they would compute the same points-to information at the corresponding program points.
Let procedure $Q$ be a transitive callee of procedure $P$. Consider points-to information represented by memory $M$ that defines all pointers that are live on entry in procedure $P$. Then, with $M$ as boundary information, .
Similar to that of Lemma 9.2.
Lemma 9.4 argues about the GPUs that reach the GPBs for the statements in $P$ (and not the statements of the inlined callees), whereas Lemma 9.6 argues about the GPUs that reach the GPBs for the statements belonging to the (transitive) callees inlined in $\text{$\Delta $} _{P}^{\text{Call}}$.
Consider a non-recursive procedure $P$ such that all of its transitive callees are also non-recursive. For a given boundary information (possibly containing points-to information and boundary definitions), .
We prove the lemma by inducting on two levels. At the outer level, we use structural induction on the call structure rooted at $P$. To prove the inductive step of the outer induction, we use an inner induction on the iteration number in the Gauss-Seidel method of fixed-point computation (the dataflow values in iteration $i+1$ are computed only from those computed in iteration $i$):
This completes the proof of the lemma.
The claim of Lemma 9.4 also holds for recursive procedures.
We prove the lemma by showing that for all $0\lt i \le k$:
If the IN equivalence holds, then the OUT equivalence holds because by Lemma 9.9, the inlined version $\text{$\Delta $} _{P}^{\text{Opt}}$ is same as $\text{$\Delta $} _{P}^{\text{SRed}}$,which is same as $\text{$\Delta $} _{P}^{\text{Call}}$ by Lemma 9.8. Thus, our proof obligation reduces to showing the IN equivalence, which is easy to argue using induction on recursion depth $i$. The base case $i=1$ represents the first (i.e., “outermost”) recursive call. Since no recursive call has been encountered so far, it is easy to see that ${{\text RGIn}$_{\text{$m$}}^{1}$} = \text{${{\text RGIn}$_{1\cdot S}$} $}$. Assuming that it holds for recursion depth $i$, it also holds for recursion depth $i+1$ because as explained earlier, the effect of $\text{$\Delta $} _{P}^{\text{Opt}}$ is same as $\text{$\Delta $} _{P}^{\text{Call}}$ by Lemmas 9.9 and 9.8. For $i=k$, since a fixed point has been reached in both $\text{$\Delta $} _{P}^{\text{Init}}$ and $\text{$\Delta $} _{P}^{\text{Call}}$, the absence of recursive call $k+1$ in $\text{$\Delta $} _{P}^{\text{Call}}$ does not matter.
In the following lemma, we need to consider the contexts of the calls within procedure $P$. For our reasoning, the way a context is defined does not matter, and we generically denote a context as $\sigma$. We assume that $\sigma$ denotes the full context without any approximation.
Consider a non-recursive procedure $P$ such that all of its transitive callees are also non-recursive. Assume that $P$ calls procedure $Q$ possibly transitively. For any boundary information (possibly containing points-to information and boundary definitions) for $P$, .
There could be multiple paths in the call graph from $P$ to $Q$. We assume without any loss of generality that these calls of $Q$ have different contexts and boundary information reaching them. Assume that there are $i$ calls to $Q$ with contexts $\sigma _i$, $i\gt 0$. Let the corresponding boundary information (the sets of GPUs reaching the calls) be $R_i$, $i\gt 0$. Then, ${\sf TRG} (\text{$\text{$\Delta $} _{P}^{\text{Init}}$},Q)$ analysis would analyze $Q$ separately for these contexts with the corresponding boundary information. Observe that $\text{$\Delta $} _{P}^{\text{Call}}$ contains $i$ separate instances of $\text{$\Delta $} _{Q}^{\text{Call}}$,which are analyzed independently by
Consider a particular to call to $Q$ with context $\sigma$. We prove the lemma on the length $j$ of the call chain from $P$ to $Q$ for this context. The base case is $j=1,$ representing the situation when $Q$ is a direct callee of $P$. Let this call be in the GPB $\text{$\delta $} _{{m}}$. Then, from Lemma 9.4,
For the inductive step, we assume that the lemma holds for length $j$ of the call chain. To prove the inductive step for length $j+1$, let the procedure that calls $Q$ be $Q^{\prime }$. Then, the length of the call chain from $P$ to $Q$ is $m$ and the lemma holds for $Q^{\prime }$ by the inductive hypothesis. We can argue about the call to $Q$ from within $Q^{\prime }$ in a manner similar to the base case described earlier. This proves the inductive step.
The claim of Lemma 9.6 also holds for recursive procedures.
The proof is essentially along the lines of Lemma 9.5 because all we need to argue is that
Let $Q$ denote procedure $P$ or its transitive callees. Then, for a given boundary information (possibly containing points-to information and boundary definitions) for procedure $P$,
It is easy to show that the intraprocedural reaching GPUs analysis over $\text{$\Delta $} _{P}^{\text{Call}}$ and $\text{$\Delta $} _{P}^{\text{SRed}}$ compute the same set of GPUs reaching the corresponding GPBs because the changes made by strength reduction are local in nature—there is no change in the control flow, only the GPUs in GPBs are replaced by equivalent GPUs. Thus, it is sufficient to argue that the effect of the GPUs in a GPB is preserved by strength reduction.
Although the GPBs are not renumbered by strength reduction, it is useful to distinguish between the GPBs before and after strength reduction for reasoning: let the GPB obtained after strength reduction of $\text{$\delta $} _{{m}}$ be denoted by $\text{$\delta $} _{{m}}^{\prime }$. Then, $\text{$\delta $} _{{m}}^{\prime } = \displaystyle \bigcup\nolimits _{\text{$\gamma $} \in \text{$\delta $} _{{m}}} \text{$\gamma $} \text{$\ \circ \ $} {{\text RGIn}$_{\text{$m$}}$}$. Since reaching GPUs analysis is sound (Lemmas 9.12 and 9.13), all relevant producer GPUs reach each $\text{$\delta $} _{{m}}$. Hence, from Lemma 9.11, $\gamma$ is equivalent to $\text{$\gamma $} \text{$\ \circ \ $} {{\text RGIn}$_{\text{$m$}}$}$. Thus, it follows that $\text{$\delta $} _{{m}}^{\prime }$ is equivalent to $\text{$\delta $} _{{m}}$, proving the lemma.
Let $Q$ denote procedure $P$ or its transitive callees. Then, for a given boundary information (possibly containing points-to information and boundary definitions) for procedure $P$,
GPGs $\text{$\Delta $} _{P}^{\text{SRed}}$ and $\text{$\Delta $} _{P}^{\text{Opt}}$ do not contain any call, and hence the following reasoning holds for all corresponding statements between them regardless of whether they belong to $P$ or a transitive callee of $P$. We prove the equivalence $\text{$\Delta $} _{P}^{\text{SRed}}$ and $\text{$\Delta $} _{P}^{\text{Opt}}$ in three steps for the three optimizations:
This proves the lemma.
Recall that the abstract memory computed by a points-to analysis is a relation $\text{$M$} \subseteq {{\text L}$_{{\sf P}}$} \times {\sf L},$ where
Consider GPU composition . Let the source of
be $(x,k)$. Then, if no other GPU with the source $(x,k^{\prime })$, $k^{\prime } \le k$, reaches
, then the abstract executions of
and
are identical in the memory obtained after the abstract execution of
.
Consider the picture on the right. The memory before the execution of is $M$ with no constraint, whereas the memory obtained after the execution of
is $M$ with the constraint $C_p$. The mem-ory obtained after the execution of
and
is $M$ with the constraints $C_c\wedge C_p$ and $C_r\wedge C_p$, respectively. Then, the lemma can be proved by showing that $C_c\wedge C_p$ and $C_r\wedge C_p$ are identical. We first consider
Initially, assume that causes a weak update; this assumption can be relaxed later. Let
and
be the GPUs illustrated in the first column of Figure 7. Since no other GPU with the source $(x,k^{\prime })$, $k^{\prime } \le k$, reaches
, the constraint $C_p$ is $\text{$M$} ^k\lbrace x\rbrace = \text{$M$} ^l\lbrace y\rbrace$. The constraints $C_c$ is $\text{$M$} ^i\lbrace z\rbrace \supseteq \text{$M$} ^j\lbrace x\rbrace$ and $C_r$ is $\text{$M$} ^i\lbrace z\rbrace \supseteq \text{$M$} ^{(l+j-k)}\lbrace y\rbrace$.
Consider the set . Let $M$ be the memory obtained after executing the GPUs in $\mathcal {R}$. Then, the execution of the GPUs in
in $M$.
Recall that ${\sf Red}$ is the fixed point of function ${\sf \text{GPU} \!\!\_\text{reduction}} ({\sf Red},\text{$\mathcal {R}$})$ (Definition 4) with the initial value . As explained in Section 4.3, this computation is monotonic and is guaranteed to converge. Hence, this lemma can be proved by induction on step $i$ in the fixed-point iteration that computes ${\sf Red} ^i$. The base case is $i=0,$ which follows trivially because
.
For the inductive hypothesis, assume that the lemma holds for ${\sf Red} ^i$. For the inductive step, we observe that ${\sf Red} ^{i+1}$ is computed by reducing the GPUs in ${\sf Red} ^i$ by composing them with those in $\mathcal {R}$. Consider the composition of a GPU $\text{$\gamma $} _1 \in {\sf Red} ^i$ with a GPU such that
; then, $\text{$\gamma $} _2 \in {\sf Red} ^{i+1}$. Let the source of $\text{$\gamma $} _1$ be $(x,i)$. Then, from Lemma 9.10, $\text{$\gamma $} _2$ can replace $\text{$\gamma $} _1$ if $\mathcal {R}$ does not contain any GPU $\text{$\gamma $} _1^{\prime }$ with a source $(x,i^{\prime }),$ where $i^{\prime }\le i$. Thus, there are two cases to consider:
Hence the lemma.
Lemma 9.12 shows the soundness of reaching GPUs analysis without blocking. The soundness of reaching GPUs analysis with blocking is shown in Lemma 9.13. Lemma 9.14 shows the precision of reaching GPUs analyses by arguing that every GPU that reaches a GPB is either generated by a GPB or is a boundary definition.
Consider a GPU $\text{$\gamma $} \!:x \mathop {\longrightarrow }\limits _{s}^{i|j} y$ obtained after the strength reduction of the GPUs in $\text{$\delta $} _{\text{l}}$ using the simplified GPUs reaching $\text{$\delta $} _{\text{l}}$. Assume that there is a control flow path from $\text{$\delta $} _{\text{l}}$ to $\text{$\delta $} _{{m}}$ along which the source $(x,i)$ is not strongly updated. Then, $\gamma$ reaches $\text{$\delta $} _{{m}}$.
We prove the lemma by induction on the number of nodes $k$ between $\text{$\delta $} _{\text{l}}$ and $\text{$\delta $} _{{m}}$. The basis is $k=0$ when $\text{$\delta $} _{{m}}$ is a successor of $\text{$\delta $} _{\text{l}}$. Since $\gamma$ has been obtained after strength reduction, of the GPUs in $\text{$\delta $} _{\text{l}}$, $\text{$\gamma $} \in {{\text RGOut}$_{\text{$l$}}$}$. Since
For the inductive hypothesis, assume that the lemma holds when there are $k$ nodes between $\text{$\delta $} _{\text{l}}$ and $\text{$\delta $} _{{m}}$. To prove that it holds for $k+1$ nodes between them, let the $k^{th}$ node be $\text{$\delta $} _{{n}}$. Then, $\text{$\gamma $} \in {{\text RGIn}$_{\text{$n$}}$}$ by the inductive hypothesis. Since $\text{$\delta $} _{{n}}$ does not strongly update the source $(x,i)$, it means that $\text{$\gamma $} \notin {{\text RGKill}$_{\text{$n$}}$}$ and thus $\text{$\gamma $} \in {{\text RGOut}$_{\text{$n$}}$}$. Since $\text{$\delta $} _{{m}}$ is a successor of $\text{$\delta $} _{{n}}$, it follows that $\text{$\gamma $} \in {{\text RGIn}$_{\text{$m$}}$}$, proving the inductive step, and hence the lemma.
Let GPU $\text{$\gamma $} \!:x \mathop {\longrightarrow }\limits _{s}^{i|j} y$ be obtained after the strength reduction of the GPUs in $\text{$\delta $} _{\text{l}}$ using the simplified GPUs reaching $\text{$\delta $} _{\text{l}}$. Assume that there is a control flow path from $\text{$\delta $} _{\text{l}}$ to $\text{$\delta $} _{{m}}$ along which the source $(x,i)$ is neither strongly updated nor blocked. Then, $\gamma$ reaches $\text{$\delta $} _{{m}}$.
The proof is similar to that of Lemma 9.12, but now we additionally reason about ${\sf Blocked} \,(I,G)$ (Definition 7).
If a GPU $\gamma$ reaches a GPB $\text{$\delta $} _{\text{l}}$, then there must be a GPB $\text{$\delta $} _{{m}}$ such that there is a control flow path from $\text{$\delta $} _{{m}}$ to $\text{$\delta $} _{\text{l}}$ that does not kill or block $\gamma$ and either $\text{$\delta $} _{{m}}$ is
Without any loss of generality, we generically use
For the inductive hypothesis, assume that the lemma holds for iteration $i$. Consider a GPU $\gamma$ in
The main motivation of our implementation was to evaluate the effectiveness of our optimizations in handling the following challenge for practical programs:
A procedure summary for flow- and context-sensitive points-to analysis needs to model the accesses of pointees defined in the callers and needs to maintain control flow. Thus, the size of a summary can potentially be large. Further, the transitive inlining of the summaries of the callee procedures can increase the size of a summary exponentially, thereby hampering the scalability of analysis.
Section 10.1 describes our implementation, Section 10.2 describes our measurements that include comparisons with client analyses, and Section 10.3 analyzes our observations and describes the lessons learned.
We implemented GPG-based points-to analysis in GCC 4.7.2 using the LTO framework and have carried out measurements on SPEC CPU2006 benchmarks on a machine with 16 GB of RAM with eight 64-bit Intel i7-7700 CPUs running at 4.20 GHz. The implementation can be downloaded from https://github.com/PritamMG/GPG-based-Points-to-Analysis.
10.1.1 Modeling Language Features. Our method eliminates all non-address-taken local variables12 using def-use chains explicated by the SSA-form; this generalizes the technique in Section 3.3.1 that removes compiler-added temporaries. If a GPU defining a global variable or a parameter reads a non-address-taken local variable, we identify the corresponding producer GPUs by traversing the def-use chains transitively. This eliminates the need for filtering out the local variables from the GPGs for inlining them in the callers. As a consequence, a GPG of a procedure consists only of GPUs that involve global variables,13 parameters of the procedure, and the return variable that is visible in the scope of its callers. All address-taken local variables in a procedure are treated as global variables because they can escape the scope of the procedure. However, these variables are not strongly updated because they could represent multiple locations.
Program | kLoC | No. of Statements Involving Pointers | No. of Call Sites | No. of Procedures | Proc. Count for Different Buckets of No. of Calls | |||
---|---|---|---|---|---|---|---|---|
2–5 | 6–10 | 11–20 | 21+ | |||||
$A$ | $B$ | $C$ | $D$ | $E$ | ||||
lbm | 0.9 | 370 | 30 | 19 | 5 | 0 | 0 | 0 |
mcf | 1.6 | 480 | 29 | 23 | 11 | 0 | 0 | 0 |
libquantum | 2.6 | 340 | 277 | 80 | 24 | 11 | 4 | 3 |
bzip2 | 5.7 | 1,650 | 288 | 89 | 35 | 7 | 2 | 1 |
milc | 9.5 | 2,540 | 782 | 190 | 60 | 15 | 9 | 1 |
sjeng | 10.5 | 700 | 726 | 133 | 46 | 20 | 5 | 6 |
hmmer | 20.6 | 6,790 | 1,328 | 275 | 93 | 33 | 22 | 11 |
h264ref | 36.1 | 17,770 | 2,393 | 566 | 171 | 60 | 22 | 16 |
gobmk | 158.0 | 212,830 | 9,379 | 2,699 | 317 | 110 | 99 | 134 |
Column E omits procedures with a single call. |
We approximate heap memory using context-insensitive allocation-site-based abstraction and by maintaining $k$-limited indirection lists of field dereferences for $k=3$ (see Appendix B) for heap locations that are live on entry to a procedure. An array is treated index insensitively. Since there is no kill owing to weak update, arrays are maintained flow insensitively by our analysis.
For pointer arithmetic involving a pointer to an array, we approximate the pointer being defined to point to every element of the array. For pointer arithmetic involving other pointers, we approximate the pointer being defined to point to every possible location.
10.1.2 Variants of Points-to Analysis Implemented. For comparing the precision of our analysis, we implemented the following variants. For convenience, we implemented them using GPUs and not by using special data structures for efficient flow-insensitive analysis:
The third variant—that is, flow-sensitive and context-insensitive (FSCI) points-to analysis—can be modeled by constructing a supergraph by joining the CFGs of all procedures such that calls and returns are replaced by gotos. This amounts to a top-down approach (or a bottom-up approach with a single summary for the entire program instead of separate summaries for each procedure). For practical programs, this initial GPG is too large for our analysis to scale. Our analysis achieves scalability by keeping the GPGs as small as possible at each stage. Therefore, we did not implement this variant of points-to analysis. Note that the FICI variant is also not a bottom-up approach, because a separate summary is not constructed for every procedure. However, it was easy to implement because of a single store.
Program | No. of Call Graph Nodes | No. of Call Graph Edges | No. of Monomorphic Calls | No. of Polymorphic Calls | ||||
---|---|---|---|---|---|---|---|---|
GPG | SVF | GPG | SVF | GPG | SVF | GPG | SVF | |
lbm | 19 | 19 | 20 | 20 | 0 | 0 | 0 | 0 |
mcf | 23 | 23 | 26 | 26 | 0 | 0 | 0 | 0 |
libquantum | 80 | 80 | 156 | 156 | 0 | 0 | 0 | 0 |
bzip2 | 88 | 88 | 140 | 144 | 22 | 20 | 3 | 5 |
milc | 174 | 174 | 434 | 434 | 0 | 0 | 0 | 0 |
sjeng | 121 | 121 | 367 | 367 | 0 | 0 | 1 | 1 |
hmmer | 263 | 263 | 709 | 723 | 7 | 0 | 2 | 9 |
h264ref | 560 | 560 | 1,231 | 1,521 | 9 | 1 | 343 | 351 |
gobmk | 2,679 | 2,679 | 8,889 | 8,889 | 0 | 0 | 44 | 44 |
10.1.3 Client Analyses Implemented. We implemented mod-ref analysis and call-graph construction to measure the effectiveness of our points-to analysis. The mod-ref analysis computes interprocedural reference and modification side effects for each variable caused by a procedure on the callers of the procedure. Call graph represents caller-callee relationships between the procedures in a program. Standard compilers like GCC and LLVM construct call graphs for only direct calls and do not resolve the calls through function pointers. We constructed the call graph that includes the effect of indirect calls using the points-to information computed for function pointers.
10.1.4 Comparison with Other Points-to Analysis. We also computed corresponding data for client analyses using static value flow analysis (SVF) [41].14 SVF is used for comparison because its implementation is readily available. SVF is a static analysis framework implemented in LLVM that allows value-flow construction and context-insensitive pointer analysis to be performed in an iterative manner (sparse analysis performed in stages, from cheap overapproximate analysis to precise expensive analysis). It uses the points-to information from Andersen's analysis and constructs an interprocedural memory SSA (Static Single Assignment) form where def-use chains of both top-level (i.e., non-address-taken) and address-taken variables are included. The scalability and precision of the analysis is controlled by designing memory SSA that allows users to partition memory into a set of regions.
This section describes the evaluations made on SPEC CPU 2006 benchmarks. The characteristics of benchmark programs in terms of number of procedures, number of pointer assignments, and the number of call sites is given in Table 1.
(We measured the data for the following two categories of evaluations:
Program | No. of Calls with Mod | No. of Calls with Ref | No. of Mods Across All Calls | No. of Refs Across All Calls | ||||||||
GPG | SVF | RR | GPG | SVF | RR | GPG | SVF | RR | GPG | SVF | RR | |
lbm | 2 | 6 | 0.33 | 7 | 7 | 1.00 | 3 | 6 | 0.50 | 12 | 16 | 0.75 |
mcf | 8 | 16 | 0.50 | 13 | 21 | 0.62 | 9 | 58 | 0.16 | 21 | 108 | 0.19 |
libquantum | 13 | 60 | 0.22 | 22 | 64 | 0.34 | 14 | 208 | 0.07 | 41 | 224 | 0.18 |
bzip2 | 16 | 30 | 0.53 | 61 | 181 | 0.34 | 30 | 150 | 0.20 | 36 | 128 | 0.28 |
milc | 31 | 63 | 0.49 | 18 | 94 | 0.19 | 36 | 228 | 0.16 | 30 | 650 | 0.05 |
sjeng | 32 | 42 | 0.76 | 39 | 70 | 0.56 | 93 | 291 | 0.32 | 128 | 455 | 0.28 |
hmmer | 32 | 152 | 0.21 | 126 | 207 | 0.61 | 173 | 896 | 0.19 | 1091 | 1384 | 0.79 |
h264ref | 183 | 204 | 0.90 | 178 | 402 | 0.44 | 2607 | 2722 | 0.96 | 4232 | 7342 | 0.58 |
gobmk | 105 | 622 | 0.17 | 261 | 681 | 0.38 | 10194 | 13426 | 0.76 | 5225 | 27842 | 0.19 |
Geometric mean | 0.39 | 0.41 | 0.27 | 0.27 | ||||||||
RR denotes the reduction ratio of GPG-based analysis over SVF-based analysis computed by dividing the counts for the GPG-based method by the counts for the SVF-based method. A value smaller than 1.00 indicates that GPG-based analysis is more precise than SVF-based analysis, and the smaller the value, the higher the precision. |
10.2.1 Comparison of GPG-Based and SVF Analyses. We compared the data for mod-ref analysis and call graph for GPG-based points-to analysis with that of SVF. However, the comparison is not straightforward, because the two implementations use two differently engineered intermediate representations of programs. The underlying compiler frameworks (GCC and LLVM) use different strategies for function inlining and function cloning (for creating specialized versions of the same functions), leading to a different number of procedures in the call graph for the same benchmark program. Although we suppressed the two optimizations in GCC with the appropriate flags, GCC continues to perform function inlining and cloning at a smaller scale, indicating that we do not have a direct control over the IR. We therefore make the comparison only on the common part of the benchmark programs:
Program | No. of Proc. | No. of Stmts. | FSCS | FICI | FICS | SVF |
---|---|---|---|---|---|---|
lbm | 19 | 367 | 0.05 | 3.26 | 2.11 | 0.21 |
mcf | 23 | 484 | 0.63 | 8.13 | 7.39 | 0.92 |
libquantum | 80 | 396 | 0.12 | 3.99 | 2.42 | 0.28 |
bzip2 | 89 | 1,645 | 0.18 | 4.72 | 2.94 | 2.44 |
milc | 190 | 2,467 | 0.29 | 3.43 | 2.87 | 1.05 |
sjeng | 133 | 684 | 0.42 | 1.12 | 1.9 | 0.39 |
hmmer | 275 | 6,717 | 0.07 | 5.10 | 1.52 | 3.44 |
h264ref | 566 | 17,253 | 0.49 | 5.02 | 3.08 | 0.41 |
gobmk | 2,699 | 10,557 | 0.24 | 2.95 | 1.39 | 7.58 |
Geometric mean | 0.21 | 3.74 | 2.51 | 0.94 | ||
FSCS, FICI, FICS, and SVF analysis. |
10.2.2 Data for Points-to Analysis Using GPGs. We describe our observations about the sizes of GPGs, GPG optimizations, and performance of the analysis. Data related to the time measurements are presented in Section 10.2.3. Section 10.3 discusses these observations by analyzing them. Our observations include the following:
Count of Procedures in Each Bucket of Percentages | |||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Program | Dead GPUs (%) | Empty GPBs Eliminated (%) | GPBs Reduced Because of Coalescing (%) | Back Edges Reduced Because of Coalescing (%) | |||||||||||||||
0–20 | 21–40 | 41–60 | 61–80 | 81–100 | 0–20 | 21–40 | 41–60 | 61–80 | 81–100 | 0–20 | 21–40 | 41–60 | 61–80 | 81–100 | 0–20 | 21–40 | 40–80 | 81–100 | |
lbm | 4 | 0 | 0 | 0 | 0 | 4 | 0 | 15 | 0 | 0 | 19 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
mcf | 10 | 1 | 0 | 0 | 0 | 14 | 0 | 8 | 0 | 1 | 17 | 2 | 3 | 1 | 0 | 1 | 0 | 0 | 3 |
libquantum | 10 | 0 | 0 | 0 | 0 | 10 | 0 | 68 | 2 | 0 | 77 | 2 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
bzip2 | 11 | 0 | 0 | 0 | 0 | 23 | 0 | 58 | 0 | 0 | 75 | 4 | 2 | 0 | 0 | 0 | 0 | 0 | 1 |
milc | 12 | 0 | 0 | 0 | 0 | 60 | 0 | 126 | 1 | 0 | 184 | 1 | 2 | 0 | 0 | 0 | 0 | 0 | 1 |
sjeng | 29 | 1 | 2 | 0 | 0 | 10 | 0 | 99 | 18 | 4 | 124 | 1 | 3 | 2 | 1 | 0 | 0 | 0 | 9 |
hmmer | 32 | 0 | 1 | 0 | 0 | 100 | 0 | 170 | 0 | 0 | 234 | 11 | 15 | 8 | 2 | 3 | 0 | 0 | 0 |
h264ref | 123 | 2 | 1 | 1 | 1 | 207 | 2 | 331 | 18 | 5 | 523 | 12 | 17 | 9 | 2 | 4 | 0 | 1 | 4 |
gobmk | 549 | 2 | 0 | 0 | 0 | 701 | 0 | 1,952 | 40 | 4 | 2,478 | 72 | 46 | 67 | 34 | 33 | 2 | 6 | 63 |
Geometric mean | 97.01% | 25.09% | 65.7% | 91.6% | |||||||||||||||
The geometric mean has been shown for the percentage of procedures in the buckets with the largest numbers. The percentages for dead GPU elimination is computed against a much smaller number of procedures (obtained by omitting the procedures that have zero GPUs). |
Program | No. of TotalProc. | No. of Stmts. | No. of Proc. with Zero GPUs | No. of Proc. Whose GPG is $\text{$\Delta $} _{\top }$ | No. of Proc. Containing Back Edgesin the CFG | No. of Proc. Containing Back Edgesin the GPG | No of Queued GPUs Computed When GPU Compositions Were Postponed |
---|---|---|---|---|---|---|---|
lbm | 19 | 367 | 15 | 0 | 10 | 0 | 0 |
mcf | 23 | 484 | 12 | 0 | 20 | 1 | 115 |
libquantum | 80 | 396 | 70 | 0 | 36 | 0 | 0 |
bzip2 | 89 | 1,645 | 70 | 8 | 43 | 0 | 0 |
milc | 190 | 2,467 | 175 | 3 | 92 | 0 | 0 |
sjeng | 133 | 684 | 99 | 2 | 65 | 0 | 0 |
hmmer | 275 | 6,717 | 237 | 5 | 153 | 0 | 9 |
h264ref | 566 | 17,253 | 435 | 3 | 308 | 8 | 15 |
gobmk | 2,699 | 10,557 | 2,146 | 2 | 464 | 45 | 3 |
Program | Procedure Count for Different Buckets of No. of GPBs | Procedure Count for Different Buckets of No. of GPUs | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1–3 | 4–10 | 11–25 | 26–35 | 36+ | 0 | 1–3 | 4–6 | 7–10 | 11–30 | 31–50 | 51–70 | 71+ | |
lbm | 0 | 18 | 1 | 0 | 0 | 0 | 15 | 4 | 0 | 0 | 0 | 0 | 0 | 0 |
mcf | 0 | 22 | 1 | 0 | 0 | 0 | 12 | 6 | 2 | 2 | 0 | 0 | 0 | 1 |
libquantum | 0 | 80 | 0 | 0 | 0 | 0 | 70 | 8 | 2 | 0 | 0 | 0 | 0 | 0 |
bzip2 | 8 | 79 | 2 | 0 | 0 | 0 | 70 | 11 | 5 | 3 | 0 | 0 | 0 | 0 |
milc | 3 | 186 | 1 | 0 | 0 | 0 | 175 | 7 | 6 | 2 | 0 | 0 | 0 | 0 |
sjeng | 2 | 130 | 1 | 0 | 0 | 0 | 99 | 26 | 3 | 3 | 2 | 0 | 0 | 0 |
hmmer | 5 | 253 | 13 | 3 | 1 | 0 | 237 | 29 | 4 | 5 | 0 | 0 | 0 | 0 |
h264ref | 3 | 544 | 15 | 4 | 0 | 0 | 435 | 81 | 20 | 8 | 17 | 3 | 1 | 1 |
gobmk | 2 | 2,514 | 150 | 9 | 0 | 24 | 2,146 | 75 | 16 | 361 | 63 | 37 | 1 | 0 |
Geo. Mean | 95.06% | 77.63% | 18.08% | |||||||||||
The geometric mean of percentages of procedures with 1 to 3 GPBs is 95.06%, whereas that of procedures with 0 GPUs and 1 to 10 GPUs is 77.63% and 18.08%, respectively. Some procedures have zero GPBs because they have an exit node with no successors. |
A: | Procedure count for different buckets of ratio of GPBs/BBs in CFG and GPG after inlining | ||||||||||||||
B: | Procedure count for different buckets of ratio of GPBs/BBs in CFG and optimized GPG | ||||||||||||||
C: | Procedure count for different buckets of ratio of GPBs in GPG before and after optimizations | ||||||||||||||
Program | A | B | C | ||||||||||||
0–20 | 21–40 | 41–60 | 61–80 | 81+ | 0–20 | 21–40 | 41–60 | 61–80 | 81+ | 0–20 | 21–40 | 41–60 | 61–80 | 81+ | |
lbm | 9 | 3 | 3 | 1 | 3 | 11 | 5 | 3 | 0 | 0 | 2 | 1 | 15 | 0 | 1 |
mcf | 14 | 5 | 2 | 1 | 1 | 22 | 1 | 0 | 0 | 0 | 3 | 5 | 14 | 1 | 0 |
libquantum | 42 | 14 | 12 | 1 | 11 | 56 | 13 | 11 | 0 | 0 | 26 | 17 | 36 | 0 | 1 |
bzip2 | 53 | 16 | 10 | 4 | 6 | 71 | 12 | 6 | 0 | 0 | 13 | 4 | 70 | 1 | 1 |
milc | 115 | 20 | 14 | 7 | 34 | 134 | 22 | 34 | 0 | 0 | 10 | 6 | 169 | 1 | 4 |
sjeng | 87 | 17 | 7 | 3 | 19 | 105 | 9 | 19 | 0 | 0 | 19 | 13 | 99 | 1 | 1 |
hmmer | 205 | 34 | 18 | 1 | 17 | 239 | 19 | 16 | 0 | 1 | 62 | 32 | 164 | 8 | 9 |
h264ref | 401 | 71 | 49 | 10 | 35 | 476 | 51 | 38 | 1 | 0 | 46 | 79 | 412 | 17 | 12 |
gobmk | 2,336 | 275 | 24 | 6 | 58 | 2,610 | 29 | 56 | 1 | 3 | 210 | 163 | 2,038 | 235 | 53 |
Geo. mean | 63.3% | 79.13% | 69.31% | ||||||||||||
The geometric mean has been shown for the percentages of procedures in buckets with the largest numbers. |
A: | Procedure count for different buckets of ratio of GPUs/stmts in a CFG and GPG after inlining | ||||||||||||||
B: | Procedure count for different buckets of ratio of GPUs/stmts in a CFG and an optimized GPG | ||||||||||||||
C: | Procedure count for different buckets of ratio of GPBs in a GPG before and after optimizations | ||||||||||||||
Program | A | B | C | ||||||||||||
0–20 | 21–40 | 41–60 | 61–80 | 81+ | 0–20 | 21–40 | 41–60 | 61–80 | 81+ | 0–20 | 21–40 | 41–60 | 61–80 | 81+ | |
lbm | 16 | 3 | 0 | 0 | 0 | 19 | 0 | 0 | 0 | 0 | 18 | 0 | 0 | 1 | 0 |
mcf | 21 | 0 | 0 | 1 | 1 | 23 | 0 | 0 | 0 | 0 | 17 | 0 | 3 | 0 | 3 |
libquantum | 75 | 4 | 0 | 0 | 1 | 80 | 0 | 0 | 0 | 0 | 47 | 1 | 1 | 0 | 31 |
bzip2 | 89 | 0 | 0 | 0 | 0 | 89 | 0 | 0 | 0 | 0 | 85 | 0 | 0 | 0 | 4 |
milc | 189 | 1 | 0 | 0 | 0 | 190 | 0 | 0 | 0 | 0 | 185 | 0 | 0 | 0 | 5 |
sjeng | 131 | 0 | 2 | 0 | 0 | 133 | 0 | 0 | 0 | 0 | 105 | 0 | 1 | 2 | 25 |
hmmer | 273 | 0 | 1 | 0 | 1 | 275 | 0 | 0 | 0 | 0 | 266 | 6 | 1 | 0 | 2 |
h264ref | 540 | 12 | 10 | 1 | 3 | 563 | 2 | 1 | 0 | 0 | 505 | 3 | 1 | 1 | 56 |
gobmk | 2,688 | 4 | 2 | 0 | 5 | 2,697 | 1 | 1 | 0 | 0 | 2,189 | 0 | 4 | 7 | 499 |
Geo. mean | 95.59% | 99.93% | 84.14% | ||||||||||||
The geometric mean has been shown for the percentages of procedures in buckets with the largest numbers. |
A: | Procedure count for different buckets of ratio of control flow edges in a CFG and GPG after inlining | ||||||||||||||
B: | Procedure count for different buckets of ratio of control flow edges in a CFG and an optimized GPG | ||||||||||||||
C: | Procedure count for different buckets of ratio of GPBs in a GPG before and after optimizations | ||||||||||||||
Program | A | B | C | ||||||||||||
0–20 | 21–40 | 41–60 | 61–80 | 81+ | 0–20 | 21–40 | 41–60 | 61–80 | 81+ | 0–20 | 21–40 | 41–60 | 61–80 | 81+ | |
lbm | 13 | 4 | 2 | 0 | 0 | 19 | 0 | 0 | 0 | 0 | 18 | 0 | 0 | 0 | 1 |
mcf | 21 | 1 | 1 | 0 | 0 | 23 | 0 | 0 | 0 | 0 | 16 | 4 | 2 | 1 | 0 |
libquantum | 61 | 8 | 2 | 0 | 9 | 80 | 0 | 0 | 0 | 0 | 78 | 1 | 0 | 0 | 1 |
bzip2 | 72 | 9 | 2 | 2 | 4 | 89 | 0 | 0 | 0 | 0 | 79 | 7 | 1 | 1 | 1 |
milc | 180 | 3 | 5 | 0 | 2 | 189 | 0 | 1 | 0 | 0 | 182 | 1 | 3 | 0 | 4 |
sjeng | 124 | 5 | 1 | 0 | 3 | 133 | 0 | 0 | 0 | 0 | 130 | 2 | 0 | 0 | 1 |
hmmer | 246 | 24 | 3 | 0 | 2 | 274 | 1 | 0 | 0 | 0 | 252 | 8 | 1 | 5 | 9 |
h264ref | 509 | 26 | 13 | 1 | 17 | 562 | 0 | 2 | 1 | 1 | 516 | 15 | 14 | 11 | 10 |
gobmk | 2,572 | 72 | 31 | 1 | 23 | 2,693 | 1 | 2 | 1 | 2 | 2,336 | 43 | 92 | 162 | 66 |
Geo. mean | 86.13% | 99.8% | 89.97% | ||||||||||||
The geometric mean has been shown for the percentages of procedures in buckets with the largest numbers. |
10.2.3 Time Measurements. We measured the overall time (Table 11). We also measured the time taken by the SVF points-to analysis. We observe that our analysis takes less than 16 minutes on gobmk.445, which is a large benchmark with 158 kLoC. Our current implementation does not scale beyond that. SVF is faster than all variants of points-to analysis that we implemented. This is expected because SVF is context insensitive.
Program | Time (in Seconds) | |
---|---|---|
FSCS (with Blocking) | SVF | |
lbm | 0.070 | 0.01 |
mcf | 8.690 | 0.062 |
libquantum | 1.514 | 0.031 |
bzip2 | 1.066 | 0.534 |
milc | 1.133 | 0.236 |
sjeng | 3.702 | 0.131 |
hmmer | 4.961 | 2.032 |
h264ref | 73.779 | 1.852 |
gobmk | 938.949 | 17.959 |
Our experiments and empirical data leads us to some important learnings as described next:
We learned these lessons the hard way in the situations described in the rest of this section.
10.3.1 Handling a Large Size of Context-Dependent Information. Some GPGs had a large amount of context-dependent information (i.e., GPUs with upwards-exposed versions of variables), and the GPGs could not be optimized much. This caused the size of the caller GPGs to grow significantly, threatening the scalability of our analysis. Hence, we devised a heuristic threshold $t$ representing the number of GPUs containing upwards-exposed versions of variables. This threshold is used as follows. Let a GPG contain $x$ GPUs containing upwards-exposed versions:
This keeps the size of the caller GPG small and simultaneously allows reduction of the context-dependent GPUs in the calling context. Once all GPUs are reduced to classical points-to edge, we effectively get the procedure summary of the original callee procedure for that call chain. Since the reduction of context-dependent GPUs is different for different calling contexts, the process needs to be repeated for each call chain. This is similar to the top-down approach where we analyze a procedure multiple times. We used a threshold of 80% context-dependent GPUs in a GPG containing more than 10 GPUs. Thus, 8 context-dependent GPUs from a total of 11 GPUs was below our threshold, as was 9 context-dependent GPUs from a total of 9 GPUs.
Note that in our implementation, we discovered very few cases (and only in large benchmarks) where the threshold actually exceeded. The number of call chains that required multiple traversals are in single digits, and they are not very long. The important point to note is that we got the desired scalability only when we introduced this small twist of using symbolic GPG.
10.3.2 Handling Arrays and SSA Form. Pointers to arrays were weakly updated, and hence we realized early on that maintaining this information flow sensitively prohibited scalability. This was particularly true for large arrays with static initializations. Similarly, GPUs involving SSA versions of variables were not required to be maintained flow sensitively. This allowed us to reduce the propagation of data across control flow without any loss in precision.
10.3.3 Making Coalescing More Effective. Unlike dead GPU elimination, coalescing proved to be a very significant optimization for boosting the scalability of the analysis. The points-to analysis failed to scale in the absence of this optimization. However, this optimization was effective (i.e., coalesced many GPBs) only when we brought in the concept of types. In cases where the data dependence between the GPUs was unknown because of the dependency on the context information, we used type-based non-aliasing to enable coalescing.
Many investigations reported in the literature have described the popular points-to analysis methods and have presented a comparative study of the methods with respect to scalability and precision [14, 15, 17, 24, 35, 38]. Instead of discussing these methods, we devise a metric of features that influence the precision and efficiency/scalability of points-to analysis. This metric can be used for identifying important characteristic of any points-to analysis at an abstract level.
Figure 20 presents our metric. At the top level, we have language features and analysis features. The analysis features have been divided further based on whether their primary influence is on the precision or efficiency/scalability of points-to analysis. The categorization of language features is obvious. Here we describe our categorization of analysis features.
11.1.1 Features Influencing Precision. Two important sources of imprecision in an analysis are approximation of data dependence and abstraction of data.
11.1.2 Features Influencing Efficiency and Scalability. Different methods use different techniques to achieve scalability. We characterize them based on the following three criteria:
11.1.3 Interaction Between the Features. In this section, we explain the interaction between the features indicated by the arrows in Figure 20:
11.1.4 Our Work in the Context of the Big Picture of Points-to Analysis. GPG-based points-to analysis preserves data dependence by being flow- and context sensitive. It is path insensitive and uses SSA form for non-address-taken local variables. Unlike the approaches that overapproximate control flow indiscriminately, we discard control flow as much as possible, but only when there is a guarantee that it does not overapproximate data dependence.
Our analysis is field sensitive. It overapproximates arrays by treating all of its elements alike. We use context-insensitive allocation-site-based abstraction for representing heap locations and use $k$-limiting for summarizing the unbounded accesses of heap where allocation sites are not known.
Like every bottom-up approach, points-to information is computed when all of the information is available in the context. Our analysis computes points-to information for all pointers.
There is a large body of work on flow-insensitive or context-insensitive points-to (or alias) analysis. In addition, the literature is abound with investigations exploring analysis of Java programs. Finally, a large number of investigations focus on demand-driven methods. We restrict ourselves to exhaustive flow- and context-sensitive points-to analysis of primarily C programs and mention Java-related works that are directly related to our ideas.
Most of the top-down approaches to flow- and context-sensitive pointer analysis of C programs have not scaled [8, 21, 31] with the largest program successfully analyzed by them consisting of 35 kLoC [21]. It is no surprise, then, that the literature of flow- and context-sensitive points-to analyses is dominated by bottom-up approaches. Our work also belongs to this category, and hence we focus on them in this section by classifying them into the MTF or STF approach (Section 2.3).
11.2.1 MTF Approach for Bottom-Up Summaries. In this approach [16, 43, 46, 47], control flow is not required to be recorded between memory updates. This is because the data dependency between memory updates (even the ones that access unknown pointers) is known by using either the alias information or the points-to information from the calling context. These approaches construct symbolic procedure summaries. This involves computing preconditions and corresponding postconditions (in terms of aliases or points-to information). A calling context is matched against a precondition, and the corresponding postcondition gives the result.
Two approaches that stand out among these from the viewpoint of scalability are bootstrapping [16] and level-by-level analysis [46]. The bootstrapping approach partitions the pointers using flow-insensitive analyses such that each subset is an equivalence class with respect to alias information and then analyses are performed in a cascaded fashion in a series $A_0, A_1, \ldots , A_k,$ where analysis $A_i$ uses the points-to information computed by the analysis $A_{i-1}$. In addition, analysis of different equivalence classes at any level can be performed in parallel. This process involves constructing MTF procedure summaries using a top-down traversal of call graph to compute alias information using FSCI analysis. This may cause some imprecision in the computed summaries. However, it is not clear if the precision loss is significant, as there are no formal guarantees of precision, nor does the work provide empirical evaluation of precision—the focus being solely on scalability. The analysis is reported to scale to 128 kLoC.
Level-by-level analysis [46] constructs a procedure summary with multiple interprocedural conditions. It matches the calling context with these conditions and chooses the appropriate summary for the given context. This method partitions the pointer variables in a program into different levels based on the Steensgaard's points-to graph for the program. It constructs a procedure summary for each level (starting with the highest level) and uses the points-to information from the previous level. This method constructs interprocedural def-use chains by using an extended SSA form. When used in conjunction with conditions based on points-to information from calling contexts, the chains become context sensitive. This method is claimed to scale to 238 kLoC; however, similar to the bootstrapping method, there are no formal guarantees or empirical evaluation of precision.15
Since these approaches depend on the number of aliases/points-to pairs in the calling contexts, the procedure summaries are not context independent. Thus, this approach may not be useful for constructing summaries for library functions that have to be analyzed without the benefit of different calling contexts. Saturn [12] creates sound summaries, but they may not be precise across applications because of their dependence on context information.
Relevant context inference [5] constructs a procedure summary by inferring the relevant potential aliasing between unknown pointees that are accessed in the procedure. Although it does not use the information from the context, it has multiple versions of the summary depending on the alias and the type context. This analysis could be inefficient if the inferred possibilities of aliases and types do not actually occur in the program. It also overapproximates the alias and the type context as an optimization, thereby being only partially context sensitive.
11.2.2 STF Approach for Bottom-Up Summaries. This approach does not make any assumptions about the calling contexts [4, 6, 23, 25, 26, 27, 33, 39, 42, 48] and uses multiple placeholders for distinct accesses of the pointees of the same pointer (Section 2.3). This tends to increase the size of the resulting procedure summaries. This problem is mitigated by choosing carefully where the placeholders are required [39, 42], by employing optimizations that merge placeholders [26], by maintaining restricted control flow [4], by overapproximating the control flow through flow insensitivity [23], or a combination of them [27]. In some cases, the overapproximation is only in the application of procedure summaries even though they are constructed flow sensitively [27]. Many of these approaches scale to millions of lines of code.
Although the attempts to minimize the placeholders prohibits killing of points-to information of pointer variables in C/C++ programs, it does not have much adverse impact on Java programs. This is because all local variables in Java have SSA versions, thanks to the absence of indirect assignments to variables (there is no addressof operator). In addition, there are few static variables in Java programs, and absence of kill for them may not matter much.
Lattner et al. [23] proposed a heap-cloning-based context-sensitive points-to analysis. For achieving a scalable implementation, several algorithmic and engineering design choices were made in this approach. Some of these choices are a flow-insensitive and unification-based analysis, and sacrificing context sensitivity across recursive procedures.
Cheng and Hwu [6] proposed a modular interprocedural pointer analysis based on access paths for C programs. They illustrate that access paths can reduce the overhead of representing context-sensitive transfer functions. The abstraction of access paths is similar to the indirection lists (
Access fetch graphs [4] is another representation for procedure summaries for points-to analysis. This approach presents two versions of a summary: a flow-insensitive version and a flow-aware version that is a flow-insensitive version augmented by encoding control flow using a total order. The latter is sound and more precise than the flow-insensitive version but less precise than a flow-sensitive version.
Note that the MTF approach is precise even though no control flow in the procedure summaries is recorded because the information from calling context obviates the need for control flow.
11.2.3 The Hybrid Approach. Hybrid approaches use customized summaries and combine the top-down and bottom-up analyses to construct summaries [47]. This choice is controlled by the number of times a procedure is called. If this number exceeds a fixed threshold, a summary is constructed using the information of the calling contexts that have been recorded for that procedure. A new calling context may lead to generating a new precondition and hence a new summary. If the threshold is set to zero, then a summary is constructed for every procedure and hence we have a pure bottom-up approach. If the threshold is set to a very large number, then we have a pure top-down approach and no procedure summary is constructed.
Additionally, we can set a threshold on the size of procedure summary or the percentage of context-dependent information in the summary or a combination of these choices. In our implementation, we used the percentage of context-dependent information as a threshold—when a procedure has a significant amount of context-dependent information, it is better to introduce a small touch of top-down analysis (Section 10.3.1). If this threshold is set to 0%, our method becomes a purely bottom-up approach; if it is set to 100%, our method becomes a top-down approach.
Constructing compact procedure summaries for flow- and context-sensitive points-to analysis seems hard because it
These issues have been handled in the past as follows. The first issue has been handled by modeling accesses of unknown pointees using placeholders. However, it may require a large number of placeholders. The second issue has been handled by constructing multiple versions of a procedure summary for different aliases in the calling contexts. The third issue can only be handled by inlining the summaries of the callees. However, it can increase the size of a summary exponentially, thereby hampering the scalability of analysis.
We handled the first issue by proposing the concept of GPUs that track indirection levels. Simple arithmetic on indirection levels allows composition of GPUs to create new GPUs with smaller indirection levels; this reduces them progressively to classical points-to edges.
To handle the second issue, we maintain control flow within a GPG and perform optimizations of strength reduction and control flow minimization. Together, these optimizations reduce the indirection levels of GPUs, eliminate data dependences between GPUs, and significantly reduce control flow. These optimizations also mitigate the impact of the third issue.
We achieved the preceding by devising novel dataflow analyses such as two variants of reaching GPUs analysis and coalescing analysis. Interleaved call inlining and strength reduction of GPGs facilitated a novel optimization that computes flow- and context-sensitive points-to information in the first phase of a bottom-up approach. This obviates the need for the usual second phase.
Our measurements on SPEC benchmarks show that GPGs are small enough to scale fully flow- and context-sensitive exhaustive points-to analysis to C programs as large as 158 kLoC. Our work differs from most other investigations exploring scalable exhaustive flow- and context-sensitive points-to analysis of C in the following ways:
Two important takeaways from our empirical evaluation are the following:
Our empirical measurements show that most of the GPGs are acyclic even if they represent procedures that have loops or are recursive.
As a possible direction of future work, it would be useful to explore the possibility of scaling the implementation to larger programs; we suspect that this would be centered around examining the control flow in the GPGs and optimizing it still further. In addition, it would be interesting to explore the possibility of restricting GPG construction to live pointer variables [21] for scalability. It would also be useful to extend the scope of the implementation to C++ and Java programs.
The concept of GPG provides a useful abstraction of memory and memory transformers involving pointers by directly modeling load, store, and copy of memory addresses. Any client program analysis that uses these operations may be able to use GPGs by combining them with the original abstractions of the analysis. In particular, we expect to integrate this method into an in-house bounded model checking infrastructure being developed at IIT Bombay.
The appendix is available at https://github.com/PritamMG/GPG-based-Points-to-Analysis.
We would like to thank the anonymous referees for their incisive comments on the earlier draft of the article. In particular, their insistence on soundness proof forced us to investigate deeper. In the process, many concepts became more rigorous, and we could formally show that our method is equivalent to a top-down flow- and context-sensitive classical points-to analysis, proving both soundness and precision.
We would like to thank Akshat Garg and N. Venkatesh for empirical measurements of points-to information computed by SVF analysis as also the data for the client analyses.
P. Gharat was partially supported by TCS Research Fellowship.
Authors’ addresses: P. M. Gharat and U. P. Khedker (corresponding author), Indian Institute of Technology Bombay, India; emails: pritamg@cse.iitb.ac.in, uday@cse.iitb.ac.in; A. Mycroft, University of Cambridge, UK; email: Alan.Mycroft@cl.cam.ac.uk.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org.
©2020 Association for Computing Machinery.
0164-0925/2020/05-ART8 $15.00
DOI: https://doi.org/10.1145/3382092
Publication History: Received January 2018; revised October 2019; accepted January 2020