Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Reordering Functions in Mobiles Apps for Reduced Size and Faster Start-Up

Published: 10 June 2024 Publication History

Abstract

Function layout, also known as function reordering or function placement, is one of the most effective profile-guided compiler optimizations. By reordering functions in a binary, compilers can improve the performance of large-scale applications or reduce the compressed size of mobile applications. Although the technique has been extensively studied in the context of large-scale binaries, no study has thoroughly investigated function layout algorithms on mobile applications.
In this article, we develop the first principled solution for optimizing function layouts in the mobile space. To this end, we identify two key optimization goals: reducing the compressed code size and improving the cold start-up time of a mobile application. Then, we propose a formal model for the layout problem, whose objective closely matches our goals, and a novel algorithm for optimizing the layout. The method is inspired by the classic balanced graph partitioning problem. We have carefully engineered and implemented the algorithm in an open-source compiler, Low-level Virtual Machine (LLVM). An extensive evaluation of the new method on large commercial mobile applications demonstrates improvements in start-up time and compressed size compared to the state-of-the-art approach.1

1 Introduction

Mobile applications became an essential part of everyday life, making it crucial to improve their speed, size, and reliability. Profile-guided optimization (PGO) is a critical component in modern compilers for improving the performance and size of applications; it enables the development and delivery of new app features for mobile devices that have limited storage and low memory. The technique, also known as feedback-driven optimization, leverages the program’s dynamic behavior to generate optimized applications. Currently PGO is a standard feature in most commercial and open-source compilers.
Modern PGO has been successful in speeding up server workloads [7, 18, 41] by providing a double-digit percentage boost in performance. This is accomplished through a combination of several compiler optimizations, such as function inlining and code layout. PGO relies on execution profiles of a program, such as the execution frequencies of basic blocks and function invocations, to guide compilers in selectively and efficiently optimizing critical paths. Typically, server-side PGO aims to improve CPU and cache utilization during the steady state of the program execution, resulting in higher server throughput. Applying PGO for mobile applications poses a unique challenge, as mobile applications are largely I/O bound and lack a well-defined steady-state performance due to their user-interactive nature. Instead, the download speed and the launch time of an app are crucial to its success, as they directly impact user experience, and therefore, user retention [5, 35].
In this article, we revisit a classic PGO technique, function layout, and show how to successfully apply it in the context of mobile applications. We emphasize that most of the earlier compiler optimizations focus on a single objective, such as the performance or the size of a binary. However, function layout might impact multiple key metrics of a mobile application. We show how to place functions in a binary to simultaneously improve its (compressed) size and start-up performance. The former objective is directly related to the app download speed and has been extensively discussed in recent works on compiler optimizations for mobile applications [5, 32, 33, 35, 46, 48]. The latter received considerably less attention but nevertheless is of prime importance in the mobile space [10, 20, 21, 37].
Function layout, along with basic block reordering and inlining, is one of the most impactful PGOs. The seminal work of Pettis and Hansen [43] introduced a heuristic for function placement by co-locating instances frequently executed together. Improving the code locality reduces translation lookaside buffer (I-TLB) cache misses, which results in an optimized steady-state performance of large-scale binaries. The follow up work of Ottoni and Maher [40] is based on the same model and further improved the placement scheme by considering the performance of the processor instruction cache (I-cache). The two heuristics are utilized in the majority of modern compilers and binary optimizers [30, 40, 41, 50]. However, such optimizations based on code locality are not used in the mobile development, and the corresponding layout algorithms have not been thoroughly studied. One primary reason is that improving the utilization of instruction caches does not affect the key mobile app metrics, including its size, download speed, or responsiveness. The recent work of Lee, Hoag, and Tillmann [32] is, to the best of our knowledge, the only study that mentions a technique for function placement in native mobile applications. The work lacks, however, a thorough explanation of why the heuristic impacts the metrics of interest and does not provide an implementation scalable for largest applications.
With this in mind, we initiate a formal study of function layout algorithms in the context of mobile applications. We provide the first comprehensive investigation of various reordering techniques. First, in Section 1.1, we explain how function layout impacts the compressed app size, which in turn affects its download speed. Then, in Section 1.2, we describe how an optimized function placement can improve start-up time. Finally, Section 1.3 highlights our main contributions, a unified optimization model to tackle these two seemingly unrelated objectives and a novel algorithm for the problem based on the recursive balanced graph partitioning.

1.1 Function Layout for App Download Speed

As mobile apps continue to grow rapidly, reducing the binary size is crucial for application developers [23, 32, 35]. Smaller apps can be downloaded faster, which directly impacts user experience [45]. For example, a recent study [5] establishes a strong correlation between app size and user engagement. Furthermore, mobile app distribution platforms may impose size limitations for downloads that use cellular data. For example, in the Apple App Store, users will not receive timely updates that include critical security improvements if an app’s size exceeds a certain threshold, unless they are connected to a Wi-Fi network.
Mobile apps are distributed to users in a compressed form via mobile app platforms. Using compression schemes tailored to a specific app could lead to smaller apps. Unfortunately, application developers typically do not have control over the compression technique used by the platforms. However, Lee et al. [32] demonstrated that modifying the function layout of a binary can lead to gains in compressed size. Specifically, co-locating “similar” functions in the binary can improve the compression ratio achieved by popular compression algorithms such as ZSTD [58] or LZFSE [57]. A similar technique is used in a bytecode Android optimizer, Redex [22].
Why does function layout affect compression ratios? Most modern lossless compression tools rely on the Lempel-Ziv (LZ) scheme [56]. Such algorithms try to identify long repeated sequences in the data and substitute them with pointers to their previous occurrences. If the pointer is represented using fewer bits than the actual data, then the substitution results in a compressed-size win. That is, the shorter the distance between the repeated sequences, the higher the compression ratio. To make the computation effective, LZ-based algorithms search for common sequences inside a sliding window, which is typically much shorter than the actual data. Therefore, function layouts in which repeated instructions are grouped together, lead to smaller (compressed) mobile apps; see Figure 1.
Fig. 1.
Fig. 1. Placing similar (same-patterned) functions nearby in the binary leads to higher compression rates achieved by Lempel-Ziv algorithms. Functions are considered similar when they share common sequences of instructions that can be encoded by short references.

1.2 Function Layout for App Launch Time

Start-up time is one of the key metrics for mobile applications, since a quick launch ensures that users have a good first impression [54]. According to a study in [37], \(20\%\) of users abandon an app after one use, and \(80\%\) of users give poorly performing apps at most three chances before uninstalling them. Start-up time is the time between a click on an application icon and the display of the first frame after rendering. There are several start-up scenarios: cold start, warm start, and hot start [10, 20, 21]. Switching back and forth between different apps leads to a hot/warm start and typically does not incur significant delays. In contrast, starting an app from scratch or resuming it after a memory intensive process is referred to as cold start. Our focus is to improve this cold start scenario, which is usually the key performance metric.
Unlike server workloads, where code layout algorithms optimize the cache utilization [39, 40, 41], start-up performance is mostly dictated by memory page faults [12]. When an app is launched, its code is transferred from the permanent storage device to the main memory. Function layout can affect the performance, because the transfer happens at the granularity of memory pages. As illustrated in Figure 2, interleaving cold functions that are never executed with hot functions results in more memory pages being fetched from the storage device. While simply grouping hot functions in the binary is an attractive solution, we note that some mobile apps have a user base of billions of daily active users across a wide range of devices and platforms. As a result, optimizing the layout for a particular usage scenario can lead to a suboptimal performance for other scenarios. The main challenge is to produce a single function layout that optimizes the start-up performance across all use cases.
Fig. 2.
Fig. 2. Co-locating hot (round) functions and cold (rectangular) functions nearby in the binary leads to a reduction in page faults. The hotness of the functions and their order of executions might depend on the usage scenario (trace), and the task is to find a single optimized function layout.

1.3 Contributions

We model the problem of computing an optimized function layout for mobile apps as the balanced graph partitioning problem [15]. This approach enables a single algorithm to enhance both app start-up time (which impacts user experience) and app size (which impacts download speed). However, while the layout algorithm is the same for the two objectives, it operates with different datasets collected during profiling. For the sake of clarity, we call the optimizations Balanced Partitioning for Start-up Optimization (bps) and Balanced Partitioning for Compression Optimization (bpc). Algorithm 1 outlines our implementation.
The former optimization, bps, is applied to hot functions in the binary that are executed during app start-up, while the latter optimization, bpc, is applied to the remaining cold functions. In our experiments, we found that approximately \(15\%\) of functions are hot, allowing us to improve the overall start-up performance while simultaneously reordering most of the functions in the “compression-friendly” manner. Compared to the prior work [32], which describes a heuristic for function placement in mobile apps, we achieve an average start-up time improvement of \(4\%\) and a compressed size reduction of up to \(2\%\) , while speeding up the function layout phase by 30 times for SocialApp (iOS), one of the largest mobile apps in the world. We summarize the contributions of the article as follows.
We formally define the function layout problem in the context of mobile applications. To this end, we identify and formalize two optimization objectives, based on the application start-up time and the compressed size.
Next, we present the Balanced Partitioning algorithm, which takes as input a bipartite graph between function and utility vertices, and outputs an order of the function vertices. We demonstrate how to reduce the objectives of bpc and bps to an instance of the balanced graph partitioning problem.
Finally, we extensively evaluate the compressed size, the start-up performance, and the runtime of the new algorithms with large commercial applications on iOS and Android platforms and an open-source standalone binary.
The rest of the article is organized as follows. Section 2 builds an optimization model for compression and start-up performance, respectively. Then, Section 3 introduces the recursive balanced graph partitioning algorithm, which forms the foundation for effectively solving the two optimization problems. Next, in Section 4, we describe our implementation of the technique in an open-source compiler, LLVM. Section 5 presents an evaluation on real-world mobile applications. We conclude the article with a discussion of related works in Section 6 and possible future directions in Section 7.

2 Building an Optimization Model

We model the function layout problem with a bipartite graph, denoted \(G=(F \cup U, E)\) , where F and U are disjoint sets of vertices and E is the set of edges between the vertices. The set F is a collection of all functions in a binary, and the goal is to find a permutation (also called an order or a layout) of F. The set U represents auxiliary utility vertices that are used to define an objective for optimization. Every utility vertex \(u \in U\) is adjacent with a subset of functions, \(f_1, \dots , f_k \in F\) so that \((u, f_1), \dots , (u, f_k) \in E\) for some integer \(k \ge 2\) . Intuitively, the goal of the layout algorithm is to place all functions so that \(f_1, \dots , f_k\) are nearby in the resulting order, for each utility vertex u. That is, the utility vertex encodes a locality preference for the adjacent functions. Next, we formalize the intuition for each of the two objectives.

2.1 Compression

As explained in Section 1.1, the compression ratio of a Lempel-Ziv-based algorithm can be improved if similar functions are placed nearby in the binary. This observation is based on earlier theoretical studies [27, 44] and has been verified empirically [13, 34] in the context of lossless data compression. These studies define (sometimes, implicitly) a proxy metric that correlates an order of functions with the compression achieved by an LZ scheme. Suppose we are given some data to compress, e.g., a sequence of bytes that represents the instructions in a binary. Define a k-mer to be a contiguous substring in the data of length k, which is a small constant. Let w be the size of the sliding window utilized by the compression algorithm; typically, w is much smaller than the length of the data. The compression ratio achieved by an LZ-based compression algorithm is determined by the number of distinct k-mers in the data within each sliding window of size w. Equivalently, the compressed size of the data is minimized when each k-mer occurs in as few windows of size w as possible.
To validate the intuition, we computed and plotted the number of distinct 8-mers within 64 KB windows on a set of functions from SocialApp and ChatApp; see Figure 3. To obtain a data point for the plots, we fixed a specific layout of functions in the binary and extracted its .text section to a string, by concatenating the instructions. Then for every (contiguous) substring of length w, we count the number of distinct k-mers in the substring. This number serves as the proxy metric for predicting the compressed size of the data. We then apply a compression algorithm to the entire string and measure the compressed size. To get multiple points on Figure 3, we repeat the process by starting with a different function layout, which was achieved by randomly permuting some of the functions. The results in Figure 3 reveal a strong correlation between the actual compression ratio achieved on the data and the predicted value based on k-mers. We record a Pearson correlation coefficient \(\rho \gt 0.95\) between the two quantities. Interestingly, such a high correlation is observed for various values of k (in our evaluation, \(4 \le k \le 12\) ), different window sizes (4 KB \(\le w \le\) 128 KB), and various compression tools. In particular, we experimented with ZSTD (which combines a dictionary-matching stage with a fast entropy-coding stage), LZ4 (belongs to the LZ77 family of byte-oriented compression schemes), and LZMA (uses dictionary compression within the xz tool).
Fig. 3.
Fig. 3. Correlation between the number of distinct k-mers ( \(k=8\) ) in a sliding window of size \(w=64\) KB in a binary and its compressed size after applying a Lempel-Ziv-based compression algorithm.
Given the remarkable predictive power of this simple proxy metric and the fact that it can easily be computed from a given ordering, we suggest to optimize the function layout in a binary to minimize this metric. We represent each function, \(f \in F\) , as a sequence of instructions. For every instruction in the binary that occurs in at least two functions, we create a utility vertex \(u \in U\) . The bipartite graph, \(G=(F \cup U, E)\) , contains an edge \((f, u) \in E\) if function f contains instruction u; refer to Figure 4 for an illustration of the process. The goal is to co-locate functions that share many utility vertices so that the corresponding instructions can be efficiently encoded.
Fig. 4.
Fig. 4. Modeling compression-aware function layout (bpc) with a bipartite graph.

2.2 Start-up

To optimize cold start, we develop a simplified memory model. Initially, we assume that the application code is not present in the main memory. When the application starts, the code needs to be fetched from the disk to the main memory at the granularity of memory pages. We assume that the pages are never evicted from the memory. That is, when a function is executed for the first time, its page should be present in the memory to avoid a start-up delay caused by page faults. The goal is to find a function layout that results in the fewest number of page faults possible.
In this model, the start-up performance is affected only by the first execution of a function; all subsequent executions do not result in page faults. Hence, we record, for each function \(f \in F\) , the timestamp when it was first executed, and collect the sequence of functions ordered by the timestamps. Such sequence of functions is called the function trace. The traces list the functions participating in the cold start and may differ from each other depending on the user or the usage scenario of the application. Next, we assume that we have a representative collection of traces, which we denote with S.
Given an order of functions, we can determine which memory page each function belongs to, assuming we known the memory page size and the size of each function. Then for every start-up trace, \(\sigma \in S\) , and an index \(t \le |\sigma |\) , we define \(p_{\sigma }(t)\) to be the number of page faults during the execution of the first t functions in \(\sigma\) . Similarly, for a set of traces S, we define the evaluation curve as the average number of page faults for each \(\sigma \in S\) , that is, \(p(t) := \sum _{\sigma \in S} p_{\sigma }(t) / |S|\) .
To gain an intuition, consider what happens when there is a single trace \(\sigma \in S\) (or equivalently, when all traces are identical). Then the optimal layout is to use the order induced by \(\sigma\) , in which case the evaluation curve looks linear in t. In contrast, a random permutation of functions results in fetching most pages early in the execution and the corresponding evaluation curve looks like a step function. We refer the reader to Figure 5(b) for a concrete example of how different layouts lead to different evaluation curves. In practice, we have a diverse set of traces, and the goal is to create a function order whose evaluation curve is as flat as possible.
Fig. 5.
Fig. 5. Modeling start-up-aware function layout (bps).
We remark that while traces have the same length if all functions are executed eventually, the length of the prefix of each trace corresponding to the start-up phase may vary due to diverging execution paths specific to the device and the user. Hence, instead of optimizing the value of \(p(t)\) for a particular t, we aim to minimize the area under the curve \(p(t)\) by selecting a discrete set of threshold values \(t_1, t_2, \ldots t_k\) , and use the bipartite graph \(G = (F \cup U, E)\) with utility vertices
\begin{equation*} U = \lbrace (\sigma , t_i) : \sigma \in S \text{ and } 1 \le i \le k \rbrace , \end{equation*}
and edge set
\begin{equation*} E = \lbrace (f, (\sigma , t_i)) : \sigma ^{-1}(f) \le t_i \rbrace , \end{equation*}
where \(\sigma ^{-1}(f)\) is the index of function f in \(\sigma\) . That way, the algorithm builds an order of F in which the first \(t_i\) positions of every \(\sigma \in S\) occur, as much as possible, consecutively.

3 Recursive Balanced Graph Partitioning

Our algorithm for function layout utilizes the recursive balanced graph partitioning scheme. Recall that the input is an undirected bipartite graph \(G=(F \cup U, E)\) , where F and U are disjoint sets of functions and utilities, respectively, and E are the edges between them; see Figure 6(a). The goal of the algorithm is to find a permutation of F that optimizes a specific objective.
Fig. 6.
Fig. 6. Recursive balanced graph partitioning.
For a high-level overview of our method, refer to Algorithm 1. The algorithm combines recursive graph bisection with a local search optimization at each step. Given an input graph G with \(|F|=n\) , we apply the bisection algorithm to obtain two disjoint sets of (approximately) equal cardinality, \(F_1, F_2 \subseteq F\) , where \(|F_1|=\lfloor n/2 \rfloor\) and \(|F_2|=\lceil n/2 \rceil\) . We layout \(F_1\) on the set \(\lbrace 1, \dots , \lfloor n/2 \rfloor \rbrace\) and \(F_2\) on the set \(\lbrace \lceil n/2 \rceil , \dots , n\rbrace\) . By doing so, we divide the problem into two sub-problems, each of half the size, and recursively compute orders for the two subgraphs induced by vertices \(F_1\) and \(F_2\) , adjacent utility vertices, and incident edges. Naturally, when the graph contains only one function, the order is trivial; see Figure 6(b) for an overview of the recursive scheme.
Every bisection step of Algorithm 1 is a variant of the local search optimization inspired by the popular Kernighan-Lin heuristic [26] for the graph bisection problem. We start by splitting F into two sets, \(F_1\) and \(F_2\) , of roughly equal size. Then, we iteratively exchange pairs of vertices between \(F_1\) and \(F_2\) to improve a certain cost. To this end, we compute, for every function \(f \in F\) , the move gain, that is, the difference of the cost after moving f from its current set to another one. Then the vertices of \(F_1\) ( \(F_2\) ) are sorted in the decreasing order of the gains to produce list \(S_1\) ( \(S_2\) ). Finally, the lists \(S_1\) and \(S_2\) are traversed in the order, exchanging the pairs of vertices when the sum of the move gains is positive. Note that unlike the classic graph bisection heuristic [26], we do not update move gains after every swap, which enables the efficient implementation. The process is repeated until a convergence criterion is met or the maximum number of iterations is reached. The final order of the functions is obtained by concatenating the two recursively computed orders for \(F_1\) and \(F_2\) .
Optimization objective. An important aspect of our algorithm is the objective to optimize at each bisection step. The goal is to find a layout in which functions sharing many utility vertices are co-located in the order. We capture this with the cost of a given partition of F into \(F_1\) and \(F_2\) :
\begin{equation} cost(F_1, F_2) := \sum _{u \in U} cost\big (L(u), R(u)\big), \end{equation}
(1)
where \(L(u)\) and \(R(u)\) are the numbers of functions adjacent to utility vertex u in parts \(F_1\) and \(F_2\) , respectively; see Figure 6(a). Observe that \(L(u)+R(u)\) is the degree of vertex u, and thus, it is independent of the split. The objective, which we try to minimize, is the summation of the individual contributions to the cost over the utilities. The contribution of one utility vertex, \(cost\big (L(u), R(u)\big)\) , is minimized when \(L(u)=0\) or \(R(u)=0\) , that is, when all functions of u belong to the same part; in that case, the algorithm might be able to group the functions in the final order. In contrast, when \(L(u) \approx R(u)\) , the cost takes its highest value, as the functions will likely be spread out in the order. Of course, it is easy to minimize the cost for one utility vertex (by placing its functions to one of the parts). However, minimizing \(cost(F_1, F_2)\) for all utilities simultaneously is a challenging task, due to the constraint on the sizes of \(F_1\) and \(F_2\) .
There are multiple functions that we could use to define \(cost\big (L(u), R(u)\big)\) , so that we satisfy the conditions above. After an extensive empirical evaluation of various candidates, the following objective was chosen as the winner:
\begin{equation} cost\big (L(u), R(u)\big) = -L(u) \log {\big (L(u)+1\big)} - R(u) \log {\big (R(u)+1\big)}. \end{equation}
(2)
The definition is inspired by the so-called uniform log-gap cost previously used in the context of index compression [8, 11]. We refer the reader to Section 5.4 for a description of the other alternatives considered and the details of the empirical evaluation.
Finally, we mention that to implement \(ComputeMoveGain(f)\) in Algorithm 1, we simply traverse the edges \((u, f) \in E\) , computing the change in \(cost\big (L(u), R(u)\big)\) that would result from moving f to the other part, and summing up over all neighboring u to get the overall change in objective for such a move.
Computational complexity. To estimate the computational complexity of Algorithm 1 and predict its running time, denote \(|F| = n\) and \(|E| = m\) . Suppose that at each bisection step, we apply a constant number of refinement steps (referred to as the iteration limit in the pseudocode). There are \(\lceil \log n \rceil\) levels of recursion, and we assume that every call of \(\mathsf {ReorderBP}\) splits the graph into two equal-sized parts with \(n/2\) vertices and \(m/2\) edges. Each call of the graph bisection consists of computing move gains and sorting two arrays with n elements. The former can be done in \({\mathcal {O}}(m)\) steps, while the latter takes \({\mathcal {O}}(n \log n)\) steps. Therefore, the total number of steps, \(T(n, m)\) , is expressed as the following recurrence:
\begin{equation*} T(n, m) = c (m + n \log n) + 2 \cdot T(n / 2, m / 2), \end{equation*}
where c is the constants inside the asymptotic upper bound on the running time of a graph bisection step. It is easy to verify that summing over all subproblems yields \(T(n, m) = {\mathcal {O}}(m \log n + n \log ^2 n)\) . We complete the discussion by noticing that it is possible to reduce the worst-case bound to \({\mathcal {O}}(m \log n)\) via a modified procedure for performing swaps [36] but the strategy does not result in a runtime reduction on our datasets. The next section provides important details of our implementation.

3.1 Algorithm Engineering

While implementing Algorithm 1 in an open-source compiler, we developed a few modifications improving certain aspects of the technique. In particular, we implement the algorithm in parallel manner and limit the depth of the recursion to reduce the running time. Next, we enhance the procedures for the initial split of functions into two parts and for the way to perform swaps between the parts, which is beneficial for the quality of the identified solutions. Finally, we propose a sampling technique to reduce the space requirements of the algorithm.
Improving the running time. Due to the simplicity of the algorithm, it can be implemented to run in parallel. Since the two subgraphs resulting from the bisection step are disjoint, the two recursive calls can be processed concurrently. To this end, we use the fork-join computation model, where small enough graphs are processed sequentially, while larger graphs are solved in parallel. To speed up the algorithm further, we set the maximum depth of the recursive tree (16 in our implementation) and limit the number of local search iterations per split (20 in our implementation). If the recursive tree reaches the lowest node and there are still unordered functions, then we fall back to the original relative order of the functions provided by the compiler.
Finally, we observe that our objective cost requires repeated computation of \(\log (x+1)\) expressions for integer arguments. To avoid costly floating-point logarithm evaluations, we pre-computed a table of values for \(0 \le x \lt 2^{14}\) , where the upper bound is chosen small enough to fit in the processor data cache. That way, we replaced most of the logarithm evaluations with a table lookup, saving approximately \(10\%\) of the total runtime.
Optimizing the quality. One ingredient of Algorithm 1 is how the two initial sets, \(F_1\) and \(F_2\) , are initialized. Arguably the initialization procedure might affect the quality of the final vertex order, since it serves as the starting point for the local search optimization. To initialize the bisection, we consider two alternatives. The simpler, outlined in the pseudocode, is to randomly split F into two (approximately) equal-sized sets. A more involved strategy is to employ minwise hashing [8, 11] to order the functions by similarity and then assign the first \(\lfloor n/2 \rfloor\) functions to \(F_1\) and the last \(\lceil n/2 \rceil\) to \(F_2\) . As discussed in Section 5.4, splitting the vertices randomly is our preferred option due to its simplicity.
An interesting aspect of Algorithm 1 is the way for exchanging functions between the two sets. Recall that we pair the functions in \(F_1\) with functions in \(F_2\) based on the computed move gains, which are positive when a function should be moved to another set or negative when a function should stay in its current set. We observed that it is beneficial to skip some of the moves. To this end, we introduce a fixed probability (0.1 in our implementation) of skipping the move for a vertex that would otherwise have been moved to a new set. Intuitively, this adjustment prevents the optimization from becoming stuck at a local minimum. It is also beneficial for avoiding redundant swapping cycles, which might occur in the algorithm; refer to References [36, 53] for a discussion of the problem in the context of graph reordering.
Reducing the space complexity. One potential downside to our start-up function layout algorithm is the need to collect full traces during profiling. If too many executions are profiled, then the storage requirements may become impractical. To address the issue, we cap the number of stored traces by a fixed integer \(\ell\) . If the profiling process generates more than \(\ell\) traces, then we select a representative random sample of size \(\ell\) using reservoir sampling [52]. When the ith trace arrives, if \(i \le \ell\) , then we keep the trace; otherwise, with probability \(1- \ell / i\) , we ignore the trace, and with complementary probability, we pick uniformly at random one of the stored traces and swap it out with the new one. The process yields a sample of \(\ell\) traces chosen uniformly at random from the stream of traces. If the reservoir is too small, then we run the risk of the samples not being representative and the solution found being of low quality. However, as we add more and more samples the improvement in solution quality goes down. In Section 5.2, we show empirically that the benefit of using a reservoir larger than \(\ell =300\) is negligible. Thus our implementation uses \(\ell =300\) .

4 Implementation in LLVM

Both bpc and bps use profile data to guide function layout. Ideally, profile data should accurately represent common real-world scenarios. The classic instrumentation in LLVM [28] produces an instrumented binary with large size and performance overhead due to added instrumentation instructions, added metadata sections, and changes in optimization passes. This is particularly problematic for mobile devices, where increased code size can lead to performance regressions and alter the behavior of the application. Profiles collected from these instrumented binaries might not accurately represent our target scenarios. To address these issues, Machine Intermediate Representation (IR) Profile (MIP) [32] aims to minimize the binary size and the performance overhead for instrumented binaries. This is achieved by extracting instrumentation metadata from the binary and using it to post-process the profiles offline.
MIP collects profile data that are relevant for optimizing mobile apps. It records function call counts used to identify functions as either hot or cold. Within each function, MIP can derive coverage data for each basic block. MIP has an optional mode, called return address sampling, which adds probes to callees to collect a sample of their call-sites. This can be used to construct a dynamic call graph that includes dynamically dispatched calls. Furthermore, MIP collects function timestamps by recording and incrementing a global timestamp for each function when it is called for the first time. We sort the functions by their initial call timestamp to construct a function trace. To collect raw profiles at runtime, we run instrumented apps under normal usage, and dump raw profiles to the disk, which is then uploaded to a database. These raw profiles are later merged offline into a single indexed profile.

4.1 Overview of the Build Pipeline

Figure 7 shows an overview of our build pipeline. We collect thousands of raw profile data files from various uses and periodically perform offline post-processing to generate a single indexed profile. This profile is used to categorize functions as either hot or cold based on whether they have been executed in any raw profile. During post-processing, bps determines the optimized order of hot functions that were profiled, including both start-up and non-start-up functions. Our apps are built with link-time optimization (LTO or ThinLTO). At the end of LTO, bpc orders cold functions to achieve a highly compressed binary size. These two orders of functions are concatenated and passed to the linker, which finalizes the function layout in the binary. We have chosen to use two separate optimization passes for bps and bpc, since applying them jointly at the end of LTO would require carrying large amounts of traces through the build pipeline.
Fig. 7.
Fig. 7. Overview of the build pipeline with the optimized function layout.

4.2 Hot Function Layout

As shown in Figure 7, we first merge the raw profiles into the indexed profile with instrumentation metadata during post-processing. For the block coverage and dynamic call graph data, we simply accumulate them into the indexed profile as we go along. However, to run bps, we need to keep the function timestamps from each raw profile. We encode the sequence of indices to the functions participating in the cold start-up and append them to a separate section of the indexed profile.
The bps algorithm, described in Section 2.2, uses function traces with thresholds, to set utility vertices, and produces an optimized order for start-up functions. Once bps is completed, the embedded function traces are no longer needed, and can be removed from the indexed profile.
Third-party library functions and outlined functions that appear later than instrumentation in the compilation pass, might not be instrumented. To order such functions, we first check if their call sites are profiled using block coverage data. If that is the case, then these functions inherit the order of their first caller. For example, if an uninstrumented outlined function, \(f_{outlined}\) , is called from the profiled functions, \(f_A\) and \(f_B\) , and bps orders \(f_A\) followed by \(f_B\) , then we insert \(f_{outlined}\) after \(f_A\) ; this results in the layout \(f_A, f_{outlined}, f_B\) .

4.3 Cold Function Layout

We execute bpc after optimization and code generation are finished, without using an IR, as shown in Figure 7. During code generation, each function publishes a set of hashes that represent its contents, which are meaningful across modules. We use one 64-bit stable hash for each instruction by combining hashes of its opcode and operands, resulting in an 8-mer, a substring of length 8, for every instruction. When computing stable hashes, we omit hashes of pointers and instead use hashes of the contents of their targets. Unlike outliners that need to match instruction sequences, we do not consider the order or duplicates of the hashes. We only keep track of the unique stable hashes per function as the input to bpc.
Since hot functions are already ordered, we filter them out before applying bpc. It is worth noting that outliners can optimistically produce many identical functions, which will eventually be folded by the linker. To efficiently handle deduplication, bpc groups functions that have identical sets of hashes, and runs with the set of unique cold functions.

5 Evaluation

We design our experiments to answer two primary questions: (i) How well does the new function layout impact real-world binaries in comparison with alternative techniques? (ii) How do various parameters of the algorithm contribute to the solution, and what are the best set of parameters? We also investigate the scalability of our algorithm.

5.1 Experimental Setup

We evaluated our approach on three commercial iOS applications, three commercial Android applications, and an open-source compiler; refer to Table 1 for basic properties of the apps. Among the iOS applications, SocialApp is one of the largest mobile applications in the world, with a total size of over 250 MB; it provides a variety of usage scenarios, which makes it an attractive target for compiler optimizations. ChatApp is a medium sized mobile app with a total size of over 50 MB. AdsApp is another large mobile app with a total size of 190 MB. Similarly, the three Android apps are large mobile applications whose sizes approach 80 MB. Each of the applications consist of a number of binaries built individually and packaged together. Observe that the Android applications consist of several hundred shared native binaries, while iOS ChatApp and AdsApp contains only 7 and 4 binaries, respectively. Therefore, individual iOS binaries are much larger in size than the Android binaries. For this reason, iOS instances are built with ThinLTO, while Android ones are utilizing (Full)LTO. We highlight that iOS applications are primarily developed in Objective-C and Swift, whereas Android applications utilize C/C++ for their native components to provide JNI support. Finally, Clang (release 15.0) is an open-source C++ standalone program built on MacOS with (Full)LTO.
Table 1.
 platformnumber of binaries.text sizeapp sizetotalhotblocks per func.
  in app package(MB)(MB)func.func.p50p95p99
SocialAppiOS61119259 \(856\text{K}\) \(154\text{K}\) 11129
ChatAppiOS73558 \(202\text{K}\) \(44\text{K}\) 32470
AdsAppiOS487194 \(682\text{K}\) \(41\text{K}\) 11862
SocialAppAndroid3694573 \(215\text{K}\) \(16\text{K}\) 334107
ChatAppAndroid3934980 \(229\text{K}\) \(17\text{K}\) 336110
AdsAppAndroid2974179 \(210\text{K}\) \(22\text{K}\) 336109
ClangMacOS16169 \(134\text{K}\) \(32\text{K}\) 32693
Table 1. Basic Properties of Evaluated Applications
The last three columns show percentile statistics about the number of blocks per function in a given app (e.g., p50 is the median number of blocks per function).
All the algorithms discussed in the article are implemented on top of release_14 of LLVM.2 The binaries are built on a Linux-based server with a dual-node 28-core 2.4 GHz Intel Xeon E5-2680 (Broadwell) having 256 GB RAM.
Next, in Section 5.2, we demonstrate the impact of function reordering on the start-up performance. These experiments are conducted on four applications (SocialApp and ChatApp for iOS and Android), which are integrated with a large-scale automatic measurement system deployed in the production environment. Then, in Section 5.3, we conduct experiments with the compressed app size, which does not require production deployment, and hence, analyzed for all the instances from the benchmark.

5.2 Start-up Performance

Here, we present the impact of function layout on the start-up performance of mobile apps. Recall that the start-up performance, often referred to as the cold start time, is the duration from the moment an app is launched until it loads and displays its first feed, thereby becoming ready for user interaction. The new algorithm, which we refer to as bps, is compared with the following alternatives:
baseline is the original ordering of functions as dictated by the compiler; the function layout follows the order of object files that are passed into the linker;
random is the result of randomly permuting the hot functions;
order-avg is a heuristic for ordering hot functions suggested in the work [32] based on the average timestamp of a function during start-up computed across all traces.
We first evaluate the start-up performance of iOS apps. In the production environment for iOS apps, only a single application can be shipped. Hence, to compare the new bps algorithm with alternatives, we analyze two consecutive releases of the apps. Release N corresponds to a version with the order-avg algorithm, while release \(N+1\) utilizes bps. We acknowledge that the differences observed may result from multiple optimizations shipped simultaneously with bps, as the applications are being actively developed. To account for this, we repeated the alternations three times in consecutive releases, and reported the average improvements. Table 2 provides the measurements for start-up time reductions for SocialApp ( \(2.9\%\) ) and ChatApp ( \(5.7\%\) ), where the 99% confidence intervals estimated from the three collected data-points are within \(0.5\%\) . To understand the reason behind the reduced start-up time, we record the number of major page faults during start-up. Table 3 presents the detailed results for the average and the 99th percentile number of page faults observed for millions of samples published in production. On average, bps reduced the number of major page faults by \(6.9\%\) and \(16.9\%\) for SocialApp and ChatApp, respectively. To double check that the average is not being affected by outliers, we inspect the improvement in the 99th percentile of major page faults; at \(4.6\%\) for SocialApp and \(9.1\%\) for ChatApp they indicate that most of the improvement is coming from typical executions. We emphasize that less effective algorithms (such as baseline or random) cannot be utilized without regressing the performance for all users, and thus, the corresponding experiments are omitted.
Table 2.
 Start-up Time
SocialApp (iOS) 
bpc    vs order-avg \(2.9\%\)
ChatApp (iOS) 
bpc    vs order-avg \(5.7\%\)
SocialApp (Android) 
order-avg vs baseline \(0.54\%\)
bpc    vs baseline \(0.85\%\)
ChatApp (Android) 
order-avg vs baseline \(0.60\%\)
bpc    vs baseline \(1.26\%\)
Table 2. Relative Improvements of the Start-up Performance of Various Function Layout Algorithms on iOS and Android Applications
Table 3.
  averagep99
SocialApp (iOS)   
order-avgrelease N \(3.4\text{K}\) \(7.6\text{K}\)
bpsrelease \(N+1\) \(3.1\text{K}~(6.9\%)\) \(7.2\text{K}~(4.6\%)\)
ChatApp (iOS)   
order-avgrelease N \(1.7\text{K}\) \(10.3\text{K}\)
bpsrelease \(N+1\) \(1.4\text{K}~(16.9\%)\) \(\:\:9.3\text{K}~(9.1\%)\)
Table 3. Number of Major Page Faults Measured for order-avg and bps Shipped in Consecutive Releases for iOS Apps.
The relative improvement of bps over order-avg is shown in parentheses. The column labeled average compares the average number of page faults across all executions, while column labeled p99 indicates the 99th-percentile of the number of page faults across all executions.
An interesting observation from Table 3 is that function layout has a greater impact on the start-up performance of ChatApp (iOS) than that of SocialApp (iOS). This is likely due to the observation that SocialApp consists of dozens of native binaries. Since function layout is applied for each binary individually, the overall impact on the start-up performance of SocialApp is reduced. In contrast, ChatApp is composed of only 7 binaries, and among them, only the largest one is responsible for the start-up; thus, an optimized function layout can directly impact the performance.
Next, we discuss the start-up performance of two Android apps. Unlike the experiments with iOS, there is currently no accurate way of measuring page faults; however, there exists an option to deliver different versions of an app to different users (via Google PlayStore), and thus, perform A/B experiments. We estimate the impact of function layout on the start-up performance for order-avg and bps over baseline. As illustrated in Table 2, bps results in the highest reduction of the start-up performance; the 99% confidence intervals for the A/B experiment are within \(0.1\%\) . We notice that the start-up time reductions for Android apps appear smaller than the corresponding values for iOS apps: \(0.85\%\) and \(1.26\%\) for SocialApp (Android) and ChatApp (Android), respectively. Our explanation is the nature of Android apps consisting of much larger number of binaries yet having similar or smaller code size; refer to Table 1. Hence, the opportunity for bps to reduce the number of page faults is smaller.
Finally, we investigate the start-up performance using the optimization model developed in Section 2.2. Note that using the model enables a fine-grained analysis, which is impossible in production environment. Figure 8(a) illustrates the mean number of page faults simulated during start-up with up to 20K time steps, utilizing different function layout algorithms, on SocialApp (iOS). As expected, random immediately suffers from many page faults early in the execution, as the randomized placement of functions likely spans many pages. Note that baseline outperforms random; this is likely due to the natural co-location of related functions in the source code. Then, order-avg improves the evaluation curve over baseline, and bps stretches the curve even closer to the linear line, further minimizing the expected number of page faults.
Fig. 8.
Fig. 8. (a) Simulated start-up performance of various function layout algorithms, and (b) the number of sampled traces on SocialApp (iOS); the page size is assumed to be 16 KB with a total of 265 memory pages.
Figure 8(b) shows the average number of page faults for a different number of function traces on SocialApp (iOS). Ideally, we want to build a function layout utilizing as few traces as possible. In practice, the start-up scenarios are fairly consistent across different usages, and one can reduce the need of capturing all raw function traces. We observe that running bps with more than 300 traces does not improve the page faults on our dataset. As described in Section 3.1, we use the value by default for bps to reduce the space complexity of post-processing.

5.3 Compressed App Size

We now present the results of size optimizations on the selected applications. In addition to the baseline and random function layout algorithms, we compare bpc with the following heuristic:
greedy is an approach for ordering cold functions discussed in see work [32] for details. It is a procedure that iteratively builds an order by appending one function at a time. On each step, the most recently placed function is compared (based on the instructions) with not yet selected functions, and the one with the highest similarity score is appended to the order. To reduce an expensive \({\mathcal {O}}(n^2)\) -computation of the scores, several pruning rules are applied to reduce the set of candidates; see work [32] for details.
Table 4 summarizes the app size reduction from each function layout algorithm, where the improvements are computed on top of baseline (that is, the original order of functions generated by the compiler). The compressed size reduction is measured in three modes: the size of the .text section of the binary, which is directly impacted by our optimization, the size of the executables excluding resource files such as images and videos, and the total app size in a compressed package. For iOS apps, we observe that bpc reduces the size of .text by \(3\%\) , \(1.8\%\) , and \(5.4\%\) for SocialApp, ChatApp and AdsApp, respectively. Since this section is the largest in the binary (responsible for \(2/3\) of the compressed app size), this translates into overall \(1.9\%\) , \(1.3\%\) , and \(3.9\%\) improvements. At the same time, the impact of all the tested algorithm on the uncompressed size of a binary is minimal (within \(0.1\%\) ), which is mainly due to differences in code alignment. Compared to SocialApp and ChatApp, AdsApp exhibits a larger size reduction. This is likely because each individual binary in AdsApp has more functions. Notably, the execution of greedy on the app was unsuccessful due to an out-of-memory failure.
Table 4.
 TextExecutablesApp Size
SocialApp (iOS)   
random \(-5.3\%\) \(-4.6\%\) \(-3.7\%\)
greedy \(1.6\%\) \(1.3\%\) \(1.1\%\)
bpc \(3.0\%\) \(2.3\%\) \(\boldsymbol {1.9\%}\)
ChatApp (iOS)   
random \(-4.9\%\) \(-4.4\%\) \(-3.8\%\)
greedy \(1.3\%\) \(1.0\%\) \(0.8\%\)
bpc \(1.8\%\) \(1.6\%\) \(\boldsymbol {1.3\%}\)
AdsApp (iOS)   
random \(-6.1\%\) \(-10.3\%\) \(-8.2\%\)
greedyOOMOOMOOM
bpc \(5.4\%\) \(4.9\%\) \(\boldsymbol {3.9\%}\)
SocialApp (Android)   
random \(-4.3\%\) \(-3.7\%\) \(-1.6\%\)
greedy \(1.3\%\) \(0.7\%\) \(0.3\%\)
bpc \(2.3\%\) \(1.4\%\) \(\boldsymbol {0.6\%}\)
ChatApp (Android)   
random \(-2.2\%\) \(-2.0\%\) \(-1.0\%\)
greedy \(0.4\%\) \(0.1\%\) \(0.1\%\)
bpc \(0.9\%\) \(0.5\%\) \(\boldsymbol {0.2\%}\)
AdsApp (Android)   
random \(-4.2\%\) \(-4.1\%\) \(-1.6\%\)
greedy \(1.4\%\) \(0.4\%\) \(0.2\%\)
bpc \(2.3\%\) \(1.0\%\) \(\boldsymbol {0.4\%}\)
Clang (MacOS)   
random \(-11.5\%\) \(-8.1\%\) N/A
greedy \(3.6\%\) \(3.1\%\) N/A
bpc \(6.6\%\) \(5.0\%\) N/A
Table 4. Compressed Size Improvements of Various Function Layout Algorithms Over baseline; Negative Values Indicate Regressions
AdsApp (iOS) has no data for greedy, as the algorithm failed to run due to an out-of-memory failure (indicated by OOM).
An interesting observation is the behavior of random on the dataset, which worsen the compression ratios by approximately \(5\%\) in comparison to baseline. The explanation is that similar functions are naturally clustered in the source code. For example, functions within the same object file tend to have many local calls, making the corresponding call instructions good candidates for a compact LZ-based encoding. Yet bpc can improve the instruction locality beyond the natural clustering, by reordering functions across different object files.
Next, we assess the reduction in compressed size for Android applications. It is important to note that the total app size is measured using the Android Package Kit (apk), which includes not only native binaries but also Android Dex bytecode files [22]. Hence, the reported app size improvements are noticeably smaller than for the .text sections or the executables. For all the Android apps, two factors play a role. On the one hand, the application packages are comprised of a large number of individual binaries, each containing a relatively small number of functions. This diminishes the impact of function reordering. On the other hand, the applications are implemented in C/C++, which result in fewer call-sites in comparison to Objective-C on iOS. Function call instructions encode their call targets with relative offsets whose values differ for each call-site, making the C++ programs more layout-sensitive from the compression point of view. Overall, we record \(2.3\%\) , \(0.9\%\) , and \(2.3\%\) .text size reductions for SocialApp, ChatApp, and AdsApp, respectively, by applying bpc on top of baseline; the corresponding sizes of apk are reduced by \(0.6\%\) , \(0.2\%\) , and \(0.4\%\) .
Similar to Android apps, Clang is developed in C++. Unlike the commercial applications, it is a single binary, which yields a larger size improvement of \(6.6\%\) for its .text section. As the binary is not bundled with other resource files, the concept of app size is not applicable in the context.

5.4 Further Analysis of Balanced Partitioning

The new Algorithm 1 has a number of parameters that can affect its quality and performance. In the following, we discuss some of the parameters and explain our choice of their default values. We emphasize that such a detailed analysis requires building and evaluating hundreds of app versions, which is feasible only for synthetic data in a simulated environment.
As discussed in Section 3, the central component of the algorithm is the objective to optimize at a bisection step, which we refer to as the cost of a partition the set of functions into two disjoint parts. Equation (1) provides a general form of the objective, where the summation is taken over the utility vertices. For a given utility vertex having \(L(u)\) functions adjacent in one part and \(R(u)\) functions adjacent in another part, \(cost\big (L(u), R(u)\big)\) can be an arbitrary objective that is minimized when \(L(u) = 0\) or \(R(u) = 0\) and maximized when \(L(u) = R(u)\) . Besides the uniform log-gap defined by Equation (2), we consider two alternatives. The first one is probabilistic fanout defined as follows:
\begin{equation*} cost\big (L(u), R(u)\big) = 1 - p^{L(u)} + 1 - p^{R(u)}, \end{equation*}
where \(p \in [0, 1]\) is a constant. The objective is motivated by partitioning graphs in the context of database sharding [24], where p represents the probability that a query to a database accesses a certain shard. The second utilized objective represents the absolute difference between \(L(u)\) and \(R(u)\) :
\begin{equation*} cost\big (L(u), R(u)\big) = L(u) + R(u) - |L(u) - R(u)|. \end{equation*}
It is easy to verify that both objectives satisfy the requirements mentioned above.
The plots on Figure 9 illustrate the impact of the optimization objective on the quality of resulting function layouts, that is, the corresponding compressed size and the (estimated) number of page faults during start-up. For the fanout objective, we utilize \(p=0.9\) , as it typically results in the best outcomes in the evaluation. We observe that the uniform log-gap cost results in the smallest binaries, outperforming fanout and absolute difference by \(1.7\%\) and \(2.3\%\) , respectively. Similarly, this objective is the best choice for start-up optimization, where it yields fewer page faults by \(4.1\%\) and \(11.5\%\) on average, respectively. Thus, we consider the uniform log-gap as the preferred option for the optimization. However, function orders produced by the algorithm coupled with the other two objectives are still meaningfully better than alternatives investigated in Sections 5.2 and 5.3.
Fig. 9.
Fig. 9. Impact of the depth of recursion, the iteration limit, and the optimization objective on the quality of resulting function orders evaluated on ChatApp (iOS).
Next, we experiment with two parameters affecting Algorithm 1: the number of refinement iterations and the maximum depth of the recursion. Figure 9 (top) illustrates the impact of the latter on the quality of the result evaluated on ChatApp (iOS). For every \(0 \le d \lt 20\) (that is, when the input graph is split into \(2^d\) parts), we stop the algorithm and measure the quality of the order respecting the computed partition. It turns out that bisecting vertices is beneficial only when F contains a few tens of vertices. Therefore, we limit the depth of the recursion by \(\min (\lceil \log _2 n \rceil , 16)\) levels. To investigate the effect of the maximum number of refinement iterations, we apply the algorithm with various values in the range between 1 and 45; see Figure 9 (bottom). We observe an improvement of the quality up to iteration 20, which serves as the default limit in our implementation.
Finally, we explore the choice of the initial splitting strategy of F into \(F_1\) and \(F_2\) in Algorithm 1. Arguably the initialization procedure might affect the quality of the final order, as it provides the starting point for the subsequent local search optimization. To verify the hypothesis, we implemented three initialization techniques that bisect a given graph: (i) a random splitting as outlined in the pseudocode, (ii) a similarity-based minwise hashing [8, 11], and (iii) an input-based strategy that splits the functions based on their relative order in the compiler. In the experiments we found no consistent winner among the three options. Therefore, we recommend the simplest approach (i) in the implementation.

5.5 Build-time Analysis

Finally, we discuss the impact of function layout on the build time of the applications. The time overhead by running bpc is minimal in comparison with the overall build time: it takes around 20 s for SocialApp (iOS) and less than 1 s for the smaller apps. In contrast, using the greedy approach leads to a noticeable slowdown, increasing the overall build of SocialApp by around 10 min, which accounts for more than \(10\%\) of the total build time of approximately 100 min. Furthermore, greedy, being a quadratic-time algorithm, fails to process the largest binary in the AdsApp (iOS) package. Refer to Table 5 for the measurements of the build times on the largest apps from the benchmark. Observe that for Android apps, consisting of smaller binaries, the build-time impact of any function layout is insignificant.
Table 5.
  Build Time (s)
SocialApp (iOS) 
baseline5,400
greedy6,000
bpc5,420
AdsApp (iOS) 
baseline4,530
greedyOOM
bpc4,580
Clang (MacOS) 
baseline350
greedy400
bpc370
Table 5. Impact of Various Function Layout Algorithms on the Build Time of Largest Applications
The worst-case time complexity of our implementation is upper bounded by \({\mathcal {O}}(m \log n + n \log ^2 n)\) , where n is the total number of functions and m is the number of function-utility edges. The estimation aligns with Figure 10(a), which plots the dependency of the runtime on the number of functions in the binary. We emphasize that the measurements are done in a multi-threaded environment in which distinct subgraphs (arising from the recursive computation) are processed in parallel. To assess the speed up of the parallelism, we limit the number of threads for the computation; see Figure 10(b). Observe that using two threads provides approximately 2 \(\times\) speedup, whereas four threads yields a 2.5 \(\times\) speedup in comparison with the single-threaded implementation. Increasing the number of threads beyond that does not yield measurable runtime improvements. However it is likely that for larger instances with more recursive subgraphs, utilizing multiple threads can be beneficial.
Fig. 10.
Fig. 10. Impact of the number of functions and the depth of recursion on the runtime of function layout evaluated on SocialApp (iOS).

6 Related Work

There exists a rich literature on profile-guided compiler optimizations. Here, we discuss previous works that are closely related to PGO in the mobile space, code layout techniques, and our algorithmic contributions.
PGO. Most compiler optimizations for mobile applications are aimed at reducing the code size. Such techniques include algorithms for function inlining and outlining [9, 33], merging similar functions [47, 48], loop optimization [46], unreachable code elimination, and many others. In addition, some works describe performance improvements for mobiles, by improving their responsiveness, memory management, and start-up time [32, 54]. The optimizations can be applied at the compile time, link time [35], or post-link time [41, 50]. Our approach is complimentary to the works and can be applied in combination with the existing optimizations.
Code Layout. The work by Pettis and Hansen [43] serves as the basis for most modern code reordering techniques for server workloads. The goal of their basic block reordering is to create chains of blocks that are frequently executed together. Many variants of the technique were suggested in the literature and implemented in various tools [30, 38, 39, 40, 41, 49, 50]. The state-of-the-art approach for basic block layout is due to Newell and Pupyrev [39]. Alternative models have been studied in several papers [16, 25, 31], where a temporal-relation graph is taken into account. Temporal affinities between code instructions can also be utilized for reducing conflict cache misses [17].
Code reordering at the function-level is also initiated by Pettis and Hansen [43] whose algorithm is implemented in many compilers and binary optimization tools [41, 50]. This approach greedily merges chains of functions with the primary goal of reducing I-TLB misses. An improvement is proposed by Ottoni and Maher [40], who propose working with a directed call graph to reduce I-cache misses. As discussed in Section 1, the approaches are designed to improve the steady-state performance of server workloads and cannot be applied to mobile apps. The very recent work of Lee, Hoag, and Tillmann [32] is the only study discussing an approach for function layout in the mobile space; our novel algorithm is more efficient and more effective than their heuristic.
Algorithms. Our model for function layout relies on the balanced graph partitioning problem [2, 15, 26]. There exists a rich literature on the topic from both theoretical and practical points of view [3, 4]. The most closely related work to our study is on graph reordering [11, 36], which utilizes recursive graph bisection for creating “compression-friendly” inverted indices. While our algorithm shares some similarities with these works, our objectives and application area are different.
The general problem of how to optimize memory performance has been studied from the theoretical point of view. One classic stream of works deals with the problem of how to design cache eviction policies to minimize cache misses [14, 51, 55]. A more recent stream of works deals with the problem of computing a suitable data layout for a given cache eviction policy [6, 29, 42]. Our setting for start-up optimization is closest to the latter stream; however, a major difference in our setting is the fact that the short-time horizon of the start-up means that page evictions do not play a significant role in our setup. Therefore, we cannot rely on previous methods.

7 Discussion

In this article, we have presented and evaluated the first function layout algorithm designed for mobile compiler optimizations. The algorithm was carefully engineered making it scalable to process the largest instances within a matter of seconds. We have successfully applied this optimization to several large commercial mobile applications, resulting in improvements in the start-up performance and reductions in the compressed app size.
An important contribution of the work is a formal model for function layout optimizations. We believe that the model utilizing the bipartite graph with utility vertices is general enough to be applicable in various contexts. In our current implementation, each function is either optimized for start-up or for size but not for both at the same time. However, it might be possible to relax the constraint and design an approach that unifies the two objectives. Our early experiments show that reordering all functions with bpc could result in up to \(0.3\%\) size reduction, but this may come at the cost of a longer start-up time. Unifying the optimizations is a promising direction for future work.
From a theoretical point of view, our work is related to a computationally hard problem of balanced graph partitioning [2]. While the problem is hard in theory, real-world instances obey certain characteristics, which may simplify the analysis of algorithms. For example, control-flow and call graphs arising from modern programming languages have constant treewidth, which is a standard notion to measure how close a graph is to a tree [1, 6, 38]. Many NP-hard optimization problems can be solved efficiently on graphs with a small treewidth, and therefore, exploring function layout algorithms parameterized by the treewidth is of interest.

Acknowledgment

We thank Nikolai Tillmann for fruitful discussions of the problem.

Footnotes

1
The work is a substantially extended version of a paper by some of the of authors at the LCTES workshop [19].
2
Refer to https://reviews.llvm.org/D147812 for the open-source implementation of the algorithm.

References

[1]
Ali Ahmadi, Majid Daliri, Amir Kafshdar Goharshady, and Andreas Pavlogiannis. 2022. Efficient approximations for cache-conscious data placement. In Proceedings of the International Conference on Programming Language Design and Implementation, Ranjit Jhala and Isil Dillig (Eds.). ACM, 857–871.
[2]
Konstantin Andreev and Harald Räcke. 2006. Balanced graph partitioning. Theory Comput. Syst. 39, 6 (2006), 929–939.
[3]
Charles-Edmond Bichot and Patrick Siarry. 2013. Graph Partitioning. John Wiley & Sons.
[4]
Aydin Buluç, Henning Meyerhenke, Ilya Safro, Peter Sanders, and Christian Schulz. 2016. Recent advances in graph partitioning. In Algorithm Engineering—Selected Results and Surveys. Springer, Cham, 117–158.
[5]
Milind Chabbi, Jin Lin, and Raj Barik. 2021. An experience with code-size optimization for production iOS mobile applications. In Proceedings of the International Symposium on Code Generation and Optimization, Jae W. Lee, Mary Lou Soffa, and Ayal Zaks (Eds.). IEEE, 363–377.
[6]
Krishnendu Chatterjee, Amir Kafshdar Goharshady, Nastaran Okati, and Andreas Pavlogiannis. 2019. Efficient parameterized algorithms for data packing. Proc. ACM Program. Lang. 3, POPL (2019), 1–28.
[7]
Dehao Chen, Tipp Moseley, and David Xinliang Li. 2016. AutoFDO: Automatic feedback-directed optimization for warehouse-scale applications. In Proceedings of the International Symposium on Code Generation and Optimization. ACM, 12–23.
[8]
Flavio Chierichetti, Ravi Kumar, Silvio Lattanzi, Michael Mitzenmacher, Alessandro Panconesi, and Prabhakar Raghavan. 2009. On compressing social networks. In Proceedings of the Conference on Knowledge Discovery and Data Mining. ACM, 219–228.
[9]
Thaís Damásio, Vinícius Pacheco, Fabrício Goes, Fernando Pereira, and Rodrigo Rocha. 2021. Inlining for code size reduction. In Proceedings of the 25th Brazilian Symposium on Programming Languages. ACM, 17–24.
[10]
Google Developers. 2022. App Startup Time. Retrieved from https://developer.android.com/topic/performance/vitals/launch-time
[11]
Laxman Dhulipala, Igor Kabiljo, Brian Karrer, Giuseppe Ottaviano, Sergey Pupyrev, and Alon Shalita. 2016. Compressing graphs and indexes with recursive graph bisection. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’16). ACM, New York, NY, 1535–1544.
[12]
Malcolm C. Easton and Ronald Fagin. 1978. Cold-start vs. warm-start miss ratios. Commun. ACM 21, 10 (1978), 866–872.
[13]
Paolo Ferragina and Giovanni Manzini. 2010. On compressing the textual web. In Proceedings of the International Conference on Web Search and Data Mining. ACM, New York, NY, 391–400.
[14]
Amos Fiat, Richard M. Karp, Michael Luby, Lyle A. McGeoch, Daniel D. Sleator, and Neal E. Young. 1991. Competitive paging algorithms. J. Algor. 12, 4 (1991), 685–699.
[15]
Michael R. Garey, David S. Johnson, and Larry Stockmeyer. 1974. Some simplified NP-complete problems. In Proceedings of the 6th Annual ACM Symposium on Theory of Computing. ACM, New York, NY, 47–63.
[16]
Nikolas Gloy and Michael D. Smith. 1999. Procedure placement using temporal-ordering information. Trans. Program. Lang. Syst. 21, 5 (1999), 977–1027.
[17]
Amir H. Hashemi, David R. Kaeli, and Brad Calder. 1997. Efficient procedure mapping using cache line coloring. SIGPLAN Notices 32, 5 (1997), 171–182.
[18]
Wenlei He, Julián Mestre, Sergey Pupyrev, Lei Wang, and Hongtao Yu. 2022. Profile inference revisited. Proc. ACM Program. Lang. 6, POPL (2022), 1–24.
[19]
Ellis Hoag, Kyungwoo Lee, Julián Mestre, and Sergey Pupyrev. 2023. Optimizing function layout for mobile applications. In Proceedings of the International Conference on Languages, Compilers, and Tools for Embedded Systems. ACM, New York, NY, 52–63.
[20]
Apple Inc.2022. Reducing Your App’s Launch Time. Retrieved from https://developer.apple.com/documentation/xcode/reducing-your-app-s-launch-time
[21]
Facebook Inc.2015. Optimizing Facebook for iOS Start Time. Retrieved from https://engineering.fb.com/2015/11/20/ios/optimizing-facebook-for-ios-start-time
[22]
Facebook Inc. 2021. Redex: A Bytecode Optimizer for Android Apps. Retrieved from https://fbredex.com
[23]
Facebook Inc.2021. Superpack: Pushing the Limits of Compression in Facebook’s Mobile Apps. Retrieved from https://engineering.fb.com/2021/09/13/core-data/superpack/
[24]
Igor Kabiljo, Brian Karrer, Mayank Pundir, Sergey Pupyrev, Alon Shalita, Yaroslav Akhremtsev, and Alessandro Presta. 2017. Social hash partitioner: A scalable distributed hypergraph partitioner. Proc. VLDB Endow. 10, 11 (2017), 1418–1429.
[25]
J. Kalamationos and David R. Kaeli. 1998. Temporal-based procedure reordering for improved instruction cache performance. In Proceedings of the Conference on High-Performance Computer Architecture. IEEE Computer Society, 244–253.
[26]
Brian W. Kernighan and Shen Lin. 1970. An efficient heuristic procedure for partitioning graphs. Bell Syst. Tech. J. 49, 2 (1970), 291–307.
[27]
Tomasz Kociumaka, Gonzalo Navarro, and Nicola Prezza. 2022. Toward a definitive compressibility measure for repetitive sequences. IEEE Trans. Info. Theory 69, 4 (2022), 2074–2092.
[28]
Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis & transformation. In Proceedings of the International Symposium on Code Generation and Optimization. IEEE Computer Society, 75.
[29]
Rahman Lavaee. 2016. The hardness of data packing. In Proceedings of the Symposium on Principles of Programming Languages. ACM, 232–242.
[30]
Rahman Lavaee, John Criswell, and Chen Ding. 2019. Codestitcher: Inter-procedural basic block layout optimization. In Proceedings of the 28th International Conference on Compiler Construction, José Nelson Amaral and Milind Kulkarni (Eds.). ACM, 65–75.
[31]
Rahman Lavaee and Chen Ding. 2014. ABC Optimizer: Affinity Based Code Layout Optimization. Technical Report. University of Rochester.
[32]
Kyungwoo Lee, Ellis Hoag, and Nikolai Tillmann. 2022. Efficient profile-guided size optimization for native mobile applications. In Proceedings of the International Conference on Compiler Construction. ACM, 243–253.
[33]
Kyungwoo Lee, Manman Ren, and Shane Nay. 2022. Scalable size inliner for mobile applications (WIP). In Proceedings of the International Conference on Languages, Compilers, and Tools for Embedded Systems. ACM, 116–120.
[34]
Xing Lin, Guanlin Lu, Fred Douglis, Philip Shilane, and Grant Wallace. 2014. Migratory compression: Coarse-grained data reordering to improve compressibility. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST’14). USENIX, Santa Clara, CA, USA, 257–271.
[35]
Gai Liu, Umar Farooq, Chengyan Zhao, Xia Liu, and Nian Sun. 2023. Linker code size optimization for native mobile applications. In Proceedings of the International Conference on Compiler Construction. ACM, New York, NY, 168–179.
[36]
Joel Mackenzie, Matthias Petri, and Alistair Moffat. 2023. Tradeoff options for bipartite graph partitioning. IEEE Trans. Knowl. Data Eng. 35, 8 (2023), 8644–8657.
[37]
Nezar Mansour. 2020. Understanding Cold, Hot, and Warm App Launch Time. Retrieved from https://blog.instabug.com/understanding-cold-hot-and-warm-app-launch-time/
[38]
Julián Mestre, Sergey Pupyrev, and Seeun William Umboh. 2021. On the extended TSP problem. In Proceedings of the 32nd International Symposium on Algorithms and Computation(LIPIcs, Vol. 212), Hee-Kap Ahn and Kunihiko Sadakane (Eds.). Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 42:1–42:14.
[39]
Andy Newell and Sergey Pupyrev. 2020. Improved basic block reordering. IEEE Trans. Comput. 69, 12 (2020), 1784–1794.
[40]
Guilherme Ottoni and Bertrand Maher. 2017. Optimizing function placement for large-scale data-center applications. In Proceedings of the International Symposium on Code Generation and Optimization. IEEE Press, 233–244.
[41]
Maksim Panchenko, Rafael Auler, Bill Nell, and Guilherme Ottoni. 2019. BOLT: A practical binary optimizer for data centers and beyond. In Proceedings of the International Symposium on Code Generation and Optimization. IEEE, Washington, DC, 2–14.
[42]
Erez Petrank and Dror Rawitz. 2005. The hardness of cache conscious data placement. Nordic J. Comput. 12, 3 (2005), 275–307.
[43]
Karl Pettis and Robert C. Hansen. 1990. Profile guided code positioning. SIGPLAN Notices 25, 6 (1990), 16–27.
[44]
Sofya Raskhodnikova, Dana Ron, Ronitt Rubinfeld, and Adam Smith. 2013. Sublinear algorithms for approximating string compressibility. Algorithmica 65, 3 (2013), 685–709.
[45]
Peter Reinhardt. 2016. Effect of Mobile App Size on Downloads. Retrieved from https://segment.com/blog/mobile-app-size-effect-on-downloads/
[46]
Rodrigo C. O. Rocha, Pavlos Petoumenos, Björn Franke, Pramod Bhatotia, and Michael O’Boyle. 2022. Loop rolling for code size reduction. In Proceedings of the International Symposium on Code Generation and Optimization, Jae W. Lee, Sebastian Hack, and Tatiana Shpeisman (Eds.). IEEE, 217–229.
[47]
Rodrigo C. O. Rocha, Pavlos Petoumenos, Zheng Wang, Murray Cole, Kim Hazelwood, and Hugh Leather. 2021. HyFM: Function merging for free. In Proceedings of the International Conference on Languages, Compilers, and Tools for Embedded Systems, Jörg Henkel and Xu Liu (Eds.). ACM, 110–121.
[48]
Rodrigo C. O. Rocha, Pavlos Petoumenos, Zheng Wang, Murray Cole, and Hugh Leather. 2020. Effective function merging in the SSA form. In Proceedings of the International Conference on Programming Language Design and Implementation, Alastair F. Donaldson and Emina Torlak (Eds.). ACM, 854–868.
[49]
Benjamin Schwarz, Saumya Debray, Gregory Andrews, and Matthew Legendre. 2001. PLTO: A link-time optimizer for the Intel IA-32 architecture. In Proceedings of the Workshop on Binary Rewriting. 1–7.
[50]
Han Shen, Krzysztof Pszeniczny, Rahman Lavaee, Snehasish Kumar, Sriraman Tallam, and Xinliang David Li. 2023. Propeller: A profile guided, relinking optimizer for warehouse-scale applications. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 617–631.
[51]
Daniel D. Sleator and Robert E. Tarjan. 1985. Amortized efficiency of list update and paging rules. Commun. ACM 28, 2 (1985), 202–208.
[52]
Jeffrey Scott Vitter. 1985. Random sampling with a reservoir. ACM Trans. Math. Softw. 11, 1 (1985), 37–57.
[53]
Qi Wang and Torsten Suel. 2019. Document reordering for faster intersection. Proc. VLDB Endow. 12, 5 (2019), 475–487.
[54]
Tingxin Yan, David Chu, Deepak Ganesan, Aman Kansal, and Jie Liu. 2012. Fast app launching for mobile devices using predictive user context. In Proceedings of the 10th International Conference on Mobile Systems, Applications, and Services. ACM, 113–126.
[55]
Neal E. Young. 2016. Online paging and caching. In Encyclopedia of Algorithms. Springer New York, New York, NY, 1457–1461.
[56]
Jacob Ziv and Abraham Lempel. 1977. A universal algorithm for sequential data compression. IEEE Trans. Info. Theory 23, 3 (1977), 337–343.
[57]
LZFSE compression algorithm. 2024.Retrieved from https://en.wikipedia.org/wiki/LZFSE
[58]
ZSTD compression algorithm. 2024.Retrieved from https://en.wikipedia.org/wiki/Zstd

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Embedded Computing Systems
ACM Transactions on Embedded Computing Systems  Volume 23, Issue 4
July 2024
333 pages
EISSN:1558-3465
DOI:10.1145/3613607
  • Editor:
  • Tulika Mitra
Issue’s Table of Contents
This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

Publication History

Published: 10 June 2024
Online AM: 20 April 2024
Accepted: 10 April 2024
Revised: 14 March 2024
Received: 20 October 2023
Published in TECS Volume 23, Issue 4

Check for updates

Author Tags

  1. Profile-guided optimizations
  2. code layout
  3. function reordering
  4. code-size reduction
  5. graph algorithms

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 455
    Total Downloads
  • Downloads (Last 12 months)455
  • Downloads (Last 6 weeks)182
Reflects downloads up to 01 Sep 2024

Other Metrics

Citations

Cited By

View all

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media