research-article

Open access

Reordering Functions in Mobiles Apps for Reduced Size and Faster Start-Up

Authors:

Yongkang ZhuAuthors Info & Claims

ACM Transactions on Embedded Computing Systems, Volume 23, Issue 4

Article No.: 54, Pages 1 - 54

https://doi.org/10.1145/3660635

Published: 10 June 2024 Publication History

PDF eReader

Abstract

Function layout, also known as function reordering or function placement, is one of the most effective profile-guided compiler optimizations. By reordering functions in a binary, compilers can improve the performance of large-scale applications or reduce the compressed size of mobile applications. Although the technique has been extensively studied in the context of large-scale binaries, no study has thoroughly investigated function layout algorithms on mobile applications.

In this article, we develop the first principled solution for optimizing function layouts in the mobile space. To this end, we identify two key optimization goals: reducing the compressed code size and improving the cold start-up time of a mobile application. Then, we propose a formal model for the layout problem, whose objective closely matches our goals, and a novel algorithm for optimizing the layout. The method is inspired by the classic balanced graph partitioning problem. We have carefully engineered and implemented the algorithm in an open-source compiler, Low-level Virtual Machine (LLVM). An extensive evaluation of the new method on large commercial mobile applications demonstrates improvements in start-up time and compressed size compared to the state-of-the-art approach.¹

1 Introduction

Mobile applications became an essential part of everyday life, making it crucial to improve their speed, size, and reliability. Profile-guided optimization (PGO) is a critical component in modern compilers for improving the performance and size of applications; it enables the development and delivery of new app features for mobile devices that have limited storage and low memory. The technique, also known as feedback-driven optimization, leverages the program’s dynamic behavior to generate optimized applications. Currently PGO is a standard feature in most commercial and open-source compilers.

Modern PGO has been successful in speeding up server workloads [7, 18, 41] by providing a double-digit percentage boost in performance. This is accomplished through a combination of several compiler optimizations, such as function inlining and code layout. PGO relies on execution profiles of a program, such as the execution frequencies of basic blocks and function invocations, to guide compilers in selectively and efficiently optimizing critical paths. Typically, server-side PGO aims to improve CPU and cache utilization during the steady state of the program execution, resulting in higher server throughput. Applying PGO for mobile applications poses a unique challenge, as mobile applications are largely I/O bound and lack a well-defined steady-state performance due to their user-interactive nature. Instead, the download speed and the launch time of an app are crucial to its success, as they directly impact user experience, and therefore, user retention [5, 35].

In this article, we revisit a classic PGO technique, function layout, and show how to successfully apply it in the context of mobile applications. We emphasize that most of the earlier compiler optimizations focus on a single objective, such as the performance or the size of a binary. However, function layout might impact multiple key metrics of a mobile application. We show how to place functions in a binary to simultaneously improve its (compressed) size and start-up performance. The former objective is directly related to the app download speed and has been extensively discussed in recent works on compiler optimizations for mobile applications [5, 32, 33, 35, 46, 48]. The latter received considerably less attention but nevertheless is of prime importance in the mobile space [10, 20, 21, 37].

Function layout, along with basic block reordering and inlining, is one of the most impactful PGOs. The seminal work of Pettis and Hansen [43] introduced a heuristic for function placement by co-locating instances frequently executed together. Improving the code locality reduces translation lookaside buffer (I-TLB) cache misses, which results in an optimized steady-state performance of large-scale binaries. The follow up work of Ottoni and Maher [40] is based on the same model and further improved the placement scheme by considering the performance of the processor instruction cache (I-cache). The two heuristics are utilized in the majority of modern compilers and binary optimizers [30, 40, 41, 50]. However, such optimizations based on code locality are not used in the mobile development, and the corresponding layout algorithms have not been thoroughly studied. One primary reason is that improving the utilization of instruction caches does not affect the key mobile app metrics, including its size, download speed, or responsiveness. The recent work of Lee, Hoag, and Tillmann [32] is, to the best of our knowledge, the only study that mentions a technique for function placement in native mobile applications. The work lacks, however, a thorough explanation of why the heuristic impacts the metrics of interest and does not provide an implementation scalable for largest applications.

With this in mind, we initiate a formal study of function layout algorithms in the context of mobile applications. We provide the first comprehensive investigation of various reordering techniques. First, in Section 1.1, we explain how function layout impacts the compressed app size, which in turn affects its download speed. Then, in Section 1.2, we describe how an optimized function placement can improve start-up time. Finally, Section 1.3 highlights our main contributions, a unified optimization model to tackle these two seemingly unrelated objectives and a novel algorithm for the problem based on the recursive balanced graph partitioning.

1.1 Function Layout for App Download Speed

As mobile apps continue to grow rapidly, reducing the binary size is crucial for application developers [23, 32, 35]. Smaller apps can be downloaded faster, which directly impacts user experience [45]. For example, a recent study [5] establishes a strong correlation between app size and user engagement. Furthermore, mobile app distribution platforms may impose size limitations for downloads that use cellular data. For example, in the Apple App Store, users will not receive timely updates that include critical security improvements if an app’s size exceeds a certain threshold, unless they are connected to a Wi-Fi network.

Mobile apps are distributed to users in a compressed form via mobile app platforms. Using compression schemes tailored to a specific app could lead to smaller apps. Unfortunately, application developers typically do not have control over the compression technique used by the platforms. However, Lee et al. [32] demonstrated that modifying the function layout of a binary can lead to gains in compressed size. Specifically, co-locating “similar” functions in the binary can improve the compression ratio achieved by popular compression algorithms such as ZSTD [58] or LZFSE [57]. A similar technique is used in a bytecode Android optimizer, Redex [22].

Why does function layout affect compression ratios? Most modern lossless compression tools rely on the Lempel-Ziv (LZ) scheme [56]. Such algorithms try to identify long repeated sequences in the data and substitute them with pointers to their previous occurrences. If the pointer is represented using fewer bits than the actual data, then the substitution results in a compressed-size win. That is, the shorter the distance between the repeated sequences, the higher the compression ratio. To make the computation effective, LZ-based algorithms search for common sequences inside a sliding window, which is typically much shorter than the actual data. Therefore, function layouts in which repeated instructions are grouped together, lead to smaller (compressed) mobile apps; see Figure 1.

Fig. 1.

1.2 Function Layout for App Launch Time

Start-up time is one of the key metrics for mobile applications, since a quick launch ensures that users have a good first impression [54]. According to a study in [37], \(20\%\) of users abandon an app after one use, and \(80\%\) of users give poorly performing apps at most three chances before uninstalling them. Start-up time is the time between a click on an application icon and the display of the first frame after rendering. There are several start-up scenarios: cold start, warm start, and hot start [10, 20, 21]. Switching back and forth between different apps leads to a hot/warm start and typically does not incur significant delays. In contrast, starting an app from scratch or resuming it after a memory intensive process is referred to as cold start. Our focus is to improve this cold start scenario, which is usually the key performance metric.

Unlike server workloads, where code layout algorithms optimize the cache utilization [39, 40, 41], start-up performance is mostly dictated by memory page faults [12]. When an app is launched, its code is transferred from the permanent storage device to the main memory. Function layout can affect the performance, because the transfer happens at the granularity of memory pages. As illustrated in Figure 2, interleaving cold functions that are never executed with hot functions results in more memory pages being fetched from the storage device. While simply grouping hot functions in the binary is an attractive solution, we note that some mobile apps have a user base of billions of daily active users across a wide range of devices and platforms. As a result, optimizing the layout for a particular usage scenario can lead to a suboptimal performance for other scenarios. The main challenge is to produce a single function layout that optimizes the start-up performance across all use cases.

Fig. 2.

1.3 Contributions

We model the problem of computing an optimized function layout for mobile apps as the balanced graph partitioning problem [15]. This approach enables a single algorithm to enhance both app start-up time (which impacts user experience) and app size (which impacts download speed). However, while the layout algorithm is the same for the two objectives, it operates with different datasets collected during profiling. For the sake of clarity, we call the optimizations Balanced Partitioning for Start-up Optimization (bps) and Balanced Partitioning for Compression Optimization (bpc). Algorithm 1 outlines our implementation.

The former optimization, bps, is applied to hot functions in the binary that are executed during app start-up, while the latter optimization, bpc, is applied to the remaining cold functions. In our experiments, we found that approximately \(15\%\) of functions are hot, allowing us to improve the overall start-up performance while simultaneously reordering most of the functions in the “compression-friendly” manner. Compared to the prior work [32], which describes a heuristic for function placement in mobile apps, we achieve an average start-up time improvement of \(4\%\) and a compressed size reduction of up to \(2\%\) , while speeding up the function layout phase by 30 times for SocialApp (iOS), one of the largest mobile apps in the world. We summarize the contributions of the article as follows.

—

We formally define the function layout problem in the context of mobile applications. To this end, we identify and formalize two optimization objectives, based on the application start-up time and the compressed size.

—

Next, we present the Balanced Partitioning algorithm, which takes as input a bipartite graph between function and utility vertices, and outputs an order of the function vertices. We demonstrate how to reduce the objectives of bpc and bps to an instance of the balanced graph partitioning problem.

—

Finally, we extensively evaluate the compressed size, the start-up performance, and the runtime of the new algorithms with large commercial applications on iOS and Android platforms and an open-source standalone binary.

The rest of the article is organized as follows. Section 2 builds an optimization model for compression and start-up performance, respectively. Then, Section 3 introduces the recursive balanced graph partitioning algorithm, which forms the foundation for effectively solving the two optimization problems. Next, in Section 4, we describe our implementation of the technique in an open-source compiler, LLVM. Section 5 presents an evaluation on real-world mobile applications. We conclude the article with a discussion of related works in Section 6 and possible future directions in Section 7.

2 Building an Optimization Model

We model the function layout problem with a bipartite graph, denoted \(G=(F \cup U, E)\) , where F and U are disjoint sets of vertices and E is the set of edges between the vertices. The set F is a collection of all functions in a binary, and the goal is to find a permutation (also called an order or a layout) of F. The set U represents auxiliary utility vertices that are used to define an objective for optimization. Every utility vertex \(u \in U\) is adjacent with a subset of functions, \(f_1, \dots , f_k \in F\) so that \((u, f_1), \dots , (u, f_k) \in E\) for some integer \(k \ge 2\) . Intuitively, the goal of the layout algorithm is to place all functions so that \(f_1, \dots , f_k\) are nearby in the resulting order, for each utility vertex u. That is, the utility vertex encodes a locality preference for the adjacent functions. Next, we formalize the intuition for each of the two objectives.

2.1 Compression

As explained in Section 1.1, the compression ratio of a Lempel-Ziv-based algorithm can be improved if similar functions are placed nearby in the binary. This observation is based on earlier theoretical studies [27, 44] and has been verified empirically [13, 34] in the context of lossless data compression. These studies define (sometimes, implicitly) a proxy metric that correlates an order of functions with the compression achieved by an LZ scheme. Suppose we are given some data to compress, e.g., a sequence of bytes that represents the instructions in a binary. Define a k-mer to be a contiguous substring in the data of length k, which is a small constant. Let w be the size of the sliding window utilized by the compression algorithm; typically, w is much smaller than the length of the data. The compression ratio achieved by an LZ-based compression algorithm is determined by the number of distinct k-mers in the data within each sliding window of size w. Equivalently, the compressed size of the data is minimized when each k-mer occurs in as few windows of size w as possible.

To validate the intuition, we computed and plotted the number of distinct 8-mers within 64 KB windows on a set of functions from SocialApp and ChatApp; see Figure 3. To obtain a data point for the plots, we fixed a specific layout of functions in the binary and extracted its .text section to a string, by concatenating the instructions. Then for every (contiguous) substring of length w, we count the number of distinct k-mers in the substring. This number serves as the proxy metric for predicting the compressed size of the data. We then apply a compression algorithm to the entire string and measure the compressed size. To get multiple points on Figure 3, we repeat the process by starting with a different function layout, which was achieved by randomly permuting some of the functions. The results in Figure 3 reveal a strong correlation between the actual compression ratio achieved on the data and the predicted value based on k-mers. We record a Pearson correlation coefficient \(\rho \gt 0.95\) between the two quantities. Interestingly, such a high correlation is observed for various values of k (in our evaluation, \(4 \le k \le 12\) ), different window sizes (4 KB \(\le w \le\) 128 KB), and various compression tools. In particular, we experimented with ZSTD (which combines a dictionary-matching stage with a fast entropy-coding stage), LZ4 (belongs to the LZ77 family of byte-oriented compression schemes), and LZMA (uses dictionary compression within the xz tool).

Fig. 3.

Given the remarkable predictive power of this simple proxy metric and the fact that it can easily be computed from a given ordering, we suggest to optimize the function layout in a binary to minimize this metric. We represent each function, \(f \in F\) , as a sequence of instructions. For every instruction in the binary that occurs in at least two functions, we create a utility vertex \(u \in U\) . The bipartite graph, \(G=(F \cup U, E)\) , contains an edge \((f, u) \in E\) if function f contains instruction u; refer to Figure 4 for an illustration of the process. The goal is to co-locate functions that share many utility vertices so that the corresponding instructions can be efficiently encoded.

Fig. 4.

2.2 Start-up

To optimize cold start, we develop a simplified memory model. Initially, we assume that the application code is not present in the main memory. When the application starts, the code needs to be fetched from the disk to the main memory at the granularity of memory pages. We assume that the pages are never evicted from the memory. That is, when a function is executed for the first time, its page should be present in the memory to avoid a start-up delay caused by page faults. The goal is to find a function layout that results in the fewest number of page faults possible.

In this model, the start-up performance is affected only by the first execution of a function; all subsequent executions do not result in page faults. Hence, we record, for each function \(f \in F\) , the timestamp when it was first executed, and collect the sequence of functions ordered by the timestamps. Such sequence of functions is called the function trace. The traces list the functions participating in the cold start and may differ from each other depending on the user or the usage scenario of the application. Next, we assume that we have a representative collection of traces, which we denote with S.

Given an order of functions, we can determine which memory page each function belongs to, assuming we known the memory page size and the size of each function. Then for every start-up trace, \(\sigma \in S\) , and an index \(t \le |\sigma |\) , we define \(p_{\sigma }(t)\) to be the number of page faults during the execution of the first t functions in \(\sigma\) . Similarly, for a set of traces S, we define the evaluation curve as the average number of page faults for each \(\sigma \in S\) , that is, \(p(t) := \sum _{\sigma \in S} p_{\sigma }(t) / |S|\) .

To gain an intuition, consider what happens when there is a single trace \(\sigma \in S\) (or equivalently, when all traces are identical). Then the optimal layout is to use the order induced by \(\sigma\) , in which case the evaluation curve looks linear in t. In contrast, a random permutation of functions results in fetching most pages early in the execution and the corresponding evaluation curve looks like a step function. We refer the reader to Figure 5(b) for a concrete example of how different layouts lead to different evaluation curves. In practice, we have a diverse set of traces, and the goal is to create a function order whose evaluation curve is as flat as possible.

Fig. 5.

We remark that while traces have the same length if all functions are executed eventually, the length of the prefix of each trace corresponding to the start-up phase may vary due to diverging execution paths specific to the device and the user. Hence, instead of optimizing the value of \(p(t)\) for a particular t, we aim to minimize the area under the curve \(p(t)\) by selecting a discrete set of threshold values \(t_1, t_2, \ldots t_k\) , and use the bipartite graph \(G = (F \cup U, E)\) with utility vertices

\begin{equation*} U = \lbrace (\sigma , t_i) : \sigma \in S \text{ and } 1 \le i \le k \rbrace , \end{equation*}

and edge set

\begin{equation*} E = \lbrace (f, (\sigma , t_i)) : \sigma ^{-1}(f) \le t_i \rbrace , \end{equation*}

where \(\sigma ^{-1}(f)\) is the index of function f in \(\sigma\) . That way, the algorithm builds an order of F in which the first \(t_i\) positions of every \(\sigma \in S\) occur, as much as possible, consecutively.

3 Recursive Balanced Graph Partitioning

Our algorithm for function layout utilizes the recursive balanced graph partitioning scheme. Recall that the input is an undirected bipartite graph \(G=(F \cup U, E)\) , where F and U are disjoint sets of functions and utilities, respectively, and E are the edges between them; see Figure 6(a). The goal of the algorithm is to find a permutation of F that optimizes a specific objective.

Fig. 6.

For a high-level overview of our method, refer to Algorithm 1. The algorithm combines recursive graph bisection with a local search optimization at each step. Given an input graph G with \(|F|=n\) , we apply the bisection algorithm to obtain two disjoint sets of (approximately) equal cardinality, \(F_1, F_2 \subseteq F\) , where \(|F_1|=\lfloor n/2 \rfloor\) and \(|F_2|=\lceil n/2 \rceil\) . We layout \(F_1\) on the set \(\lbrace 1, \dots , \lfloor n/2 \rfloor \rbrace\) and \(F_2\) on the set \(\lbrace \lceil n/2 \rceil , \dots , n\rbrace\) . By doing so, we divide the problem into two sub-problems, each of half the size, and recursively compute orders for the two subgraphs induced by vertices \(F_1\) and \(F_2\) , adjacent utility vertices, and incident edges. Naturally, when the graph contains only one function, the order is trivial; see Figure 6(b) for an overview of the recursive scheme.

Every bisection step of Algorithm 1 is a variant of the local search optimization inspired by the popular Kernighan-Lin heuristic [26] for the graph bisection problem. We start by splitting F into two sets, \(F_1\) and \(F_2\) , of roughly equal size. Then, we iteratively exchange pairs of vertices between \(F_1\) and \(F_2\) to improve a certain cost. To this end, we compute, for every function \(f \in F\) , the move gain, that is, the difference of the cost after moving f from its current set to another one. Then the vertices of \(F_1\) ( \(F_2\) ) are sorted in the decreasing order of the gains to produce list \(S_1\) ( \(S_2\) ). Finally, the lists \(S_1\) and \(S_2\) are traversed in the order, exchanging the pairs of vertices when the sum of the move gains is positive. Note that unlike the classic graph bisection heuristic [26], we do not update move gains after every swap, which enables the efficient implementation. The process is repeated until a convergence criterion is met or the maximum number of iterations is reached. The final order of the functions is obtained by concatenating the two recursively computed orders for \(F_1\) and \(F_2\) .

Optimization objective. An important aspect of our algorithm is the objective to optimize at each bisection step. The goal is to find a layout in which functions sharing many utility vertices are co-located in the order. We capture this with the cost of a given partition of F into \(F_1\) and \(F_2\) :

\begin{equation} cost(F_1, F_2) := \sum _{u \in U} cost\big (L(u), R(u)\big), \end{equation}

(1)

where \(L(u)\) and \(R(u)\) are the numbers of functions adjacent to utility vertex u in parts \(F_1\) and \(F_2\) , respectively; see Figure 6(a). Observe that \(L(u)+R(u)\) is the degree of vertex u, and thus, it is independent of the split. The objective, which we try to minimize, is the summation of the individual contributions to the cost over the utilities. The contribution of one utility vertex, \(cost\big (L(u), R(u)\big)\) , is minimized when \(L(u)=0\) or \(R(u)=0\) , that is, when all functions of u belong to the same part; in that case, the algorithm might be able to group the functions in the final order. In contrast, when \(L(u) \approx R(u)\) , the cost takes its highest value, as the functions will likely be spread out in the order. Of course, it is easy to minimize the cost for one utility vertex (by placing its functions to one of the parts). However, minimizing \(cost(F_1, F_2)\) for all utilities simultaneously is a challenging task, due to the constraint on the sizes of \(F_1\) and \(F_2\) .

There are multiple functions that we could use to define \(cost\big (L(u), R(u)\big)\) , so that we satisfy the conditions above. After an extensive empirical evaluation of various candidates, the following objective was chosen as the winner:

\begin{equation} cost\big (L(u), R(u)\big) = -L(u) \log {\big (L(u)+1\big)} - R(u) \log {\big (R(u)+1\big)}. \end{equation}

(2)

The definition is inspired by the so-called uniform log-gap cost previously used in the context of index compression [8, 11]. We refer the reader to Section 5.4 for a description of the other alternatives considered and the details of the empirical evaluation.

Finally, we mention that to implement \(ComputeMoveGain(f)\) in Algorithm 1, we simply traverse the edges \((u, f) \in E\) , computing the change in \(cost\big (L(u), R(u)\big)\) that would result from moving f to the other part, and summing up over all neighboring u to get the overall change in objective for such a move.

Computational complexity. To estimate the computational complexity of Algorithm 1 and predict its running time, denote \(|F| = n\) and \(|E| = m\) . Suppose that at each bisection step, we apply a constant number of refinement steps (referred to as the iteration limit in the pseudocode). There are \(\lceil \log n \rceil\) levels of recursion, and we assume that every call of \(\mathsf {ReorderBP}\) splits the graph into two equal-sized parts with \(n/2\) vertices and \(m/2\) edges. Each call of the graph bisection consists of computing move gains and sorting two arrays with n elements. The former can be done in \({\mathcal {O}}(m)\) steps, while the latter takes \({\mathcal {O}}(n \log n)\) steps. Therefore, the total number of steps, \(T(n, m)\) , is expressed as the following recurrence:

\begin{equation*} T(n, m) = c (m + n \log n) + 2 \cdot T(n / 2, m / 2), \end{equation*}

where c is the constants inside the asymptotic upper bound on the running time of a graph bisection step. It is easy to verify that summing over all subproblems yields \(T(n, m) = {\mathcal {O}}(m \log n + n \log ^2 n)\) . We complete the discussion by noticing that it is possible to reduce the worst-case bound to \({\mathcal {O}}(m \log n)\) via a modified procedure for performing swaps [36] but the strategy does not result in a runtime reduction on our datasets. The next section provides important details of our implementation.

3.1 Algorithm Engineering

While implementing Algorithm 1 in an open-source compiler, we developed a few modifications improving certain aspects of the technique. In particular, we implement the algorithm in parallel manner and limit the depth of the recursion to reduce the running time. Next, we enhance the procedures for the initial split of functions into two parts and for the way to perform swaps between the parts, which is beneficial for the quality of the identified solutions. Finally, we propose a sampling technique to reduce the space requirements of the algorithm.

Improving the running time. Due to the simplicity of the algorithm, it can be implemented to run in parallel. Since the two subgraphs resulting from the bisection step are disjoint, the two recursive calls can be processed concurrently. To this end, we use the fork-join computation model, where small enough graphs are processed sequentially, while larger graphs are solved in parallel. To speed up the algorithm further, we set the maximum depth of the recursive tree (16 in our implementation) and limit the number of local search iterations per split (20 in our implementation). If the recursive tree reaches the lowest node and there are still unordered functions, then we fall back to the original relative order of the functions provided by the compiler.

Finally, we observe that our objective cost requires repeated computation of \(\log (x+1)\) expressions for integer arguments. To avoid costly floating-point logarithm evaluations, we pre-computed a table of values for \(0 \le x \lt 2^{14}\) , where the upper bound is chosen small enough to fit in the processor data cache. That way, we replaced most of the logarithm evaluations with a table lookup, saving approximately \(10\%\) of the total runtime.

Optimizing the quality. One ingredient of Algorithm 1 is how the two initial sets, \(F_1\) and \(F_2\) , are initialized. Arguably the initialization procedure might affect the quality of the final vertex order, since it serves as the starting point for the local search optimization. To initialize the bisection, we consider two alternatives. The simpler, outlined in the pseudocode, is to randomly split F into two (approximately) equal-sized sets. A more involved strategy is to employ minwise hashing [8, 11] to order the functions by similarity and then assign the first \(\lfloor n/2 \rfloor\) functions to \(F_1\) and the last \(\lceil n/2 \rceil\) to \(F_2\) . As discussed in Section 5.4, splitting the vertices randomly is our preferred option due to its simplicity.

An interesting aspect of Algorithm 1 is the way for exchanging functions between the two sets. Recall that we pair the functions in \(F_1\) with functions in \(F_2\) based on the computed move gains, which are positive when a function should be moved to another set or negative when a function should stay in its current set. We observed that it is beneficial to skip some of the moves. To this end, we introduce a fixed probability (0.1 in our implementation) of skipping the move for a vertex that would otherwise have been moved to a new set. Intuitively, this adjustment prevents the optimization from becoming stuck at a local minimum. It is also beneficial for avoiding redundant swapping cycles, which might occur in the algorithm; refer to References [36, 53] for a discussion of the problem in the context of graph reordering.

Reducing the space complexity. One potential downside to our start-up function layout algorithm is the need to collect full traces during profiling. If too many executions are profiled, then the storage requirements may become impractical. To address the issue, we cap the number of stored traces by a fixed integer \(\ell\) . If the profiling process generates more than \(\ell\) traces, then we select a representative random sample of size \(\ell\) using reservoir sampling [52]. When the ith trace arrives, if \(i \le \ell\) , then we keep the trace; otherwise, with probability \(1- \ell / i\) , we ignore the trace, and with complementary probability, we pick uniformly at random one of the stored traces and swap it out with the new one. The process yields a sample of \(\ell\) traces chosen uniformly at random from the stream of traces. If the reservoir is too small, then we run the risk of the samples not being representative and the solution found being of low quality. However, as we add more and more samples the improvement in solution quality goes down. In Section 5.2, we show empirically that the benefit of using a reservoir larger than \(\ell =300\) is negligible. Thus our implementation uses \(\ell =300\) .

4 Implementation in LLVM

Both bpc and bps use profile data to guide function layout. Ideally, profile data should accurately represent common real-world scenarios. The classic instrumentation in LLVM [28] produces an instrumented binary with large size and performance overhead due to added instrumentation instructions, added metadata sections, and changes in optimization passes. This is particularly problematic for mobile devices, where increased code size can lead to performance regressions and alter the behavior of the application. Profiles collected from these instrumented binaries might not accurately represent our target scenarios. To address these issues, Machine Intermediate Representation (IR) Profile (MIP) [32] aims to minimize the binary size and the performance overhead for instrumented binaries. This is achieved by extracting instrumentation metadata from the binary and using it to post-process the profiles offline.

MIP collects profile data that are relevant for optimizing mobile apps. It records function call counts used to identify functions as either hot or cold. Within each function, MIP can derive coverage data for each basic block. MIP has an optional mode, called return address sampling, which adds probes to callees to collect a sample of their call-sites. This can be used to construct a dynamic call graph that includes dynamically dispatched calls. Furthermore, MIP collects function timestamps by recording and incrementing a global timestamp for each function when it is called for the first time. We sort the functions by their initial call timestamp to construct a function trace. To collect raw profiles at runtime, we run instrumented apps under normal usage, and dump raw profiles to the disk, which is then uploaded to a database. These raw profiles are later merged offline into a single indexed profile.

4.1 Overview of the Build Pipeline

Figure 7 shows an overview of our build pipeline. We collect thousands of raw profile data files from various uses and periodically perform offline post-processing to generate a single indexed profile. This profile is used to categorize functions as either hot or cold based on whether they have been executed in any raw profile. During post-processing, bps determines the optimized order of hot functions that were profiled, including both start-up and non-start-up functions. Our apps are built with link-time optimization (LTO or ThinLTO). At the end of LTO, bpc orders cold functions to achieve a highly compressed binary size. These two orders of functions are concatenated and passed to the linker, which finalizes the function layout in the binary. We have chosen to use two separate optimization passes for bps and bpc, since applying them jointly at the end of LTO would require carrying large amounts of traces through the build pipeline.

Fig. 7.

4.2 Hot Function Layout

As shown in Figure 7, we first merge the raw profiles into the indexed profile with instrumentation metadata during post-processing. For the block coverage and dynamic call graph data, we simply accumulate them into the indexed profile as we go along. However, to run bps, we need to keep the function timestamps from each raw profile. We encode the sequence of indices to the functions participating in the cold start-up and append them to a separate section of the indexed profile.

The bps algorithm, described in Section 2.2, uses function traces with thresholds, to set utility vertices, and produces an optimized order for start-up functions. Once bps is completed, the embedded function traces are no longer needed, and can be removed from the indexed profile.

Third-party library functions and outlined functions that appear later than instrumentation in the compilation pass, might not be instrumented. To order such functions, we first check if their call sites are profiled using block coverage data. If that is the case, then these functions inherit the order of their first caller. For example, if an uninstrumented outlined function, \(f_{outlined}\) , is called from the profiled functions, \(f_A\) and \(f_B\) , and bps orders \(f_A\) followed by \(f_B\) , then we insert \(f_{outlined}\) after \(f_A\) ; this results in the layout \(f_A, f_{outlined}, f_B\) .

4.3 Cold Function Layout

We execute bpc after optimization and code generation are finished, without using an IR, as shown in Figure 7. During code generation, each function publishes a set of hashes that represent its contents, which are meaningful across modules. We use one 64-bit stable hash for each instruction by combining hashes of its opcode and operands, resulting in an 8-mer, a substring of length 8, for every instruction. When computing stable hashes, we omit hashes of pointers and instead use hashes of the contents of their targets. Unlike outliners that need to match instruction sequences, we do not consider the order or duplicates of the hashes. We only keep track of the unique stable hashes per function as the input to bpc.

Since hot functions are already ordered, we filter them out before applying bpc. It is worth noting that outliners can optimistically produce many identical functions, which will eventually be folded by the linker. To efficiently handle deduplication, bpc groups functions that have identical sets of hashes, and runs with the set of unique cold functions.

5 Evaluation

We design our experiments to answer two primary questions: (i) How well does the new function layout impact real-world binaries in comparison with alternative techniques? (ii) How do various parameters of the algorithm contribute to the solution, and what are the best set of parameters? We also investigate the scalability of our algorithm.

5.1 Experimental Setup

We evaluated our approach on three commercial iOS applications, three commercial Android applications, and an open-source compiler; refer to Table 1 for basic properties of the apps. Among the iOS applications, SocialApp is one of the largest mobile applications in the world, with a total size of over 250 MB; it provides a variety of usage scenarios, which makes it an attractive target for compiler optimizations. ChatApp is a medium sized mobile app with a total size of over 50 MB. AdsApp is another large mobile app with a total size of 190 MB. Similarly, the three Android apps are large mobile applications whose sizes approach 80 MB. Each of the applications consist of a number of binaries built individually and packaged together. Observe that the Android applications consist of several hundred shared native binaries, while iOS ChatApp and AdsApp contains only 7 and 4 binaries, respectively. Therefore, individual iOS binaries are much larger in size than the Android binaries. For this reason, iOS instances are built with ThinLTO, while Android ones are utilizing (Full)LTO. We highlight that iOS applications are primarily developed in Objective-C and Swift, whereas Android applications utilize C/C++ for their native components to provide JNI support. Finally, Clang (release 15.0) is an open-source C++ standalone program built on MacOS with (Full)LTO.

Table 1.

	platform	number of binaries	.text size	app size	total	hot	blocks per func.
		in app package	(MB)	(MB)	func.	func.	p50	p95	p99
SocialApp	iOS	61	119	259	\(856\text{K}\)	\(154\text{K}\)	1	11	29
ChatApp	iOS	7	35	58	\(202\text{K}\)	\(44\text{K}\)	3	24	70
AdsApp	iOS	4	87	194	\(682\text{K}\)	\(41\text{K}\)	1	18	62
SocialApp	Android	369	45	73	\(215\text{K}\)	\(16\text{K}\)	3	34	107
ChatApp	Android	393	49	80	\(229\text{K}\)	\(17\text{K}\)	3	36	110
AdsApp	Android	297	41	79	\(210\text{K}\)	\(22\text{K}\)	3	36	109
Clang	MacOS	1	61	69	\(134\text{K}\)	\(32\text{K}\)	3	26	93

Table 1. Basic Properties of Evaluated Applications

The last three columns show percentile statistics about the number of blocks per function in a given app (e.g., p50 is the median number of blocks per function).

All the algorithms discussed in the article are implemented on top of release_14 of LLVM.² The binaries are built on a Linux-based server with a dual-node 28-core 2.4 GHz Intel Xeon E5-2680 (Broadwell) having 256 GB RAM.

Next, in Section 5.2, we demonstrate the impact of function reordering on the start-up performance. These experiments are conducted on four applications (SocialApp and ChatApp for iOS and Android), which are integrated with a large-scale automatic measurement system deployed in the production environment. Then, in Section 5.3, we conduct experiments with the compressed app size, which does not require production deployment, and hence, analyzed for all the instances from the benchmark.

5.2 Start-up Performance

Here, we present the impact of function layout on the start-up performance of mobile apps. Recall that the start-up performance, often referred to as the cold start time, is the duration from the moment an app is launched until it loads and displays its first feed, thereby becoming ready for user interaction. The new algorithm, which we refer to as bps, is compared with the following alternatives:

—

baseline is the original ordering of functions as dictated by the compiler; the function layout follows the order of object files that are passed into the linker;

—

random is the result of randomly permuting the hot functions;

—

order-avg is a heuristic for ordering hot functions suggested in the work [32] based on the average timestamp of a function during start-up computed across all traces.

We first evaluate the start-up performance of iOS apps. In the production environment for iOS apps, only a single application can be shipped. Hence, to compare the new bps algorithm with alternatives, we analyze two consecutive releases of the apps. Release N corresponds to a version with the order-avg algorithm, while release \(N+1\) utilizes bps. We acknowledge that the differences observed may result from multiple optimizations shipped simultaneously with bps, as the applications are being actively developed. To account for this, we repeated the alternations three times in consecutive releases, and reported the average improvements. Table 2 provides the measurements for start-up time reductions for SocialApp ( \(2.9\%\) ) and ChatApp ( \(5.7\%\) ), where the 99% confidence intervals estimated from the three collected data-points are within \(0.5\%\) . To understand the reason behind the reduced start-up time, we record the number of major page faults during start-up. Table 3 presents the detailed results for the average and the 99th percentile number of page faults observed for millions of samples published in production. On average, bps reduced the number of major page faults by \(6.9\%\) and \(16.9\%\) for SocialApp and ChatApp, respectively. To double check that the average is not being affected by outliers, we inspect the improvement in the 99th percentile of major page faults; at \(4.6\%\) for SocialApp and \(9.1\%\) for ChatApp they indicate that most of the improvement is coming from typical executions. We emphasize that less effective algorithms (such as baseline or random) cannot be utilized without regressing the performance for all users, and thus, the corresponding experiments are omitted.

Table 2.

	Start-up Time
SocialApp (iOS)
bpc vs order-avg	\(2.9\%\)
ChatApp (iOS)
bpc vs order-avg	\(5.7\%\)
SocialApp (Android)
order-avg vs baseline	\(0.54\%\)
bpc vs baseline	\(0.85\%\)
ChatApp (Android)
order-avg vs baseline	\(0.60\%\)
bpc vs baseline	\(1.26\%\)

Table 2. Relative Improvements of the Start-up Performance of Various Function Layout Algorithms on iOS and Android Applications

Table 3.

		average	p99
SocialApp (iOS)
order-avg	release N	\(3.4\text{K}\)	\(7.6\text{K}\)
bps	release \(N+1\)	\(3.1\text{K}~(6.9\%)\)	\(7.2\text{K}~(4.6\%)\)
ChatApp (iOS)
order-avg	release N	\(1.7\text{K}\)	\(10.3\text{K}\)
bps	release \(N+1\)	\(1.4\text{K}~(16.9\%)\)	\(\:\:9.3\text{K}~(9.1\%)\)

Table 3. Number of Major Page Faults Measured for order-avg and bps Shipped in Consecutive Releases for iOS Apps.

The relative improvement of bps over order-avg is shown in parentheses. The column labeled average compares the average number of page faults across all executions, while column labeled p99 indicates the 99th-percentile of the number of page faults across all executions.

An interesting observation from Table 3 is that function layout has a greater impact on the start-up performance of ChatApp (iOS) than that of SocialApp (iOS). This is likely due to the observation that SocialApp consists of dozens of native binaries. Since function layout is applied for each binary individually, the overall impact on the start-up performance of SocialApp is reduced. In contrast, ChatApp is composed of only 7 binaries, and among them, only the largest one is responsible for the start-up; thus, an optimized function layout can directly impact the performance.

Next, we discuss the start-up performance of two Android apps. Unlike the experiments with iOS, there is currently no accurate way of measuring page faults; however, there exists an option to deliver different versions of an app to different users (via Google PlayStore), and thus, perform A/B experiments. We estimate the impact of function layout on the start-up performance for order-avg and bps over baseline. As illustrated in Table 2, bps results in the highest reduction of the start-up performance; the 99% confidence intervals for the A/B experiment are within \(0.1\%\) . We notice that the start-up time reductions for Android apps appear smaller than the corresponding values for iOS apps: \(0.85\%\) and \(1.26\%\) for SocialApp (Android) and ChatApp (Android), respectively. Our explanation is the nature of Android apps consisting of much larger number of binaries yet having similar or smaller code size; refer to Table 1. Hence, the opportunity for bps to reduce the number of page faults is smaller.

Finally, we investigate the start-up performance using the optimization model developed in Section 2.2. Note that using the model enables a fine-grained analysis, which is impossible in production environment. Figure 8(a) illustrates the mean number of page faults simulated during start-up with up to 20K time steps, utilizing different function layout algorithms, on SocialApp (iOS). As expected, random immediately suffers from many page faults early in the execution, as the randomized placement of functions likely spans many pages. Note that baseline outperforms random; this is likely due to the natural co-location of related functions in the source code. Then, order-avg improves the evaluation curve over baseline, and bps stretches the curve even closer to the linear line, further minimizing the expected number of page faults.

Fig. 8.

Figure 8(b) shows the average number of page faults for a different number of function traces on SocialApp (iOS). Ideally, we want to build a function layout utilizing as few traces as possible. In practice, the start-up scenarios are fairly consistent across different usages, and one can reduce the need of capturing all raw function traces. We observe that running bps with more than 300 traces does not improve the page faults on our dataset. As described in Section 3.1, we use the value by default for bps to reduce the space complexity of post-processing.

5.3 Compressed App Size

We now present the results of size optimizations on the selected applications. In addition to the baseline and random function layout algorithms, we compare bpc with the following heuristic:

—

greedy is an approach for ordering cold functions discussed in see work [32] for details. It is a procedure that iteratively builds an order by appending one function at a time. On each step, the most recently placed function is compared (based on the instructions) with not yet selected functions, and the one with the highest similarity score is appended to the order. To reduce an expensive \({\mathcal {O}}(n^2)\) -computation of the scores, several pruning rules are applied to reduce the set of candidates; see work [32] for details.

Table 4 summarizes the app size reduction from each function layout algorithm, where the improvements are computed on top of baseline (that is, the original order of functions generated by the compiler). The compressed size reduction is measured in three modes: the size of the .text section of the binary, which is directly impacted by our optimization, the size of the executables excluding resource files such as images and videos, and the total app size in a compressed package. For iOS apps, we observe that bpc reduces the size of .text by \(3\%\) , \(1.8\%\) , and \(5.4\%\) for SocialApp, ChatApp and AdsApp, respectively. Since this section is the largest in the binary (responsible for \(2/3\) of the compressed app size), this translates into overall \(1.9\%\) , \(1.3\%\) , and \(3.9\%\) improvements. At the same time, the impact of all the tested algorithm on the uncompressed size of a binary is minimal (within \(0.1\%\) ), which is mainly due to differences in code alignment. Compared to SocialApp and ChatApp, AdsApp exhibits a larger size reduction. This is likely because each individual binary in AdsApp has more functions. Notably, the execution of greedy on the app was unsuccessful due to an out-of-memory failure.

Table 4.

	Text	Executables	App Size
SocialApp (iOS)
random	\(-5.3\%\)	\(-4.6\%\)	\(-3.7\%\)
greedy	\(1.6\%\)	\(1.3\%\)	\(1.1\%\)
bpc	\(3.0\%\)	\(2.3\%\)	\(\boldsymbol {1.9\%}\)
ChatApp (iOS)
random	\(-4.9\%\)	\(-4.4\%\)	\(-3.8\%\)
greedy	\(1.3\%\)	\(1.0\%\)	\(0.8\%\)
bpc	\(1.8\%\)	\(1.6\%\)	\(\boldsymbol {1.3\%}\)
AdsApp (iOS)
random	\(-6.1\%\)	\(-10.3\%\)	\(-8.2\%\)
greedy	OOM	OOM	OOM
bpc	\(5.4\%\)	\(4.9\%\)	\(\boldsymbol {3.9\%}\)
SocialApp (Android)
random	\(-4.3\%\)	\(-3.7\%\)	\(-1.6\%\)
greedy	\(1.3\%\)	\(0.7\%\)	\(0.3\%\)
bpc	\(2.3\%\)	\(1.4\%\)	\(\boldsymbol {0.6\%}\)
ChatApp (Android)
random	\(-2.2\%\)	\(-2.0\%\)	\(-1.0\%\)
greedy	\(0.4\%\)	\(0.1\%\)	\(0.1\%\)
bpc	\(0.9\%\)	\(0.5\%\)	\(\boldsymbol {0.2\%}\)
AdsApp (Android)
random	\(-4.2\%\)	\(-4.1\%\)	\(-1.6\%\)
greedy	\(1.4\%\)	\(0.4\%\)	\(0.2\%\)
bpc	\(2.3\%\)	\(1.0\%\)	\(\boldsymbol {0.4\%}\)
Clang (MacOS)
random	\(-11.5\%\)	\(-8.1\%\)	N/A
greedy	\(3.6\%\)	\(3.1\%\)	N/A
bpc	\(6.6\%\)	\(5.0\%\)	N/A

Table 4. Compressed Size Improvements of Various Function Layout Algorithms Over baseline; Negative Values Indicate Regressions

AdsApp (iOS) has no data for greedy, as the algorithm failed to run due to an out-of-memory failure (indicated by OOM).

An interesting observation is the behavior of random on the dataset, which worsen the compression ratios by approximately \(5\%\) in comparison to baseline. The explanation is that similar functions are naturally clustered in the source code. For example, functions within the same object file tend to have many local calls, making the corresponding call instructions good candidates for a compact LZ-based encoding. Yet bpc can improve the instruction locality beyond the natural clustering, by reordering functions across different object files.

Next, we assess the reduction in compressed size for Android applications. It is important to note that the total app size is measured using the Android Package Kit (apk), which includes not only native binaries but also Android Dex bytecode files [22]. Hence, the reported app size improvements are noticeably smaller than for the .text sections or the executables. For all the Android apps, two factors play a role. On the one hand, the application packages are comprised of a large number of individual binaries, each containing a relatively small number of functions. This diminishes the impact of function reordering. On the other hand, the applications are implemented in C/C++, which result in fewer call-sites in comparison to Objective-C on iOS. Function call instructions encode their call targets with relative offsets whose values differ for each call-site, making the C++ programs more layout-sensitive from the compression point of view. Overall, we record \(2.3\%\) , \(0.9\%\) , and \(2.3\%\) .text size reductions for SocialApp, ChatApp, and AdsApp, respectively, by applying bpc on top of baseline; the corresponding sizes of apk are reduced by \(0.6\%\) , \(0.2\%\) , and \(0.4\%\) .

Similar to Android apps, Clang is developed in C++. Unlike the commercial applications, it is a single binary, which yields a larger size improvement of \(6.6\%\) for its .text section. As the binary is not bundled with other resource files, the concept of app size is not applicable in the context.

5.4 Further Analysis of Balanced Partitioning

The new Algorithm 1 has a number of parameters that can affect its quality and performance. In the following, we discuss some of the parameters and explain our choice of their default values. We emphasize that such a detailed analysis requires building and evaluating hundreds of app versions, which is feasible only for synthetic data in a simulated environment.

As discussed in Section 3, the central component of the algorithm is the objective to optimize at a bisection step, which we refer to as the cost of a partition the set of functions into two disjoint parts. Equation (1) provides a general form of the objective, where the summation is taken over the utility vertices. For a given utility vertex having \(L(u)\) functions adjacent in one part and \(R(u)\) functions adjacent in another part, \(cost\big (L(u), R(u)\big)\) can be an arbitrary objective that is minimized when \(L(u) = 0\) or \(R(u) = 0\) and maximized when \(L(u) = R(u)\) . Besides the uniform log-gap defined by Equation (2), we consider two alternatives. The first one is probabilistic fanout defined as follows:

\begin{equation*} cost\big (L(u), R(u)\big) = 1 - p^{L(u)} + 1 - p^{R(u)}, \end{equation*}

where \(p \in [0, 1]\) is a constant. The objective is motivated by partitioning graphs in the context of database sharding [24], where p represents the probability that a query to a database accesses a certain shard. The second utilized objective represents the absolute difference between \(L(u)\) and \(R(u)\) :

\begin{equation*} cost\big (L(u), R(u)\big) = L(u) + R(u) - |L(u) - R(u)|. \end{equation*}

It is easy to verify that both objectives satisfy the requirements mentioned above.

The plots on Figure 9 illustrate the impact of the optimization objective on the quality of resulting function layouts, that is, the corresponding compressed size and the (estimated) number of page faults during start-up. For the fanout objective, we utilize \(p=0.9\) , as it typically results in the best outcomes in the evaluation. We observe that the uniform log-gap cost results in the smallest binaries, outperforming fanout and absolute difference by \(1.7\%\) and \(2.3\%\) , respectively. Similarly, this objective is the best choice for start-up optimization, where it yields fewer page faults by \(4.1\%\) and \(11.5\%\) on average, respectively. Thus, we consider the uniform log-gap as the preferred option for the optimization. However, function orders produced by the algorithm coupled with the other two objectives are still meaningfully better than alternatives investigated in Sections 5.2 and 5.3.

Fig. 9.

Next, we experiment with two parameters affecting Algorithm 1: the number of refinement iterations and the maximum depth of the recursion. Figure 9 (top) illustrates the impact of the latter on the quality of the result evaluated on ChatApp (iOS). For every \(0 \le d \lt 20\) (that is, when the input graph is split into \(2^d\) parts), we stop the algorithm and measure the quality of the order respecting the computed partition. It turns out that bisecting vertices is beneficial only when F contains a few tens of vertices. Therefore, we limit the depth of the recursion by \(\min (\lceil \log _2 n \rceil , 16)\) levels. To investigate the effect of the maximum number of refinement iterations, we apply the algorithm with various values in the range between 1 and 45; see Figure 9 (bottom). We observe an improvement of the quality up to iteration 20, which serves as the default limit in our implementation.

Finally, we explore the choice of the initial splitting strategy of F into \(F_1\) and \(F_2\) in Algorithm 1. Arguably the initialization procedure might affect the quality of the final order, as it provides the starting point for the subsequent local search optimization. To verify the hypothesis, we implemented three initialization techniques that bisect a given graph: (i) a random splitting as outlined in the pseudocode, (ii) a similarity-based minwise hashing [8, 11], and (iii) an input-based strategy that splits the functions based on their relative order in the compiler. In the experiments we found no consistent winner among the three options. Therefore, we recommend the simplest approach (i) in the implementation.

5.5 Build-time Analysis

Finally, we discuss the impact of function layout on the build time of the applications. The time overhead by running bpc is minimal in comparison with the overall build time: it takes around 20 s for SocialApp (iOS) and less than 1 s for the smaller apps. In contrast, using the greedy approach leads to a noticeable slowdown, increasing the overall build of SocialApp by around 10 min, which accounts for more than \(10\%\) of the total build time of approximately 100 min. Furthermore, greedy, being a quadratic-time algorithm, fails to process the largest binary in the AdsApp (iOS) package. Refer to Table 5 for the measurements of the build times on the largest apps from the benchmark. Observe that for Android apps, consisting of smaller binaries, the build-time impact of any function layout is insignificant.

Table 5.

	Build Time (s)
SocialApp (iOS)
baseline	5,400
greedy	6,000
bpc	5,420
AdsApp (iOS)
baseline	4,530
greedy	OOM
bpc	4,580
Clang (MacOS)
baseline	350
greedy	400
bpc	370

Table 5. Impact of Various Function Layout Algorithms on the Build Time of Largest Applications

The worst-case time complexity of our implementation is upper bounded by \({\mathcal {O}}(m \log n + n \log ^2 n)\) , where n is the total number of functions and m is the number of function-utility edges. The estimation aligns with Figure 10(a), which plots the dependency of the runtime on the number of functions in the binary. We emphasize that the measurements are done in a multi-threaded environment in which distinct subgraphs (arising from the recursive computation) are processed in parallel. To assess the speed up of the parallelism, we limit the number of threads for the computation; see Figure 10(b). Observe that using two threads provides approximately 2 \(\times\) speedup, whereas four threads yields a 2.5 \(\times\) speedup in comparison with the single-threaded implementation. Increasing the number of threads beyond that does not yield measurable runtime improvements. However it is likely that for larger instances with more recursive subgraphs, utilizing multiple threads can be beneficial.

Fig. 10.

6 Related Work

There exists a rich literature on profile-guided compiler optimizations. Here, we discuss previous works that are closely related to PGO in the mobile space, code layout techniques, and our algorithmic contributions.

PGO. Most compiler optimizations for mobile applications are aimed at reducing the code size. Such techniques include algorithms for function inlining and outlining [9, 33], merging similar functions [47, 48], loop optimization [46], unreachable code elimination, and many others. In addition, some works describe performance improvements for mobiles, by improving their responsiveness, memory management, and start-up time [32, 54]. The optimizations can be applied at the compile time, link time [35], or post-link time [41, 50]. Our approach is complimentary to the works and can be applied in combination with the existing optimizations.

Code Layout. The work by Pettis and Hansen [43] serves as the basis for most modern code reordering techniques for server workloads. The goal of their basic block reordering is to create chains of blocks that are frequently executed together. Many variants of the technique were suggested in the literature and implemented in various tools [30, 38, 39, 40, 41, 49, 50]. The state-of-the-art approach for basic block layout is due to Newell and Pupyrev [39]. Alternative models have been studied in several papers [16, 25, 31], where a temporal-relation graph is taken into account. Temporal affinities between code instructions can also be utilized for reducing conflict cache misses [17].

Code reordering at the function-level is also initiated by Pettis and Hansen [43] whose algorithm is implemented in many compilers and binary optimization tools [41, 50]. This approach greedily merges chains of functions with the primary goal of reducing I-TLB misses. An improvement is proposed by Ottoni and Maher [40], who propose working with a directed call graph to reduce I-cache misses. As discussed in Section 1, the approaches are designed to improve the steady-state performance of server workloads and cannot be applied to mobile apps. The very recent work of Lee, Hoag, and Tillmann [32] is the only study discussing an approach for function layout in the mobile space; our novel algorithm is more efficient and more effective than their heuristic.

Algorithms. Our model for function layout relies on the balanced graph partitioning problem [2, 15, 26]. There exists a rich literature on the topic from both theoretical and practical points of view [3, 4]. The most closely related work to our study is on graph reordering [11, 36], which utilizes recursive graph bisection for creating “compression-friendly” inverted indices. While our algorithm shares some similarities with these works, our objectives and application area are different.

The general problem of how to optimize memory performance has been studied from the theoretical point of view. One classic stream of works deals with the problem of how to design cache eviction policies to minimize cache misses [14, 51, 55]. A more recent stream of works deals with the problem of computing a suitable data layout for a given cache eviction policy [6, 29, 42]. Our setting for start-up optimization is closest to the latter stream; however, a major difference in our setting is the fact that the short-time horizon of the start-up means that page evictions do not play a significant role in our setup. Therefore, we cannot rely on previous methods.

7 Discussion

In this article, we have presented and evaluated the first function layout algorithm designed for mobile compiler optimizations. The algorithm was carefully engineered making it scalable to process the largest instances within a matter of seconds. We have successfully applied this optimization to several large commercial mobile applications, resulting in improvements in the start-up performance and reductions in the compressed app size.

An important contribution of the work is a formal model for function layout optimizations. We believe that the model utilizing the bipartite graph with utility vertices is general enough to be applicable in various contexts. In our current implementation, each function is either optimized for start-up or for size but not for both at the same time. However, it might be possible to relax the constraint and design an approach that unifies the two objectives. Our early experiments show that reordering all functions with bpc could result in up to \(0.3\%\) size reduction, but this may come at the cost of a longer start-up time. Unifying the optimizations is a promising direction for future work.

From a theoretical point of view, our work is related to a computationally hard problem of balanced graph partitioning [2]. While the problem is hard in theory, real-world instances obey certain characteristics, which may simplify the analysis of algorithms. For example, control-flow and call graphs arising from modern programming languages have constant treewidth, which is a standard notion to measure how close a graph is to a tree [1, 6, 38]. Many NP-hard optimization problems can be solved efficiently on graphs with a small treewidth, and therefore, exploring function layout algorithms parameterized by the treewidth is of interest.

Acknowledgment

We thank Nikolai Tillmann for fruitful discussions of the problem.

Footnotes

The work is a substantially extended version of a paper by some of the of authors at the LCTES workshop [19].

Refer to https://reviews.llvm.org/D147812 for the open-source implementation of the algorithm.

References

[1]

Ali Ahmadi, Majid Daliri, Amir Kafshdar Goharshady, and Andreas Pavlogiannis. 2022. Efficient approximations for cache-conscious data placement. In Proceedings of the International Conference on Programming Language Design and Implementation, Ranjit Jhala and Isil Dillig (Eds.). ACM, 857–871.

Abstract

1 Introduction

1.1 Function Layout for App Download Speed

1.2 Function Layout for App Launch Time

1.3 Contributions

2 Building an Optimization Model

2.1 Compression

2.2 Start-up

3 Recursive Balanced Graph Partitioning

3.1 Algorithm Engineering

4 Implementation in LLVM

4.1 Overview of the Build Pipeline

4.2 Hot Function Layout

4.3 Cold Function Layout

5 Evaluation

5.1 Experimental Setup

5.2 Start-up Performance

5.3 Compressed App Size

5.4 Further Analysis of Balanced Partitioning

5.5 Build-time Analysis

6 Related Work

7 Discussion

Acknowledgment

Footnotes

References

Cited By

Index Terms

Recommendations

Optimizing Function Layout for Mobile Applications

Efficient profile-guided size optimization for native mobile applications

An Architectural Framework for Runtime Optimization

Comments

Information

Published In

Publisher

Journal Family

Publication History

Check for updates

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations