1 Introduction
With the increasing popularity of virtual machines and diversity of
Instruction Set Architectures (ISAs), dynamic binary translation is becoming ubiquitous. Dynamic binary translation enables applications built for a guest ISA to run on a host ISA machine, with uses in several areas. First, it can translate legacy or existing ISAs to enable migration into emerging ISA ecosystems where the guest and host ISAs differ. Second, it can instrument applications like DynamoRIO [
12] and Pin [
37] to obtain runtime information. Third, it can profile and optimize hot paths, like Dynamo [
6].
Regardless of its various purposes, translation efficiency is the primary design metric for all dynamic binary translation systems. Extensive research focuses on optimizing dynamic binary translation efficiency. Software techniques include register mapping [
35,
60], indirect branch target lookup [
16], arithmetic flag reduction [
22,
38], enhanced translation rules [
55,
61], and multi-threaded LLVM optimization [
10,
64]. Hardware optimizations include
Very Long Instruction Word (VLIW) [
9,
21,
33] and ISA extensions [
27,
36,
65]. These works identify specific types of translation overhead and significantly improve efficiency. Consequently, same-ISA dynamic binary translation systems like DynamoRIO and Pin demonstrate near-native efficiency, as well as similar-ISA systems like LATM [
67] and MamBox64 [
16,
17]. However, diverse ISA translation, especially
Complex Instruction Set Computer (CISC) to
Reduced Instruction Set Computer (RISC), in ExaGear [
28], Rosetta2 [
1,
2], XTA [
39], and LATX [
67], still has noticeable overhead that prevents near-native efficiency. Our study aligns with prior work [
7,
11] showing that
Dynamic Binary Translators (DBTs) like DynamoRIO, Pin, ExaGear, Rosetta, LATX, Box64, FEX, and QEMU spend more than 98.9% of execution time on translated code for computational workloads. As Figure
1 shows, more than 99% of DynamoRIO’s time is devoted to the execution of translated code. Less than 0.2% of the time involves DBT tasks like translation, disassembly, instrumentation (only for instrumentation tools), guest memory management, internal data management (e.g., branch table), and guest syscall emulation. This indicates the main overhead stems from translated code.
Since most execution time involves dynamically translated code, conventional tools can find hot code segments but struggle to determine their origin. To elucidate the origin of overhead in dynamically translated code, we utilize the term inflation to describe the phenomenon wherein one guest instruction is translated into multiple host instructions. Overheads in translated code can be classified into two categories: the instruction semantic gap (e.g., floating-point translation) and limitations of the DBT mechanism (e.g., indirect branch table lookup), both of which lead to one-to-multiple translation. Consequently, inflation can encompass both categories of overhead.
We analyzed dynamic instruction inflation and performance across eight DBTs: The commercial ExaGear, Rosetta2,
2 LATX, and Pin, and open source DynamoRIO, Box64 [
46], FEX [
23], and QEMU [
8]. Figure
2(a) shows the performance slowdown (DBT execution time / native execution time) per system. DynamoRIO and Pin perform best by running x86_64 guest code natively, incurring little translation overhead. Still, they have more than 1.2x slowdowns due to the limitation of the DBT mechanism. The commercial DBTs ExaGear, Rosetta2, and LATX exhibit relatively minor performance slowdowns, whereas the open source solutions Box64 and FEX display moderate slowdowns, with QEMU showcasing a substantial performance decrease. Figure
2(b) illustrates the dynamic instruction inflation for these eight DBT systems, which is correlated with the performance slowdown. Linear regression in Figure
2(c) establishes the correlation, which determines that higher inflation indicates greater performance overhead.
Despite the cross-ISA DBTs achieving relatively low inflation, the inflation remains greater than or equal to 1.46, indicating that one guest instruction is translated into at least 1.46 host instructions on average. This 46% instruction inflation highlights the persistence of significant inflation, even in commercial DBT products.
However, previous high-level studies on translation overhead provide an inadequate understanding of DBT efficiency and limited inspiration for potential optimization techniques. Thus, a comprehensive analysis methodology is needed to accurately characterize the overhead introduced by translated code. The key challenges are as follows:
—
There is no off-the-shelf methodology to analyze DBT inflation at the instruction level. Instruction-level analysis is complicated by the intricate nature of DBT translation rules and optimizations.
—
Analyzing overhead of commercial DBTs is limited by restricted access to their source code as commercial DBTs are typically close source. Without access to the source code, it is difficult to determine host instructions into which a specific guest instruction is translated, thus preventing further inflation analysis.
—
Due to the extensive types of x86_64 instruction, it is time consuming and error prone to analyze every potential instruction. A modern disassembler [
13] reveals that there are more than 1,500 types of operation code in x86_64, with possible variants for each operation code.
In this work, we address the research problem of
DBT inflation analysis methodology at the instruction level. We also seek an open source solution to support those who cannot access the source code of commercial DBTs. To solve this problem, we present
Deflater, an open source framework for analyzing DBT instruction inflation. This framework consists of a
mathematical model, a collection of black-box unit tests named
BenchMIAOes [
40], and a trace-based simulator called
InflatSim [
41]. The mathematical model calculates the overall inflation based on the inflation of individual instructions and
Translation Block (TB) optimizations. BenchMIAOes extract the model parameters from DBTs without accessing their source code. InflatSim implements the model with extracted parameters to simulate the behavior of a given DBT. With Deflater, we simulate three commercial DBTs with inflation errors of 5.63%, 5.15%, and 3.44%, and gain insights from the simulation. In addition, using Deflater, we simulate and optimize an open source DBT with 4.65% inflation error and 5.47x performance improvement. The contributions of this work include the following:
—
We propose that inflation can be represented by using a mathematical model to enable instruction-level analysis. To demonstrate this, we have developed the trace-based simulator InflatSim to facilitate deeper insights into DBT inflation and guide further efforts in its reduction.
—
We have devised a series of meticulously designed black-box unit tests, named BenchMIAOes (BenchMarks for Inflation Analysis and Optimizations). These tests help to ascertain the model parameters of commercial DBTs. Additionally, to efficiently analyze the extensive types of x86_64 instructions, BenchMIAOes are tailored along two orthogonal dimensions: basic x86_64 instructions and variant instructions.
—
We simulated the inflation of three commercial DBTs and gained insights from the simulation results. Our insights encompass x86_64-DBT-friendly ISA features, efficient translation rules, and TB optimizations.
—
Furthermore, we applied Deflater to a practical development process, to optimize an open source DBT, QEMU. Deflater efficiently simulated the QEMU’s dynamic instruction inflation, effectively guided optimizations, and achieved substantial performance improvements.
The rest of this article is organized as follows. Section
2 provides an outline of the DBT. Section
3 introduces Deflater, including the mathematical model, BenchMIAOes, and InflatSim. Section
4 presents the evaluation of Deflater simulation results. Section
5 demonstrates the utilization of Deflater as a guide for QEMU optimizations. Section
6 provides an overview of related work on the analysis of DBT overhead. Section
7 discusses the limitations of Deflater and presents directions for future research. Section
8 summarizes this work.
2 Background
In this section, we provide a brief introduction to the functioning of a DBT. A DBT is a type of software that enables the execution of applications designed for one ISA on another ISA platform. The term guest refers to the ISA platform emulated by DBT, and the term host refers to the ISA platform on which the DBT runs. There are two types of DBTs: user-level DBTs and system-level DBTs. User-level DBTs target applications as guests, whereas system-level DBTs target Operating Systems (OSes) as guests. Since our analysis focuses on user-level DBTs, all subsequent mentions of DBTs in this article specifically pertain to user-level DBTs.
Figure
3(a) illustrates the four main components of a DBT: a disassembler, a translator, an optimizer, and a translated code cache. The DBT operates through a loop involving these four components:
(1)
The disassembler disassembles the guest executable and creates data structures representing guest instructions.
(2)
The
translator converts each disassembled guest instruction into corresponding host instructions. As shown in Figure
3(b), the translator code logic uses a switch-case statement to determine the translation based on its type. Similar guest instruction types, like
\({\tt add}\) and
\({\tt sub}\), may utilize the same translation function. Each guest instruction is translated into one or more host instructions, The host instructions are organized into basic blocks called TBs [
8,
16,
53], which have single entries and exits to facilitate optimization analysis.
(3)
The optimizer improves the performance of host instructions within a TB and across multiple TBs.
(4)
The optimized host instructions are executed on the host OS, and they are stored in the translated code cache for efficient re-execution. After executing a TB, the DBT looks up the next TB in the translated code cache. If found, execution continues seamlessly; otherwise, the preceding processing flow is repeated.
DBT implementations may vary slightly depending on optimization strategies and purposes. For example, the translated code cache can be persistent [
63], translation and optimization can offload to separate threads [
25], and the scope of optimization can expand from TB to trace or region [
14,
57]. Despite these variations, the disassembly-translation-optimization-execution loop depicted in Figure
3(a) represents the overall processing flow for DBT implementations.
5 Case Study: the Application of Deflater
To showcase Deflater’s capabilities, we have integrated it into our real-world development workflow to facilitate the optimization of the open source DBT, QEMU. Deflater efficiently constructs an inflation model and provides essential insights into potential inflation reduction. Thus, Deflater can save developers valuable time before they embark on typically time-consuming optimization efforts.
The left side of Figure
14 illustrates a typical DBT optimization workflow. Since the translated code is dynamically generated, conventional performance analysis tools such as
perf can identify the hot translated code but cannot determine its source. Consequently, DBT developers usually devise potential optimizations based on their experience. Implementing these optimizations can then take substantial time, potentially several weeks, for the DBT developer. With luck, the DBT developer may succeed in improving performance on the initial attempt. However, without proper performance analysis tools, debugging and reimplementing complex DBT optimizations can be an unguided, time-consuming process—potentially lasting weeks or months.
Nevertheless, using Deflater can streamline the DBT optimization workflow, as depicted on the right side of Figure
14. To demonstrate Deflater’s optimization guidance, we optimized QEMU, with RISC-V as the guest architecture and LoongArch as the host. We found that compared with the native execution, QEMU running SPEC CPU 2017 suffered more than 10x slowdown and 15x dynamic instruction inflation. To optimize it, we followed the workflow presented next.
First, we designed a series of RISC-V BenchMIAOes, extracted model parameters from QEMU, and constructed a DeflatSim model for QEMU in days. SPEC CPU 2017 results reveal actual inflations of 9.87 for CINT and 28.50 for CFP, whereas the simulated inflations are 9.97 for CINT and 28.63 for CFP with a 6.61% inflation error.
Second, the DeflatSim model indicates that QEMU experiences significant inflation in nearly all instruction translations. This is primarily due to QEMU’s use of a two-stage translation process to accommodate multiple ISAs, wherein guest instructions are initially converted to an intermediate representation known as TCG. Subsequently, TCG instructions are translated into host instructions. If TCG lacks support for specific guest instructions, such as floating-point instructions, it resorts to invoking C-written functions to simulate guest semantics, resulting in notable inflation. Hence, transitioning from this two-stage translation to a direct guest-to-host translation could be a viable optimization.
Third, we assessed the effectiveness of this optimization by modifying the inflation parameters of the DeflatSim model, a task accomplished within hours and about 400 lines of code edited. The simulation results demonstrate that the inflation rates for CINT and CFP decrease substantially by 84.2% and 94.4%, resulting in values of 1.58 and 1.60, respectively, which is highly appealing.
Fourth, we eliminated the TCG IR representation, achieving the end-to-end translation from RISC-V to LoongArch. During this process, we also realized two optimizations mentioned earlier: compare and conditional jump fusion, register mapping. The implementation of this optimization required several weeks and the editing of about 8,000 lines of code. Experimental results reveal that our implementation results in inflation values of 1.62 and 1.55 for CINT and CFP, respectively, with a 4.65% inflation error. Notably, the optimized QEMU achieved an overall 5.47x performance improvement, with 2.99x for CINT and 7.12x for CFP.
In summary, we modeled QEMU using Deflater and identified multiple feasible optimization approaches. We rapidly assessed the optimization impact using Deflater and then proceeded with specific implementations. Deflater reduced the trial-and-error costs, expediting the implementation of QEMU optimization algorithms.
7 Discussion and Future Work
Deflater is limited to simulating DBTs that preserve instruction boundaries. This limitation arises from Deflater’s mathematical model, which is grounded in Observation
1 that real DBT products typically refrain from breaking instruction boundaries to ensure the guest’s precise exception handling. However, an increasing amount of research explores DBTs that use aggressive optimizers like LLVM, disregarding precise exception handling to enable more optimizations. Analyzing the overhead of these highly optimized DBTs requires a new approach, such as developing a statistical model, which can be investigated in future work.
Our analysis reveals that the primary overhead in computational workloads lies within the translated code. Hence, Deflater proves to be well suited for characterizing this computational workload overhead. However, DBT workloads vary, with performance affected by other overhead factors beyond the translated code. For instance, workloads with poor code locality may experience performance degradation due to frequent code generation. The performance of large codebases may be impacted by software code cache management policies.
Moreover, Deflater does not model the hardware overhead. With multi-threaded workloads, if the guest employs a stronger memory model (e.g., x86’s TSO memory model) than the host (e.g., ARM’s weak memory model), DBT must emit fence instructions for correctness. Prior research [
24,
49] focuses on efficient and accurate inter-thread memory access translation. Additionally, translating from a guest with a coherent L1 instruction cache (e.g., x86) to a non-coherent one (e.g., ARM) requires careful handling of dynamically generated code [
31,
62]. Currently, Deflater lacks BenchMIAOes tailored for identifying inflation from multi-threading and dynamically generated code, as these aspects are not typically found in computational workloads like SPEC CPU benchmarks. Nevertheless, both multi-threading and dynamically generated code represent significant DBT research areas and will be subject to future analysis.
While BenchMIAOes successfully identified DBT optimizations, this currently relies on empirical evidence. Unlike BenchMIAOes for various instruction types and variants, which leverage dynamic instruction frequency (as depicted in Table
1) and ISA semantic, identifying potential optimizations involves practical experience and a trial-and-error approach. Furthermore, interpreting the measured inflation necessitates DBT expertise, as Section
4.4 demonstrates. Future work will investigate automating BenchMIAOes creation to facilitate optimization identifications.
8 Conclusion
To gain insights into translation inflation in DBTs, we presented Deflater, an inflation analyzing framework. Deflater consists of three components: a mathematical model for calculating DBT inflation, a collection of black-box unit tests named BenchMIAOes, and an InflatSim simulator that creates models for DBT inflation. By utilizing Deflater, we analyzed three commercial x86_64-to-RISC DBTs—ExaGear, Rosetta2, and LATX, exhibiting low inflation errors of 5.63%, 5.15%, and 3.44%, respectively. Our experimental results also revealed that the primary sources of inflation are memory access, immediate load, address calculation, sub-register access, and indirect branch. Moreover, we employed Deflater in a practical development process to optimize the open source DBT, QEMU. Deflater efficiently simulated QEMU’s dynamic instruction inflation and suggested optimizations that can significantly reduce inflation with a 4.65% inflation error. Our implementation of the suggested optimizations validates the effectiveness of Deflater’s guidance, resulting in approximately 90% reduction in inflation and a 5.47x performance improvement.