research-article

Open access

A Hierarchical Classification Method for High-accuracy Instruction Disassembly with Near-field EM Measurements

Authors:

Vishnuvardhan V. Iyer,

Aditya Thimmaiah,

Michael Orshansky,

Andreas Gerstlauer,

Ali E. YilmazAuthors Info & Claims

ACM Transactions on Embedded Computing Systems, Volume 23, Issue 1

Article No.: 10, Pages 1 - 21

https://doi.org/10.1145/3629167

Published: 10 January 2024 Publication History

PDF eReader

Abstract

Electromagnetic (EM) fields have been extensively studied as potent side-channel tools for testing the security of hardware implementations. In this work, a low-cost side-channel disassembler that uses fine-grained EM signals to predict a program's execution trace with high accuracy is proposed. Unlike conventional side-channel disassemblers, the proposed disassembler does not require extensive randomized instantiations of instructions to profile them, instead relying on leakage-model-informed sub-sampling of potential architectural states resulting from instruction execution, which is further augmented by using a structured hierarchical approach. The proposed disassembler consists of two phases: (i) In the feature-selection phase, signals are collected with a relatively small EM probe, performing high-resolution scans near the chip surface, as profiling codes are executed. The measured signals from the numerous probe configurations are compiled into a hierarchical database by storing the min-max envelopes of the probed EM fields and differential signals derived from them, a novel dimension that increases the potency of the analysis. The envelope-to-envelope distances are evaluated throughout the hierarchy to identify optimal measurement configurations that maximize the distance between each pair of instruction classes. (ii) In the classification phase, signals measured for unknown instructions using optimal measurement configurations identified in the first phase are compared to the envelopes stored in the database to perform binary classification with majority voting, identifying candidate instruction classes at each hierarchical stage. Both phases of the disassembler rely on a four-stage hierarchical grouping of instructions by their length, size, operands, and functions. The proposed disassembler is shown to recover ∼97–99% of instructions from several test and application benchmark programs executed on the AT89S51 microcontroller.

1 Introduction

On-chip computations impact the electromagnetic (EM) fields emanated as well as the power consumed by embedded systems [1–10], causing information about the operations they execute to leak through these side channels. By probing these fields and exploiting variations in the measured signals, side-channel analysis (SCA) attacks can non-invasively recover information about target processes even in embedded processors that execute general-purpose programs. At the highest fidelity, EM SCA can potentially disassemble a program's execution trace from a device under test (DUT) at the instruction level. Although such instruction-level disassemblers based on power SCA are well documented [3–5], only a few attempts based on EM SCA are reported in the literature [6–8].

Disassemblers using relatively large EM [6] or power [3–5] probes aggregate the fields emanated or power consumed by many/all system components throughout the DUT. Thus, any potential features in the measured signals that can distinguish instructions are heavily obfuscated by algorithmic noise from uncorrelated processes in addition to measurement noise from the environment and the sensor setup [10]. Such coarse-grained EM/power SCA setups generally require extensive measurements to quantify and filter out noise [3–6]. Contrarily, fine-grained EM SCA setups [7, 9, 10], which use relatively small probes, are sensitive to the fields emanated by a subset of system components near the probes, because EM emanations decay rapidly with distance and are polarized. Indeed, when probes are appropriately positioned and oriented, fine-grained EM SCA can improve the success rate of disassembly [7]. Thus, fine-grained EM SCA attacks first scan for effective measurement configurations that have high signal-to-noise ratios and then use these low-noise configurations to actually extract information [7, 9]. However, the “acquisition cost” of finding optimal configurations in existing fine-grained approaches can be prohibitively large [10]. The efficiency of a disassembler directly relates to how well the instructions are profiled during the initial acquisition phase, which dictates the acquisition cost in terms of measurement time and storage requirements. A naïve profiling approach involves instantiating each instruction with all possible combinations of different operands, addresses, and data present in architectural registers, such as program counters, stack, and so on [3–6]. To feasibly profile instructions, conventional SCA-based disassemblers typically sub-sample this space of architectural states by randomly instantiating instructions several times with different operand values and machine states. This approach has limited feasibility for fine-grained EM SCA-based disassemblers because of the high acquisition cost of searching a five-dimensional (5D) space of potential optimal measurement configurations— the possible probe locations (3D), orientations (1D), and observation times (1D) —as the DUT executes many instantiations of each instruction [10]; e.g., the setup used in this article would require \(\sim\!5,\!000 \times\) more signals to be collected compared to using a single probe configuration. The scalability of such methods further reduces as the size of the instruction set \(N\) increases. Indeed, fine-grained EM SCA approaches using the random instantiations method for profiling instructions [7] have been limited to small instruction sets. Random instantiations may also miss critical corner cases that can lead to potential misclassifications in the classification phase.

In this article, a novel scalable and effective instruction disassembler using fine-grained EM signals is proposed. As in previous SCA-based disassemblers [3–7], the proposed method has two phases. The feature-selection phase identifies optimal measurement configurations and corresponding signal features. After this phase, the classification phase identifies instructions from signals measured as the DUT executes an arbitrary code. It collects signals using only the selected set of configurations and evaluates them according to the features identified in the first phase. To support large instruction sets, the disassembly is performed hierarchically; a four-stage hierarchy—consisting of an instruction's cycle length, size, operands used, and functions implemented (Figure 1)—is used, and the feature-selection phase is performed bottom-up, while the classification phase is performed top-down through the hierarchy. A hierarchical classification allows evaluators to identify distinct leakage-mode informed features pertinent to each stage. Furthermore, ensuring high classification success rate in upper hierarchical levels enables evaluators to still recover key information about the executed instructions even if accuracy in separating details on lower levels is reduced.

Fig. 1.

The hierarchical classification is combined with a leakage model-informed sub-sampling of potential architectural states to profile instructions and identify optimal features for each stage in a feasible and scalable manner. The feature-selection phase uses a Hamming weight (HW) leakage model to design “profiling codes” consisting of a condensed set of test instructions such that—if there was no noise and if the leakage model was valid—the signals measured as the DUT executes these codes would min-max bound the signals that would be measured as the DUT executes all possible instantiations of the profiled instructions. The min-max signal envelopes for each instruction class are collected and stored in the hierarchical database, as the profiling codes are executed. Configurations where pairs of instruction classes can most easily be separated are identified. The signals measured at these configurations are the “features” that are used to classify instructions using binary classification with majority voting [5] in the next phase.

In addition to measured signals, this work also uses novel “differential signals” derived from them to improve success rates. These signals capture the impact of an instruction on the architectural state over multiple cycles. The capabilities of the disassembler are further augmented by assuming branches taken and not-taken as separate instruction classes, enabling control-flow prediction. The proposed method enables high-resolution measurements at a low acquisition cost, efficiently identifying highly potent features within a large search space. As a result of the leakage-model-informed feature selection, and hierarchical classification, improved success rates are observed for application benchmarks, compared to alternative methods [4, 7].

The contributions of this work can be summarized as follows:

—

Fine-grained EM SCA-based disassembly is performed by identifying optimal probe configurations and corresponding signal envelopes during the feature-selection phase.

—

In addition to directly probed signals, novel differential signals derived from them are used as features.

—

Control-flow leakage prediction is enabled with input-constrained analysis of branch instructions.

—

Success rates of ∼99% and ∼97% are observed when the proposed method is used to disassemble test codes and application benchmarks from the Dalton project [14] executed by a AT89S51 microcontroller unit implementing the i8051 instruction set [12] ( \(N = 90\) instructions).

The rest of the article is organized as follows: Section 2 compares various disassemblers with the proposed approach. Section 3 presents relevant background for the proposed experiments. Section 4 details the feature-selection method. Section 5 describes the classification method. Section 6 presents the measurement results. Section 7 concludes the work.

2 Overview

This section reviews previous SCA-based disassemblers and presents an overview of the proposed approach.

2.1 Related Work

Various SCA-based methods exist for recovering information about target processes on embedded systems. Code-monitoring with SCA is most often used to identify fixed instruction sequences, separate basic blocks, and predict control flow [1, 2] based on some a priori knowledge of an evaluated benchmark. Using SCA to disassemble individual instructions from an arbitrary unknown code as in References [4–8] is far more challenging in part because each instruction impacts a multitude of architectural blocks differently. Disassemblers can be compared based on their success rates and their acquisition cost. While success rate is simply the ratio of correctly identified instructions and total number of executed instructions, the acquisition cost is a function of the number of sensor configurations used during profiling \({N}_{{\rm{pc}}}\) , the number of instantiations performed to characterize each instruction \({\bar{N}}_{{\rm{inst}}}\) , and the number of samples collected for each of these measurements \({N}_t\) . The acquisition cost in this work only accounts for samples stored post measurement collection and does not quantify repeated measurements and averaging performed by the oscilloscope software.¹

Instruction disassembly based on coarse-grained EM or power SCA setups [4–6] uses a single sensor configuration ( \({N}_{{\rm{pc}}} = 1)\) and requires significant post-processing of the signals measured as the DUT executes an extensive set of test instructions. In Reference [4], a power SCA-based disassembler, using principal component analysis (PCA) for feature selection and a multivariate Gaussian classifier, was proposed to evaluate a small instruction set ( \(N = 33\) ). It correctly recognized ∼71% and ∼51% of instructions in test code and application benchmarks, respectively. The method in Reference [4] assumes some a priori knowledge of the code, however, as it applies hidden Markov models to blocks of the executed code. In Reference [6], a coarse-grained EM SCA-based disassembler, using PCA with frequency-domain signals for feature selection and AdaBoost, support vector machine, and other methods for classification, was proposed. It was able to distinguish two instructions with a 100% success rate. Unfortunately, the method's performance for the remaining instructions was not evaluated in Reference [6]. A larger instruction set ( \(N > 100\) ) was evaluated in Reference [5] with a power SCA-based disassembler, using Kullback–Leibler (KL) divergence for feature selection and quadratic discriminant analysis for classification. The method disassembled a test code with ∼99% success rate. Although Reference [5] used hierarchical classification, included an extra method to improve success rates for application benchmarks, and recovered two instructions implemented in one such code with 92% success rate, the method was not evaluated comprehensively on real-world application benchmarks. In Reference [27], an instruction disassembler targeting a Cortex M0 processor was proposed, implementing KL divergence for feature selection and classification algorithms demonstrated in Reference [5], which was further enhanced by using models based on multi-layer perceptron and convolutional neural network. While the method recognized ∼99% and ∼88% of instructions in test code and application benchmarks, respectively, the disassembly was limited to a small subset of the full instruction set ( \(N = 17\) ).

Instruction disassembly based on fine-grained EM SCA was demonstrated in References [7, 8]. A small instruction set ( \(N = 33\) ) was evaluated in Reference [7] using linear discriminant analysis for feature selection and a k-Nearest Neighbor algorithm for classification. While the disassembler recognized ∼96% of the instructions in a test code and ∼88% of them in application benchmarks, the approach in Reference [7] is an invasive method that requires decapsulation of the DUT to constrain the search space of configurations during feature selection. A similar fine-grained setup in Reference [8] targeted a slightly larger instruction set ( \(N = 50\) ) by performing bit-level disassembly of opcodes, training quadrature discriminant analysis-based classifiers to identify individual bit transitions as instructions are pre-fetched. Although the disassembler recognized 95% of instructions in test codes, it was not evaluated on real benchmarks.

While the methods proposed in References [4–8, 27] (Table 1) have very high success rates when disassembling test codes that follow the same structure/template as the profiling codes they use to select features, their success rates either decrease markedly or are unknown when disassembling application benchmarks; moreover, the methods in References [4, 6, 7, 27], which were developed and tested with only limited number of instructions, may not scale well as \(N\) , the instruction set's size, increases. Another issue common to the methods in References [4–8] is that they do not elaborate on the disassembly of conditional branches; such branches requires careful consideration during both phases of disassembly and can enable the detection of possible transitions to different parts of the code and the evaluation of control flow for comprehensive disassembly. Finally, the methods in References [4–7] extensively instantiate instructions with randomized operands, in different sequences, and so on; they instantiate each instruction from 200 [6] to 3,000 [5] times. These methods cannot be directly extended to fine-grained EM SCA, because their acquisition costs would be infeasibly high, especially if the number of possible instructions and measurement configurations is large. By contrast, our proposed method aims to (i) improve the success rate of disassembly for application codes, (ii) identify if branches were taken/not taken during execution, and (iii) maintain a feasible acquisition cost even for large instruction sets and high-resolution EM probing.

Table 1.

	[4]	[6]	[7]	[5]	[8]	[27]	This Work
DUT	PIC16F 687	ATMega 328	PIC 16F687	ATMega 328P	PIC16F15376	Cortex M0	AT89S51
# of Instr. ( \(N)\)	33	2	33	112	50	17	90
Side-Channel	Power	Coarse-grained EM	Fine-grained EM	Power	Fine-grained EM	Power	Fine-grained EM
# of Samples Measured per Instr. ( \({N}_{{\rm{pc}}} \times {N}_{\rm{t}} \times {\bar{N}}_{{\rm{inst}}})\)	\(\sim \!\!2{\rm{\ }} \times {10}^6\) (1 \(\times {\rm{\ }}1000{\rm{\ }} \times {\rm{\ }}2000)\)	\(\sim \!\!2{\rm{\ }} \times {10}^4\) (1 \(\times {\rm{\ }}100{\rm{\ }} \times\) \(200)\)	\(\sim \!\!1.2{\rm{\ }} \times {10}^8\) ( \(20{\rm{\ }} \times {\rm{\ }}2500{\rm{\ }} \times {\rm{\ }}2350)\)	\(\sim \!\!1.5{\rm{\ }} \times {10}^5\) (1 \(\times {\rm{\ }}50{\rm{\ }} \times {\rm{\ }}3000)\)	\(\sim \!\!3.2\ \times {10}^7\) \(( 400\ \times \ 2000\ \times \ 40 )\)	\(\sim \!\!1.1\ \times {10}^7\) \(( 1\ \times \ 6000\ \times 1768 )\)	\(\sim \!\!4.7{\rm{\ }} \times {10}^7\) ( \(5200{\rm{\ }} \times {\rm{\ }}1000{\rm{\ }} \times {\rm{\ }}9\) )
Success (test code)	∼70.1%	100%	∼96.2%	∼99.0%	∼95.0%	∼99.0%	∼99.3%
Success (application code)	∼50.8%	–	∼87.7%	–	–	∼88.2%	∼97.3%

Table 1. Comparison of Relevant Work

2.2 Proposed Approach

As mentioned in the Introduction, the proposed method consists of two phases (Figure 2). In the feature-selection phase, EM fields emanated from the DUT are collected for all instructions by designing and using profiling codes that instantiate each instruction for multiple specific machine states, chosen according to the HW leakage model [9, 15]. The signals are collected with all measurement configurations in a 5D search space consisting of the probe location, probe orientation, and time interval. Next, the min-max bounds of signals—directly probed fields, as well as differential signals derived from them—are found for each instruction, and these signal envelopes are compiled within a hierarchical database. The database stores for each instruction—at the bottom stage of the hierarchy—real-valued envelopes that are multivariate functions of the measurement configuration, i.e., they are functions of five variables. For the upper stages of the hierarchy, instructions are grouped using certain instruction attributes (Figure 1), and the database is compiled bottom-up, i.e., the envelopes for the instruction classes in the upper stages are constructed using envelopes for instruction classes compiled in the lower stages.

Fig. 2.

Once the database is constructed, it is used to identify optimal measurement configurations and features for binary classification. During feature selection, the envelopes for each instruction class are compared pairwise (one at a time) to those of other classes at the same stage; the comparison identifies \(M\) configurations, where the pair's signal envelopes are most distant; i.e., these are the optimal values of the five variables to distinguish the pair from each other. The signals obtained with the optimal measurement configurations, i.e., the selected features, and the envelopes of the two classes corresponding to them are recorded for use in the next phase. In the classification phase, signals measured while the DUT executes arbitrary codes are categorized hierarchically starting from the top stage. At each stage, candidate classes are identified given the class selected in the previous stage, using binary classification with majority voting [5].

3 Background

This section describes the DUT's measurement setup, the SCA threat model, the hierarchical grouping of the instruction set, and the signals used in the proposed method.

3.1 Measurement Setup

To demonstrate the proposed method, this article uses the AT89S51 microcontroller, which implements 111 instructions, differing in function, size, length, addressing mode, source and destination operands, and so on [12]. The setup used for the measurements is shown in Figure 3. The DUT was operated at 2 MHz. Fields were sensed using a 1-mm H-field probe, positioned at a fixed height of 0.5 mm and various points on an equally spaced 51 × 51 grid over the DUT's surface (area∼8 × 8 mm²) using Riscure's probe positioner. Measurements were performed using both x- and y- oriented probes. Therefore, \({N}_{{\rm{pc}}} = \\)\) 5,202 probe configurations were used for constructing the database. Signals were collected and analyzed using a Keysight DSOS054A oscilloscope at a sampling rate of 2 GS/s ( \({N}_{\rm{t}} = 1,000\) samples); the signals were collected 50 times and averaged to minimize measurement noise. For comparison and validation, measurements using the coarse-grained EM SCA setup were also performed, using a 10-mm H-field probe. HEX files for programs, generated using Keil's 8051 emulator, were uploaded to the program memory of the chip using an Arduino as interface. These codes included start/end markers to simplify measurements, implemented via a general-purpose I/O pin. The probe positioning, data acquisition, and subsequent data storage were automated to save experiment time. To reduce storage requirements, samples were saved as single-precision floating-point numbers in binary file format. More information on the setup can be found in References [15, 16]. Only \(N = 90\) instructions were considered for the following analyses; instructions that use external and indirect addressing modes were excluded, because such instructions are seldom used by compilers for general-purpose codes, unless access to external memory is required, and because the focus of this article is on EM emanations arising from on-chip switching activity.

Fig. 3.

3.2 Threat Model

Different threat models are assumed in the feature-selection and classification phase experiments. To allow accurate profiling, limited restrictions are placed on evaluators during the first phase. As in previous works [4–8], the feature-selection phase assumes that evaluators have the ability to control a clone of the DUT, or the DUT itself such that they have the ability to send known profiling codes to the device and observe the internal architectural state of the microcontroller as each instruction is executed. Further, the evaluators are assumed to also have the ability to repeat such codes as many times as desired, allowing field measurements to be averaged to minimize measurement noise. In contrast to this transparent “white-box” model of the feature-selection phase, a more restrictive “gray-box” model [17] is used in the classification phase. In this model, the code being executed, the inputs, and the internal operations of the DUT are assumed to be not visible to the evaluators but the evaluators are assumed to still have the ability to repeat the codes being targeted, similarly to the setup used by other fine-grained EM works that combine measurements from multiple locations to increase success rates of disassembling instructions [7, 8], or identify an instruction's functional units [18].

3.3 Hierarchical Grouping of Instructions

Attempting to directly classify measured signals within a large set of candidate instructions increases the odds of misclassification. Hierarchical classification can decrease the misclassification risk by reducing the number of possible candidates in each stage, assuming the stages in the hierarchy are appropriately chosen for the DUT (poor groupings can result in potentially more misclassifications at the upper stages). In Reference [5], a two-stage hierarchy was used: the instructions were separated into eight groups based on operands and into sub-groups based on their function. That grouping is not suitable for microcontrollers that have a large number of possible operands (>30 for AT89S51). Instead, in this article, two higher stages, where instructions are grouped according to length and size, are added to the hierarchy. In Stages III and IV of the hierarchy, instructions are grouped based on operands and their functions as in Reference [5], resulting in four stages of hierarchy (Figure 1). These four attributes of each instruction \(ins\) are represented with the label \({\rm{I}}{{\rm{D}}}_{ins} = ( {L,S,Op,Fn} ).\) Here \(L\) denotes the length, \(S\) the size, \(Op\) the operands, and \(Fn\) the function of the instruction, i.e., how long it requires to complete execution, the number of bytes fetched from program memory for it, the memory locations of the chosen data values in it, and the operations it performs, respectively. In AT89S51, instructions require \(L \in \{ {1,2,4} \}\) cycles for execution, are of size \(S \in \{ {1,2,3} \}\) bytes, have 30 possible operands, and implement 45 functions. Table 2 shows the resulting hierarchy. In the following, cycle lengths and sizes are represented with the suffixes C and B; e.g., the label for the 1 cycle 1 byte instruction INC Acc is \({\rm{I}}{{\rm{D}}}_{INC\ Acc} =\) (1C, 1B, Acc, INC).

Table 2.

Length	Size	Operands	Functions
1Cycle (51 ins)	1Byte (25 ins)	Acc¹	INC; DEC; RR; RRC; RL; RLC; SWAP; DA; CPL; CLR
		Acc,Reg	ADD; ADDC; SUBB; ORL; XRL; ANL; MOV; XCH
		C-bit²	SETB; CLR; CPL
		Reg³	INC; DEC
		Reg,Acc	MOV
		No ops.	NOP
	2Byte (26 ins)	Acc, Imm⁴	ADD; ADDC; SUBB; ORL; XRL; ANL; MOV
		Acc, Dir	ADD; ADDC; ORL; ANL; XRL; SUBB;MOV; XCH
		Dir⁵	INC; DEC
		C-bit, Bit	MOV
		Bit⁶	CLR; CPL; SETB
		Reg, Imm	MOV
		Dir, Acc	ORL; ANL; XRL; MOV
2Cycle (51 ins)	1Byte(5 ins)	Acc, Dptr⁷	MOVC
		Acc, PC⁸	JMP; MOVC
		No ops.	RET;RETI
	2Byte (17 ins)	Addr⁹	ACALL; AJMP
		C, Bit	ANL; ORL
		Reg, Off¹⁰	DJNZ
		Off	JZ; JNZ; JC; JNC; SJMP
		C, /Bit	ANL; ORL
		Dir	PUSH;POP
		Reg, Dir	MOV
		Dir, Reg	MOV
		Bit, Cbit	MOV
	3Byte (15 ins)	Dir, Imm	MOV; ANL; ORL; XRL
		Bit, Off	JB; JBC; JNB
		Addr	LCALL;LJMP
		Acc, Imm, Off	CJNE
		Acc, Dir, Off	CJNE
		Reg, Imm, Off	CJNE
		Dir, Off	DJNZ
		Dir, Dir	MOV
		Dptr, Imm	MOV
4Cycle (2 ins)	1Byte (2 ins)	Acc, B¹¹	MUL;DIV

Table 2. Instruction Classes

¹Accumulator, ² Carry Bit, ³ General Purpose Registers, ⁴ Immediate Value,⁵ Direct RAM Address, ⁶ Register Bit, ⁷ Data Pointer, ⁸ Program Counter, ⁹ Branch Address, ¹⁰ Branch Offset, ¹¹ B Register.

3.4 Observed Signals and Target Processes

Signals collected by a near-field probe above a DUT are functions of five variables in the measurement setup used (Figure 3): The probe's configuration \(pc\) —its transverse location \(( {x,y} )\) , height \(h\) , and orientation \(o\) relative to the DUT—and the time of observation \(t\) . Thus, the probed fields can be represented as five-dimensional functions \(V( {pc,t} )\) . Of course, the measured signal also depends on the processes \(pr\) that the DUT is executing, i.e., the state of the microcontroller. These processes are performed at specific time intervals within a DUT's machine cycle, localizing features temporally. The processes can be abstracted as a combination of a target process \(Tp{r}_i\) and one or more background processes \(Bp{r}_j\) , where the subscripts \(i\) and \(j\) represent versions within these processes [9]; e.g., if the entire instruction opcode is considered the target process, then the 90 target versions are \(Tp{r}_1 \equiv\) INC Acc, \(Tp{r}_2 \equiv\) DEC Acc, …, \(Tp{r}_{90} \equiv\) DIV Acc, B and the background processes include data operations in various architectural registers. The background processes can be represented using the state of architectural registers \(X \in \{ {{\rm{X}}}_1\) ,… \({{\rm{X}}}_{{N}_{\rm{x}}}\}\) , where each state \({{\rm{X}}}_k\) represents a unique data value in registers (RAM, stack, program counter, etc.) and \({N}_{\rm{x}}\) is the number of combinations of register contents. Thus, the signals can also be represented as seven-dimensional functions \(V( {pc,t,Tp{r}_i,Bp{r}_j} )\) . Using the notation in Reference [9], a signal's dependence on measurement configuration and processes executed on the DUT are highlighted with super/sub-scripts; e.g., \(V_{Tp{r}_i,Bp{r}_j}^{pc,t}\) .

In addition to the probed fields \(V_{Tp{r}_i,Bp{r}_j}^{pc,t}\) , the differential signal

\begin{equation} {\rm{\Delta }}V_{Tp{r}_i,Bp{r}_j}^{pc,t} = \left| {V_{Tp{r}_i,Bp{r}_j}^{pc,t + {\rm{\Delta }}t} - V_{Tp{r}_i,Bp{r}_j}^{pc,t}} \right| \end{equation}

(1)

is introduced. Here \(\Delta t\) is the product of cycle length \(L\) of the target process \(Tp{r}_i\) and clock period \({T}_{{\rm{clk}}}.\) In this work, the differential signals are computed between the corresponding clock cycles of adjacent instructions. While traditional differential side-channel analysis assumes observed signals in a single clock cycle represents the transition between different machine states, the differential signal introduced in this article computes differences in fields over multiple clock cycles, i.e., it captures the change in fields measured from before an instruction is executed to after it is executed. This is a useful quantity for separating instructions that modify contents of architectural blocks shared across the instruction set, such as program counters, or the pre-fetched architectural registers. For instance, the 8051 reserves certain sub-cycles to operate on the accumulator or certain RAM registers [11], irrespective of the executed instruction, enabling easier identification of instructions impacting these registers with differential signals. Example signals are plotted in Figure 4.

Fig. 4.

If a single-stage disassembler was used, then the target process would be the complete instruction opcode. Thus, each version of the target process from \(Tp{r}_1\) to \(Tp{r}_{90}\) would represent a candidate opcode for disassembling the observed signals. The large set of candidates poses major issues in feature selection and classification; e.g., a total of \({}_2^{90}C\\)\) = 4,005 classifiers are required for binary classification [5]. In contrast, the proposed four-stage hierarchical disassembler constructs only 281 classifiers, because there are relatively small numbers of candidate classes in each stage. What constitutes target and background processes, however, changes at each stage of the hierarchy. The target process in each stage is a different attribute of the opcode, identified by the label \({\rm{I}}{{\rm{D}}}_{ins} = ( {L,S,Op,Fn} )\) . Because classification in each stage distinguishes instructions based on only one attribute, the remaining attributes of the opcode are assumed to be part of the background: In Stage I, the target instruction length can take values from the set \(L \in \{ {1{\rm{C}},{\rm{\ }}2{\rm{C}},{\rm{\ }}4{\rm{C}}} \}.\) Here \(Bpr\) for \(L\) \(= 1\ {\rm{C}}\) instructions includes any combination of the architectural state \(X\) , and the 51 groups of \(( {1{\rm{C}},S,Op,Fn} )\) in Table 2. The hierarchy then enables independent analysis within each branch in the following stages; e.g., in Stage II, the instruction size is analyzed separately for 1C instructions (for which \(S \in \{ {1{\rm{B}},{\rm{\ }}2{\rm{B}}} \})\) and 2C ones (for which \(S \in \{ {1{\rm{B}},\ 2{\rm{B}},\ 3{\rm{B}}} \})\) . Although attributes \(( {S,Op,Fn} )\) are assumed to be “background” processes here, they are still constrained by target process versions being evaluated, unlike the state of background architectural registers that is unrestricted.

4 Phase I: Feature Selection

This section details the database construction, the profiling codes, and the feature-selection method in the first phase of disassembly.

4.1 Database Construction

Each instruction class is characterized by four signal envelopes in the database; these envelopes are five-dimensional functions (of \(pc\) , \(t\) ). The hierarchical database is constructed as follows (see Figure 1 for stage definitions). First, the Stage IV portion of the database is compiled for the 90 instructions. For each instruction \(Tp{r}_i\) , multiple instantiations are executed (see Section 4.2), the EM fields are probed using all possible probe configurations, and the min-max envelopes of probed fields and differential signals are stored in the database:

\begin{equation} {{\bf env}}_{Tp{r}_i}^{pc,t} = \left[ {\min V,\max V,\min {\rm{\Delta }}V,\max {\rm{\Delta }}V} \right]\ . \end{equation}

(2)

Here the minima and maxima are found among all instantiations of the instruction, i.e., \(\forall Bp{r}_j \in Bpr\) . Next, these 90 instructions are grouped according to their operand class, as per Table 2. The envelopes or each of the 35 operand classes in Stage III are constructed by computing the min-max bounds of the envelopes of all the instructions with that operand. Similarly, Stage II (I) portions of the database are compiled from its Stage III (II) portions. Figure 5 shows an example computation of the min-max envelopes.

Fig. 5.

4.2 Profiling Codes

One approach to finding the signal envelopes is to collect an extensive set of signals, e.g., by instantiating the architectural registers \(X\) with random values. For instance, Reference [5] used \(3,\!000\) such instantiations per instruction for feature selection. While this can improve classification accuracy for coarse-grained EM/power SCA setups, the acquisition cost for fine-grained EM setups quickly becomes intractable when so many instantiations are used: For \(N = 90\) instructions, if \({N}_{\rm{t}} = 50\) time samples of signals are measured as in Reference [5] with a single probe configuration ( \({N}_{{\rm{pc}}} = 1\) ), then a total of \(13.5\ \times {10}^6\) samples would be acquired. If they are measured with the fine-grained EM SCA setup in this work, with \({N}_{{\rm{pc}}}\\)\) ∼ 5,200 probe configurations (Section 6.1), then a total of \(70\ \times {10}^9\) samples would be acquired. Storing these samples as single-precision floating-point numbers would require ∼50 MB of space for the former and ∼280 GB for the latter setup. Additional storage may be required during feature selection, e.g., to transform time-domain data to frequency domain.

A smaller set of signals can be collected by modeling the leakage as if it depends only on HWs of data in architectural registers, a common approach in processor security evaluations [9, 11]; e.g., signals for 256 data values can be bound by those for extreme instantiations of data 0x00 (HW 0) and 0xFF (HW 8). Then, the data dependency of each instruction—except conditional branch instructions—can be bound by using at most four instantiations by setting operands and result to data values 0x00 and 0xFF. For example, consider the instruction ADD Acc, Imm. To bound its data dependence, the data values in the Accumulator register and the Immediate value in program memory are chosen from the set {(0x00,0x00), (0x00,0xFF), (0xFF,0x00), (0xFF,0xFF)}. Further, to improve coverage of background processes, all 128 bytes of RAM, including stack registers, are instantiated as either 0x00 or 0xFF. Therefore, eight instantiations are used to characterize each instruction in the profiling codes. Code snippets used to profile this instruction are shown in Figure 6. In addition to the instruction instantiations, extra instructions are used to support measurements, such as a general-purpose pin triggering the oscilloscope for ease of experiment.

Fig. 6.

Because conditional branches perform different functions depending on the result of the condition evaluation, branches taken and not taken for the same instruction are considered as separate classes in Stage IV, i.e., they have the same instruction length, size, and operands but different functions. Introducing 12 additional instruction classes for the conditional branch instructions in Table 2, control-flow prediction is enabled in the final stage of disassembly. Using 16 instantiations for conditional branch instructions and 8 for other instructions, the proposed profiling codes contain a total of \(N{\bar{N}}_{{\rm{inst}}} = 12\ \times \ 16 + 78\ \times \ 8 = 816\) specially designed test instructions (in addition to miscellaneous instructions used as markers for measurement, and various instructions needed to clear flag registers, data memory, or stack). These profiling codes are used to acquire the following total number of samples to construct the database:

\begin{equation} {N}_{{\rm{samp}}} = N{\bar{N}}_{{\rm{inst}}}{N}_{{\rm{pc}}}{N}_{\rm{t}}\ {\rm{(\# \ of\ Samples\ Acquired}}). \end{equation}

(3)

Here \({N}_{{\rm{pc}}}\) is number of probe configurations, \({N}_{\rm{t}}\) is number of time samples, \(N\) is the number of instructions, and \({\bar{N}}_{{\rm{inst}}}\) is the average number of instantiations used to profile each instruction. While \({N}_{{\rm{pc}}}{N}_t\) depends on the measurement setup, \({\bar{N}}_{{\rm{inst}}}\) depends on the profiling method.

4.3 Selecting the Features

Feature selection identifies optimal measurement configurations where envelopes (and therefore signals) are easily separable when compared pairwise. Here, as well as in Section 5, the process is presented for two instruction classes \(a\) and \(b\) at the same stage of the hierarchy. First, the “average distance” between the pairs’ envelopes is computed:

\begin{equation} Dist_{a,b}^{pc,t} = \frac{{\left| {\left( {{{\bf env}}_a^{pc,t}\left[ 1 \right] + {{\bf env}}_a^{pc,t}\left[ 2 \right]} \right) - \left( {{{\bf env}}_b^{pc,t}\left[ 1 \right] + {{\bf env}}_b^{pc,t}\left[ 2 \right]} \right)} \right|}}{2}. \end{equation}

(4)

While feature selection in Stages II–IV directly uses this quantity, a pre-processing step is required in Stage I, because signals with different time lengths are compared. It is assumed that the first cycle of multi-cycle instructions is similar to a single-cycle instruction, due to the presence of opcode fetch-related processes. Consequently, in Stage I feature selection, signals for multi-cycle instructions are partitioned into multiple single-cycle windows, similarly to Reference [4]. The partitioned windows are then compared separately to single-cycle instructions, assuming the cycles that follow the first cycle will show sufficient differences to allow their length-based classification. Figure 7 shows an example of the distance between single-cycle instructions and the second cycle of two-cycle instructions. The distance \({\rm{\Delta }}Dist_{a,b}^{pc,t}\) between the differential signal envelopes is computed similarly. As demonstrated in Figure 8, some instruction classes are potentially more separable using differential signals. Prediction of a program's control flow can be achieved in Stage IV of the disassembly, as shown in Figure 9.

Fig. 7.

Fig. 8.

Fig. 9.

Next, optimal measurement configurations that maximize the distance between signal envelopes are identified. For each pairwise comparison, \(M = 10\) optimal probe configurations—5 each for direct and differential signals—and the corresponding 10 optimal time instances are stored in the arrays \({{\bf pc}}_{a,b}^{{\rm{opt}}}\) and \({{\bf t}}_{a,b}^{{\rm{opt}}}\) . The signals at these optimal measurement configurations are the selected features that will be compared with the stored envelopes to classify instructions.

5 Phase II: Classification

During classification, the probed field \(V_{}^{pc,t}\) and differential signal \(\Delta V_{}^{pc,t}\) are compared to the signal envelopes in the database. The deviation of evaluated signals from the envelopes of candidate classes \(a\) and \(b\) in the database are computed as

\begin{equation} Dev_{a/b}^{pc,t} = {\rm{Max}}\left\{ {V - {{\bf env}}_{a/b}^{pc,t}\left[ 2 \right],0} \right\} + {\rm{Max}}\left\{ {{{\bf env}}_{a/b}^{pc,t}\left[ 1 \right] - V,0} \right\}{\rm{\ }}. \end{equation}

(5)

This metric is 0 if the evaluated signal is within the stored envelope. The deviation of a probed field from the envelopes in Figure 7 is shown in Figure 10. A corresponding metric \({\rm{\Delta }}Dev_{a/b}^{pc,t}\) is computed for the differential signals.

Fig. 10.

During binary classification, the net deviation of the evaluated signal from the two candidates \(a\) and \(b\) is computed only with the \(M\) optimal measurement configurations for separating them:

\begin{equation} NetDe{v}_{a/b} = \mathop \sum \limits_{m = 1}^{M/2} Dev_{a/b}^{\ {{\bf pc}}_{a,b}^{{\rm{opt}}}\left[ m \right],{{\bf t}}_{a/b}^{{\rm{opt}}}\left[ m \right]} + \ \mathop \sum \limits_{m = M/2 + 1}^M {\rm{\Delta }}Dev_{a/b}^{\ {{\bf pc}}_{a,b}^{{\rm{opt}}}\left[ m \right],{{\bf t}}_{a,b}^{{\rm{opt}}}\left[ m \right]}. \end{equation}

(6)

The instruction class with the smaller net deviation is considered the more likely candidate for the evaluated signal. To classify among multiple candidates, the binary classification is implemented with a majority voting method [5],

\begin{equation} \begin{array}{@{}*{1}{c}@{}} {vot{e}_{a,b} = \left\{ {\begin{array}{@{}*{1}{c}@{}} { + 1,\ {\rm{if}}\ NetDe{v}_a \ge NetDe{v}_b}\\ { - 1,\ {\rm{if}}\ NetDe{v}_a < NetDe{v}_b} \end{array}} \right.}\\ {\ {a}^* = \mathop {{\rm{argmax}}}\limits_a \mathop \sum \limits_{b = 1\ \left( {b \ne a} \right)}^{{N}_{\rm{c}}} vot{e}_{a,b}} \end{array}. \end{equation}

(7)

Here \({a}^*\) is the most likely candidate class and \({N}_{\rm{c}}\) is the number of candidate classes.

6 Experiments and Results

To test the proposed disassembler, first, each instruction is instantiated 100 times with random operand values. In this test set, each instruction is padded with a NOP instruction, and before the instantiations the RAM registers are cleared, similarly to the profiling codes shown in Figure 6. A total of 10,200 instructions are evaluated in this test set. This evaluation is similar to the test sets that follow the templates of profiling codes, used in References [4–7]. For conditional branch instructions, two separate test sets are used for the branch “taken” and “not-taken” cases. The operands in both cases are randomized with constraints to ensure the functions are correctly executed; e.g., for the jump-if-not-zero instruction's branch “taken” case, the operand is allowed to take all values other than 0.

Second, a more robust and complete evaluation of the proposed disassembler is performed by using a set of four application codes from Dalton benchmarks [14], which are specifically designed to optimize the performance of 8,051 cores: the greatest common divisor (GCD), Fibonacci (FIB), sort, and square root (SQRT) codes. As their names indicate, the codes compute the GCD of two numbers, generate the first 10 Fibonacci numbers, sort 10 specified integers in ascending order, and find the square root of a specified floating-point number. The compiled codes were first disassembled using KIEL's 8051 emulator, providing a reference assembly code to judge the accuracy of the proposed disassembler.

Third, the potency of fine-grained EM SCA approach is evaluated by implementing the proposed feature-selection and classification methodology using a coarse-grained EM SCA setup (with a relatively large probe [6]) and comparing the success rates of the two approaches. Here the measurement configurations are optimized only over the time dimension as there is a single fixed probe location and orientation.

6.1 Feature-selection Results

To construct the database with the proposed profiling codes, a total of \({N}_{{\rm{samp}}} = N{\bar{N}}_{{\rm{inst}}}{N}_{{\rm{pc}}}{N}_{\rm{t}}\ = 816\ \times {\rm{\ }}5202\ \times {\rm{\ }}1,\!000\sim4.2\ \times {10}^9\) samples (after they were averaged 50 times by the oscilloscope) were acquired. For comparison, consider applying the methods presented in References [4–7] directly to the presented fine-grained EM SCA setup: Assuming \({N}_{{\rm{pc}}}\) and \(N\\)\) are the same as in this work, but using the same \({\bar{N}}_{{\rm{inst}}}\) and \({N}_{\rm{t}}\) values as in the previous works, the methods would require ∼222 × [4], ∼17 × [5], ∼2.2 × [6], and ∼650 × [7] more samples than the proposed method.

Results for feature selection phase are exemplified in Figure 11, which shows that the envelope-to-envelope distances reduce across space and time at the lower stages of the hierarchy. This behavior is expected for well-designed hierarchies that progressively refine the granularity of recovered instruction. It was also observed that the spatio-temporal distributions of distances for each stage were different, i.e., each stage of the hierarchy impacted the probed fields differently. Further, it was observed that features for all classifiers were limited to the region marked with white in Figure 11. Consequently, measurements for the classification phase were limited to this region (25 × 25 locations).

Fig. 11.

6.2 Classification Results

First, the test codes with 100 randomized instantiations of each instruction were disassembled, and the recovered results were compared to the reference assembly code line by line. The accuracy is then simply computed as a ratio of correctly recovered instructions to the total number of instructions. The success rate of the disassembly was 10,130 of 10,200 instructions (∼99.3%). Evaluating accuracy stagewise showed that the disassembled instructions had 100% accuracy for all instructions in Stages I–III, i.e., all misclassifications were in Stage IV. Therefore, the incorrectly recovered instructions still contained some relevant information. It was also observed that all conditional branches were correctly identified, including if the branch was taken or not. Such high success rates are to be expected, because these codes follow a similar template to the profiling codes.

Results for the disassembly of application benchmarks are shown in Table 3. The total accuracy for the fine-grained setup was found to be ∼97%, with less than \(\pm 2\%\) variation among the four benchmarks. Similarly to the evaluation of the test codes, no misclassifications were observed in the first three stages, and a 100% accuracy was observed in identifying conditional branch instructions. While a slight decrease in the disassembly accuracy was observed for the benchmarks, the difference is minimal compared to the disassemblers demonstrated in Reference [4] and Reference [7]. Finally, the most misidentified instruction for both test codes and benchmarks was the \({\rm{ADDC}}\ {\rm{Acc}},{\rm{Reg}}\) , commonly misclassified as instruction \({\rm{ADD}}\ {\rm{Acc}},{\rm{Reg}}\\)\) (misclassified in 22 of 123 instances). Potential reasons for the misclassification have to do with the close functional relation between the ADD and ADDC (i.e., add with carry) instructions, since in the absence of a carry bit, identical operations are performed by the microarchitecture. The opcodes of these instructions in the ISA are also very similar, including how they are decoded. Similar misclassifications were also observed for rotate and rotate with carry instructions that only differ minimally in functionality and operation. However, these instructions are not frequently used by the compiler thereby limiting inaccuracies and misclassification rates in large benchmarks.

Table 3.

Benchmark	Code Size (bytes)	# of Instructions	Fine-grained EM		Coarse-grained EM
			# of Correct Instructions	Accuracy (%)	# of Correct Instructions	Accuracy (%)
GCD	55	111	108	∼97.3	71	∼64.0
FIB	303	804	794	∼98.7	531	∼66.0
sort	572	2665	2556	∼95.9	1702	∼63.9
SQRT	1167	2006	1972	∼98.3	1327	∼66.1
Total	2097	5586	5430	∼97.2	3631	∼65.0

Table 3. Measurement Results

The disassembler implemented using the coarse-grained EM SCA only showed a success rate of ∼70% disassembling test codes and ∼65% accuracy disassembling the benchmarks (Table 3). Contrary to the fine-grained measurement setup, misclassifications were observed in Stages II, III, and IV. Clearly, the fine-grained EM SCA setup resulted in a more potent disassembler. An example demonstrating the differences between database envelopes for the fine-grained and coarse-grained EM setups are shown in Figure 12. It was observed that envelopes from the fine-grained setup were narrower and had sharper signal variations compared to the envelopes from the coarse-grained setup. Consequently the min-max envelopes predicted by the coarse-grained setup overlap for multiple classes at selected configurations leading to misclassifications, even when distance predicted between instruction classes is high (Figure 12). Further, the overlap is also observed to increase in the coarse-grained case, as the classification moves to the lower hierarchical levels.

Fig. 12.

7 Conclusions and Future Work

A fine-grained EM SCA based disassembler was proposed to recover instructions executed on a general-purpose micro-controller. The proposed method uses a hierarchical framework to improve feature selection and classification. It identifies optimal measurement configurations that distinguish instruction classes in the first phase by (i) executing model-based profiling codes to efficiently collect probed fields in a database and (ii) finding envelopes that bound the probed fields and, a novel quantity, differential signals derived from them. In the second phase, measured signals with these optimal measurement configurations are classified by comparing them to the signal envelopes of instruction classes one pair at a time. The comparisons were performed by quantifying the deviation of the measured signals from the signal envelopes. The proposed disassembler was shown to successfully and feasibly recover ∼97% to ∼99% instructions from application benchmarks and test codes executed on an AT89S51 microcontroller. Further, all conditional branch executions were correctly identified, enabling control-flow leakage prediction. It was also observed that the fine-grained EM SCA was significantly more potent compared to a coarse-grained EM SCA analysis.

The proposed disassembler can potentially detect malware within basic blocks [19], as well as those impacting control flow integrity [20–22]. Combined with appropriate tools quantifying vulnerabilities in side channels [15, 23–25], the disassembler can further enable programmers to optimize programs to minimize leakage. Finally, the instruction level granularity of the disassembler enables detection of small-scale hardware trojans that are more challenging to address compared to malicious code [26].

The DUT used in this article simplifies the disassembly significantly because of its low-complex multi-cycle architecture; additional work is required to extend the proposed work to more complex embedded processors. For instance, in Reference [27], randomized instructions were introduced based on the number of pipeline stages, while profiling individual instruction classes. A similar extension can be proposed for the fine-grained disassembler in this work; e.g., the feature-selection phase in heavily pipelined processors can be split into two sub-phases: The first sub-phase can implement the feature-selection methodology, using a few select instructions padded with NOPs (Section 4.2). Once a sufficiently small set of potent probe configurations are identified, the NOP instructions can be replaced with randomized instructions and operands for reduction, depending on the number of pipeline stages. Additional datasets can also be created for groups with a large number of instructions, to improve their disassembly, similarly to Reference [27].

The disassembly can be improved further by recovering data values of operands [9], in addition to instructions. There is also potential to improve disassembly with higher-resolution probes. A more optimal method of combining features from multiple configurations can also reduce misclassifications, with the potential to re-examine predicted results and observe anomalies. Further, differential signals are a novel quantity that requires further exploration, potentially being used to observe changes across multiple pipeline stages as the instruction is executed, adding a new dimension to the analysis. Finally, imposing more restrictions on evaluators in the classification phase, similarly to generic black-box testing threat models, may necessitate the use of more potent post-processing techniques in combination with some of the aforementioned potential improvements to the setup. Code monitoring through instruction disassembly presents a non-invasive pathway to detect intrusions, and therefore evaluate embedded hardware security.

Footnote

Please note that the acquisition cost here only quantifies storage requirements and not acquisition time. Acquisition time is related to several setup-dependent factors including oscilloscope features, DUT parameters, averaging method, etc., some of which are not always available in the literature.

References

[1]

Yannan Liu, Lingxiao Wei, Zhe Zhou, Kehuan Zhang, Wenyuan Xu, and Qiang Xu. 2016. On code execution tracking via power side-channel. In Proceedings of the ACM SIGSAC Conference on Computer and Communications Security (CCS '16). Association for Computing Machinery, New York, NY, 1019–1031.

Abstract

1 Introduction

2 Overview

2.1 Related Work

2.2 Proposed Approach

3 Background

3.1 Measurement Setup

3.2 Threat Model

3.3 Hierarchical Grouping of Instructions

3.4 Observed Signals and Target Processes

4 Phase I: Feature Selection

4.1 Database Construction

4.2 Profiling Codes

4.3 Selecting the Features

5 Phase II: Classification

6 Experiments and Results

6.1 Feature-selection Results

6.2 Classification Results

7 Conclusions and Future Work

Footnote

References

Cited By

Index Terms

Recommendations

Register coalescing techniques for heterogeneous register architecture with copy sifting

Optimistic coalescing for heterogeneous register architectures

Optimistic coalescing for heterogeneous register architectures

Comments

Information

Published In

Publisher

Journal Family

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations