SJB Institute of Technology: CO & ARM Microcontrollers (21EC52)
SJB Institute of Technology: CO & ARM Microcontrollers (21EC52)
SJB Institute of Technology: CO & ARM Microcontrollers (21EC52)
Dr. Supreeth H S G
Associate Professor
Dept. of ECE SJBIT
VTU syllabus
Module – 05
Textbook 2: Chapter 4, 5
Textbook:
These
instructions use two pre - indexed addressing
modes: of set by register and of set by
immediate.
Single-Register Load-Store Instructions
Single-Register Load-Store Instructions
Single-Register Load-Store Instructions
The of set by register uses a base register Rn
plus the register of set Rm.
ADR r4,b
LDR r0,[r4]
ADR r4,c
LDR r1,[r4]
ADD r2,r0,r1
ADR r4,a
LDR r0,[r4]
MUL r2,r2,r0
ADR r4,y
STR r2,[r4]
Programs
To evaluate (A + 8B + 7C - 27)/4, where A = 25, B = 19,
and C = 99.
MOV r0,#25
MOV r1,#19
ADD r0,r0,r1,LSL #3
MOV r1,#99
MOV r2,#7
MLA r0,r1,r2,r0
SUB r0,r0,#27
MOV r0,r0,ASR #2
Programs
if (i == 0)
{
i = i +10;
}
SUBS R1, R1, #0
ADDEQ R1, R1, #10
CMP R1,#0
ADDEQ R1, R1, #10
Programs
• Compiling Optimization
• Time-consuming, make it hard for reading
• Make the most common FAST
• Find the hot path via Profiler ( in ARM or GCC compiler/debugger )
Example: Use C compiler
• ARM compiler (armcc)
• ARM Developer Suite (ADS) v1.1
• armcc –Otime –c –o test.o test.c
• fromelf –text/c test.o > test.txt
• For Example
• Let’s start by looking at how ARM compiler (armcc) handle the basic C
data type.
• We will see that some types are more efficient for local variables than
others.There are also difference between the addressing modes available
when loading and storing data of each type.
C Data Types
• C data type
• char: unsigned byte (ARM C compiler)
• short: signed 16-bit
• int: signed 32-bit
• long: signed 32-bit
• long long: signed 64-bit
• For example
• char i; // i: unsigned
• while (i>=0) … // i always >= 0, so never quit while loop
• In ARM C
• char: unsigned 8-bit byte
• short: signed 16-bit half word
• int, long:signed 32-bit word
• long long: signed 64-bit double word
Use the Type explicitly
• Assigning an Explicit Size to Data Items
Sometimes kernel code requires data items of a specific size, either to match predefined binary structures[39] or to
align data within structures by inserting "filler'' fields (but please refer to "Data Alignment" later in this chapter for
information about alignment issues).
[39]This happens when reading partition tables, when executing a binary file, or when decoding a network packet.
The kernel offers the following data types to use whenever you need to know the size of your data. All the types are
declared in <asm/types.h>, which in turn is included by <linux/types.h>:
u8; /* unsigned byte (8 bits) */
u16; /* unsigned word (16 bits) */
u32; /* unsigned 32-bit value */
u64; /* unsigned 64-bit value */
These data types are accessible only from kernel code (i.e., _ _KERNEL_ _ must be defined before including
<linux/types.h>). The corresponding signed types exist, but are rarely needed; just replace u with s in the name if you
need them.
If a user-space program needs to use these types, it can prefix the names with a double underscore: _ _u8 and the
other types are defined independent of _ _KERNEL_ _. If, for example, a driver needs to exchange binary structures
with a program running in user space by means of ioctl, the header files should declare 32-bit fields in the structures
as _ _u32.
It's important to remember that these types are Linux specific, and using them hinders porting software to other Unix
flavors. Systems with recent compilers will support the C99-standard types, such as uint8_t and uint32_t; when
possible, those types should be used in favor of the Linux-specific variety. If your code must work with 2.0 kernels,
however, use of these types will not be possible (since only older compilers work with 2.0).
Typedef in Linux
• Linux:
• asm/types.h (included by linux/types.h)
• #if PLATFORM_INT_SIZE!=32
#if PLATFORM_LONG_SIZE==32
typedef long u32;
#elif PLATFORM_LONG_LONG_SIZE==32
typedef long long u32;
#endif
#else
typedef int u32;
#endif
int a=sizeof(u32);
Local variable
• Example:
• int checksum(int *data)
• {
• char i;
• int sum = 0;
• for(i=0;i<64;i++)
• sum += data[i];
• return sum;
• }
Local Variable: as “char”
• checksum_v1_s
• MOV r2,r0 ; r2 = data
• MOV r0,#0 ; sum = 0
• MOV r1,#0 ;i=0
• checksum_v1_loop
• LDR r3,[r2,r1,LSL #2] ; r3 = data[i]
• ADD r1,r1,#1 ; r1 = i+1
• AND r1,r1,#0xff ; i = (char)r1
• CMP r1,#0x40 ; compare i, 64
• ADD r0,r3,r0 ; sum += r3
• BCC checksum_v1_loop ; if (i<64) loop
• MOV pc,r14 ; return sum
Local Variable: as “unsigned int”
• As to “unsigned int i”
• one instruction less in loop body (miss AND operation)
• checksum_v2_s
• MOV r2,r0 ; r2 = data
• MOV r0,#0 ; sum = 0
• MOV r1,#0 ;i=0
• checksum_v2_loop
• LDR r3,[r2,r1,LSL #2] ; r3 = data[i]
• ADD r1,r1,#1 ; r1++
• CMP r1,#0x40 ; compare i, 64
• ADD r0,r3,r0 ; sum += r3
• BCC checksum_v2_loop ; if (i<64) goto loop
• MOV pc,r14 ; return sum
Local Variable: as “short”
• Suppose the data packet contains 16-bit values. It’s tempting to write
the following C code.
• The loop is now three instructions longer than the loop for example checksum_v2
earlier !
• checksum_v3_s
• MOV r2,r0 ; r2 = data
• MOV r0,#0 ; sum = 0
• MOV r1,#0 ;i=0
• checksum_v3_loop
• ADD r3,r2,r1,LSL #1 ; r3 = &data[i] // (1) Shifting
• LDRH r3,[r3,#0] ; r3 = data[i] // LDRH
• ADD r1,r1,#1 ; i++
• CMP r1,#0x40 ; compare i, 64
• ADD r0,r3,r0 ; r0 = sum + r3
• MOV r0,r0,LSL #16 ;
• MOV r0,r0,ASR #16 ; sum = (short)r0 // (2) Casting
• BCC checksum_v3_loop ; if (i<64) goto loop
• MOV pc,r14 ; return sum
checksum_v3: Question & Solution
• Q1: LDRH does not allow for a shifted address offset as the
LDR instruction did in checksum_v2.
• A: It’s a new issue. We can solve it by accessing the array by
incrementing the pointer “data” rather than using an index as in
“data[i]” – All ARM load/store instruction have a postincrement
addressing mode.
• Post Increment
• The *(data++) operation translates to a single ARM instruction that Loading data and
increments the data pointer.
• Note:
• LDRSH: Three instructions have been removed from loop.
• MOV-SHIFTs (casting): still here, but outside loop body
• checksum_v4_s
• MOV r2,#0 ; sum = 0
• MOV r1,#0 ;i=0
• checksum_v4_loop
• LDRSH r3,[r0],#2 ; r3 = *(data++)
• ADD r1,r1,#1 ; i++
• CMP r1,#0x40 ; compare i, 64
• ADD r2,r3,r2 ; sum += r3
• BCC checksum_v4_loop ; if (sum<64) goto loop
• MOV r0,r2,LSL #16
• MOV r0,r0,ASR #16 ; r0 = (short)sum
• MOV pc,r14 ; return r0
Function arguments
• Lessons
• We saw in “Local Variable” that converting local variable from types “char” or
“short” to type “int” increase performance and reduce code size. The same
holds for function arguments.
• For example
• This function is a little artificial, but it helps to illustrate the problems faced by the
compiler.
• The input value ‘a’ and ‘b’ will be passed in 32-bit ARM
registers.
• Should the compiler assume that these 32-bit values are in the range of a
“short” type (that is, -32768 to +32767)? Or should the compiler force values
to be in this range by sign-extending.
• The compiler must make compatible decision for the function Caller and
Callee.
• either Caller or callee must perform casting to a short type.
• armcc: “a + (b>>1)”
• add_v1_s
• ADD r0,r0,r1,ASR #1 ; r0 = (int)a + ((int)b>>1)
• MOV r0,r0,LSL #16 ; Narrow the return value
• MOV r0,r0,ASR #16 ; r0 = (short)r0
• MOV pc,r14 ; return r0
• So,
• It assumes that caller has already ensured r0 and r1 are in the range
of short.
• Caller: narrowing the inputs r0 and r1;
• Callee: narrowing the return value r0;
gcc’s add_v1
• gcc (wide)
• add_v1_gcc
• MOV r0, r0, LSL #16 ; Narrow by callee
• MOV r1, r1, LSL #16
• MOV r1, r1, ASR #17 ; r1 = (int)a
• ADD r1, r1, r0, ASR #16 ; r1 += (int)b
• MOV r1, r1, LSL #16
• MOV r0, r1, ASR #16 ; r0 = (short)r1
• MOV pc, lr ; return r0
• gcc is more caution and make no assumption about the
range of argument value.
• Callee: Narrow both input arguments and return value.
Lessons on Function arguments
• Example
• int average_v1(int a, int b)
• {
• return (a+b)/2;
• }
• This compiles to
• average_v1_s
• ADD r0,r0,r1 ; r0 = a+b
• ADD r0,r0,r0,LSR #31 ; if (r0<0) r0++ (one more instruction)
• MOV r0,r0,ASR #1 ; r0 = r0>>1
• MOV pc,r14 ; return r0
--------------TOPICS-----------------
1. fixed number of iterations;
2. variable number of iterations;
3. loop unrolling
Loop with fixed number of iterations
• checksum_v5_s
• MOV r2,r0 ; r2 = data
• MOV r0,#0 ; sum = 0
• MOV r1,#0 ;i=0
• checksum_v5_loop
• LDR r3,[r2],#4 ; r3 = *(data++)
• ADD r1,r1,#1 ; i++
• CMP r1,#0x40 ; compare i, 64
• ADD r0,r3,r0 ; sum += r3
• BCC checksum_v5_loop ; if (i<64) goto loop
• MOV pc,r14 ; return sum
• The key point is that the loop counter should count down to zero
rather than counting up to some arbitrary limit.
• int checksum_v6(int *data) {
• unsigned int i;
• int sum=0;
•
• for (i=64; i!=0; i--) {
• sum += *(data++);
• }
• return sum;
• }
Using Decrementing loop
• checksum_v6_s
• MOV r2,r0 ; r2 = data
• MOV r0,#0 ; sum = 0
• MOV r1,#0x40 ; i = 64
• checksum_v6_loop
• LDR r3,[r2],#4 ; r3 = *(data++)
• SUBS r1,r1,#1 ; i-- and set flags
• ADD r0,r3,r0 ; sum += r3
• BNE checksum_v6_loop
• MOV pc,r14 ; return sum
Q: “i != 0 or i>0” when i is signed
• For an unsigned counter (i), we can use either of the loop continuation conditions “i!
=0” or “i>0”. (As i can’t be negative, they are the same condition). For signed counter
(i), it’s tempting to use “i>0”.
• You may expect the compiler to generate following two instruction to implement the
loop
• SUBS r1, r1, #1 ;compare i with 1, i=i-1
• BGT loop ;if(i+1>1) goto loop
• In fact, it is
• SUB r1, r1, #1 ;i--
• CMP r1, #0 ;compare i with 0,
• BGT loop ;if(i>0) goto loop
• The compiler is not being inefficient. It must be careful about the case when “i=-
0x80000000”, because two sections of code generate different answers in this case
• For SUBS, since “-0x80000000<1”, loop will terminate.
• For SUB, since Modulo arithmetic means that i now has the value +0x7fff,ffff (it is >0), so loop
continues.
• So, “i!=0” always win ! (it saves one instruction over the signed i’s “i>0”).
Counter is a variable
• checksum_v7_s
• MOV r2,#0 ; sum = 0
• CMP r1,#0 ; compare N, 0
• BEQ checksum_v7_end ; if (N==0) goto end
• checksum_v7_loop
• LDR r3,[r0],#4 ; r3 = *(data++)
• SUBS r1,r1,#1 ; N-- and set flags
• ADD r2,r3,r2 ; sum += r3
• BNE checksum_v7_loop ; if (N!=0) goto loop
• checksum_v7_end
• MOV r0,r2 ; r0 = sum
• MOV pc,r14 ; return r0
do
{
sum += *(data++);
sum += *(data++);
sum += *(data++);
sum += *(data++);
N-=4;
} while ( N!=0);
return sum;
}
Result of unrolling
checksum_v9_s
MOV r2,#0 ; sum = 0
checksum_v9_loop
LDR r3,[r0],#4 ; r3 = *(data++)
SUBS r1,r1,#4 ; N-=4 & set flags
ADD r2,r3,r2 ; sum += r3
LDR r3,[r0],#4 ; r3 = *(data++)
ADD r2,r3,r2 ; sum += r3
LDR r3,[r0],#4 ; r3 = *(data++)
ADD r2,r3,r2 ; sum += r3
LDR r3,[r0],#4 ; r3 = *(data++)
ADD r2,r3,r2 ; sum += r3
BNE checksum_v9_loop ; if (N!=0) goto loop
MOV r0,r2 ; r0 = sum
MOV pc,r14 ; return r0
Speedup of Unrolling