Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

SJB Institute of Technology: CO & ARM Microcontrollers (21EC52)

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 95

║JAI SRI GURUDEV║

Sri AdichunchanagiriShikshana Trust (R)

SJB INSTITUTE OF TECHNOLOGY


BGS Health & Education City, Kengeri , Bangalore – 60 .

DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING

CO & ARM Microcontrollers


[21EC52]
Module-5
Introduction to the THUMB instruction set & Efficient C
Programming

Dr. Supreeth H S G
Associate Professor
Dept. of ECE SJBIT
VTU syllabus
Module – 05

• Introduction to the THUMB instruction set: Introduction, THUMB register


usage, ARM – THUMB interworking, Other branch instructions, Data processing
instructions, Stack instructions, Software interrupt instructions.
• Efficient C Programming: Overview of C Compilers and optimization, Basic C
Data types, C looping structures.

Textbook 2: Chapter 4, 5

Textbook:

Andrew N Sloss, Dominic System and Chris Wright, “ARM System


Developers Guide”, Elsevier, Morgan Kaufman publisher, 1st Edition, 2008.
Contents
• Thumb Instructions
• THUMB register usage
• ARM – THUMB interworking
• Other branch instructions
• Data processing instructions
• Stack instructions
• Software interrupt instructions

• Overview of C Compilers and optimization


• Basic C Data types
• C looping structures
THUMB
INSTRUCTION
Thumb encodes a subset of the 32-bit ARM
instructions into a 16-bit instruction set space.

Thumb has higher performance than ARM on


a
proces with a 16-bit data bus.
or
Lower performance than ARM on a 32-bit data bus.

 Thumb is prefer edfor memory-constrained


systems.
THUMB
INSTRUCTION
Thumbhas higher code density—the space taken up in
memory by an executable program—than ARM.

For memory-constrained embedded


systems, for example, mobile phones
and PDAs, code density is very important.

On average, a Thumb implementation of the same


code takes up around 30% les memory than the
equivalent ARM implementation.
THUMB
INSTRUCTION
THUMB
INSTRUCTION
THUMB
INSTRUCTION
THUMB
INSTRUCTION
THUMB
INSTRUCTION
THUMB
INSTRUCTION
Only the branch relative instruction can be conditional y
executed.
The limited space available in 16 bits causes the barrel shift
operations ASR, LSL, LSR, and ROR to be separate
instructions in the Thumb ISA.
Thumb Register Usage
In Thumb state, you do not have to all
direct access registers.

Only the low registers r0 to r7 are fully


accessible.

The higher registers r8 to r12 are only accessible


with MOV, ADD, or CMP instructions.

CMP and all the data processing


instructions that operate on low registers
Thumb Register Usage
Thumb Register Usage
There are no MSR- and MRS-equivalent Thumb
instructions. To alter the cpsr or spsr, you must switch into
ARM state touse MSR and MRS.

Similarly, there are no coprocessor instructions


in Thumb state. You need to be in ARM state to
access
the coprocessor for configuring cache
and management. memory

Only the branch relative instruction can be


conditionally executed.

The limited space available in 16 bits causes the barrel shift


operations ASR, LSL, LSR, and ROR to be separate instructions
in the Thumb ISA.
ARM-Thumb Interworking
ARM-Thumb interworking is the name given to the
method of linking ARM and Thumb code together for
both as embly and C/C++.

It handles the transition between the two


states.

Extra code, cal ed a veneer, is sometimes


needed to car y out the transition.

 ATPCS defines the ARM and Thumb procedure call


ARM-Thumb Interworking
To call a Thumb routine from an ARM
routine, the core has to change state.

This state change is shown in the T bit of


the cpsr.

The BX and BLX branch instructions cause a


switch between ARM and Thumb
state while branching to a routine.
ARM-Thumb Interworking
The BLX instruction was introduced in ARMv5T.

On ARMv4T cores the linker uses a veneer to


switch state on a subroutine call.

Instead of cal ing the routine directly, the


linker calls the veneer, which switches to Thumb
state using the BX instruction.
ARM-Thumb Interworking
Thereare two versions of the BX or BLX instructions:
an ARM instruction and a Thumb equivalent.

The ARM BX instruction enters Thumb state only if


bit 0 of the addres in Rn is set to binary 1;
otherwise, it enters ARM state.

The Thumb BX instruction does the same.

Unlikethe ARM version, the Thumb BX instruction


cannot be conditional y executed.
ARM-Thumb Interworking
ARM-Thumb Interworking
ARM-Thumb Interworking
ARM-Thumb Interworking
Branch Instructions
 There are two variations of the standard
branch instruction,or B.

1. The first is similar to the ARM version


and is conditional y executed
 The branch range is limited to a signed 8-
bit immediate, or−256 to +254 bytes.

2. The second version removes the


conditional part of the instruction
 Expands the ef ective branch range to a
signed 11- bit
immediate, or −2048 to +2046 bytes.
Branch Instructions

 The BL instruction is not


conditionally executed
Data Processing Instructions
The data processing instructions manipulate data within
registers.

They include move instructions, arithmetic instructions,


shifts, logical instructions, comparison instructions,
and multiply instructions.

The Thumb data processing ing instructions are a


subset of the ARM data processing instructions.
Data Processing Instructions
Data Processing Instructions
Data Processing Instructions
Data Processing Instructions
Data Processing Instructions
These instructions follow the same style as
the equivalent ARM instructions.

Most Thumb data processing instructions operate on


low registers and update the cpsr. The
exceptions are,

 They can operate on the higher registers r8–r14 and


the pc.
Data Processing Instructions
Data Processing Instructions
Single-Register Load-Store Instructions
The Thumb instruction set supports load and
storing registers, or LDR and STR.

These
instructions use two pre - indexed addressing
modes: of set by register and of set by
immediate.
Single-Register Load-Store Instructions
Single-Register Load-Store Instructions
Single-Register Load-Store Instructions
The of set by register uses a base register Rn
plus the register of set Rm.

The second uses the same base register Rn plus


a 5- bit immediate or a value dependent on the
data size.
Single-Register Load-Store Instructions

This example shows two Thumb instructions that use a pre-index


addressing mode. Both use the same preconditions
Multiple -Register Load-Store

 They only support the increment after (IA) addressing mode.

 Here N is the number of registers in the list of registers.


You can see that these instructions always update the
base
register Rn after execution.

The base register andlist of registers are


limited to the low registers r0 to r7.
Multiple -Register Load-Store Instructions

This example saves registers r1 to r3 to memory addresses 0x9000 to


0x900c. It also updates base register r4.
STACK INSTRUCTIONS
The Thumb stack operations are different from
the equivalent ARM instructions because they use
the more traditional POP and PUSH concept.
STACK INSTRUCTIONS
 The interesting point to note is that there is no stack
pointer in the instruction.
 This is because the stack pointer is fixed as register
r13 in Thumb operations and sp is automatically
updated.
 The list of registers is limited to the low registers
r0 to r7.
 The PUSH register list also can include the link
register lr; similarly, the POP register list can include the
pc.
 The stack instructions only support full descending
operations.
STACK
INSTRUCTIONS

 The link register lr is pushed onto the stack with


register r1.

 Upon return, register r1 is popped of the stack, as


well as the return address being loaded into the pc. This
returns from the subroutine
SOFTWARE INTERRUPT INSTRUCTION
• Similar to the ARM equivalent, the Thumb
software interrupt (SWI) instruction causes a
software interrupt exception.

• If any interrupt or exception flag is raised in


Thumb state, the processor automatically reverts
to ARM state to handle the exception.
SOFTWARE INTERRUPT INSTRUCTION
• The Thumb SWI instruction has the same effect and
nearly the same syntax as the ARM equivalent.

• It differs in that the SWI number is limited to the range 0


to 255 and it is not conditionally executed.
SOFTWARE INTERRUPT
INSTRUCTION
Programs
x = (a + b) - c;
ADR r4,a
LDR r0,[r4]
ADR r4,b
LDR r1,[r4]
ADD r3,r0,r1
ADR r4,c
LDR r2,[r4]
SUB r3,r3,r2
ADR r4,x
STR r3,[r4]
y = a*(b+c);

ADR r4,b
LDR r0,[r4]
ADR r4,c
LDR r1,[r4]
ADD r2,r0,r1
ADR r4,a
LDR r0,[r4]
MUL r2,r2,r0
ADR r4,y
STR r2,[r4]
Programs
To evaluate (A + 8B + 7C - 27)/4, where A = 25, B = 19,
and C = 99.

MOV r0,#25
MOV r1,#19
ADD r0,r0,r1,LSL #3
MOV r1,#99
MOV r2,#7
MLA r0,r1,r2,r0
SUB r0,r0,#27
MOV r0,r0,ASR #2
Programs

if (i == 0)
{
i = i +10;
}
SUBS R1, R1, #0
ADDEQ R1, R1, #10

CMP R1,#0
ADDEQ R1, R1, #10
Programs

for ( i = 0 ; i < 15 ; i++)


{
j = j + j;
}

SUB R0, R0, R0


Start CMP R0, #15
ADDLT R1, R1,
R1 ADDLT R0,
R0, #1
BLT start
Overview of C Compiler and Optimization

• Optimizing code takes time and reduce source code


readability.

• So, only frequently executed functions are worthy to be


optimized, and document non-obvious optimizations with
source code comments to aid maintainability.
Optimization via C Compiler

• Compiler’s Code optimization


• Time-sensitive
• Size-sensitive

• Compiling Optimization
• Time-consuming, make it hard for reading
• Make the most common FAST
• Find the hot path via Profiler ( in ARM or GCC compiler/debugger )
Example: Use C compiler
• ARM compiler (armcc)
• ARM Developer Suite (ADS) v1.1
• armcc –Otime –c –o test.o test.c
• fromelf –text/c test.o > test.txt

• GNU compiler (arm-elf-gcc v2.95.2)


• GNU
• arm-elf-gcc –O2 –fomit-frame-pointer –c –o test.o test.c

• fomit-pointer: don’t save frame pointer (save performance)


• Frame pointer: used for local variables dumping (debugging)

• arm-elf-objdump –d test.o > test.txt


IQ of a Compiler

• C compiler have to translate your C function literally into


assembler, and provide it works for all possible.
• In practice, many of the input combinations are not possible or won’t
occur.

• Let’s start by looking at an example of what the problems


the compiler faces.
• see NEXT slide: memclr()
Q: memclr: INPUT sentinel

• For Example

• void memclr(char *data, int N)


•{
• for(;N>0;N--) {
• *data = 0;
• data++;
• }
•}

• Q: No matter how advanced the Compiler, it does not know


whether N can be 0 or not on input. Also, data array pointer
is four-byte aligned or not.
Basic C Data Types

• Let’s start by looking at how ARM compiler (armcc) handle the basic C
data type.
• We will see that some types are more efficient for local variables than
others.There are also difference between the addressing modes available
when loading and storing data of each type.
C Data Types

• C data type
• char: unsigned byte (ARM C compiler)
• short: signed 16-bit
• int: signed 32-bit
• long: signed 32-bit
• long long: signed 64-bit

• ARM’s load/store support


• Pre-ARMv4: LDRB/STRB, LDR/STR
• ARMv4: LDRSB/LDRH/LDRSH,STRH (H:half)
• ARMv5: LDRD/STRD (D:double)
char: unsigned vs. signed ?

• In ARM C compiler, “char” is unsigned.


• Because, prior to ARMv4, ARM processor were not good at handling signed 8-
bit or 16-bit value. Therefore ARM C compilers define char to be an unsigned
8-bit value.

• For example
• char i; // i: unsigned
• while (i>=0) … // i always >= 0, so never quit while loop

• ARMCC will warning -- “unsigned comparison with 0”.

• Compiler also provide an override switch to make char signed.


• GCC: “-fsigned-char”, ARMCC: “-zc”
Data Type Mapping

• In ARM C
• char: unsigned 8-bit byte
• short: signed 16-bit half word
• int, long:signed 32-bit word
• long long: signed 64-bit double word
Use the Type explicitly
• Assigning an Explicit Size to Data Items
Sometimes kernel code requires data items of a specific size, either to match predefined binary structures[39] or to
align data within structures by inserting "filler'' fields (but please refer to "Data Alignment" later in this chapter for
information about alignment issues).
[39]This happens when reading partition tables, when executing a binary file, or when decoding a network packet.
The kernel offers the following data types to use whenever you need to know the size of your data. All the types are
declared in <asm/types.h>, which in turn is included by <linux/types.h>:
u8; /* unsigned byte (8 bits) */
u16; /* unsigned word (16 bits) */
u32; /* unsigned 32-bit value */
u64; /* unsigned 64-bit value */
These data types are accessible only from kernel code (i.e., _ _KERNEL_ _ must be defined before including
<linux/types.h>). The corresponding signed types exist, but are rarely needed; just replace u with s in the name if you
need them.
If a user-space program needs to use these types, it can prefix the names with a double underscore: _ _u8 and the
other types are defined independent of _ _KERNEL_ _. If, for example, a driver needs to exchange binary structures
with a program running in user space by means of ioctl, the header files should declare 32-bit fields in the structures
as _ _u32.

It's important to remember that these types are Linux specific, and using them hinders porting software to other Unix
flavors. Systems with recent compilers will support the C99-standard types, such as uint8_t and uint32_t; when
possible, those types should be used in favor of the Linux-specific variety. If your code must work with 2.0 kernels,
however, use of these types will not be possible (since only older compilers work with 2.0).
Typedef in Linux

• Linux:
• asm/types.h (included by linux/types.h)
• #if PLATFORM_INT_SIZE!=32
#if PLATFORM_LONG_SIZE==32
typedef long u32;
#elif PLATFORM_LONG_LONG_SIZE==32
typedef long long u32;
#endif
#else
typedef int u32;
#endif

int a=sizeof(u32);
Local variable

• To be “int” rather than “char”.


• For a local variable, i, make it a “int” rather than “char” (except you want wrap-around
to occur, e.g., “255 + 1 = 0”)
• Q: char, less register space or less space on the ARM stack ?
• WRONG! Because all ARM registers are 32-bit and all stack entries are at least 32-
bit.

• Example:
• int checksum(int *data)
• {
• char i;
• int sum = 0;
• for(i=0;i<64;i++)
• sum += data[i];
• return sum;
• }
Local Variable: as “char”

• To implement the i++ exactly, the compiler must account for


the case when i=255.
• For “char”, providing: “255 + 1” MUST be “0”

• checksum_v1_s
• MOV r2,r0 ; r2 = data
• MOV r0,#0 ; sum = 0
• MOV r1,#0 ;i=0
• checksum_v1_loop
• LDR r3,[r2,r1,LSL #2] ; r3 = data[i]
• ADD r1,r1,#1 ; r1 = i+1
• AND r1,r1,#0xff ; i = (char)r1
• CMP r1,#0x40 ; compare i, 64
• ADD r0,r3,r0 ; sum += r3
• BCC checksum_v1_loop ; if (i<64) loop
• MOV pc,r14 ; return sum
Local Variable: as “unsigned int”

• As to “unsigned int i”
• one instruction less in loop body (miss AND operation)

• checksum_v2_s
• MOV r2,r0 ; r2 = data
• MOV r0,#0 ; sum = 0
• MOV r1,#0 ;i=0
• checksum_v2_loop
• LDR r3,[r2,r1,LSL #2] ; r3 = data[i]
• ADD r1,r1,#1 ; r1++
• CMP r1,#0x40 ; compare i, 64
• ADD r0,r3,r0 ; sum += r3
• BCC checksum_v2_loop ; if (i<64) goto loop
• MOV pc,r14 ; return sum
Local Variable: as “short”

• Suppose the data packet contains 16-bit values. It’s tempting to write
the following C code.

• short checksum_v3(short *data)


• {
• unsigned int i;
• short sum=0;

• for (i=0; i<64; i++) {


• sum = (short)(sum + data[i]);
• }
• return sum;
• }

Q: Why not “sum += data[i]”


Narrowing Cast Warning

• Q: You may wonder why not “sum +=data[i]”


• With armcc, this code will produce a warning if you enable “implicit
narrowing cast warning” using compiler switch “-W+n”.
• A: “sum+data[i]” is an integer and so can only be assigned to
a short using an (implicit or explicit) narrowing cast.
Local Variable: as “short”

• The loop is now three instructions longer than the loop for example checksum_v2
earlier !
• checksum_v3_s
• MOV r2,r0 ; r2 = data
• MOV r0,#0 ; sum = 0
• MOV r1,#0 ;i=0
• checksum_v3_loop
• ADD r3,r2,r1,LSL #1 ; r3 = &data[i] // (1) Shifting
• LDRH r3,[r3,#0] ; r3 = data[i] // LDRH
• ADD r1,r1,#1 ; i++
• CMP r1,#0x40 ; compare i, 64
• ADD r0,r3,r0 ; r0 = sum + r3
• MOV r0,r0,LSL #16 ;
• MOV r0,r0,ASR #16 ; sum = (short)r0 // (2) Casting
• BCC checksum_v3_loop ; if (i<64) goto loop
• MOV pc,r14 ; return sum
checksum_v3: Question & Solution

• Q1: LDRH does not allow for a shifted address offset as the
LDR instruction did in checksum_v2.
• A: It’s a new issue. We can solve it by accessing the array by
incrementing the pointer “data” rather than using an index as in
“data[i]” – All ARM load/store instruction have a postincrement
addressing mode.

• Q2: The cast reducing “sum+array[i]” to a short requires two


MOV instructions.
• A: Using an “int” type variable to hold the partial sum. Reduce the
sum to a “short” type at the function exit.
Post-Increment – *(p++) in C

• Post Increment
• The *(data++) operation translates to a single ARM instruction that Loading data and
increments the data pointer.

• checksum_v4 fixed all the problem

• short checksum_v4(short *data)


• {
• unsigned int i;
• int sum=0; // Solution 2

• for (i=0; i<64; i++) {
• sum += *(data++); // Solution 1: post-increment
• }
• return (short)sum; // Solution 2
• }
Local variable: with “pointer”

• Note:
• LDRSH: Three instructions have been removed from loop.
• MOV-SHIFTs (casting): still here, but outside loop body

• checksum_v4_s
• MOV r2,#0 ; sum = 0
• MOV r1,#0 ;i=0
• checksum_v4_loop
• LDRSH r3,[r0],#2 ; r3 = *(data++)
• ADD r1,r1,#1 ; i++
• CMP r1,#0x40 ; compare i, 64
• ADD r2,r3,r2 ; sum += r3
• BCC checksum_v4_loop ; if (sum<64) goto loop
• MOV r0,r2,LSL #16
• MOV r0,r0,ASR #16 ; r0 = (short)sum
• MOV pc,r14 ; return r0
Function arguments

• Lessons
• We saw in “Local Variable” that converting local variable from types “char” or
“short” to type “int” increase performance and reduce code size. The same
holds for function arguments.

• For example
• This function is a little artificial, but it helps to illustrate the problems faced by the
compiler.

• short add_v1(short a, short b)


• {
• return (short)(a + (b>>1));
• }
Problems to Compiler

• The input value ‘a’ and ‘b’ will be passed in 32-bit ARM
registers.
• Should the compiler assume that these 32-bit values are in the range of a
“short” type (that is, -32768 to +32767)? Or should the compiler force values
to be in this range by sign-extending.
• The compiler must make compatible decision for the function Caller and
Callee.
• either Caller or callee must perform casting to a short type.

• Calling convention: Narrow or Wide ?


• We say that function arguments are passed “wide” while they are not
reduced to the range of the type and “narrow” if they are. (For armcc,
arguments are passed narrow and values returned narrow)
add_v1: assembly output in armcc

• armcc: “a + (b>>1)”
• add_v1_s
• ADD r0,r0,r1,ASR #1 ; r0 = (int)a + ((int)b>>1)
• MOV r0,r0,LSL #16 ; Narrow the return value
• MOV r0,r0,ASR #16 ; r0 = (short)r0
• MOV pc,r14 ; return r0
• So,
• It assumes that caller has already ensured r0 and r1 are in the range
of short.
• Caller: narrowing the inputs r0 and r1;
• Callee: narrowing the return value r0;
gcc’s add_v1

• gcc (wide)
• add_v1_gcc
• MOV r0, r0, LSL #16 ; Narrow by callee
• MOV r1, r1, LSL #16
• MOV r1, r1, ASR #17 ; r1 = (int)a
• ADD r1, r1, r0, ASR #16 ; r1 += (int)b
• MOV r1, r1, LSL #16
• MOV r0, r1, ASR #16 ; r0 = (short)r1
• MOV pc, lr ; return r0
• gcc is more caution and make no assumption about the
range of argument value.
• Callee: Narrow both input arguments and return value.
Lessons on Function arguments

• Whatever the merits of different narrow or wide calling protocols, you


can see that char or short type function arguments and return values
introduce extra casts.
• It’s more efficient to use the int type for function arguments and return
value, even if you are only passing an 8-bit value.
Signed vs. Unsigned Types

• The previous sections demonstrate the advantage of using


int rather than a char or short type for local variable.
• This section compares the efficiencies of “signed int” and
“unsigned int”.
• If your code use only ‘+’, ‘-’, ‘*’, there is no performance difference
between signed and unsigned operations.
• However, there is a difference when it comes to division (/).
Cost in signed division

• Example
• int average_v1(int a, int b)
• {
• return (a+b)/2;
• }

• This compiles to
• average_v1_s
• ADD r0,r0,r1 ; r0 = a+b
• ADD r0,r0,r0,LSR #31 ; if (r0<0) r0++ (one more instruction)
• MOV r0,r0,ASR #1 ; r0 = r0>>1
• MOV pc,r14 ; return r0

• Notice: one more ADD-Shift


Negative Division

• In C on an ARM target, a divide by two is not a right shift if x


is negative.
• For example, -3>>1 = -2, but -3/2 = -1.
• Division rounds towards zero, but arithmetic right shift round
towards -∞.

• It’s more efficient to use unsigned types for divisions.


• The compiler converts unsigned power of two divisions directly to
right shifts.
Loop Structures
This section looks at the most efficient ways to code “for” and
“while” loops on the ARM.

--------------TOPICS-----------------
1. fixed number of iterations;
2. variable number of iterations;
3. loop unrolling
Loop with fixed number of iterations

• Here is a 64-word packet checksum routine, it shows how


compiler treats a loop with incrementing count ( i++).

• int checksum_v5(int *data)


• {
• unsigned int i;
• int sum=0;

• for (i=0; i<64; i++) {
• sum += *(data++);
• }
• return sum;
• }
Overhead in a FOR loop

• checksum_v5_s
• MOV r2,r0 ; r2 = data
• MOV r0,#0 ; sum = 0
• MOV r1,#0 ;i=0
• checksum_v5_loop
• LDR r3,[r2],#4 ; r3 = *(data++)
• ADD r1,r1,#1 ; i++
• CMP r1,#0x40 ; compare i, 64
• ADD r0,r3,r0 ; sum += r3
• BCC checksum_v5_loop ; if (i<64) goto loop
• MOV pc,r14 ; return sum

• It takes three instructions to implement the for loop


structure.
• Counter increment: An ADD to increment i
• Comparing: A compare to check if i is less than 64
• Branch: A conditional branch
Reduce the Overhead

• This is NOT efficient. On ARM, a loop should only use two


instructions.
• A subtract to decrement the loop counter
• A conditional branch instruction

• The key point is that the loop counter should count down to zero
rather than counting up to some arbitrary limit.
• int checksum_v6(int *data) {
• unsigned int i;
• int sum=0;

• for (i=64; i!=0; i--) {
• sum += *(data++);
• }
• return sum;
• }
Using Decrementing loop

• checksum_v6_s
• MOV r2,r0 ; r2 = data
• MOV r0,#0 ; sum = 0
• MOV r1,#0x40 ; i = 64
• checksum_v6_loop
• LDR r3,[r2],#4 ; r3 = *(data++)
• SUBS r1,r1,#1 ; i-- and set flags
• ADD r0,r3,r0 ; sum += r3
• BNE checksum_v6_loop
• MOV pc,r14 ; return sum
Q: “i != 0 or i>0” when i is signed
• For an unsigned counter (i), we can use either of the loop continuation conditions “i!
=0” or “i>0”. (As i can’t be negative, they are the same condition). For signed counter
(i), it’s tempting to use “i>0”.

• You may expect the compiler to generate following two instruction to implement the
loop
• SUBS r1, r1, #1 ;compare i with 1, i=i-1
• BGT loop ;if(i+1>1) goto loop
• In fact, it is
• SUB r1, r1, #1 ;i--
• CMP r1, #0 ;compare i with 0,
• BGT loop ;if(i>0) goto loop
• The compiler is not being inefficient. It must be careful about the case when “i=-
0x80000000”, because two sections of code generate different answers in this case
• For SUBS, since “-0x80000000<1”, loop will terminate.
• For SUB, since Modulo arithmetic means that i now has the value +0x7fff,ffff (it is >0), so loop
continues.

• So, “i!=0” always win ! (it saves one instruction over the signed i’s “i>0”).
Counter is a variable

• Using the lessons from the last section, we count


down until N=0 and don’t require an extra loop
counter.

• int checksum_v7(int *data, unsigned int N)


• {
• int sum=0;

• for ( ; N!=0; N--) {
• sum += *(data++);
• }
• return sum;
• }
Checking N == 0, Why ?

• checksum_v7_s
• MOV r2,#0 ; sum = 0
• CMP r1,#0 ; compare N, 0
• BEQ checksum_v7_end ; if (N==0) goto end
• checksum_v7_loop
• LDR r3,[r0],#4 ; r3 = *(data++)
• SUBS r1,r1,#1 ; N-- and set flags
• ADD r2,r3,r2 ; sum += r3
• BNE checksum_v7_loop ; if (N!=0) goto loop
• checksum_v7_end
• MOV r0,r2 ; r0 = sum
• MOV pc,r14 ; return r0

• Compiler checks that N is nonzero on entry to the function


• Often, “check N” is unnecessary, since you know that the array won’t be
empty. In this case, a “do-while” loop gives better performance and code
density than a “for” loop.
Using do-while

• It should remove the test for N being zero.

• int checksum_v8(int *data, unsigned int N)


• {
• int sum=0;

• do {
• sum += *(data++);
• } while ( --N!=0);
• return sum;
• }
do-while: Saving Two Cycles

• No need to check “N==0”,


• checksum_v8_s
• MOV r2,#0 ; sum = 0
• CMP r1,#0 ; compare N, 0
• BEQ checksum_v7_end ; if (N==0) goto end
• checksum_v8_loop
• LDR r3,[r0],#4 ; r3 = *(data++)
• SUBS r1,r1,#1 ; N-- and set flags
• ADD r2,r3,r2 ; sum += r3
• BNE checksum_v8_loop ; if (N!=0) goto loop
• MOV r0,r2 ; r0 = sum
• MOV pc,r14 ; return r0
Loop Unrolling

int checksum_v9(int *data, unsigned int N)


{
int sum=0;

do
{
sum += *(data++);
sum += *(data++);
sum += *(data++);
sum += *(data++);
N-=4;
} while ( N!=0);
return sum;
}
Result of unrolling

checksum_v9_s
MOV r2,#0 ; sum = 0
checksum_v9_loop
LDR r3,[r0],#4 ; r3 = *(data++)
SUBS r1,r1,#4 ; N-=4 & set flags
ADD r2,r3,r2 ; sum += r3
LDR r3,[r0],#4 ; r3 = *(data++)
ADD r2,r3,r2 ; sum += r3
LDR r3,[r0],#4 ; r3 = *(data++)
ADD r2,r3,r2 ; sum += r3
LDR r3,[r0],#4 ; r3 = *(data++)
ADD r2,r3,r2 ; sum += r3
BNE checksum_v9_loop ; if (N!=0) goto loop
MOV r0,r2 ; r0 = sum
MOV pc,r14 ; return r0
Speedup of Unrolling

• Loop Overhead in Cycle


• SUB(1), BRANCH(3)
• LOAD(3), ADD(1)
• Cycles per iteration
• Old = 3(load)+1(add)+1(sub)+3(branch) = 8
• New = [(3+1)*4+1+3]/4 = 20/4 = 5
• Double Speedup
• Old/New = 8/5 = ~2
• * ARM9TDMI (faster LOAD) brings out more Speedup.
Questions in Unrolling (1)

• Q1: How much times (K) should I unroll the loop ?


• Suppose the loop is very important, for example, 30% of the entire
application.
• Suppose you unroll the loop until it is 0.5KB in code size (128 x instr.).
• Then, loop overhead is at most 4 cycles compared to a loop body of around
128 cycles.
• The loop overhead cost is 3/128, roughly 3% of loop, and 1% (3% x 30%) of
overall application.
• So, unrolling the code further gains little extra performance,
but has a significant impact on the cache contents.
Questions in Unrolling (2)

• Q2: What if N is not a multiple of K ?


• An easy question

• for (i=N/4; i!=0; i--) {
• sum += *(data++);
• …………………
• sum += *(data++);
• }

• for (i=N&3; i!=0; i--){


• sum += *(data++);
• }
Summary: Writing Loops Efficiently

• Use loops that count down to zero;


• Use unsigned loop counters, and
• i!=0 rather than i>0;
• Use do-while loop rather than for loop
• This saves the compiler checking to see if counter is 0;
• Unrolling important loops to reduce overhead
• But do not over-unroll (in which hurt the cache perf.).
ANY
QUESTIONS??

You might also like