Compiler Optimization
Compiler Optimization
n computing, optimization is the process of modifying a system to make some aspect of it work more
efficiently or use fewer resources. For instance, a computer program may be optimized so that it executes
more rapidly, or is capable of operating within a reduced amount of memory storage, or draws less battery
power (ie, in a portable computer). Optimization can occur at a number of levels. At the highest level the design
may be optimized to make best use of the available resources. The implementation of this design will benefit
from the use of efficient algorithms and the coding of these algorithms will benefit from the writing of good
quality code. Use of an optimizing compiler can help ensure that the executable program is optimized. At the
lowest level, it is possible to bypass the compiler completely and write assembly code by hand. With modern
optimizing compilers and the greater complexity of recent CPUs, it takes great skill to write code that is better
than the compiler can generate and few projects ever have to resort to this ultimate optimization step.
Optimization will generally focus on one or two of execution time, memory usage, disk space, bandwidth or
some other resource. This will usually require a tradeoff — where one is optimized at the expense of others. For
example, increasing the size of cache improves runtime performance, but also increases the memory consump-
tion. Other common tradeoffs include code clarity and conciseness.
“The order in which the operations shall be performed in every particular case is a very interesting and
curious question, on which our space does not permit us fully to enter. In almost every computation a great
variety of arrangements for the succession of the processes is possible, and various considerations must
influence the selection amongst them for the purposes of a Calculating Engine. One essential object is to
choose that arrangement which shall tend to reduce to a minimum the time necessary for completing the
calculation.” - Ada Byron’s notes on the analytical engine 1842.
“We should forget about small efficiencies, say about 97% of the time: premature optimization is the root
of all evil. Yet we should not pass up our opportunities in that critical 3%.” - Knuth
The optimizer (a program that does optimization) may have to be optimized as well. The compilation with
the optimizer being turned on usually takes more time, though this is only a problem when the program is
significantly large. In particular, for just-in-time compilers the performance of the optimizer is a key in improving
execution speed. Usually spending more time can yield better code, but it is also the precious computer time
that we want to save; thus in practice tuning the performance requires the trade-off between the time taken for
optimization and the reduction in the execution time gained by optimizing code.
Compiler optimization is the process of tuning the output of a compiler to minimize some attribute (or
maximize the efficiency) of an executable program. The most common requirement is to minimize the time taken
to execute a program; a less common one is to minimise the amount of memory occupied, and the growth of
portable computers has created. It has been shown that some code optimization problems are NP-complete. Now
in this chapter we discuss about all types of optimizations and target of optimization. Now we are standing at fifth
phase of compiler. Input to this phase is intermediate code of expressions from intermediate code generator as:
Optimize
intermediate code to CODE INTERMEDIATE
code generator OPTIMIZATION CODE
GENERATION
Techniques in optimization can be broken up among various scopes which affect anything from a single
statement to an entire program. Some examples of scopes include:
Optimization
Techniques
Functional
Loop Data-flow SSA based
language Other
Optimization optimizations optimizations
optimizations
Loop Copy
inversion Propagation
Common Optimization
Loop Function
interchange
Algorithms Chunking
Rematerializa
Code motion
tion
Loop nest
Code Hoisting
optimization
Loop reversal
Loop
unrolling
Loop splitting
Loop
unswitching
Usually performed late in the compilation process after machine code has been generated. This form of optimi-
zation examines a few adjacent instructions (like looking through a peephole at the code) to see whether they
can be replaced by a single instruction or a shorter sequence of instructions. This example is also an instance
of strength reduction.
Example 7.1:
1. a = b + c;
2. d = a + e ; becomes
MOV b, R0
ADD c, R0
MOV R0, a
MOV a, R0 # redundant load, can be removed
ADD e, R0
MOV R0, d
Peephole
optimizations
Machine
Local or
independent vs.
intraprocedural
Machine
optimizations
dependent
Types of
optimizations
Programming
Interprocedural
language-
or whole-
independent vs.
program
Language
optimization
dependent
Loop
optimizations
These only consider information local to a function definition. This reduces the amount of analysis that needs
to be performed i.e.saving time and reducing storage requirements.
These analyse all of a programs source code. The greater quantity of information extracted means that optimi-
zations can be more effective compared to when they only have access to local information (i.e., within a single
function). This kind of optimization can also allow new techniques to be performed. For instance function
inlining, where a call to a function is replaced by a copy of the function body.
These act on the statements which make up a loop, such as a for loop. Loop optimizations can have a significant
impact because many programs spend a large percentage of their time inside loops.
In addition to scoped optimizations there are two further general categories of optimization:
Most high-level languages share common programming constructs and abstractions — decision (if, switch,
case), looping (for, while, repeat... until, do... while), encapsulation (structures, objects). Thus similar optimiza-
tion techniques can be used across languages. However certain language features make some kinds of optimi-
zations possible or difficult. For instance, the existence of pointers in C and C++ makes certain optimizations of
array accesses difficult. Conversely, in some languages functions are not permitted to have “side effects”.
Therefore, if repeated calls to the same function with the same arguments are made, the compiler can immedi-
ately infer that results need only be computed once and the result referred to repeatedly.
Many optimizations that operate on abstract programming concepts (loops, objects, structures) are indepen-
dent of the machine targeted by the compiler. But many of the most effective optimizations are those that best
exploit special features of the target platform.
1. Avoid redundancy: If something has already been computed, it is generally better to store it and reuse
it later, instead of recomputing it.
2. Less code: There is less work for the CPU, cache, and memory. So, likely to be faster.
3. Straight line code/ fewer jumps: Less complicated code. Jumps interfere with the prefetching of
instructions, thus slowing down code.
4. Code locality: Pieces of code executed close together in time should be placed close together in
memory, which increases spatial locality of reference.
5. Extract more information from code: The more information the compiler has, the better it can
optimize.
6. Avoid memory accesses: Accessing memory, particularly if there is a cache miss, is much more
expensive than accessing registers.
7. Speed: Improving the runtime performance of the generated object code. This is the most common
optimisation.
8. Space: Reducing the size of the generated object code.
9. Safety: Reducing the possibility of data structures becoming corrupted (for example, ensuring that
an illegal array element is not written to).
“Speed” optimizations make the code larger, and many “Space” optimizations make the code slower — this
is known as the space-time tradeoff.
Many of the choices about which optimizations can and should be done depend on the characteristics of the
target machine. GCC is a compiler which exemplifies this approach.
• Number of CPU registers: To a certain extent, the more registers, the easier it is to optimize for
performance. Local variables can be allocated in the registers and not on the stack. Temporary/
intermediate results can be left in registers without writing to and reading back from memory.
• RISC vs CISC: CISC instruction sets often have variable instruction lengths, often have a larger
number of possible instructions that can be used, and each instruction could take differing amounts
of time. RISC instruction sets attempt to limit the variability in each of these: instruction sets are
usually constant length, with few exceptions, there are usually fewer combinations of registers and
memory operations, and the instruction issue rate is usually constant in cases where memory latency
is not a factor. There may be several ways of carrying out a certain task, with CISC usually offering
more alternatives than RISC.
• Pipelines: A pipeline is essentially an ALU broken up into an assembly line. It allows use of parts of
the ALU for different instructions by breaking up the execution of instructions into various stages:
instruction decode, address decode, memory fetch, register fetch, compute, register store, etc. One
instruction could be in the register store stage, while another could be in the register fetch stage.
Pipeline conflicts occur when an instruction in one stage of the pipeline depends on the result of
another instruction ahead of it in the pipeline but not yet completed. Pipeline conflicts can lead to
pipeline stalls: where the CPU wastes cycles waiting for a conflict to resolve.
• Number of functional units: Some CPUs have several ALUs and FPUs (Floating Point Units). This
allows them to execute multiple instructions simultaneously. There may be restrictions on which
instructions can pair with which other instructions and which functional unit can execute which
instruction. They also have issues similar to pipeline conflicts.
• Cache Size (256 KB–4 MB) & type (direct mapped, 2-/4-/8-/16-way associative, fully associative):
Techniques like inline expansion may increase the size of the generated code and reduce code
locality. The program may slow down drastically if an oft-run piece of code suddenly cannot fit in the
cache. Also, caches which are not fully associative have higher chances of cache collisions even in
an unfilled cache.
• Cache/memory transfer rates: These give the compiler an indication of the penalty for cache
misses. This is used mainly in specialized applications.
A basic block is a straight-line piece of code without jump targets in the middle; jump targets, if any, start a
block, and jumps end a block.
or
A sequence of instructions forms a basic block if the instruction in each position dominates, or always
executes before, all those in later positions, and no other instruction executes between two instructions in the
sequence.
Basic blocks are usually the basic unit to which compiler optimizations are applied in compiler theory. Basic
blocks form the vertices or nodes in a control flow graph or they can be represented by control flow graph. The
blocks to which control may transfer after reaching the end of a block are called that block’s successors, while
the blocks from which control may have come when entering a block are called that block’s predecessors.
Instructions which begin a new basic block include
• Procedure and function entry points.
• Targets of jumps or branches.
• Instructions following some conditional branches.
• Instructions following ones that throw exceptions.
• Exception handlers.
Instructions that end a basic block include
• Unconditional and conditional branches, both direct and indirect.
• Returns to a calling procedure.
• Instructions which may throw an exception.
• Function calls can be at the end of a basic block if they may not return, such as functions which
throw exceptions or special calls.
A control flow graph (CFG) is a representation, using graph notation, of all paths that might be traversed
through a program during its execution. Each node in the graph represents a basic block and directed edges are
used to represent jumps in the control flow. There are two specially designated blocks: the entry block, through
which control enters into the flow graph, and the exit block, through which all control flow leaves.
1
2 3
5
6
8 9
10
Terminology: These terms are commonly used when discussing control flow graphs
Entry block: Block through which all control flow enters the graph. ( 4 is entry block for 5, 6, 7, 8, 9, 0 )
Exit block: Block through which all control flow leaves the graph. ( 8 is exit block)
Back edge: An edge that points to an ancestor in a depth-first (DFS) traversal of the graph. (edge from 0 to 3)
Critical edge: An edge which is neither the only edge leaving its source block, nor the only edge entering
its destination block. These edges must be split (a new block must be created in the middle of the edge) in order
to insert computations on the edge.
Abnormal edge: An edge whose destination is unknown. These edges tend to inhibit optimization. Excep-
tion handling constructs can produce them.
Impossible edge / Fake edge: An edge which has been added to the graph solely to preserve the property
that the exit blocks post dominates all blocks. It cannot ever be traversed.
Dominator: Block M dominates block N written as M dom N if every path from the entry node to block N
has to pass through block M or if every possible execution path from entry to N includes M. The entry block
dominates all blocks. Dominator must satisfy three properties as:
(i) Reflexive: Due to every node dominates itself.
( ii ) Transitive: If A dom B and B dom C then A dom C.
( iii ) Antisymmetric : If A dom B and B dom A then A = B
4 dominates 7, 8, 9, 0 not 5 and 6 because there is an edge from 4 to 7 similar 5 dominates 5 also 6 dominates 6.
Post-dominator: Block M postdominates block N if every path from N to the exit has to pass through block
M. The exit block postdominates all blocks. ( 7 is Postdominator for 8 and 9 )
Immediate dominator: Block M immediately dominates block N written as M idom N if M dominates N,
and there is no intervening block P such that M dominates P and P dominates N. In other words, M is the last
dominator on any path from entry to N. Each block has a unique immediate dominator, if it has any at all. ( 7
is immediate dominator for 9 not for 0 )
Immediate postdominator: Similar to immediate dominator.
Dominator tree: An ancillary data structure depicting the dominator relationships. There is an arc from
Block M to Block N if M is an immediate dominator of N known as dominator tree.
Postdominator tree: Similar to dominator tree. This tree is rooted at the exit block.
Loop header: Loop header dominates all blocks in the loop body and sometimes called as the entry point
of the loop.
(1, 2, 3 are loop header)
1 2 3
4 Dominator
Tree
5 6 7
8 9 10
t1 = a B1 t1 = a B1 t1 = a B1
t2 = b t2 - b t2 - b
t3 = a + b t3 = a + b t3 = a + b
t4= t3 * a t4= t3 * a t4= t3 * a
t6 = a * b t6 = a * b t6 = a * b
t7 = a + b t7 = t3 t7 = t3
a = t7 a = t7 a = t7
if a > b goto B2 if a > b goto B2 if a > b goto B2
B3 B3 B3
B2 B2 B2
t7 = a + b t8 = a*b t7 = a + b t8 = a*b t7 = t3 t8 = a*b
t9 = a – b B4 t9 = a – b B4 t9 = a – b B4
t10 = a* t9 t10 = a* t9 t10 = a* t9
B5 B5 B5
t11 = a – b t11 = a – b t11 = a – b
t 12 = a*b t 12 = a*b t 12 = t6
t13 = t11 * t12 t13 = t11 * t12 t13 = t11 * t12
Original Code After local common sub- After global Common Sub-
expression Elimination expression Elimination
The common sub-expression can be avoided by storing its value in a temporary variable which can cache
its result. After applying this Common Sub-expression Elimination technique the program becomes:
Thus in the last statement the recomputation of the expression b + c is avoided. Compiler writers distin-
guish two kinds of CSE:
• Local Common Subexpression Elimination works within a single basic block and is thus a simple
optimization to implement.
• Global Common Subexpression Elimination works on an entire procedure, and relies on dataflow
analysis which expressions are available at which points in a procedure.
Copy propagation
Copy propagation is the process of replacing the occurrences of targets of direct assignments with their
values. A direct assignment is an instruction of the form x = y, which simply assigns the value of y to x.
From the following code:
y=x
z=3+y
Copy propagation would yield:
z=3+x
Copy propagation often makes use of reaching definitions, use-def chains* and def-use chains+ when
computing which occurrences of the target may be safely replaced. If all upwards exposed uses of the target
may be safely modified, the assignment operation may be eliminated. Copy propagation is a useful “clean up”
optimization frequently used after other optimizations have already been run. Some optimizations require that
copy propagation be run afterward in order to achieve an increase in efficiency. Copy propagation also classi-
fied as local and global copy propagation.
• Local copy propagation is applied to an individual basic block.
• Global copy propagation is applied to all code’s basic block.
Dead code elimination
Removation of instructions that will not affect the behaviour of the program, for example definitions which
have no uses or code which will never execute regardless of input called dead code and this process is known
as dead code elimination which is a size optimization (although it also produces some speed improvement) that
aims to remove logically impossible statements from the generated object code. This technique is common in
debugging to optionally activate blocks of code; using an optimizer with dead code elimination eliminates the
need for using a preprocessor to perform the same task.
Example 7.4: Consider the following program:
t1 = x B1 t1 = x B1 t1 = x B1
t2 = y + 11 t2 = y + x t2 = y + x
t3 = t2 t3 = t2 t3 = t2
t4 = z * t3 t4 = z * t2 t4 = z * t2
if z > y goto B2 if z > y goto B2 if z > y goto B2
B2 B3 B2 B3 B2 B3
t5 = z t5 = z t5 = z
t11 = t3 + 5 t11 = t3 + 5 t11 = t3 + 5
t6 = y t6 = y t6 = y
t12 = t1 * 9 t12 = t1 * 9 t12 = t1 * 9
t7 = t5 + 16 t7 = z + y t7 = z + y
t13 = t12 t13 = t12 t13 = t12
t8 = x t8 = x t8 = x
t9 = y + t8 t9 = y + t8 t9 = y + t8
t10 = t9 t10 = t9 t10 = t9
Example 7.5:
The variable b is assigned a value after a return statement, which makes it impossible to get to. That is,
since code execution is linear, and there is no conditional expression wrapping the return statement, any code
after the return statement cannot possibly be executed. Furthermore, if we eliminate that assignment, then we
can see that the variable b is never used at all, except for its declaration and initial assignment. Depending on
the aggressiveness of the optimizer, the variable b might be eliminated entirely from the generated code.
Example 7.6:
i=1
i=1
j=2
j=2
k=3
l=4
l=4
if i<j goto B2
if i<j goto B2
In the above mentioned code, a = a + c can be moved out of the ‘for’ loop.
The calculation of maximum – 1 and (4+array[k])*pi+5 can be moved outside the loop, and precalculated,
resulting in something similar to:
Induction variable elimination: An induction variable is a variable that gets increased or decreased by a fixed
amount on every iteration of a loop. A common compiler optimization is to recognize the existence of induction
variables and replace them with simpler computations.
Example 7.9:
(i) In the following loop, i and j are induction variables:
( ii ) In some cases, it is possible to reverse this optimization in order to remove an induction variable from
the code entirely.
This function’s loop has two induction variables: i and j either one can be rewritten as a linear function of
the other.
Strength reduction is a compiler optimization where replacing a complex or difficult or expensive operations
with simpler ones. In a procedural programming language this would apply to an expression involving a loop
variable and in a declarative language it would apply to the argument of a recursive function. One of the most
important uses of the strength reduction is computing memory addresses inside a loop. Several peephole
optimizations also fall into this category, such as replacing division by a constant with multiplication by its
reciprocal, converting multiplies into a series of bit-shifts and adds, and replacing large instructions with
equivalent smaller ones that load more quickly.
Example 7.10: Multiplication can be replaced by addition.
Function chunking: Function chunking is a compiler optimization for improving code locality. Profiling infor-
mation is used to move rarely executed code outside of the main function body. This allows for memory pages
with rarely executed code to be swapped out.
As compiler technologies have improved, good compilers can often generate better code than human program-
mers and good post pass optimizers can improve highly hand-optimized code even further. Compiler optimiza-
tion is the key for obtaining efficient code, because instruction sets are so compact that it is hard for a human
to manually schedule or combine small instructions to get efficient results. However, optimizing compilers are
by no means perfect. There is no way that a compiler can guarantee that, for all program source code, the fastest
(or smallest) possible equivalent compiled program is output. Additionally, there are a number of other more
practical issues with optimizing compiler technology:
• Usually, an optimizing compiler only performs low-level, localized changes to small sets of opera-
tions. In other words, high-level inefficiency in the source program (such as an inefficient algorithm)
remains unchanged.
• Modern third-party compilers usually have to support several objectives. In so doing, these compil-
ers are a ‘jack of all trades’ yet master of none.
• A compiler typically only deals with a small part of an entire program at a time, at most a module at a
time and usually only a procedure; the result is that it is unable to consider at least some important
contextual information.
• The overhead of compiler optimization. Any extra work takes time, whole-program optimization
(interprocedural optimization) is very costly.
• The interaction of compiler optimization phases: what combination of optimization phases are opti-
mal, in what order and how many times?
Work to improve optimization technology continues. One approach is the use of so-called “post pass”
optimizers. These tools take the executable output by an “optimizing” compiler and optimize it even further. As
opposed to compilers which optimize intermediate representations of programs, post pass optimizers work on
the assembly language level.
Data flow analysis is a technique for gathering information about the possible set of values calculated at
various points in a computer program. A program’s control flow graph is used to determine those parts of
a program to which a particular value assigned to a variable might propagate. The information gathered is
often used by compilers when optimizing a program. A canonical example of a data flow analysis is
reaching definitions.
A simple way to perform data flow analysis of programs is to set up data flow equations for each node of the
control flow graph and solve them by repeatedly calculating the output from the input locally at each node until
the whole system stabilizes, i.e., it reaches a fixpoint. This general approach was developed by Gary Kildall
while teaching at the Naval Postgraduate School. Data flow analysis analysis can be partitioned into two parts
as:
• Local Data Flow Analysis : Data flow analysis that is applied to only one basic block.
• Global Data Flow Analysis : Data flow analysis that is applied to all function at a time.
Data flow analysis is inherently flow-sensitive and typically path-insensitive. Some causes are :
• A flow-sensitive analysis takes into account the order of statements in a program. For example, a
flow-insensitive pointer alias analysis may determine “variables x and y may refer to the same
location”, while a flow-sensitive analysis may determine “after statement 20, variables x and y may
refer to the same location”.
• A path-sensitive analysis only considers valid paths through the program. For example, if two
operations at different parts of a function are guarded by equivalent predicates, the analysis must
only consider paths where both operations execute or neither operation executes. Path-sensitive
analyses are necessarily flow-sensitive.
• A context-sensitive analysis is an interprocedural analysis that takes the calling context into account
when analyzing the target of a function call. For example, consider a function that accepts a file
handle and a boolean parameter that determines whether the file handle should be closed before the
function returns. A context-sensitive analysis of any callers of the function should take into account
the value of the boolean parameter to determine whether the file handle will be closed when the
function returns.
Data flow analysis of programs is to set up data flow equations for each node of the control flow graph. General
form of data flow equation is :
OUT [ S ] = GEN [ S ] U ( IN [ S ] – KILL [ S ] )
We define the GEN and KILL sets as follows:
GEN[d : y f ( x1 ,...., xn )] {d}
KILL[d : y f ( x1 ,..., xn )] {DEFS[ y] {d}
Where DEFS[y] is the set of all definitions that assign to the variable y. Here d is a unique label attached to
the assigning instruction.
The efficiency of iteratively solving data flow equations is influenced by the order at which local nodes are
visited and whether the data flow equations are used for forward or backward data flow analysis over the CFG.
In the following, a few iteration orders for solving data flow equations are discussed.
• Random order: This iteration order is not aware whether the data flow equations solve a forward or
backward data-flow problem. Therefore, the performance is relatively poor compared to specialized
iteration orders.
• Post order: This is a typical iteration order for backward data flow problems. In postorder iteration a
node is visited after all its successor nodes have been visited. Typically, the postorder iteration is
implemented with the depth-first strategy.
• Reverse post order: This is a typical iteration order for forward data flow problems. In reverse-
postorder iteration a node is visited before all its successor nodes have been visited, except when
the successor is reached by a back edge.
A definition of a variable ‘x’ is a statement that assigns or may assign a value to ‘x’. The most common forms of
definition are assignments to ‘x’ and the statements that read a value from input device and store it in ‘x’. These
statements certainly define value for ‘x’ and they are referred to as unambiguous definition of ‘x’. a variable ‘z’
reaches to a point ‘y’ if there is a path from the point following z to y such that ‘z’ is not killed along that path.
The most common forms of ambiguous definition of ‘x’ are :
• A call of a procedure with ‘x’ as a parameter or procedure that access ‘x’.
• An assignment through a pointer that could refer to ‘x’.
A definition of variable is said to reach a given point in a function if there is an execution path from the
definition to that point. Reaching definition can be done in classic form know as an ‘iterative forward bit vector
problem’ – ‘iterative’ because we construct a collection of data flow equation to represent the information flow
and solve it by iteration from an appropriate set of initial values; ‘forward’ because information flow is in the
direction of execution along the control flow edge in the program; ‘bit vector’ because we can represent each
definition by a 1 or a 0.
This concept include some grammar for structured program and control flow graph for them by which we
calculate the data flow equations.
1. S id : = E | S ; S | if E then S else S | do S while E
2. E id + id | id
Some optimization techniques primarily designed to operate on loops include:
• Code Motion
• Induction variable analysis
• Loop fission or loop distribution
• Loop fusion or loop combining
• Loop inversion
• Loop interchange
• Loop nest optimization
• Loop unrolling
• Loop splitting
• Loop unswitching
Loop fission is a technique attempting to break a loop into multiple loops over the same index range but each
taking only a part of the loop’s body. The goal is to break down large loop body into smaller ones to achieve
better data locality. It is the reverse action to loop fusion.
Loop fusion is a loop transformation, which replaces multiple loops with a single one but don’t always improve
the run-time performance, due to architectures that provide better performance if there are two loops rather than
one, for example due to increased data locality within each loop. In those cases, a single loop may be trans-
formed into two.
Loop inversion is a loop transformation, which replaces a while loop by an if block containing a do…while loop.
At a first glance, this seems like a bad idea: there’s more code so it probably takes longer to execute.
However, most modern CPUs use a pipeline for executing instructions. By nature, any jump in the code causes
a pipeline stall. Let’s watch what happens in Assembly-like Three address code version of the above code:
On the other hand, if we looked at the order of instructions at the moment when i was assigned value 99,
we would have seen:
Again, let’s look at the order of instructions executed for i equal 100
We didn’t waste any cycles compared to the original version. Now again jump into when i was assigned
value 99:
As you can see, two gotos (and thus, two pipeline stalls) have been eliminated in the execution.
Loop interchange is the process of exchanging the order of two iteration variables. When the loop variables
index into an array then loop interchange can improve locality of reference, depending on the array’s layout.
One major purpose of loop interchange is to improve the cache performance for accessing array elements.
Cache misses occur if the contiguously accessed array elements within the loop come from a different cache
line. Loop interchange can help prevent this. The effectiveness of loop interchange depends on and must be
considered in light of the cache model used by the underlying hardware and the array model used by the
compiler. It is not always safe to exchange the iteration variables due to dependencies between statements for
the order in which they must execute. In order to determine whether a compiler can safely interchange loops,
dependence analysis is required.
Example 7.11:
do i = 1, 10000
do j = 1, 1000
a(i) = a(i) + b(i,j) * c(i)
end do
end do
Loop interchange on this example can improve the cache performance of accessing b(i,j), but it will ruin the
reuse of a(i) and c(i) in the inner loop, as it introduces two extra loads (for a(i) and for c(i)) and one extra store
(for a(i)) during each iteration. As a result, the overall performance may be degraded after loop interchange.
To ever-smaller processes making it possible to put very fast fully pipelined floating-point units onto commod-
ity CPUs. But delivering that performance is also crucially dependent on compiler transformations that reduce
the need for the high-bandwidth memory system.
Example 7.12: Matrix Multiply
Many large mathematical operations on computers end up spending much of their time doing matrix
multiplication. Examining this loop nest can be quite instructive. The operation is:
C = A*B
where A, B, and C are NxN arrays. Subscripts, for the following description, are in the form C[row][column].
The basic loop is:
Loop unwinding, also known as loop unrolling, is a technique for optimizing, the idea is to save time by
reducing the number of overhead instructions that the computer has to execute in a loop, thus improving the
cache hit rate and reducing branching. To achieve this, the instructions that are called in multiple iterations of
the loop are combined into a single iteration. This will speed up the program if the overhead instructions of the
loop impair performance significantly.
The major side effects of loop unrolling are:
(a) the increased register usage in a single iteration to store temporary variables, which may hurt performance.
(b) the code size expansion after the unrolling, which is undesirable for embedded applications.
Example 7.13:
A procedure in a computer program needs to delete 100 items from a collection. This is accomplished by
means of a for-loop which calls the function. If this part of the program is to be optimized, and the overhead of
the loop requires significant resources, loop unwinding can be used to speed it up.
As a result of this optimization, the new program has to make only 20 loops, instead of 100. There are now
1/5 as many jumps and conditional branches that need to be taken, which over many iterations would be a great
improvement in the loop administration time while the loop unrolling makes the code size grow from 3 lines to
7 lines and the compiler has to allocate more registers to store variables in the expanded loop iteration.
Loop splitting / loop peeling that attempts to simplify a loop or eliminate dependencies by breaking it into
multiple loops which have the same bodies but iterate over different contiguous portions of the index range. A
useful special case is loop peeling, which can simplify a loop with a problematic first iteration by performing
that iteration separately before entering the loop.
Example 7.14:
Loop unswitching moves a conditional inside a loop outside of it by duplicating the loop’s body, and placing
a version of it inside each of the if and else clauses of the conditional. This can improve the parallelization of the
loop. Since modern processors can operate fast on vectors this increases the speed.
Example 7.15:
Suppose we want to add the two arrays x and y (vectors) and also do something depending on the variable
w. The conditional inside this loop makes it hard to safely parallelize this loop. After unswitching this becomes:
Each of these new loops can be separately optimized. Note that loop unswitching will double the amount
of code generated.
Data flow optimizations, based on Data flow analysis, primarily depend on how certain properties of data are
propagated by control edges in the control flow graph. Some of these include:
• Common Subexpression elimination
• Constant folding and propagation
• Aliasing.
Constant folding and constant propagation are related optimization techniques used by many modern compil-
ers. A more advanced form of constant propagation known as sparse conditional constant propagation may be
utilized to simultaneously remove dead code and more accurately propagate constants.
Constant folding is the process of simplifying constant expressions at compile time. Terms in constant expres-
sions are typically simple literals, such as the integer values and variables whose values are never modified or
explicit variables. Constant folding can be done in a compiler’s front end on intermediate representation, then
that represents the high-level source language, before it is translated into three-address code, or in the back
end. Consider the statement:
i = 320 * 200 * 32;
Most modern compilers would not actually generate two multiply instructions and a store for this statement.
Instead, they identify constructs such as these, and substitute the computed values at compile time (in this
case, 2,048,000), usually in the intermediate representation.
Constant propagation is the process of substituting the values of known constants in expressions at
compile time. A typical compiler might apply constant folding now, to simplify the resulting expressions, before
attempting further propagation.
Example 7.16:
As a and b have been simplified to constants and their values substituted everywhere they occurred, the
compiler now applies dead code elimination to discard them, reducing the code further:
int c;
c = 12;
if (12 > 10) {
c = 2;
}
return c * 2;
Aliasing is a term that generally means that one variable or some reference, when changed, has an indirectly
effect on some other data. For example in the presence of pointers, it is difficult to make any optimizations at all,
since potentially any variable can have been changed when a memory location is assigned to.
Example 7.17:
(i) Array bounds checking: C programming language does not perform array bounds checking. If an
array is created on the stack, with a variable laid out in memory directly beside that array, one could
index outside that array and then directly change that variable by changing the relevant array
element. For example, if we have a int array of size ten (for this example’s sake, calling it vector), next
to another int variable (call it i), vector [10] would be aliased to i if they are adjacent in memory.
This is possible in some implementations of C because an array is in reality a pointer to some location
in memory, and array elements are merely offsets off that memory location. Since C has no bounds
checking, indexing and addressing outside of the array is possible. Note that the aforementioned
aliasing behaviour is implementation specific. Some implementations may leave space between ar-
rays and variables on the stack, for instance, to minimize possible aliasing effects. C programming
language specifications do not specify how data is to be laid out in memory.
(ii) Aliased pointers: Another variety of aliasing can occur in any language that can refer to one location
in memory with more than one name .See the C example of the xor swap algorithm that is a function;
it assumes the two pointers passed to it are distinct, but if they are in fact equal (or aliases of each
other), the function fails. This is a common problem with functions that accept pointer arguments,
and their tolerance (or the lack thereof) for aliasing must be carefully documented, particularly for
functions that perform complex manipulations on memory areas passed to them.
Specified Aliasing
Controlled aliasing behaviour may be desirable in some cases. It is common practice in FORTRAN. The
Perl programming language specifies, in some constructs, aliasing behaviour, such as in for each loops. This
allows certain data structures to be modified directly with less code. For example,
will print out “2 3 4” as a result. If one would want to bypass aliasing effects, one could copy the
contents of the index variable into another and change the copy.
Conflicts With Optimization
(i) Many times optimizers have to make conservative assumptions about variables in the presence of
pointers. For example, a constant propagation process which knows that the value of variable x is 5
would not be able to keep using this information after an assignment to another variable (for example,
*y = 10) because it could be that *y is an alias of x. This could be the case after an assignment like y
= &x. As an effect of the assignment to *y, the value of x would be changed as well, so propagating
the information that x is 5 to the statements following *y = 10 would be potentially wrong . However,
if we have information about pointers, the constant propagation process could make a query like:
can x be an alias of *y? Then, if the answer is no, x = 5 can be propagated safely.
(ii) Another optimisation that is impacted by aliasing is code reordering; if the compiler decides that x is
not an alias of *y, then code that uses or changes the value of x can be moved before the assignment
*y = 10, if this would improve scheduling or enable more loop optimizations to be carried out. In order
to enable such optimisations to be carried out in a predictable manner, the C99 edition of the C
programming language specifies that it is illegal (with some exceptions) for pointers of different
types to reference the same memory location. This rule, known as “strict aliasing”, allows impressive
increases in performance, but has been known to break some legacy code.
These optimizations are intended to be done after transforming the program into a special form called static
single assignment. Although some function without SSA, they are most effective with SSA. Compiler optimiza-
tion algorithms which are either enabled or strongly enhanced by the use of SSA include:
• Constant propagation
• Sparse conditional constant propagation
• Dead code elimination
• Global value numbering
• Partial redundancy elimination
• Strength reduction
SSA based optimization includes :
• Global value numbering
• Sparse conditional constant propagation
Global value numbering (GVN) is based on the SSA intermediate representation so that false variable name-
value name mappings are not created. It sometimes helps eliminate redundant code that common subexpression
evaluation (CSE) does not. GVN are often found in modern compilers. Global value numbering is distinct from
local value numbering in that the value-number mappings hold across basic block boundaries as well, and
different algorithms are used to compute the mappings.
Global value numbering works by assigning a value number to variables and expressions. To those
variables and expressions which are provably equivalent, the same value number is assigned.
Example 7.18:
The reason that GVN is sometimes more powerful than CSE comes from the fact that CSE matches lexically
identical expressions whereas the GVN tries to determine an underlying equivalence. For instance, in the code:
a := c × d
e := c
f := e × d
CSE would not eliminate the recomputation assigned to f, but even a poor GVN algorithm should discover
and eliminate this redundancy.
Sparse conditional constant propagation is an optimization frequently applied after conversion to static single
assignment form (SSA). It simultaneously removes dead code and propagates constants throughout a pro-
gram. It must be noted, however, that it is strictly more powerful than applying dead code.
Recursion is often expensive, as a function call consumes stack space and involves some overhead related to
parameter passing and flushing the instruction cache. Tail recursive algorithms can be converted to iteration,
which does not have call overhead and uses a constant amount of stack space, through a process called tail
recursion elimination.
Because of the high level nature by which data structures are specified in functional languages such as Haskell,
it is possible to combine several recursive functions which produce and consume some temporary data struc-
ture so that the data is passed directly without wasting time constructing the data structure.
Partial redundancy elimination (PRE) eliminates expressions that are redundant on some but not necessarily all
paths through a program. PRE is a form of common subexpression elimination.
An expression is called partially redundant if the value computed by the expression is already available on
some but not all paths through a program to that expression. An expression is fully redundant if the value
computed by the expression is available on all paths through the program to that expression. PRE can eliminate
partially redundant expressions by inserting the partially redundant expression on the paths that do not
already compute it, thereby making the partially redundant expression fully redundant.
Example 7.19:
The expression x+4 assigned to z is partially redundant because it is computed twice if some_condition is
true. PRE would perform code motion on the expression to yield optimized code as above.
Rematerialization saves time by recomputing a value instead of loading it from memory. It is typically tightly
integrated with register allocation, where it is used as an alternative to spilling registers to memory. It was
conceived by Preston Briggs, Keith D. Cooper, and Linda Torczon in 1992. Rematerialization decreases register
pressure by increasing the amount of CPU computation. To avoid adding more computation time than neces-
sary, rematerialize is done only when the compiler can be confident that it will be of benefit i.e. when a register
spill to memory would otherwise occur.
Rematerialization works by keeping track of the expression used to compute each variable, using the
concept of available expressions. Sometimes the variables used to compute a value are modified, and so can no
longer be used to rematerialize that value. The expression is then said to no longer be available. Other criteria
must also be fulfilled, for example a maximum complexity on the expression used to rematerialize the value; it
would do no good to rematerialize a value using a complex computation that takes more time than a load.
Code Hoisting finds expressions that are always executed or evaluated following some point in a program,
without meaning of execution path and moves them to the latest point beyond which they would be executed.
This process must reduce the space occupied by program.
EX.: Write three address code for the following code and then perform optimization technique
for(i=0;i<=n;i++)
{
for(j=0;j<n;j++)
{
c[i,j]=0 ;
}
}
for(i=0;i<=n;i++)
{
for(j=0;j<=n;j++)
{
for(k=0;k<=n;k++)
{
c[i,j]=c[i,j]+a[i,k]*b[k,j] ;
}
}
}
SOL. : Here we represent three address code for the given code after that perform optimization:
Three Address Code
(1) i=0
(2) if i<n goto 4 (L)
(3) goto 18 (L)
(4) j=0 (L)
(5) if j<n goto 7 (L)
(6) goto 15 (L)
(7) t1=i*k1 (L)
(8) t1=t1+j
(9) t2=Q
(10) t3=4*t1
(11) t4=t2[t3]
(12) t5=j+1
(13) j=15
(14) goto 5
(15) t6=i+1 (L)
(16) i=t6
(17) goto 2
(18) i=0 (L)
(19) if <=n goto 21 (L)
(20) goto next (L)
(21) j=0
(22) if j<=n goto 24 (L)
(23) goto 52 (L)
(24) k=0 (L)
(25) if k<=n goto 27 (L)
(26) goto 49 (L)
(27) t7 =i*k1 (L)
(28) t7=t7+j
(29) t8=Q
(30) t9=4*t7
(31) t10=t8[t9]
(32) t11=i*k1
(33) t11=t11+k
(34) t12=Q
(35) t13=t11*4
(36) t14=t12[t13]
(37) t15=k+k1
(38) t15=t15+j
(39) t16=Q
(40) t17=4*t15
(41) t18=t16[t17]
(42) t19=t14*t18
(43) t20=t10+t19
(44) t10=t20
(45) t21=k+1
(46) k=t21
(47) goto 25
(48) t22=j+1 (L)
(49) j=t22
(50) goto 22
(51) t23=i+1 (L)
(52) i=t23
(53) goto 19
(next) ——————
Now we constructing basic block for given code as:
i=0 B1
B2
if i < n goto B3
B3
j=1 i=0 B7
B4 B8
j=1 next
t6 = i + 1
t1 = i* k1 i = t6 B10
t1 = t1 + j if i < n goto B5 = 11
goto B2
t2 = q
t3 = 4*t1 B6
t4 = t2 [t3] B11
t5 = j + 1 k=1
t22 = j + 1
j = 15 j = t22 B14
goto B5 goto B8
B12
if i < n goto B5 = 13
B5
B13
t7 = i* k1
t7 = t7 + j t22 = j + 1
t8 = Q j = t22
t9 = 4 * t7 goto B8
t10 = t8 [t9]
t11 = i * k1
t11 = t11 + k
B15
t12 = Q
t13 = t11 * 4
t14 = t12 [t13]
t15 = k + k
t15 = t15 + j
t16 = Q
t17 = 4* t15
t18 = t16 [t17]
t19 = t14 * t18
t20 = t10 + t19
t21 = k + 1
k = t21
goto B13
Now we apply optimization techniques by which only two blocks are change as B8, B10, B13. First
statement of B13 is moved towards B8 block, other second statement is moved to B10 block and one statement
is deleted, after that all blocks look like as:
i=0 B1
B2
if i < n goto B3
B3
j=1 i=0 B7
B8
B4
if i < n goto B5 if i < n goto B9
B9
j=1 next
t6 = i + 1
t1 = i* k1 i = t6 if i < n goto B5 = 11 B10
t1 = t1 + j goto B2 t7 = t7 + j
t2 = q
t3 = 4*t1 B6
t4 = t2 [t3] B11
t5 = j + 1 k=1
t22 = j + 1
j = 15 j = t22 B14
goto B5 goto B8
B12
if i < n goto B5 = 13
B5
B13
t7 = i* k1
t7 = t7 + j t22 = j + 1
t8 = Q j = t22
t9 = 4 * t7 goto B8
t10 = t8 [t9]
t11 = i * k1
t11 = t11 + k
B15
t12 = Q
t13 = t11 * 4
t14 = t12 [t13]
t15 = k + k
t15 = t15 + j
t16 = Q
t17 = 4* t15
t18 = t16 [t17]
t19 = t14 * t18
t20 = t10 + t19
t21 = k + 1
k = t21
goto B13