Getting Familiar With GCC Parameters
Getting Familiar With GCC Parameters
com/lpt/a/6968
gcc
(GNU C Compiler) is actually a collection of frontend tools that does compilation, assembly, and linking. The
goal is to produce a ready-to-run executable in a format acceptable to the OS. For Linux, this is ELF
(Executable and Linking Format) on x86 (32-bit and 64-bit). But do you know what some of the gcc
parameters can do for you? If you're looking for ways to optimize the resulted binary, prepare for a debugging
session, or simply observe the steps gcc
takes to turn your source code into an executable, getting familiar with these parameters is a must. So, please
read on.
Recall that gcc does multiple steps, not just one. Here is a brief explanation of their meaning:
Preprocessing: Producing code that no longer contains directives. Things like "#if" cannot be
understood directly by the compiler, so this must be translated into real code. Macros are also expanded
at this stage, making the resulting code larger than the original. [1]
Compilation: This takes the post-processed code, does lexical and syntax analysis, and then generates
the assembly code. gcc
outputs warning or error messages during this stage as the analyzer parses and finds any mistakes in
your code. If any optimization is requested, gcc will analyze your code further to find the possible spots
and manipulate them further. This work is done in multipass style, which demonstrates that it sometimes
takes more than one scan through the source code to optimize. [2]
Assembly: Accepts the assembler mnemonics and produces object codes containing opcodes. This is a
common misunderstanding: compilation stage doesn't produce opcodes, assembly stage does. The result
is one or more object files containing opcodes, which are really machine-dependent. [3]
Linking: Transforms object files into final executables. Opcodes alone are not enough to make the
operating system recognize and execute them. They must be wrapped into a more complete form. This
form, known as binary format, dictates how the OS loads the binary, arranges the relocation, and does
other necessary work. ELF is the default format for x86-based Linux. [4]
The gcc
parameters described here directly and indirectly touch all those four stages, so for clarity, this article is
organized as follows:
Before that, let's meet the accompanying tools that will help us to sneak inside the resulting code:
The ELF tool collection, which includes such programs as objdump and readelf. They parse ELF
information for us.
Oprofile, one of the de facto ways to collect hardware performance counters. We need this tool to see
certain aspects of the code's performance.
The following explanations can be applied to gcc version 3.x and 4.x, so it's pretty generic. Let's start digging,
shall we?
gcc offers a "no-brainer" way to do a bunch of optimizations for you: the -O option. It either makes your code
run faster or squeezes the final code's size. There are five variants of it:
-O0 (O zero) to -O3. "0" means no optimization, while "3" is highest optimization level. "1" and "2" are
between these extreme edges. If you just use -O without specifying any number, implicitly it means -O1.
-Os. Tells gcc to optimize for size. Basically, it is similar to -O2, but skips some steps that might enlarge
the size.
How much speedup do we really get from them? Well, suppose we have this code:
#include<stdio.h>
int main(int argc, char *argv[])
{
int i,j,k
unsigned long acc=0;
for(i=0;i<10000;i++)
for(j=0;j<5000;j++)
for(k=0;k<4;k++)
acc+=k;
printf("acc = %lu\n",acc);
return 0;
}
Using gcc, four different binaries are created, using each -O variants (except -Os). The time tool records their
execution time, e.g.:
$ time ./non-optimized
As you can see, between the non-optimized case and the one compiled using -O1, the elapsed time is seven
times shorter. Notice that -O1, -O2, and -O3
don't really give too much additional benefit--in fact, they are pretty much the same. So what's the magic of
-O1 here?
After a quick sightseeing trip to the source code, you would expect that such code is a dead-end for
optimization. First, let's look at a brief comparison between the disassembled versions of non-optimized and
optimizedO1.
$ objdump -D non-optimized
$ objdump -D optimizedO1
(Note: You might get different results, so use this as general guideline.)
Non-optimized OptimizedO1
mov 0xfffffff4(%ebp),%eax add $0x6,%edx
add %eax,0xfffffff8(%ebp) add $0x1,%eax
addl $0x1,0xfffffff4(%ebp) cmp $0x1388,%eax
cmpl $0x3,0xfffffff4(%ebp)
The above snippets implement the innermost loop (for (k=0;k<4;k++)). Notice the clear difference: the
non-optimized code directly loads and stores from a memory address, while optimizedO1 uses CPU registers
as the accumulator and loop counter. As you may already be aware, registers can be accessed hundreds or
thousands times faster than RAM cells.
Not satisfied to just exploit CPU registers for temporary storage, gcc uses another optimization trick. Let's see
the disassembled code of optimizedO1 again, specifically in the main() function:
......
08048390 <main>:
...
80483a1: b9 00 00 00 00 mov $0x0,%ecx
80483a6: eb 1f jmp 80483c7 <main+0x37>
80483a8: 81 c1 30 75 00 00 add $0x7530,%ecx
0x7530 is 30,000 in decimal form, so we can quickly guess the loop is simplified. This code represents the
innermost loop and the outermost loop ("for(j=0;j<5000;j++) ... for(k=0;k<4;k++)") because that is
literally a request to do 30,000 loops. Note that you just need to loop three times inside. When k=0, acc is still
the same, so the first loop can be discarded.
80483ae: 81 f9 00 a3 e1 11 cmp $0x11e1a300,%ecx
80483b4: 74 1a je 80483d0 <main+0x40>
80483b6: eb 0f jmp 80483c7 <main+0x37>
Hmm, now it is compared to 300,000,000 (10,000*5,000*6). This represents all three loops. After we reach
that number of loops, we jump straight to the printf() to print the sum (address 0x80483d0 - 0x80483db)
80483b8: 83 c2 06 add $0x6,%edx
80483bb: 83 c0 01 add $0x1,%eax
80483be: 3d 88 13 00 00 cmp $0x1388,%eax
80483c3: 74 e3 je 80483a8 <main+0x18>
80483c5: eb f1 jmp 80483b8 <main+0x28>
Six is added in the accumulator in each iteration. %edx will eventually hold the total sum after completion of all
the loops. The third and fourth lines shows us that after this is done 5,000 times, it should return to address
0x80483a8 (as mentioned previously).
Instead of jle, now jne is used. Assuming any integer could be compared in the previous cmp, you could
easily conclude that it boosts the chance of taking the branch to almost 100 percent. It is a small but useful hint
for the processor to indicate which code should be executed. However, for most modern processors, this kind
of transformation is not terribly necessary because the branch predictor is smart enough on its own.
To prove how much this transformation can help, OProfile comes to the rescue. Oprofile is executed to record
the number of retired branches and retired mispredicted branches. Retired here means it is executed inside the
CPU pipeline.
$ opcontrol --event=RETIRED_BRANCHES_MISPREDICTED:1000 --event=RETIRED_BRANCHES:1000;
We run non-optimized and optimizedO2 five times each. Then we take out the maximum and minimum of the
samples. We calculate the misprediction rate using this formula (adapted from this):
The average misprediction ratio is now calculated for each binary. Non-optimized gets 0.5117 percent, while
optimizedO2 gets 0.4323 percent--a very small gain for our case. The actual gain may vary in a real-world
case, because gcc itself can't do much without external hints. Please read about __builtin_expect() in the
gcc online documentation for further information.
gcc
basically offers you several ways to manage how a function is called. Let's take a look at inlining first. By
inlining, you reduce the cost of a function call because the body of the function is directly substituted into the
caller. Please note that this is not done by default, only when you use -O3 or at least -finline-functions.
How does the finished binary look when gcc does inlining? Observe Listing 2:
#include<stdio.h>
-S makes gcc stop right after compilation stage (we'll cover it later in this article). The results are as follows:
....
test:
pushl %ebp
movl %esp, %ebp
pushl %ebx
....
main:
leal 4(%esp), %ecx
andl $-16, %esp
pushl -4(%ecx)
...
movl $6, 16(%esp)
movl $3, 12(%esp)
movl $2, 8(%esp)
movl $1, 4(%esp)
movl $.LC0, (%esp)
call printf
...
movl $15, 16(%esp)
movl $6, 12(%esp)
movl $5, 8(%esp)
movl $4, 4(%esp)
movl $.LC1, (%esp)
call printf
...
Both test() and test2() are indeed inlined, but you also see test(), which stays outside main(). This is
where the static keyword plays a role. By saying a function is static, you tell gcc that this function won't be
called by any outside object file, so there is no need to emit the codes on its own. Thus, it is a space saver if
you can mark them as static whenever possible. On the other hand, be wise when deciding which function
should be inlined. Increasing size for a small speedup isn't always worthwhile.
Now, on to parameter passing. In x86 architectures, parameters are pushed to the stack and later popped inside
the function for further processing. But gcc
gives you a chance to change this behavior and instead use registers. Functions with up to three parameters
could use this feature by passing -mregparm=<n>, where <n> is the number of registers we want to use. If we
apply this parameter (n=3) to Listing 2, take out the inline attribute, and use no optimization, we get this:
...
test:
pushl %ebp
movl %esp, %ebp
subl $56, %esp
movl %eax, -20(%ebp)
movl %edx, -24(%ebp)
movl %ecx, -28(%ebp)
...
main:
...
movl $3, %ecx
movl $2, %edx
movl $1, %eax
call test
Instead of stack, it uses EAX, EDX, and ECX to hold the first, second, and third parameter. Because register
access time is faster than RAM, it is one way to reduce runtime. However, you must pay attention to these
issues:
You MUST compile all your code with the same -mregparm register number. Otherwise, you will have
trouble calling functions on another object file since they assume different calling conventions.
By using -mregparm, you basically break the Intel x86-compatible Application Binary Interface (ABI).
Therefore, you should mention it when you distribute your software in binary only form.
You probably notice this kind of sequence at the beginning of every function:
push %ebp
mov %esp,%ebp
sub $0x28,%esp
This sequence, also known as the function prologue, is written to set up the frame pointer (EBP). It is useful to
help the debugger do a stack trace. The structure below helps you visualize this [6]:
Can we omit it? Yes, with -fomit-frame-pointer, the prologue will be shortened so the function just begins
with a stack reservation (if there are local variables):
sub $0x28,%esp
If the function gets called very frequently, cutting out the prologue saves your program several CPU cycles.
But be careful: by doing this, you also make it hard for the debugger to investigate the stack. For example, let's
add test(7,7,7) at the end of test2() and recompile with -fomit-frame-pointer and no optimization. Now
fire up gdb to inspect the binary:
$ gdb inline
(gdb) break test
(gdb) r
Breakpoint 1, 0x08048384 in test ()
(gdb) cont
Breakpoint 1, 0x08048384 in test ()
(gdb) bt
#0 0x08048384 in test ()
#1 0x08048424 in test2 ()
#2 0x00000007 in ?? ()
#3 0x00000007 in ?? ()
#4 0x00000007 in ?? ()
#5 0x00000006 in ?? ()
#6 0x0000000f in ?? ()
#7 0x00000000 in ?? ()
On the second call of test, the program is stopped and gdb prints the stack trace. Normally, main() should
come up in Frame #2, but we only see question marks. Recall what I said about the stack layout: the absence of
a frame pointer prevents gdb from finding the location of the saved return address in Frame #2.
Everybody needs to debug his or her code sometimes. When that time comes, usually you fire up gdb, put
breakpoints here and there, analyze backtraces, and so on, to pinpoint the exact location of the offending
code(s). And what exactly do you get? Assuming you haven't used any debugging option, you likely just get
an address pointed to by an EIP register.
The problem is, you don't really want an address. You want gdb or another debugger to simply show the
related lines. But gdb
can't do that without some kind of hint. This hint, specifically called Debugging With Attributed Record
Formats (DWARF), helps you do source-level debugging.
How do you do it? Use -g when you compile into object code, e.g.:
gcc -o -g test test.c
What does gcc actually add so that the debugger can correlate an address with source code? You need
dwarfdump
[7] to find out. This tool is packaged inside the "libdwarf" tarball or RPM, so you won't find it as a standalone
package. Compile it on your own, or simply install from your distro's online repository; both should work. In
this section, I use the 20060614 RPM version.
Using readelf, you notice that there are 28 sections inside the non-debug version of Listing 1:
$ readelf -S ./non-optimized
But the debug version has 36 sections. The new sections added are:
debug_arranges
debug_pubnames
debug_info
debug_abbrev
debug_line
debug_frame
debug_str
debug_loc
You don't need to dig into all of the above sections; taking a look into .debug_line is enough for quick
observation. The command you need is:
$ /path/to/dwarfdump -l <object file>
The interpretation of the above messages is quite straightforward. Take the first entry (below the <source>
string) as an example:
line number 3 in file non-optimized.c is located in address 0x8048384.
Both readelf and dwarfdump can analyze debug information, so you're free to choose.
What you should realize is that the source code itself isn't embedded into the object file. In fact, the debugger
must check a separate source code file. The entry in the <source> column helps determine where to load the
source file. Notice that it contains a full path--meaning if the file is moved somewhere or renamed, gdb can't
locate it.
gcc itself has the ability to produce many debugging information. Besides DWARF, there are:
Stabs: -gstabs produce native stabs format, while -gstabs+ includes specific
GNU extensions.
Common Object File Format (COFF): Created with -gcoff.
XCOFF: Created with -gxcoff. If you prefer to include GNU extensions,
use -gxcoff+.
Virtual Memory System (VMS): Produced with -gvms.
Each of the above formats is described in the footnotes ([8], [9], and [10]), but in x86-compatible architecture,
without a doubt, you would use DWARF format. The latest DWARF specification is DWARF version 3, and
gdb can produce it via -gdwarf-2. This could be misleading for first-time users, because you might think it
creates DWARF 2-based information. In fact, it is DWARF 2 coupled with some DWARF 3 features. Not
every debugger supports version 3, so use it with caution.
However, things do not always go smoothly. When you combine -O and -g, it is necessary for line information
to relate to the actual code in the mentioned address offset. An example can clarify this--take a file (I use
Listing 1 again) and compile it:
$ gcc -O2 -ggdb -o debug-optimized listing-one.c
$ readelf --debug-dump=line debug-optimized
..
Special opcode 107: advance Address by 7 to 0x80483aa and Line by 4 to 11
...
There you see a complete disagreement. By inspecting debug information alone, you would expect the related
address to contain something like a CALL instruction. But in reality, you get ADD and CMP instructions, more
likely a loop construction. This is the side effect of the jobs done by optimization--instruction reordering in
this case. So as a rule of thumb, don't mix -g (or its variants) with -O.
For learning purposes, sometimes you want to know how your source code is transformed into an executable.
Fortunately, gcc provides you options to stop at any processing stage. Recall that gcc has several stages to be
accomplished--for example, linking. The options are:
-c stops at assembly phase but skips linking. The result is an object code.
-E stops after preprocessing stage. All preprocessing directives are expanded so you only see plain code.
-S stops after compilation. It leaves you with assembler code
-c
is mostly used when you have multiple source files and combine them to create the final executable. So,
instead of:
$ gcc -o final-binary test1.c test2.c
and then:
$ gcc -o final-binary ./test1.o ./test1.o
You probably notice the same sequence if you build the program using a Makefile. The advantage of using -c
is clear: you only need to recompile the changed source files. The only phase that has to be redone is linking
all the object files, and that greatly saves time, especially in large projects. An obvious example of this is the
Linux kernel.
-E
is useful if you want to see how your code really looks after macros, definitions, and such are expanded. Take
Listing 3 as an example.
#include<stdio.h>
#define A 2
#define B 4
#define calculate(a,b) a*a + b*b
void plain_dummy()
{
printf("Just a dummy\n");
}
Notice that we don't pass any -D parameters, which means TEST is undefined. So what do we have in the
preprocessed file?
void plain_dummy()
{
printf("Just a dummy\n");
}
Where is the call to justtest() inside main()? Nowhere. TEST is undefined--that's why the code is
-S gives you assembly code, much like what you see with objdump -d/-D. However, with -S, you still see
directives and symbol names, which makes it easier to study the code. For example, a call like
printf("%d\n", 20) could be transformed into:
.section .rodata.str1.1,"aMS",@progbits,1
.LC0:
.string "%d\n"
...
movl $20, 4(%esp)
movl $.LC0, (%esp)
call printf
You can see that format string %d is placed in a read-only data section ( .rodata). Also, you can confirm
that arguments are pushed to the stack from right to left, with the format string at the top of the stack.
Conclusion
gcc
gives us many useful options to make our code into whatever we like. By understanding what these options
really do, we can make the program faster and slimmer. However, do not depend entirely on them: you should
pay more attention to writing efficient and well-structured code.
Acknowledgments
I would like to thank the communities in the OFTC chat room (#kernelnewbies and #gcc) and #osdev
(Freenode) for their valuable ideas.
References
Return to ONLamp.com.
Copyright © 2007 O'Reilly Media, Inc.