Create A Working Compiler With The LLVM Framework, Part 1
Create A Working Compiler With The LLVM Framework, Part 1
framework, Part 1
Build a custom compiler with LLVM and its intermediate
representation
The LLVM (formerly the Low Level Virtual Machine) is an extremely powerful
compiler infrastructure framework designed for compile-time, link-time, and run time
optimizations of programs written in your favorite programming language. LLVM
works on several different platforms, and its primary claim to fame is generating code
that runs fast.
llvm-gcc
llvm-gcc is a modified version of the GNU Compiler Collection (gcc) that can
generate LLVM byte code when run with the -S -emit-llvm options. You can then
use lli to execute this generated byte code (also known as LLVM assembly).
For more information about llvm-gcc, see Resources. If you don't have llvm-gcc
preinstalled on your system, you should be able to build it from sources; see
Resources for a link to the step-by-step guide.
After compilation, llvm-gcc generates the file helloworld.s, which you can execute
using lli to print the message to console. The lli usage is:
Tintin.local# lli helloworld.s
Hello, World
Now, take a first look at the LLVM assembly. Listing 2 shows the code.
• Comments in LLVM assembly begin with a semicolon (;) and continue to the
end of the line.
• Global identifiers begin with the at (@) character. All function names and global
variables must begin with @, as well.
• Local identifiers in the LLVM begin with a percent symbol (%). The typical regular
expression for identifiers is [%@][a-zA-Z$._][a-zA-Z$._0-9]*.
• The LLVM has a strong type system, and the same is counted among its most
important features. The LLVM defines an integer type as iN, where N is the
number of bits the integer will occupy. You can specify any bit width between 1
and 223- 1.
• You declare a vector or array type as [no. of elements X size of each
element]. For the string "Hello World!" this makes the type [13 x i8], assuming
that each character is 1 byte and factoring in 1 extra byte for the NULL
character.
• You declare a global string constant for the hello-world string as follows: @hello
= constant [13 x i8] c"Hello World!\00". Use the constant keyword to
declare a constant followed by the type and the value. The type has already
been discussed, so let's look at the value: You begin by using c followed by
the entire string in double quotation marks, including \0 and ending with 0.
Unfortunately, the LLVM documentation does not provide any explanation of
why a string needs to be declared with the c prefix and include both a NULL
character and 0 at the end. See Resources for a link to the grammar file, if
you're interested in exploring more LLVM quirks.
• The LLVM lets you declare and define functions. Instead of going through the
entire feature list of an LLVM function, I concentrate on the bare bones. Begin
with the define keyword followed by the return type, and then the function
name. A simple definition of main that returns a 32-bit integer similar to: define
i32 @main() { ; some LLVM assembly code that returns i32 }.
• Function declarations, like definitions, have a lot of meat to them. Here's the
simplest declaration of a puts method, which is the LLVM equivalent of printf:
declare i32 puts(i8*). You begin the declaration with the declare keyword
followed by the return type, the function name, and an optional list of arguments
to the function. The declaration must be in the global scope.
• Each function ends with a return statement. There are two forms of return
statement: ret <type> <value> or ret void. For your simple main routine, ret
i32 0 suffices.
• Use call <function return type> <function name> <optional function
arguments> to call a function. Note that each function argument must be
preceded by its type. A function test that returns an integer of 6 bits and accepts
an integer of 36 bits has the syntax: call i6 @test( i36 %arg1 ).
That's it for a start. You need to define a main routine, a constant to hold the string,
and a declaration of the puts method that handles the actual printing. Listing 3 shows
the first attempt.
Oops, that didn't work as expected. What just happened? The LLVM, as mentioned
earlier, has a powerful type system. Because puts was expecting a pointer to i8 and
you passed a vector of i8, lli was quick to point out the error. The obvious fix to this
problem, coming from a C programming background, is typecasting. And that brings
you to the LLVM instruction getelementptr. Note that you must modify the puts call
in Listing 3 to something like call i32 @puts(i8* %t), where %t is of type i8* and
is the result of the typecast from [13 x i8] to i8*. (See Resources for a link to a
The first argument to getelementptr is the pointer to the global string variable. The
first index, i64 0, is required to step over the pointer to the global variable. Because
the first argument to the getelementptr instruction must always be a value of type
pointer, the first index steps through that pointer. A value of 0 means 0 elements
offset from that pointer. My development computer is running 64-bit Linux®, so the
pointer is 8 bytes. The second index, i64 0, is used to select the 0th element of the
string, which is supplied as the argument to puts.
Now, let's create a program that generates LLVM IR for the Hello World program
discussed earlier. The program won't deal with the entire LLVM API here, but the
code samples that follow should prove that a fair bit of the LLVM API is intuitive and
easy to use.
You must begin your program by creating an LLVM module. The first argument is
the name of the module and can be any dummy string. The second argument is
something called LLVMContext. The LLVMContext class is somewhat opaque, but it's
enough to understand that it provides a context in which variables and so on are
created. This class becomes important in the context of multiple threads, where you
might want to create a local context per thread, and each thread runs completely
independently of any other's context. For now, use the default global context handle
that the LLVM provides. Here's the code to create a module:
llvm::LLVMContext& context = llvm::getGlobalContext();
The next important class to learn is the one that actually provides the API to create
LLVM instructions and insert them into basic blocks: the IRBuilder class. IRBuilder
comes with a lot of bells and whistles, but I chose the simplest possible way to
construct one—by passing the global context to it with the code:
llvm::IRBuilder<> builder(context);
When the LLVM object model is ready, you can dump its contents by calling the
module's dump method. Listing 6 shows the code.
int main()
{
llvm::LLVMContext& context = llvm::getGlobalContext();
llvm::Module* module = new llvm::Module("top", context);
llvm::IRBuilder<> builder(context);
module->dump( );
}
You need to create the main method next. LLVM provides the classes llvm::Function
to create a function and llvm::FunctionType to associate a return type for the
function. Also, remember that the main method must be a part of the module. Listing
7 shows the code.
int main()
{
llvm::LLVMContext& context = llvm::getGlobalContext();
llvm::Module *module = new llvm::Module("top", context);
llvm::IRBuilder<> builder(context);
llvm::FunctionType *funcType =
llvm::FunctionType::get(builder.getInt32Ty(), false);
llvm::Function *mainFunc =
llvm::Function::Create(funcType, llvm::Function::ExternalLinkage, "main", module);
module->dump( );
}
Note that you wanted main to return void, which is why you called
builder.getVoidTy(); if main returned i32, the call would be builder.getInt32Ty().
After compiling and running the code in Listing 7, the result is:
; ModuleID = 'top'
declare void @main()
You have not yet defined the set of instructions that main is supposed to execute. For
that, you must define a basic block and associate it with the main method. A basic
block is a collection of instructions in the LLVM IR that has the option of defining a
label (akin to C labels) as part of its constructor. The builder.setInsertPoint tells
the LLVM engine where to insert the instructions next. Listing 8 shows the code.
int main()
{
llvm::LLVMContext& context = llvm::getGlobalContext();
llvm::Module *module = new llvm::Module("top", context);
llvm::IRBuilder<> builder(context);
llvm::FunctionType *funcType =
llvm::FunctionType::get(builder.getInt32Ty(), false);
llvm::Function *mainFunc =
llvm::Function::Create(funcType, llvm::Function::ExternalLinkage, "main", module);
module->dump( );
}
Here's the output of Listing 8. Note that because the basic block for main is now
defined, the LLVM dump now treats main as a method definition, not a declaration.
Cool stuff!
; ModuleID = 'top'
define void @main() {
entrypoint:
}
Now, add the global hello-world string to the code. Listing 9 shows the code.
int main()
{
llvm::LLVMContext& context = llvm::getGlobalContext();
llvm::Module *module = new llvm::Module("top", context);
llvm::IRBuilder<> builder(context);
llvm::FunctionType *funcType =
llvm::FunctionType::get(builder.getVoidTy(), false);
llvm::Function *mainFunc =
llvm::Function::Create(funcType, llvm::Function::ExternalLinkage, "main", module);
builder.SetInsertPoint(entry);
module->dump( );
}
In this output of Listing 9, note how the LLVM engine dumps the string:
; ModuleID = 'top'
@0 = internal unnamed_addr constant [14 x i8] c"hello world!\0A\00"
define void @main() {
entrypoint:
}
All you need now is to declare the puts method and make a call to it. To declare the
puts method, you must create the appropriate FunctionType*. From your original
Hello World code, you know that puts returns i32 and accepts i8* as the input
argument. Listing 10 shows the code to create the right type for puts.
llvm::FunctionType *putsType =
llvm::FunctionType::get(builder.getInt32Ty(), argsRef, false);
llvm::Constant *putsFunc = module->getOrInsertFunction("puts", putsType);
The first argument to FunctionType::get is the return type; the second argument is
an LLVM::ArrayRef structure, and the last false indicates that no variable number of
arguments follows. The ArrayRef structure is similar to a vector, except that it does
not contain any underlying data and is primarily used to wrap data blocks like arrays
and vectors. With this change, the output appears in Listing 11.
All that remains is to call the puts method inside main and return from main. The
LLVM API takes care of the casting and all the rest: All you need to call puts
is to invoke builder.CreateCall. Finally, to create the return statement, call
builder.CreateRetVoid. Listing 12 provides the complete working code.
#include "llvm/Function.h"
#include "llvm/BasicBlock.h"
#include "llvm/Support/IRBuilder.h"
#include <vector>
#include <string>
int main()
{
llvm::LLVMContext & context = llvm::getGlobalContext();
llvm::Module *module = new llvm::Module("asdf", context);
llvm::IRBuilder<> builder(context);
llvm::FunctionType *putsType =
llvm::FunctionType::get(builder.getInt32Ty(), argsRef, false);
llvm::Constant *putsFunc = module->getOrInsertFunction("puts", putsType);
builder.CreateCall(putsFunc, helloWorld);
builder.CreateRetVoid();
module->dump();
}
Conclusion
Other articles in this series
View more articles in the Create a working compiler with the LLVM
framework series.
In this initial study of LLVM, you learned about LLVM tools like lli and llvm-
config, dug into LLVM intermediate code, and used the LLVM API to generate the
intermediate code for you. The second and final part of this series will explore yet
another task you can use the LLVM for—adding an extra compilation pass with
minimum effort.
Resources
Learn
• Move beyond the basics of the LLVM in Create a working compiler with the
LLVM framework, Part 2: Use clang to preprocess C/C++ code (Arpan Sen,
developerWorks, June 2012). Put your compiler to work as you use the clang
API to preprocess C/C++ code as the LLVM compiler series continues.
• Take the official LLVM Tutorial for a great introduction to LLVM.
• See Chris Latner's chapter in The Architecture of Open Source Applications for
more information on the development of LLVM.
• Learn more about two important LLVM tools: llc and lli.
• Find more information about the llvm-gcc tool, and learn how to build it from
source with the step-by-step guide, Building llvm-gcc from Source.
• Read more about the LLVM assembly language in the LLVM Language
Reference Manual.
• Check out its grammar file, Log of /llvm/trunk/utils/llvm.grm, for more information
about the global string constant in LLVM.
• Learn more about the getelementptr instruction in "The Often Misunderstood
GEP Instruction" document.
• Dig into the LLVM Programmer's Manual, an indispensable resource for the
LLVM API.
• Read about the llvm-config tool for printing LLVM compilation options.
• The Open Source developerWorks zone provides a wealth of information on
open source tools and using open source technologies.
• In the developerWorks Linux zone, find hundreds of how-to articles and
tutorials, as well as downloads, discussion forums, and a wealth of other
resources for Linux developers and administrators.
• developerWorks Web development specializes in articles covering various web-
based solutions.
• Stay current with developerWorks technical events and webcasts focused on a
variety of IBM products and IT industry topics.
• Attend a free developerWorks Live! briefing to get up-to-speed quickly on IBM
products and tools, as well as IT industry trends.
• Watch developerWorks on-demand demos ranging from product installation
and setup demos for beginners, to advanced functionality for experienced
developers.
• Follow developerWorks on Twitter, or subscribe to a feed of Linux tweets on
developerWorks.
Get products and technologies
• Evaluate IBM products in the way that suits you best: Download a product trial,
try a product online, use a product in a cloud environment, or spend a few hours
in the SOA Sandbox learning how to implement Service Oriented Architecture
efficiently.
Discuss