C++ Optimization Strategies and Techniques: Pete Isensee
C++ Optimization Strategies and Techniques: Pete Isensee
C++ Optimization Strategies and Techniques: Pete Isensee
Pete Isensee
World Opponent Network
3380 146th Place SE, Suite 110
Bellevue, WA 98007
Pete.Isensee@WON.net
Introduction
"More computing sins are committed in the name of efficiency (without necessarily
achieving it) than for any other single reason – including blind stupidity."
– W.A. Wulf
Preliminaries
All of the examples are in C++. The code is designed to compile with any standard ANSI C++ -
compliant compiler. Some of the more complex techniques involve templates and the Standard
Template Library. I used Microsoft Visual C++ 6.0 for the example programs, targeting PCs
running Microsoft Windows 95/98 or NT.
Except where noted, all benchmarks and profiling were done on a Pentium II – 400MHz Dell
Dimension XPS400 running NT 4.00.1381. Most profiling runs were done with compiler optimi-
zations disabled to prevent any compiler-specific options influencing the results.
All performance graphs show relative performance. If the unoptimized run takes 200mS and
the optimized run takes 100mS, the optimized run will be shown as twice as tall as the unopti-
mized run (i.e. twice as fast). In other words, taller is better.
Most code examples use the following C++ objects for comparison
• int
• string (standard C++ basic_string<char> class with an average of 32 characters per string)
• complex (standard C++ complex<double> class containing two double values)
• bitmap (bitmap class with expensive default and copy ctor; average of 10000 pixels)
1: General Strategies
• Assuming some operations are faster than others. When it comes to optimizing, never ever
assume anything. Benchmark everything. Use a profiler. Even while I was doing examples
for this paper, some of my “optimizations” turned out to be major duds.
• Reducing code to improve performance. Reducing code might improve performance; it
might not. Increasing the amount of code will often improve performance. Loop unrolling is
a prime example.
• Optimizing as you go. Big mistake. As software engineer Donald Knuth said, “premature
optimization is the root of all evil.” Optimization is one of the last steps of a project. Plan for
it, but don’t optimize too soon. If you do, you’ll end up optimizing code that you either don’t
use or that doesn’t need to be streamlined in the first place. However, there are some
efficiency techniques you can use throughout your project from day one. These tips can
make your code more readable and concise. I’ve listed them below in section 3.
• Worrying about performance before concentrating on code correctness. Never! Write the
code without optimizations first. Use the profiler to determine if it needs to be revised. Don’t
ignore performance issues; let performance issues guide your design, data structures, and
algorithms, but don’t let performance affect every aspect of your code. In a typical game,
only a small percentage of the code requires optimization. Usually it’s the inner loops of the
blitting, AI or physics routines.
Swapping is so simple that we really only need a single function to handle it, right? Not
necessarily. Often an object can provide its own swapping method that is considerably faster
than calling the object’s constructor, assignment operator (twice), and destructor. In fact, with
STL, there are many specialized swap routines, including string::swap, list::swap, and so forth.
Sw ap Algorithm s
12.00
10.00
8.00
6.00
4.00
2.00
0.00
int complex string bitmap
As you can see, calling STL swap is the same as using the MySwap algorithm above.
However, for specialized classes, like strings, swapping has been optimized to be 6-7 times
faster. For the bitmap object, which has extremely expensive constructors and assignment
operators, the bitmap::swap routine is over 8000 times faster!
(see Swap project for benchmark code)
When you start working on your next game and begin to think about coding conventions,
compilers, libraries, and general C++ issues, there are many factors to consider. In the
following section I weigh some performance issues involved with C++ design considerations.
2.1 STL Containers
Take advantage of STL containers. (See Appendix A for an STL container efficiency table).
Not only is performance good today, it’s only going to get better as STL vendors focus their
efforts on optimization and compiler vendors improve template compilation. There are a
number of other advantages to using STL:
1) It’s a standard. The programmers modifying your code in the future won’t have to decipher
the semantics of yet another linked list class.
2) It’s thin. Some have claimed that template-based containers cause code bloat, but I believe
template-based containers will actually make your code smaller. You won’t have to have an
entirely different set of code for different types of lists. If you’re still concerned about code
bloat, use the STL containers to store pointers instead of object copies.
3) It’s flexible. All containers follow the same conventions, so if you decide that maybe a
deque will give better performance than a list, it’s easy to switch. If you use typedefs, it can
be as easy as changing one line of code. You also get the advantage of dozens of
predefined algorithms for searching and sorting that work for any STL container.
4) It’s already written. And debugged, and tested. No guarantees, but better than starting from
scratch.
The STL is not the be-all end-all library of containers and algorithms. You can get better
performance by writing your own containers. For instance, by definition, the STL list object
must be a doubly-linked list. In cases where a singly-linked list would be fine, you pay a
penalty for using the list object. This table shows the difference between Microsoft’s (actually
Dinkumware’s) implementation of lists and SGI’s implementation of an STL-compatible singly-
linked list called slist.
List Insertion
1.50
1.00
0.50
0.00
int complex string bitmap
Inserting at the beginning of a singly-linked list (slist) is around 30% faster for most objects
than inserting into the standard list. If you needed to insert items at the end of a list, slist is not
the best choice for obvious reasons.
One other drawback of the STL is that it only provides a limited set of container objects. It does
not provide hash tables, for instance. However, there are a number of good extension STL
libraries available. SGI distributes an excellent STL implementation with a number of useful
containers not defined in the standard.
(see SinglyLinkedList project for benchmark code)
2.2: References Instead of Pointers
As a basic design premise, consider using references instead of pointers. A quick example for
comparison:
int x;
void Ptr(const int* p) { x += *p; }
void Ref(const int& p) { x += p; }
The Ptr function and the Ref function generate exactly the same machine language. The
advantages of the Ref function:
• There’s no need to check that the reference is not NULL. References are never NULL.
(Creating a NULL reference is possible, but difficult).
• References don’t require the * dereference operator. Less typing. Cleaner code.
• There’s the opportunity for greater efficiency with references. A major challenge for
compiler writers is producing high-performance code with pointers. Pointers make it
extremely difficult for a compiler to know when different variables refer to the same location,
which prevents the compiler from generating the fastest possible code. Since a variable
reference points to the same location during its entire life, a C++ compiler can do a better
job of optimization than it can with pointer-based code. There’s no guarantee that your
compiler will do a better job at optimization, but it might.
What kind of savings can you expect? It depends. If you copy many objects, especially “empty”
objects, the savings can be significant. If you don’t do a lot of copying, two-phase construction
can have a negative impact, because it adds a new level of indirection.
(see TwoPhase project for benchmark code)
// stdio
printf("%c %d %f %p %s\n", 'a', 1234, 1234.5678, &i, "abcd");
Stream IO vs Printf
1.20
1.00
0.80
0.60
0.40
0.20
0.00
cout printf
Both examples display the same results, but printf does it more efficiently and more readably.
Use the <cstdio> family of functions instead of the <iostream> family when output speed is
critical.
(see StreamIOvsPrintf project for benchmark code)
Defy the software engineering mantra of “optimization procrastination.” These techniques can
be added to your code today! In general, these methods not only make your code more
efficient, but increase readability and maintainability, too.
3.1: Pass Class Parameters by Reference
Passing an object by value requires that the entire object be copied (copy ctor), whereas
passing by reference does not invoke a copy constructor, though you pay a “dereference”
penalty when the object is used within the function. This is an easy tip to forget, especially for
small objects. As you’ll see, even for relatively small objects, the penalty of passing by value
can be stiff. I compared the speed of the following functions:
template <class T> void ByValue(T t) { }
template <class T> void ByReference(const T& t) { }
template <class T> void ByPointer(const T* t) { }
Pass by Reference
8.00
7.00
6.00
5.00
4.00
3.00
2.00
1.00
0.00
int complex string bitmap
For strings, passing by reference is almost 30 times faster! For the bitmap class, it’s thousands
of times faster. What is surprising is that passing a complex object by reference is almost 40%
faster than passing by value. Only ints and smaller objects should be passed by value,
because it’s cheaper to copy them than to take the dereferencing hit within the function.
(see PassByReference project for benchmark code)
3.00
2.50
2.00
1.50
1.00
0.50
0.00
int complex string bitmap
Without exception, it’s as fast or faster to declare the objects within the scope of the if
statement. The only time where it may make sense to declare an object outside of the scope
where it’s used is in the case of loops. An object declared at the top of a loop is constructed
each time through the loop. If the object naturally changes every time through the loop, declare
it within the loop. If the object is constant throughout the loop, declare it outside the loop.
(see PostponeDeclaration project for benchmark code)
5.00
4.00
3.00
2.00
1.00
0.00
int complex string bitmap
Initializing a complex value is over four times faster than declaring and assigning. Even for
strings, the gain is 6%. Surprisingly, it makes little difference for the bitmap object. That’s
because the time to default construct a bitmap is miniscule in comparison to the time required
to copy one bitmap to another.
Here’s a real world case from WON, the company where I work. This is code that’s running
today – slightly modified to protect the guilty. It probably looks similar to code in your own
projects. The input strings are copied to slightly different string objects.
void SNCommGPSendNewUser(const SNstring& sUser, const SNstring& sPass,
/* 9 more SNstring params ... */ )
{
string User;
string Pass;
User = sUser; // Convert to our format
Pass = sPass;
// etc . . .
}
Readability improvement: 100%. Lines of code: 50% of original. Speed improvement: just over
3%. Not huge, but certainly nothing to complain about. Triple win.
(see PreferInitialization project for benchmark code)
3.4: Use Constructor Initialization Lists
In any constructor that initializes member objects, it can pay big dividends to set the objects
using an initialization list rather than within the constructor itself. Why? Class member
variables are automatically constructed using their default constructor prior to entry within the
class constructor itself. You can override this behavior by specifying a different member
constructor (usually a copy constructor) in the initialization list. Multiple initializations are
separated with commas (not shown here).
template <class T> class CtorInit
{
T m_Value;
public:
// no list
CtorInit(const T& t) // m_Value default ctor called here automatically
{
m_Value = t; // m_Value assignment operator called
}
// with list
CtorInit(const T& t) : m_Value(t) { } // m_Value copy ctor called
};
The drawback to using initialization lists is that there’s no way to do error checking on incoming
values. In the “no list” example we could do some validation on t within the CtorInit function. In
the “with list” example, we can’t do any error checking until we’ve actually entered the CtorInit
code, by which time t has already been assigned to m_Value. There’s also a readability
drawback, especially if you’re not used to initialization lists.
Initialization Lists
3.00
2.00
1.00
0.00
int complex string bitmap
Nevertheless, these are good performance gains, particularly for the complex object. This type
of performance can outweigh the drawbacks.
(see InitializationLists project for benchmark code)
It’s typically more efficient to use += instead of + alone, because we avoid generating a
temporary object. Consider the following functions. They give the same result, but one uses +
alone and the other uses +=
template <class T> T OperatorAlone(const T& a, const T& b)
{
T c(a + b);
return (c);
}
template <class T> T OperatorEquals(const T& a, const T& b)
{
T c(a);
c += b;
return (c);
}
Operator =
1.50
1.00
0.50
0.00
int complex string bitmap
For intrinsic types, + alone gives better results, but for non-trivial classes, especially classes
with costly construction time, += is the better choice.
(see OperatorEquals project for benchmark code)
Prefer Prefix
3.00
2.50
2.00
1.50
1.00
0.50
0.00
int complex bitmap
Strings aren’t included in the results because increment doesn’t make sense for strings. (It
doesn’t really make sense for bitmaps, either, but I defined increment to increase the width and
height by one, forcing a reallocation.) Where this recommendation really shines is for mathe-
matical objects like complex. The prefix operator is almost 50% faster for complex objects.
(see PreferPrefix project for benchmark code)
Your compiler is pretty smart. It knows how to compare two pairs because you told it how in
the pair class. It also knows how to create a pair given a string, so it can easily evaluate
(p == s). The drawback is that we’ve hidden the second pair constructor – it’s implicit. If that
constructor is expensive, it’s difficult to see that’s it’s being invoked. Worse, if we made a
mistake and we didn’t really want to compare a pair with a string, the compiler won’t tell us.
My advice: make all single-argument constructors (except the copy constructor) explicit.
explicit pair(const string& s) { . . . }
Now the (c == s) line will give a compiler error. If you really want to compare these guys, you
must explicitly call the constructor:
if (p == pair(s)) { . . . }
Using explicit will protect you from stupid mistakes and make it easier for you to pinpoint
potential bottlenecks.
Your game is up and running. The data structures are ideal, the algorithms sublime, the code
elegant, but the game – well, it’s not quite living up to its potential. Time to get drastic, and with
drastic measures, there are tradeoffs to consider. These optimizations are going to make your
code less modular, harder to understand, and more difficult to maintain. They may cause
unexpected side effects like code bloat. Your compiler may not even be able to handle some of
the more advanced template-based techniques. Proceed with caution. Arm yourself with a
good profiler.
Inline Functions
2.50
2.00
1.50
1.00
0.50
0.00
a() b() c() GetInt() GetCp() GetStr() GetBmp()
The biggest gain is inlining the function that returns the complex value. Inlining that function
increased performance over two times the non-inlined version. Inlining the larger functions or
the functions that returned non-trivial objects (strings, bitmaps) did not improve performance at
all.
Note that using Microsoft’s inline “any suitable” function option did not inline any other
functions other than those already marked inline, even the world’s most simple function a()!
Clearly, there’s room for improvement in the Visual C++ compiler.
(see Inline project for benchmark code)
The code is correct, but it could be more efficient. We already know we can improve the
efficiency by initializing the complex value when it’s constructed:
complex<double> Mult(const complex<double>& a, const complex<double>& b)
{
complex<double> c((a.real() * b.real()) - (a.imag() * b.imag()),
(a.real() * b.imag()) + (a.imag() * b.real()));
return (c);
}
At this point, the compiler can work a little magic. It can omit creating the temporary object that
holds the function return value, because that object is unnamed. It can construct the object
defined by the return expression inside the memory of the object that is receiving the result.
Return value optimization is another way of saying return constructor arguments instead of
named objects. What kind of gains can you expect from this optimization? Here are some
trivial example functions I used to evaluate potential performance improvements.
template <class T> T Original(const T& tValue)
{
T tResult; // named object; probably can’t be optimized away
tResult = tValue;
return (tResult);
}
template <class T> T Optimized(const T& tValue)
{
return (T(tValue)); // unnamed object; optimization potential high
}
Return Value Optimization
4.00
3.00
2.00
1.00
0.00
int complex string bitmap cMult
The results are favorable. String performance doubled and bitmap performance more than
tripled. Before you go using this trick willy-nilly, be aware of the drawbacks. It’s hard to check
for errors in intermediate results. In the first version of Mult above, we could easily add error
checking on the real and imaginary values. In the final version, all we can do is hope there’s no
overflow or underflow errors. The first version is also much easier to read. In the final version,
there could be a non-obvious error. For instance, is the real part the first parameter to the
complex constructor, or is the imaginary part?
One more note. The final version of the C++ standard has made it easier for compilers to opti-
mize away even named objects. The standard says that for functions with a class return type, if
the return statement is the name of a local object, and the object is the same type as the return
type, the compiler can omit creating a temporary object to hold the function return value. That
means that in some rosy future years away, return value optimization will be in the hands of
compiler writers where it should be, and we won’t have to change our code. And pigs could fly.
In the meantime, this tip is worth considering.
(see ReturnValueOpt project for benchmark code)
1.00
0.80
0.60
0.40
0.20
0.00
ctor/dtor foo
Construction/destruction time shows the performance penalty of initializing the virtual function
table. (See the Microsoft-specific tip below for reducing this penalty). Notice that the function
call overhead is very minimal, even though the function itself hardly does anything. For larger
functions, the overhead becomes even less significant.
(see VirtualFunctions project for benchmark code)
When you’re writing a function, the above method is preferable about 99.9% of the time. You
should almost never return an object by reference. That’s why this tip is entitled return objects
via reference parameters, not return object by reference.
template <class T> T& ByRef() // RED ALERT!
{
T t = // get t from file, computation, whatever
return (t); // DON’T DO THIS!
}
When t goes out of scope, it’s destroyed and its memory is returned to the stack. Any
reference to the object is invalid after the function has returned! A good compiler will warn you
about this type of mistake. You can, however, pass your return value as a non-const reference
function parameter, and in some cases see improved performance.
template <class T> void ByReference(T& byRef)
{
// get t from file, computation, whatever
byRef = t;
}
You’ll only see a performance improvement if the object you’re passing in as the desired return
value (byRef) is being reused and hasn’t been simply declared immediately prior to calling in
the function. In other words,
T t;
for( . . . )
ByReference(t);
may be faster, because t is being reused every time through the loop, and doesn’t have to be
reconstructed and redestroyed at each iteration, while
T t;
ByReference(t);
is exactly the same as returning the object by value, and even worse, doesn’t lend itself to
possible return value optimizations. There’s another good reason to avoid this suggestion – it
can make your code very hard to read.
Return by Reference
2.50
2.00
1.50
1.00
0.50
0.00
int complex string bitmap
The results above show the times spent within ByValue and ByReference. These numbers are
slightly misleading, because they don’t show any construction time for T in the ByReference
case, and T certainly must be constructed somewhere prior to calling ByReference.
Nevertheless, the results show that performance gains may be significant in limited cases.
(see ReturnByValOrRef project for benchmark code)
4.5: Per-class Allocation
One feature of C++ that shows the true power and flexibility of the language is the ability to
overload new and delete at the class level. For some objects, this power can give you
incredible speed improvements. To see what kind of performance improvements I could get, I
derived my own string class from the standard string class and tried various allocation
schemes. I wasn’t too successful, namely because standard new and delete are already pretty
dang fast, and because string class wasn’t a really good choice. The objects that will see the
most improvement are the objects you use in a specific way that can truly benefit from a
custom allocation scheme.
I used an approach called the “memory pool” method. Using individual pools for memory
allocation is beneficial because it improves locality and you can optimize knowing that all
objects in the pool are the same size.
My new string class looked like this:
template <class Pool> class MString : public string
{
public :
MString() {}
virtual ~MString() {}
void* operator new(size_t nBytes)
{
return (GetPool().Allocate(nBytes));
}
void operator delete(void* pFree)
{
GetPool().Deallocate(pFree, sizeof(MString));
}
private:
static Pool& GetPool() { static Pool p; return (p); }
};
I tried some different types of pools. MemPool uses a heap block and a free list. StackPool
uses a chunk of the stack for the pool. HeapPool uses one heap block and no free list.
typedef MString<MemPool> FLStr; // heap block w/ free-list
typedef MString<StackPool<POOL_SIZE> > SStr; // stack-based
typedef MString<HeapPool> HStr; // single heap block
1.40
1.20
1.00
0.80
0.60
0.40
0.20
0.00
string Str FLStr SStr HStr
The stack pool gave the best performance. It was faster than the default new and delete
implementations by 20%. It was sorely limited, though. After a certain number of allocations, it
would crash because there was no more room in the pool. StackPool isn’t complex enough to
say “OK everybody – out of the pool.” Profile your code. If you find that new or delete is a
bottleneck for certain specific objects, consider overloading new and delete. However, you
might find your money is better invested in a memory management library that will improve
your performance across the board.
(see OverloadNewDelete project for benchmark code)
// Our specializations
pointer allocate(size_type nBytes, const void* /*pHint*/)
{
if (nBytes < 0) nBytes = 0;
return ((pointer) GetPool().Allocate(nBytes * sizeof(T)));
}
void deallocate(void _FARQ* pFree, size_type nBytes)
{
GetPool().Deallocate(pFree, nBytes);
}
private:
static Pool& GetPool() { static Pool p; return (p); }
};
The key functions are allocate and deallocate, which perform the bulk of the work. If you
compare these functions to the overloaded new and delete operators in the previous example,
you’ll see that they’re very similar. Here’s an example of using MyAlloc as the allocator for a
list container. I use the HeapPool object mentioned above as the pool for this allocator.
typedef MyAlloc<int, HeapPool> NHeapPoolAlloc;
list<int, NHeapPoolAlloc> IntList;
To compare efficiency, I evaluated the speed of inserting items into a list using the default
allocator and NHeapPoolAlloc.
STL Allocators
6.00
5.00
4.00
3.00
2.00
1.00
0.00
int complex string bitmap
The results are all over the place. Strangely enough, there was significant improvement for
ints. On the other hand, string performance plummeted. Just goes to show how important it is
to evaluate your “optimization” once it’s in place.
(see STLAllocators project for benchmark code)
If we declare this object, it requires one byte of memory (sizeof(Allocator) == 1), because the
C++ standard requires that we be able to address its location in memory. If we declare this
object within another class, the compiler byte-alignment settings come into play. The class
below requires 8 bytes of memory if we’re 4-byte aligned.
template <class T, class Alloc = Allocator<T> >
class ListWithAllocMember // (sizeof(ListWithAlocMember) == 8)
{
private :
Alloc m_heap;
Node* m_head;
};
This storage requirement has serious ramifications. This list object requires twice the size it
really needs. Fortunately, the C++ standard provides a workaround. It says that “a base class
subobject of an empty class type may have zero size.” In other words, if we derive our class
from the empty class, the empty class overhead disappears. This is the empty member optimi-
zation. The class below is 4 bytes.
template <class T, class Alloc = Allocator<T> >
class ListWithAllocBase : private Alloc // (sizeof(ListWithAllocBase) == 4)
{
private :
Node* m_head;
};
Deriving from the empty class is not really an ideal solution. There are some cases when it’s
no solution at all. Here’s a better one. We can declare an internal data member derived from
the empty class.
template <class T, class Alloc = Allocator<T> >
class ListWithEmptyMemberAlloc (sizeof(ListWithEmptyMemberAlloc) == 4)
{
private :
struct P : public Alloc
{
Node* m_head;
};
P m_heap;
};
Now there’s an additional level of indirection within the class itself (i.e. we have to use
m_head.allocate notation instead of allocate()), but our list is still only 4 bytes large and we
have all the advantages of the allocator object. A Watcom engineer reported that STL bench-
marks ran 30% faster after their compiler team implemented the empty-base optimization.
(see EmptyMember project for benchmark code)
Now look at a template-based version that does the same thing where n is passed as the
template parameter.
template <int N> class Factorial
{
public :
// Recursive definition
enum { GetValue = N * Factorial<N - 1>::GetValue };
};
// Specialization for base case
template <> class Factorial<1>
{
public :
enum { GetValue = 1 };
};
For this particular case, the template version is 130 times as fast as the non-template version.
The Microsoft compiler optimizes away all of the template recursion, implementing the call as a
single move instruction. Now that’s the kind of optimization I like to see.
Factorial
120.00
100.00
80.00
60.00
40.00
20.00
0.00
recursive non-recursive template
Nice results, but not a terribly useful function. The template parameter must be known at
compile time, so it’s not very flexible. For some things, though, that’s not really a limitation.
Suppose you wanted to have an inline sine table. You could use template metaprogramming to
create a class that computed the sine of a number using a series expansion. I did just that. The
code is complex, so I leave it as an exercise for the reader. (Actually, it’s in the accompanying
TemplateMetaprogramming source file). I compared my template-based sine to the C runtime
version, a table-based function, and a non-template function that used series expansion.
Sine
2.00
1.50
1.00
0.50
0.00
CRT table-based series-exp template
We can replace this function with a template-based version that unrolls the loops, completely
eliminating the loop overhead:
template <int I=0, int J=0, int K=0, int Cnt=0> class MatMult
{
private :
enum
{
Cnt = Cnt + 1,
Nextk = Cnt % 4,
Nextj = (Cnt / 4) % 4,
Nexti = (Cnt / 16) % 4,
go = Cnt < 64
};
public :
static inline void GetValue
(D3DMATRIX& ret, const D3DMATRIX& a, const D3DMATRIX& b)
{
ret(I, J) += a(K, J) * b(I, K);
MatMult<Nexti, Nextj, Nextk, Cnt>::GetValue(ret, a, b);
}
};
// specialization to terminate the loop
template <> class MatMult<0, 0, 0, 64>
{
public :
static inline void GetValue
(D3DMATRIX& ret, const D3DMATRIX& a, const D3DMATRIX& b) { }
};
The template could be more flexible by providing a dimension parameter, but I left that out for
simplicity’s sake. The template function calculates the next values of i, j, and k and recursively
calls itself with a new count, which goes from 0 to 64. To terminate the loop, there’s a template
specialization that just returns. With a good compiler, the code generated should be as efficient
as writing MatrixMult like this:
D3DMATRIX MatrixMultUnrolled(const D3DMATRIX& a, const D3DMATRIX& b)
{
D3DMATRIX ret = ZeroMatrix();
ret(0,0) = a(0,0)*b(0,0) + a(1,0)*b(0,1) + a(2,0)*b(0,2) + a(3,0)*b(0,3);
ret(0,1) = a(0,1)*b(0,0) + a(1,1)*b(0,1) + a(2,1)*b(0,2) + a(3,1)*b(0,3);
. . .
ret(3,3) = a(0,3)*b(3,0) + a(1,3)*b(3,1) + a(2,3)*b(3,2) + a(3,3)*b(3,3);
return ret;
}
Matrix Multiplication
2.50
2.00
1.50
1.00
0.50
0.00
MatMult template unrolled
Unfortunately, Microsoft’s compiler wasn’t completely up to the task, although we did see a
minor improvement over the existing version. Currently, the best performance gain comes from
rewriting the entire function with all loops unrolled.
Template metaprogramming has its advantages. It can be extremely effective for mathematical
and scientific libraries, and there’s definite possibilities for streamlining 3D math. One public
scientific computing library, called Blitz++, is based on the template-metaprogramming
concept. The performance of the library is on par with the best Fortran libraries at maximum
optimization. In other cases, performance increases of 3-20 times that of a commercial C++
linear algebra library have been achieved. However, compiler support for templates is still im-
mature. Microsoft in particular was slow to implement template functionality, and even VC 6.0
doesn’t support the full C++ standard for templates. As compilers advance, template
metaprogramming may take a larger place in optimization technology.
(see TemplateMeta project for benchmark code)
4.9: Copy-On-Write
One general method of increasing efficiency is called lazy evaluation. With lazy evaluation,
nothing is precomputed; you put off all your processing until the result is really needed. A
complementary method is called “copy-on-write.” With copy-on-write, two or more objects can
share the same data until the moment when one of those objects is changed, at which point
the data is physically copied and changed in one of the objects. C++ lends itself nicely to copy-
on-write, since it can be added without affecting a class interface. Microsoft was able to
change the internals of its own CString class to add copy-on-write functionality. Most
programmers never noticed the difference because the class interface was unchanged.
Copy-on-write requires two things: reference counting and smart pointers. A reference count
indicates the number of objects referring to the same piece of data. A smart pointer points to
an object with a reference count. When the reference count goes to zero, the smart pointer is
“smart” enough to automatically delete the object. A simple RefCount and SmartPtr class
follow. Note that DownRefCount returns true when the reference count goes to zero and it’s
safe to delete the object.
class RefCount // a mixin class
{
private:
int m_nRef; // reference count
public:
RefCount() { m_nRef = 0; }
int GetRefCount() const { return (m_nRef); }
void UpRefCount() { ++m_nRef; }
bool DownRefCount()
{
if (m_nRef > 0 && --m_nRef == 0)
return (true); // safe to remove object
return (false);
}
};
A SmartPtr acts just like a regular pointer except in two cases: when it’s copied, the reference
count is incremented, and when it’s destroyed the reference count is decremented. If the
reference count goes to zero, the object pointed to is destroyed as well.
template <class T> class SmartPtr // T must be derived from RefCount
{
private:
T* m_pCountedObj;
public:
SmartPtr() { m_pCountedObj = NULL; }
SmartPtr(const SmartPtr<T>& spCopy)
{
m_pCountedObj = NULL;
SmartCopy(spCopy);
}
SmartPtr(T* pCopy) { m_pCountedObj = NULL; SmartCopy(pCopy); }
~SmartPtr() { Destroy(); }
SmartPtr<T>& operator = (const SmartPtr<T>& spCopy)
{
if (&spCopy == this)
return (*this);
return (SmartCopy(spCopy));
}
T& operator * () const { return (*m_pCountedObj); }
T* operator -> () const { return (m_pCountedObj); }
operator T* () const { return (m_pCountedObj); }
SmartPtr<T>& SmartCopy(T* pCopy)
{
Destroy();
m_pCountedObj = pCopy;
if (pCopy != NULL)
m_pCountedObj->UpRefCount();
return (*this);
}
SmartPtr<T>& SmartCopy(const SmartPtr<T>& spCopy)
{ return (SmartCopy(spCopy.m_pCountedObj)); }
private:
void Destroy()
{
if (m_pCountedObj != NULL && m_pCountedObj->DownRefCount())
{
delete m_pCountedObj;
m_pCountedObj = NULL;
}
}
};
We create a new reference-counted bitmap class by inheriting from Bitmap and using the
RefCount mixin class.
class CountedBitmap : public Bitmap, public RefCount { };
Now we can create a “smart” bitmap class. It contains a smart pointer to a reference-counted
bitmap. Whereas copying a regular bitmap required a deleting and reallocating memory,
copying a SmartBitmap is as simple and efficient as copying a pointer. We only need to do
“expensive” operations when the bitmap is actually changed (e.g. Create’d or Destroy’ed).
class SmartBitmap
{
private :
SmartPtr<CountedBitmap> m_pBitmap;
public :
SmartBitmap() { m_pBitmap = new CountedBitmap; }
SmartBitmap(int nWidth, int nHeight, Bitmap::eDepth)
{ m_pBitmap = new CountedBitmap(nWidth, nHeight, nDepth); }
virtual ~SmartBitmap() {}
virtual bool Create(int nWidth, int nHeight, Bitmap::eDepth)
{
// if creating a multiply-referred object, time to copy
if (m_pBitmap->GetRefCount() > 1)
m_pBitmap = new CountedBitmap;
return (m_pBitmap->Create(nWidth, nHeight, nDepth));
}
virtual void Destroy()
{
// if nuking a multiply-referred object, time to copy
if (m_pBitmap->GetRefCount() > 1)
m_pBitmap = new CountedBitmap;
m_pBitmap->Destroy();
}
virtual int GetWidth() const { return (m_pBitmap->GetWidth()); }
virtual int GetHeight() const { return (m_pBitmap->GetHeight()); }
// etc. . . .
};
The disadvantage is that we’ve added another level of indirection through the smart pointer.
We’ve also required that the object constructor allocate the reference-counted object so that
the smart pointer can properly delete it. Another important note: the destructor doesn’t do
anything! That’s because the smart pointer will automatically delete the object when nobody
else is referring to it. Cool.
To evaluate performance, I compared a large set of typical bitmap operations, including
copying. SmartBitmap was six times as fast as the original “dumb” bitmap. Your results will
vary depending on how much objects are copied (slow for dumb objects, fast for smart objects)
and how much copied objects are changed (fast for dumb objects, slow for smart objects).
Copy On Write
6.00
5.00
4.00
3.00
2.00
1.00
0.00
dumb smart
The nice thing about smart objects is that they can be returned by value with no performance
penalty. In the case above, it’s now feasible to use SmartBitmap as a return value.
(see SmartPointer project for benchmark code)
5: Compiler Optimization
A good compiler can have a huge effect on code performance. Most PC compilers are good,
but not great, at optimization. Be aware that sometimes the compiler won't perform optimiza-
tions even though it can. The compiler assigns a higher priority to producing consistent and
correct code than optimizing performance. Be thankful for small favors.
Be aware that some of these options can cause your program to fail. See the section below on
unsafe optimizations. There are also some optimizations that you might not choose to use for
your specific game. For instance, if you’re using RTTI or exception handling, don’t turn those
options off.
Optimizing for space can actually be faster than optimizing for speed because programs
optimized for speed are almost always larger, and therefore more likely to cause additional
paging than programs optimized for space. In fact, all Microsoft device drivers and Windows
NT itself are built to minimize space. Try both ways and see which is faster for your game.
The ImageNV constructor has two fewer instructions, namely the instructions that initialize the
virtual function table. The optimized constructor is 30% faster. Microsoft’s own ATL class
library uses this compiler option extensively.
1.50
1.00
0.50
0.00
Frame ctor FrameNV ctor
// Any Std C++ compiler assumes function cannot throw any exceptions
int NoThrowStdC(int i) throw() { return (i + 1); }
Disable Exception Throw ing
1.20
1.00
0.80
0.60
0.40
0.20
0.00
MayThrow NoThrow MS NoThrow StdC
Calling Convention
1.200
1.000
0.800
0.600
0.400
0.200
0.000
CDecl StdCall FastCall
The following table was calculated on a PII-400 running NT (a 32-bit OS). It shows
performance relative to a standard 32-bit assignment operation (value 1.00). Larger numbers
indicate better performance. For integer and floating point operations, the fastest relative time
for each operation is highlighted.
Relative Performance of Common Operations (PII-400)
8-bit (char) 16-bit (short) 32-bit (long) 64-bit (__int64) floating-point
Operation signed unsigned signed unsigned signed unsigned signed unsigned float double ldouble
a=b 1.00 1.00 0.64 1.00 1.00 0.87 0.58 0.58 0.87 0.64 0.58
a+b 1.17 0.88 1.17 1.17 1.05 1.00 0.70 0.78 0.88 0.47 0.44
a-b 1.17 0.87 1.17 0.88 1.17 0.87 0.70 0.70 0.91 0.54 0.47
-a 1.00 1.16 1.00 1.10 1.17 1.00 0.87 1.00 1.17 0.51 0.51
++a 0.78 0.88 0.88 0.59 0.82 0.88 0.64 0.63 0.77 0.34 0.30
--a 0.82 1.00 0.58 0.54 0.87 1.00 0.64 0.58 0.82 0.34 0.31
a*b 0.88 1.00 1.04 1.00 1.00 1.00 0.38 0.37 1.00 0.29 0.47
a/b 0.18 0.18 0.18 0.18 0.18 0.18 0.07 0.08 0.22 0.16 0.14
a%b 0.18 0.18 0.18 0.18 0.18 0.18 0.07 0.08 n/a n/a n/a
a == b 1.00 0.78 1.00 0.78 1.00 0.87 0.64 0.64 0.58 0.36 0.26
a != b 1.00 0.87 1.00 0.87 0.88 1.00 0.58 0.50 0.58 0.33 0.30
a>b 1.00 0.78 1.00 0.78 0.87 0.87 0.58 0.54 0.58 0.25 0.34
a<b 1.00 0.88 0.54 0.88 0.88 0.88 0.47 0.50 0.64 0.25 0.25
Relative Performance of Common Operations (PII-400)
8-bit (char) 16-bit (short) 32-bit (long) 64-bit (__int64) floating-point
Operation signed unsigned signed unsigned signed unsigned signed unsigned float double ldouble
a >= b 1.00 0.87 1.00 0.87 1.00 0.78 0.54 0.58 0.54 0.37 0.25
a <= b 1.00 0.78 1.00 0.78 1.00 0.87 0.44 0.47 0.54 0.35 0.30
a && b 0.70 0.54 0.70 0.54 0.70 0.70 0.50 0.47 n/a n/a n/a
a || b 0.88 0.87 0.70 0.70 0.78 0.78 0.70 0.70 n/a n/a n/a
!a 0.87 0.88 0.87 0.99 0.78 0.78 0.58 0.58 0.50 0.35 0.35
a >> b 1.00 0.87 1.00 0.88 1.17 1.17 0.54 0.54 n/a n/a n/a
a << b 1.17 0.88 1.00 1.00 1.15 1.17 0.50 0.50 n/a n/a n/a
a&b 0.94 0.87 1.00 1.17 1.00 1.00 0.78 0.70 n/a n/a n/a
a|b 1.17 0.87 1.17 0.88 0.87 1.00 0.70 0.78 n/a n/a n/a
a^b 1.00 1.00 1.00 1.00 0.88 0.87 0.70 0.70 n/a n/a n/a
~a 1.00 1.16 1.00 1.16 1.40 1.00 0.88 0.88 n/a n/a n/a
• There’s very little difference between 8-, 16- and 32-bit operations. In general, signed 32-bit
values give the fastest times. That’s why it makes sense to use int or long as your standard
integer type.
• Operations on unsigned types tend to be slower than the same operation on signed types.
• The slowest operations are division and modulus. In fact, floating-point division is as fast or
faster than integer division.
• Float operations are typically 1.5 to 2 times faster than double operations. If you can afford
the loss of precision, float is the best floating-point type.
Look at the same table for a lower-end machine, a P-133 running Windows 95.
Relative Performance of Common Operations (P-133)
8-bit (char) 16-bit (short) 32-bit (long) 64-bit (__int64) floating-point
Operation signed unsigned signed unsigned signed unsigned signed unsigned float double ldouble
a=b 1.00 0.31 0.63 0.77 1.00 1.00 0.91 0.91 1.00 0.55 0.83
a+b 0.71 0.91 0.71 0.83 1.00 1.00 0.63 0.24 0.71 0.43 0.33
a-b 0.71 0.90 0.71 0.83 1.00 0.71 0.83 0.77 0.71 0.43 0.44
-a 0.91 1.00 0.91 0.91 0.71 1.11 0.77 0.77 0.63 0.50 0.50
++a 0.62 0.90 0.67 0.67 0.91 0.91 0.83 0.83 0.67 0.45 0.45
--a 0.91 0.91 0.67 0.56 0.91 0.91 0.77 0.83 0.67 0.45 0.45
a*b 0.45 0.50 0.45 0.50 0.59 0.59 0.34 0.34 0.71 0.44 0.44
a/b 0.16 0.17 0.16 0.17 0.13 0.19 0.09 0.09 0.23 0.17 0.19
a%b 0.16 0.17 0.16 0.17 0.17 0.20 0.09 0.09 n/a n/a n/a
a == b 0.67 0.55 0.63 0.71 0.91 0.91 0.66 0.67 0.36 0.30 0.25
a != b 0.31 0.77 0.62 0.71 0.91 0.91 0.62 0.63 0.40 0.29 0.29
a>b 0.67 0.77 0.62 0.71 0.91 0.83 0.52 0.53 0.40 0.29 0.30
a<b 0.67 0.77 0.62 0.55 0.91 0.83 0.41 0.36 0.40 0.25 0.30
a >= b 0.67 0.77 0.50 0.71 0.91 0.83 0.50 0.53 0.40 0.29 0.25
a <= b 0.67 0.77 0.62 0.71 0.91 0.83 0.52 0.53 0.42 0.29 0.30
a && b 0.55 0.62 0.50 0.56 0.62 0.50 0.48 0.47 n/a n/a n/a
a || b 0.77 0.83 0.67 0.71 0.62 0.83 0.50 0.38 n/a n/a n/a
!a 0.77 0.63 0.71 0.77 0.91 0.91 0.62 0.62 0.42 0.33 0.33
a >> b 0.59 0.71 0.59 0.71 0.83 0.83 0.67 0.71 n/a n/a n/a
a << b 0.59 0.71 0.54 0.56 0.83 0.83 0.67 0.71 n/a n/a n/a
a&b 0.71 0.91 0.55 0.83 1.00 1.00 0.83 0.83 n/a n/a n/a
a|b 0.71 0.91 0.71 0.83 1.00 1.00 0.62 0.45 n/a n/a n/a
a^b 0.71 0.91 0.71 0.83 1.00 0.71 0.83 0.83 n/a n/a n/a
~a 0.91 1.00 0.91 0.91 0.71 1.11 0.83 0.83 n/a n/a n/a
Most conversions are reasonable. But beware the conversion of floating-point to integer! It’s
five to ten times slower than the base case. Interestingly enough, it’s also significantly slower
on the Pentium-II compared to the Pentium.
(see CommonOps project for benchmark code)
C.2: Books
Effective C++ and More Effective C++, Scott Meyers (www.aristeia.com) Superb tips from a
C++ guru, from basic techniques to copy-on-write and multiple dispatch. The chapter on
efficiency in More Effective C++ is a gem.
Code Complete, Steve McConnell (www.construx.com) General purpose code-tuning
strategies, with examples and results.
C++ Gems, Stan Lippman (people.we.mediaone.net/stanlipp) Excellent template essays.
Writing Efficient Programs, Jon Bentley (www.engr.orst.edu) A general set of rules and a
standard methodology for optimizing any program. Very concise. Includes examples.
Optimizing C++, Steve Heller (www.koyote.com/users/stheller) Algorithmic optimizations.
Graphics Programming Black Book Special Edition, Michael Abrash (www.amazon.com)
The bible on x86 assembly optimization, especially for 3D graphics.
Graphics Gems I – V, Glassner, et al. (www.acm.org/tog/GraphicsGems) Efficient algorithms
designed by graphics programmers and researchers.
Inner Loops, Rick Booth (ourworld.compuserve.com/homepages/rbooth) More x86 assembly
optimization.
High Performance Computing, 2nd Edition, Kevin Dowd (www.oreilly.com/catalog/hpc2) A
high-level look at optimization.
C.3: Websites
SGI STL extensions (www.sgi.com/Technology/STL)
SGI Singly-linked lists A non-standard list class. The SGI implementation has half the
overhead and is considerably faster than the standard list class.
SGI Rope strings A highly efficient implementation of long strings. Useful for e-mail
text, word processing, etc.
SGI Hashed containers Hashed containers are not included in the STL. These SGI
hashed containers can be just what the doctor ordered.
Todd Veldhuizen articles (extreme.indiana.edu/~tveldhui)
Todd has been on the leading edge of the use of C++ templates for the sake of
efficiency, especially Template Metaprogramming. He’s also the lead programmer for
Blitz++, a computing library that makes use of many of his techniques.
Nathan Myers articles (www.cantrip.org)
Nathan is a prolific writer on many C++ topics. Be sure to see his article on the Empty
Member optimization.
Guru of the Week (www.cntc.com/resources)
Problems and solutions to many common C++ issues, as presented by one of the lead
programmers of PeerDirect, Inc, including Avoiding Temporaries, Inline, Reference
Counting and Copy On Write.
High Performance Game Programming in C++ (www.ccnet.com/~paulp/HPGP/HPGP.html)
Paul Pedriana’s talk from the 1998 CGDC. A great discussion of C vs. C++ and C++
performance issues.
C++: Efficiency, Alan Clarke (www.ses.com/~clarke/Efficiency.html) A collection of C++ tips.
Object-Oriented System Development (gee.cs.oswego.edu/dl/oosdw3/ch25.html) The
Performance Optimization chapter from Dennis de Champeaux’s book.
High Performance C++ (oscinfo.osc.edu/software/KAI/doc/UserGuide/chapter_3.html) A
discussion of C++ and compiler optimizations from KAI, the leading vendor of optimizing C++
compilers for high-end systems (Cray, SPARC, SGI, etc.)
NULLSTONE Compiler Optimization Categories (www.nullstone.com/htmls/category.htm) A
good list and description of C/C++ compiler optimization techniques.
Optimization of Computer Programs in C (www.ontek.com/mikey/Optimization.html) A
dated but useful paper by Michael Lee on C-specific optimizations
Code Optimization – The Why’s and How’s (seka.nacs.net/~heller/optimize) A collection of
pages by Jettero Heller discussing code optimization
Performance Engineering: A Practical Approach to Performance Improvement
(www.rational.com/sitewide/support/whitepapers/dynamic.jtmpl?doc_key=307) A discussion of
profiling and bottlenecks by Rational Software
Maui High Performance Computing Center: Performance Tuning
(www.mhpcc.edu/training/workshop/html/performance) Performance tuning on IBM UNIX
systems. Some details aren’t relevant, but most of the examples are platform independent.
C.4: Author
Email the author at: Pete.Isensee@WON.net or PKIsensee@msn.com
Author’s homepage: www.tantalon.com
Appendix D: C++ Optimization Summary
Acknowledgments
Special thanks to Brian Fiete, Melanie McClaire, Brian Ruud, the WON Viper and Titan teams,
the HyperBole X-Files engineering team, my favorite gurus Steve McConnell and Scott
Meyers, and my favorite girls Kristi and Ali.