Gprof: A Call Graph Execution Profiler: Susan L. Graham Peter B. Kessler Marshall K. Mckusick

RETROSPECTIVE:
gprof: a Call Graph Execution Profiler

Susan L. Graham
Peter B. Kessler
Marshall K. McKusick
University of California, Berkeley

Computer Science Division - EECS
Berkeley, CA 94720 USA
graham@cs.berkeley.edu
Sun Microsystems, Inc.

4150 Network Circle
Santa Clara, CA 95054 USA
Peter.Kessler@ACM.ORG
1614 Oxford Street

Berkeley, CA 94709 USA
McKusick@McKusick.COM
ABSTRACT
appropriate bucket of the program counter histogram had an

almost negligible overhead, which allowed us to profile
production systems. The space for the histogram could be
controlled by getting a finer or coarser histogram. (Another thing
that was happening around this time was that we were moving
from 16-bit address spaces to 32-bit address spaces and felt quite
expansive in the amount of memory we were willing to use.)
One of us remembers an epiphany of being able to use a
histogram array that was four times the size of the text segment
of the program, getting a full 32-bit count for each possible
program counter value!
But it was in the runtime routine called from the top of each
profiled function that we made the most difference. The standard
routine uses a per-function data structure to count the number of
times each function is called. In its place, we wrote a routine that
uses the per-function data structure, and the return address of the
call, to record the callers of the function and the number of times
each had called this function. That is, we recorded incoming call
graph arcs with counts. (We were surprised at how easily, and
how dramatically, we could change the profiler with a single
late bound function call.) We wrote a new post-processing
program, i.e., gprof, to combine the call graph arcs with the
program counter histogram data to show not only the time spent
in each function but also the time spent in other functions called
from each function.
Our techniques are not without their pitfalls. For example, we
have a statistical sample of the time spent in a function from the
program counter histogram, and the count of the number of calls
to that function. From those we derive an average time per call
that need not reflect reality, e.g., if some calls take longer than
others. Further, when attributing time spent in called functions to
their callers, we have only single arcs in the call graph, and so
distribute the average time to callers in proportion to how
many times they called the function.
Another difficulty we had was when we encountered cycles in
the call graph: e.g., mutually recursive functions. We could not
accumulate time from called functions into a cycle and then
propagate that time towards the roots of the graph, because we
would go around the cycle endlessly. First we had to identify the
cycles and treat them specially. We had good graduate computer
science educations, and knew of Tarjan's strongly-connectedcomponents algorithm [5]. That was fun to implement.
Modern profilers solve both these problems by periodically
gathering not just isolated program counter samples and isolated
call graph arcs, but complete call stacks [6]. The additional
overhead of gathering the call stack can be hidden by backing off
the frequency with which the call stacks are sampled. Gathering
complete call stacks depends on being able to find the return
addresses all the way up the stack, a convention imposed in order
to debug programs.
We extended the UNIX system's profiler by gathering arcs in the

call graph of a program. Here is it 20 years later and this profiler
is still in daily use. Why is that? It's not because there aren't
well-known areas for improvement.
RETROSPECTIVE
In the early 1980's, a group of us at the University of California at
Berkeley were involved in a project to build compiler
construction tools [1]. We were, more or less simultaneously,
rewriting pieces of the UNIX operating system [2]. For many of
us, these were the largest, and most complex, programs on which
we had ever worked. Of course we were interested in squeezing
the last bits of performance out of these programs.
The UNIX system comes with a profiling tool, prof [3], which
we had found adequate up until then. The profiler consists of
three parts: a kernel module that maintains a histogram of the
program counter as it is observed at every clock tick; a runtime
routine, a call to which is inserted by the compilers at the head of
every function compiled with a profiling option; and a postprocessing program that aggregates and presents the data. The
program counter histogram provides statistical sampling of where
time is spent during execution. The runtime routine gathers
precise call counts. These two sources of information are
combined by post-processing to produce a table of each function
listing the number of times it was called, the time spent in it, and
the average time per call.
As our programs became more complex, and as we became
better at structuring them into shared, reusable pieces, we noticed
that the profiles were becoming more diffuse and less useful. We
observed two sources of confusion: as we partitioned operations
across several functions to make them more general, the time for
an operation spread across the several functions; and as the
functions became more useful, they were used from many places,
so it wasn't always clear why a function was being called as many
times as it was. The difficulty we were having was that we
wanted to understand the abstractions used in our system, but the
function boundaries did not correspond to abstraction boundaries.
Not being afraid to hack on the kernel and the runtime
libraries, we set about building a better profiler [4]. Our ground
rules were to change only what we needed and to make sure we
preserved the efficiency of the tool.
In fact, except for fixing a few bugs, the program counter
histogram part of the profiler worked fine. Incrementing the
20 Years of the ACM/SIGPLAN Conference on Programming Languages
Design and Implementation (1979-1999): A Selection, 2003.
Copyright 2003 ACM 1-58113-623-4 $5.00
ACM SIGPLAN
49
Best of PLDI 1979-1999
Another difficulty was presenting the data. Fundamentally we

had a graph with a lot of data on the arcs and summary
information at the nodes. We were limited by the output devices
of the time to character-based formatting. We ended up with a
rather dense display of the information at each node, and a view
of the arcs into and out of that node. All we can say for our
layout is that after a while we got used to it. We did add
notations to help us navigate the output in the visual editors
becoming popular at that time.
After using the profiles for a while we discovered the need to
filter the data, i.e., to show only hot functions, or only parts of the
graph containing certain methods. We also added a facility to
crawl over the executable image of the program and add arcs to
the call graph that were apparent in the code even if they hadn't
been traversed during a particular execution. We would add
these arcs so that we could better understand the shape of the call
graph. We also added the ability to sum the data over several
profiled runs, to accumulate enough time in short-running
methods to get an idea of their performance.
We had great success applying our new profiler to the program
for which we wrote it. Then we set about profiling lots of other
programs. Of course, among the programs on which we used the
new profiler was the profiler itself.
The next challenge was to adapt the profiler to profile the
Berkeley Unix kernel on which we were working. That required
adding a programmer's interface to control the profiler, and a tool
to communicate through that interface. Unlike user programs that
could be run to completion, dump their profiling data to a file,
and exit, we had to be able to profile events of interest in the
kernel without taking the kernel down. (Remember, this was a
time-sharing system with lots of users.) The programmer's
interface allowed us to turn the profiler on and off, extract the
profiling data, and reset the data.
Because of the interactions of the kernel's major subsystems,
there were several large cycles in the profiles. The effect of these
cycles was that it was impossible to get useful timing results for
modules like the networking stack. When we looked at the
profiles there were just a few arcs -- with low traversal counts -that closed the cycles. We added an option to specify a set of
arcs to be removed from the analysis. Using this option was a
matter of trial and error (or intimate knowledge of the profiled
program), but effective when used properly. To aid users unable
or unwilling to find an arc set for themselves, we added a
ACM SIGPLAN
heuristic to help choose arcs to remove. The underlying problem

is NP-complete, so we added a bound on the number of arcs the
tool would attempt to remove. In practice, we found that the
information lost by omitting these arcs was far less than the
information gained by separating the abstractions formerly
contained in the cycle.
After going out with the Berkeley Software Distributions,
gprof has been ported to all the major variants of Unix. Its widespread distribution was assured when it was adopted (and
extended) by the GNU project [7].
What is amazing to us is that gprof has survived as long as it
has, in spite of its well-known flaws. While we are happy to
have contributed such a useful tool to the community, we are
happy to see that gprof is gradually being replaced by more
accurate and more usable tools.
REFERENCES
[1] S. L. Graham, R. R. Henry, and R. A. Schulman, An
Experiment in Table Drive Code Generation, SIGPLAN '82
Symposium on Compiler Construction, June, 1982.
[2] M. K. McKusick, Twenty Years of Berkeley Unix: From
AT&T-Owned to Freely Redistributable, in Open Sources:
Voices from the Open Source Revolution, O'Reilly, January,
1999.
http://www.oreilly.com/catalog/opensources/book/
kirkmck.html
[3] prof, Unix Programmer's Manual, Section 1, Bell
Laboratories, Murray Hill, NJ, January 1979.
[4] S. L. Graham, P. B. Kessler, and M. K. McKusick, An
execution profiler for modular programs, Software - Practice
& Experience, 13(8), pp. 671 - 685, August 1983.
[5] R. E. Tarjan, Depth first search and linear graph algorithm,
SIAM Journal on Computing, Volume 1, Number 2, pp.
146-160, 1972.
[6] Sun Microsystems, Inc. "Program Performance Analysis
Tools", in Forte Developer 7 Manual, Part number
816-2458-10, May 2002 Revision A.
http://docs.sun.com/source/816-2458/index.html.
[7] GNU gprof,
http://www.gnu.org/manual/gprof-2.9.1/gprof.html, 1998.
50
ACM SIGPLAN
51
ACM SIGPLAN
52
ACM SIGPLAN
53
ACM SIGPLAN
54
ACM SIGPLAN
55
ACM SIGPLAN
56
ACM SIGPLAN
57

Gprof: A Call Graph Execution Profiler: Susan L. Graham Peter B. Kessler Marshall K. Mckusick

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Gprof: A Call Graph Execution Profiler: Susan L. Graham Peter B. Kessler Marshall K. Mckusick

Uploaded by

Copyright:

Available Formats

RETROSPECTIVE:

gprof: a Call Graph Execution Profiler

University of California, Berkeley

Sun Microsystems, Inc.

1614 Oxford Street

appropriate bucket of the program counter histogram had an

We extended the UNIX system's profiler by gathering arcs in the

Best of PLDI 1979-1999

Another difficulty was presenting the data. Fundamentally we

heuristic to help choose arcs to remove. The underlying problem

Best of PLDI 1979-1999

Best of PLDI 1979-1999

Best of PLDI 1979-1999

Best of PLDI 1979-1999

Best of PLDI 1979-1999

Best of PLDI 1979-1999

Best of PLDI 1979-1999

Best of PLDI 1979-1999

You might also like