Call Graph Prefetching For Database Applications

Call Graph Prefetching for Database
Applications
MURALI ANNAVARAM
Intel Corporation
and
JIGNESH M. PATEL and EDWARD S. DAVIDSON
The University of Michigan, Ann Arbor
With the continuing technological trend of ever cheaper and larger memory, most data sets in
database servers will soon be able to reside in main memory. In this configuration, the perfor-
mance bottleneck is likely to be the gap between the processing speed of the CPU and the memory
access latency. Previous work has shown that database applications have large instruction and
data footprints and hence do not use processor caches effectively. In this paper, we propose Call
Graph Prefetching (CGP), an N instruction prefetching technique that analyzes the call graph of
a database system and prefetches instructions from the function that is deemed likely to be called
next. CGP capitalizes on the highly predictable function call sequences that are typical of database
systems. CGP can be implemented either in software or in hardware. The software-based CGP
(CGP S) uses profile information to build a call graph, and uses the predictable call sequences in
the call graph to determine which function to prefetch next. The hardware-based CGP(CGP H) uses
a hardware table, called the Call Graph History Cache (CGHC), to dynamically store sequences
of functions invoked during program execution, and uses that stored history when choosing which
functions to prefetch.
We evaluate the performance of CGP on sets of Wisconsin and TPC-H queries, as well as on
CPU-2000 benchmarks. For most CPU-2000 applications the number of instruction cache (I-cache)
misses were very few even without any prefetching, obviating the need for CGP. On the other hand,
the database workloads do suffer a significant number of I-cache misses; CGP S improves their
performance by 23% and CGP H by 26% over a baseline system that has already been highly tuned
for efficient I-cache usage by using the OM tool. CGP, with or without OM, reduces the I-cache miss
stall time by about 50% relative to O5+OM, taking us about half way from an already highly tuned
baseline system toward perfect I-cache performance.
This work was done while M. Annavaram was at the University of Michigan. This material is based
upon work supported by the National Science Foundation under Grant IIS-0093059.
Authors’ addresses: Murali Annavaram, Intel Corporation, 220 Mission College Blvd., Santa Clara,
CA 95052-8119; email: murali.m.annavaram@intel.com; Jignesh M. Patel, University of Michigan,
2239 EECS, Ann Arbor, MI 48109-2122; email: jignesh@eecs.umich.edu; Edward S. Davidson, 1100
Chestnut Rd., Ann Arbor, MI 48104; email: davidson@eecs.umich.edu.
Permission to make digital or hard copies of part or all of this work for personal or classroom use is
granted without fee provided that copies are not made or distributed for profit or direct commercial
advantage and that copies show this notice on the first page or initial screen of a display along
with the full citation. Copyrights for components of this work owned by others than ACM must be
honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers,
to redistribute to lists, or to use any component of this work in other works requires prior specific
permission and/or a fee. Permissions may be requested from Publications Dept., ACM Inc., 1515
Broadway, New York, NY 10036 USA, fax: +1 (212) 869-0481, or permissions@acm.org.
°
C 2003 ACM 0734-2071/03/1100-0412 $5.00
ACM Transactions on Computer Systems, Vol. 21, No. 4, November 2003, Pages 412–444.
Call Graph Prefetching for Database Applications • 413
Categories and Subject Descriptors: C.4 [Performance of Systems]: Design studies; C.1.0
[Processor Architectures]: General
General Terms: Performance, Design, Experimentation
Additional Key Words and Phrases: Instruction cache prefetching, call graph, database
1. PERFORMANCE BOTTLENECKS IN DBMS

The increasing need to store and query large volumes of data has made database
management systems (DBMS) one of the most prominent applications on to-
day’s computer systems. DBMS performance in the past was bottlenecked by
disk access latency which is orders of magnitude slower than processor cycle
times. But with the trend toward denser and cheaper memory, database servers
in the near future will have large main memory configurations, and many work-
ing sets will be resident in main memory [Bernstein et al. 1998]. Moreover tech-
niques such as concurrent query execution, where a query that is waiting for a
disk access is switched with another query that is ready for execution, can suc-
cessfully mask most of the remaining disk access latencies. Several commercial
database systems already implement concurrent query execution along with
asynchronous I/O to reduce the I/O bottleneck. Once the disk access latency is
tolerated, or disk accesses are sufficiently infrequent, the primary performance
bottleneck shifts from the I/O response time to the memory access time.
There is a growing gap between processor and memory speeds, which can
be reduced by the effective use of multi-level caches. But recent studies have
shown that current database systems with their large code and data footprints
suffer significantly from poor cache performance [Ailamaki et al. 1999; Boncz
et al. 1998; Lo et al. 1998; Nyberg et al. 1994; Shatdal et al. 1994]. Thus the key
challenge in improving the performance of memory-resident database systems
is to utilize caches effectively and reduce cache miss stalls.
The graph in Figure 1 shows the number of instruction cache (I-cache) misses
incurred while concurrently executing a set of Wisconsin [Bitton et al. 1983] and
TPC-H [TPC 1999] benchmark queries in a DBMS built on top of SHORE [Carey
et al. 1994], using a 32 KB I-cache. The leftmost bar, labeled O5, shows the
I-cache misses for the binary compiled using the highest compiler optimization
level (C++ with the −O5 optimization flag turned on). The second bar, O5+OM,
shows the misses when the O5 binary is further optimized using the OM
tool [Srivastava and Wall 1992], which implements a modified code layout
scheme Pettis and Hansen [1990] for improving I-cache performance (OM opti-
mizations are described in Section 5.3). Similarly O5+NL 4 and O5+OM+NL 4
are for the O5 and O5+OM binaries, respectively, with tagged next-N-line
(NL) instruction prefetching where the underlying hardware issues prefetches
to the next 4 lines whenever there is a cache miss or a first hit to a cache
line after it was prefetched by the tagged NL scheme. O5+CGP S 4 and
O5+CGP H 4, respectively, show the cache misses using the software and the
hardware based CGP alone, without using the OM optimizations. These two
schemes are described in Sections 3 and 4, respectively. O5+OM+CGP S 4
and O5+OM+CGP H 4 are for CGP applied to an OM optimized binary.
ACM Transactions on Computer Systems, Vol. 21, No. 4, November 2003.
414 • M. Annavaram et al.
Fig. 1. I-Cache misses in DBMS for 32 KB I-cache.
For each bar, the upper component, labeled CallTarget-Miss, shows the
I-cache misses that occur immediately following a function call—upon accessing
the target address of the function call. The bottom component, IntraFunc-Miss,
shows all the remaining I-cache misses (incurred while executing instructions
within a function boundary).
Four key observations may be made from the graph:
(1) Although OM does reduce the I-cache misses, there is still room for signifi-
cant improvement.
(2) Both OM optimizations and NL 4 prefetching significantly reduce the
IntraFunc-Misses; however, they are almost totally ineffective in reducing
the CallTarget-Misses.
(3) CGP is very effective in reducing the CallTarget-Misses, as intended; more-
over the CGP H optimized binary also achieves a considerable further re-
duction in IntraFunc-Misses.
(4) Finally, comparing CGP without OM and CGP with OM, it is apparent that
CGP makes OM optimizations almost unnecessary.
Previous research [Franklin et al. 1994] has shown that DBMS have a large
number of small functions due to the modular software architecture used in
their design. Furthermore after applying existing compiler techniques and sim-
ple prefetch schemes (cf. O5+OM+NL 4 in Figure 1), I-cache misses at function
call boundaries constitute a significant portion of all I-cache misses. In order
to recover the performance lost due to call target misses without sacrificing
the advantages of modular design we have developed Call Graph Prefetching
(CGP), an instruction prefetching technique that analyzes the call graph of an
application and prefetches instructions from a function that is likely to be called
next. Although CGP is a generic instruction prefetching scheme, it is particu-
larly effective for large software systems such as DBMS because of the layered
software design approach used by these systems. Section 2 argues intuitively
why CGP can be effective in prefetching for database applications.
Section 3 then describes an algorithm (CGP S) that implements CGP in
software. This algorithm uses a profile run to build a call graph of a database
Fig. 2. Software layers in a typical DBMS.
system, and exploits the predictable call sequences in the call graph to deter-
mine which function to prefetch next.
Section 4 reviews the hardware implementation of CGP introduced in
Annavaram et al. [2001a]. This implementation (CGP H) uses a hardware
table, called the Call Graph History Cache (CGHC), to dynamically store se-
quences of functions invoked during program execution, and uses that stored
history when choosing which functions to prefetch.
In this paper we show that CGP S performs nearly as well as CGP H, but
without the need for additional hardware. However, the hardware approach
may well be preferred by many people who have an aversion to or distrust of
profiling. In Annavaram et al. [2001a] we used two well known metrics, cover-
age and accuracy, to determine the effectiveness of CGP H. But the coverage
and accuracy metrics ignore the effects of a prefetch of line X on the cache line,
Y, that it replaces, for example, in the case that Y will be needed before the
next reference to X. This lack of information about the line that is replaced
to accommodate the prefetched line makes the coverage and accuracy metrics
insufficient to evaluate the effectiveness of the prefetches issued by a prefetch
scheme. Hence, to measure the effectiveness of CGP, we now use a more re-
fined prefetch classification, the Prefetch Traffic and Miss Taxonomy (PTMT),
developed by Srinivasan et al. [2003].
Section 5 describes the simulation environment and performance analysis
tools that we used to assess the effectiveness of CGP S and CGP H.
Section 6 describes previous related work and Section 7 presents conclusions
and suggests future directions.
2. MOTIVATION FOR CALL GRAPH PREFETCHING

DBMS software is commonly built using a layered software architecture where
each layer provides a set of well-defined entry points to the layers above it.
Figure 2 shows the layers in a typical database system with the storage man-
ager being the bottom-most. The storage manager provides basic file storage
mechanisms (such as tables and indices), concurrency control and transaction
management facilities. Relational operators that implement algorithms for join,
aggregation and so on, are typically built on top of the storage manager. The
Fig. 3. Call graph for the Create rec function.
query scheduler, the query optimizer and the query parser are then built on top
of the operator layer. Each layer in this modular architecture provides a set of
well-defined entry points and hides its internal implementation details so as
to improve the portability and maintainability of the software. The sequence
of function calls within each of these entry points is transparent to the layers
above. Although such layered code typically exhibits poor spatial and temporal
locality, the function call sequences can often be predicted with great accuracy.
CGP exploits this predictability to prefetch instructions from the function that
is deemed most likely to be executed next.
2.1 A Simple Call Graph Example

We introduce CGP with the following pedagogical example. Figure 3 shows a
segment of a call graph for adding a record to a file in SHORE [Carey et al.
1994]. SHORE is a storage manager that provides storage volumes, B+ trees,
R* trees, concurrency control and transaction management. In this example,
Create rec makes a call to Find page in buffer pool to check if the relation into
which the record is being added is already in the main memory buffer pool. If
the page is not already in the pool, the Getpage from disk function is invoked to
bring it into the pool from the disk. This page is then locked using the Lock page
routine, subsequently updated using Update page, and finally unlocked using
Unlock page.
The Create rec function is the entry point provided by the storage manager
to create a record, and is routinely invoked by a number of relational opera-
tors, including insert, bulk load, join (to create temporary partitions or sorted
runs), and aggregate. Although it is difficult to predict when calls to Create rec
will occur, once Create rec is invoked, Find page in buffer pool is always the
next function to be called. When a page is brought into the memory buffer pool
from the disk, DBMS typically “pin” the page in the buffer pool to prevent the
possibility of it being replaced before it is used. Given a large buffer pool size
and repeated calls to Create rec, the page that is being updated will usually be
found pinned in the buffer. Hence Getpage from disk will usually not be called
and Lock page, Update page and Unlock page will be the sequence of functions
next invoked. CGP capitalizes on this predictability by prefetching instructions

needed for executing Find page in buffer pool upon entering Create rec, then
prefetching instructions for Lock page once Find page in buffer pool is en-
tered, and finally prefetching instructions for Update page upon returning
from Find page in buffer pool, and for Unlock page upon returning from
Update page.
3. IMPLEMENTING CGP IN SOFTWARE

This section presents CGP S, an algorithm that implements CGP in software.
CGP S uses a profile run to build a call graph of a database system, and in-
serts instructions into the original database system to prefetch a function that
is likely to be called next. Although CGP S uses a profile workload to build
the call graphs, our experiments show it to be highly effective even when the
actual query workload differs significantly from the profile workload. Profiling
is a popular technique that is used to identify execution bottlenecks and tune
applications to achieve better performance. Most compilers (including Com-
paq, Intel IA32 and IPF compilers, and gcc) support profile-directed feedback
optimizations. CGP S can be implemented on any processor whose instruction
set architecture supports an instruction for prefetching instructions into the
I-cache. Some current architectures, such as HP-PA8000 and SPARC-V9 al-
ready provide an instruction for prefetching into the I-cache, and we believe
that future architectures are likely to provide such an instruction.
3.1 CGP S Algorithm

This CGP algorithm requires two inputs: a database binary, and a sample work-
load that is used for collecting profile information about the given database
binary. CGP S has three phases; the call graph generation phase, the profile
generation phase, and the prefetch insertion phase, as described below:
(1) The Call Graph Generation Phase

(a) The CGP algorithm reads the given database binary and builds a di-
rected call graph. Each function in the binary is represented as a node
in the graph, and a directed edge is drawn from each caller function
to each function that it calls (callee). Self edges (from a node to itself ),
which occur when a function invokes itself recursively, are removed.
(b) Each node in the call graph is initially labeled 0, to signify that the first
invocation of the function has not yet occurred.
(c) Each edge in the call graph is initially labeled 0, to signify that its caller
function has not yet invoked its callee function.
(2) The Profile Generation Phase
(a) To warm up the buffer pool, CGP first runs the given database binary
once with the sample database workload as input.
(b) Then to generate a sequence of function call tokens, which are stored
in a profile file, CGP runs an instrumented binary with the same sam-
ple workload as input. During the instrumented binary run, the pro-
file generation phase collects information regarding the call sequences
of the DBMS. To collect this profile information CGP uses ATOM

[Srivastava and Eustace 1994] to instrument all the function call points
in the original (uninstrumented) database binary. When the instru-
mented binary is executed, it generates a sequence of tokens, of the
form <caller, callee>. Whenever a function is called during this run
a <caller, callee> token is generated and appended to the sequence,
provided that both the following constraints are satisfied:
i. The caller function is “enabled,” that is, it is being executed for the
first time during the profile run of the instrumented binary. All func-
tions are enabled initially and are permanently disabled when they
first return.
ii. The caller function is invoking this callee function for the first time,
that is, no token is appended if another instance of the same token
has already been stored in the profile file.
Thus the sequence of tokens generated during the execution of the in-
strumented binary corresponds to the sequence of function calls invoked
by each caller during its first invocation, with duplicate tokens deleted.
Note that the instrumented binary’s execution is functionally identical
to the original binary; it simply runs somewhat slower due to the profile
generation process.
(3) Prefetch Insertion Phase
(a) CGP reads the sequence of tokens from the profile file. For each token,
<A, B>, the value stored in node A of the call graph is incremented.
Then the edge from A to B in the call graph is labeled with the value
stored in node A. Therefore, in this labeled call graph, the edge labels
originating from node A indicate the order of the functions invoked by
A during the first invocation of A in the profile run. When all the tokens
from the profile file have been processed, a new call graph is produced
with these updated edge labels.
(b) CGP uses the labeled call graph to insert prefetch instructions into the
original binary. For each edge of the call graph with a label greater
than 0, CGP inserts a prefetch instruction that prefetches the callee
function associated with that edge. For each caller, the corresponding
callee functions are prefetched in the ascending order of their edge la-
bels as follows. For each caller, CGP inserts a prefetch for its first callee
function as early as possible in the first basic block of that caller; a
prefetch for the second callee function is inserted immediately after the
call to the first callee function, that is, the prefetch of the second callee
function will occur as soon as the first callee returns, and so on.
Since CGP S uses profile information, prefetching an entire function may

waste processor resources if the prefetched function is not invoked during
the actual query execution. Moreover prefetching an entire large function into
the I-cache could pollute the cache by replacing existing cache lines that may
be needed sooner than the prefetched lines. Hence CGP S inserts prefetch in-
structions to prefetch only the first N cache lines of a function. The rest of the
callee function is assumed to be prefetched after entering the callee function
Fig. 4. Directed call graph with the edge labels from the profile execution.
Fig. 5. Create rec function after applying the CGP algorithm (new prefetch instructions are indi-
cated by *).
itself by using a tagged NL prefetching scheme. In particular, CGP assumes

that the underlying hardware issues a prefetch to the next N lines whenever a
cache miss occurs, as well as upon the next reference to a cache line after it has
been prefetched. (Tagged NL prefetching is described further in Section 3.2). N
is a parameter whose value can be selected based on the I-cache capacity, line
size, and miss latency. We use the notation CGP S N to represent a CGP S
scheme that prefetches only N cache lines, rather than an entire function, on
each prefetch request.
Figure 4 shows the labeled directed call graph for the example described
in Figure 3. Initially all the outgoing edges are labeled 0. During the profile
workload execution when the Create rec function is first invoked it makes a call
to Find page in buffer pool followed by calls to Update page and Unlock page.
The outgoing edges are labeled 1, 2, 3 to reflect the order in which the func-
tion calls from Create rec were made. When Find page in buffer pool is first
invoked (after the buffer is warmed up) it makes a call to Lock page and hence
the corresponding outgoing edge from Find page in buffer pool is labeled 1.
Since Find page in buffer pool did not call Getpage from disk during this first
instance of its execution after warm-up, the label for the corresponding outgoing
edge retains its initial value of 0.
Figure 5 shows the code generated after applying the CGP S algorithm to the
call graph of Create rec. A prefetch instruction is inserted at the beginning of
Create rec to prefetch Find page in buffer pool. Since Update page is the next
function called after returning from Find page in buffer pool, a prefetch in-
struction is inserted for this function after the call to Find page in buffer pool.
Unlock page is prefetched immediately after returning from Update page.

An instruction is inserted early in Find page in buffer pool to prefetch
Lock page. Since its incoming edge label is 0, Getpage from disk is never
prefetched.
3.2 Considerations for CGP S Implementation

CGP S requires that the instruction set architecture includes an opcode to
prefetch instructions into the instruction cache. In our implementation we as-
sume that the underlying machine has one such opcode. Some current archi-
tectures, such as HP-PA8000 and SPARC-V9, already provide an instruction
for prefetching into the I-cache. The more recent Itanium ISA supports an even
wider range of prefetch instructions. Hence, we believe that future architec-
tures are likely to provide such an instruction.
Since, in our implementation, CGP S uses tagged NL prefetching to prefetch
within function boundaries, the tagged NL component requires hardware sup-
port. The tagged NL scheme can be implemented by adding one tag bit to each
cache line; the tag bit associated with a cache line is set whenever the line is
prefetched and is reset once the line has been accessed or replaced. The tag
bit is never set for a cache line that was brought into the cache by a demand
reference that missed. On any I-cache miss and on any hit to an I-cache line
whose tag bit is set, the next N sequential lines are prefetched into the I-cache;
prefetches are squashed for any of these lines that are already in the I-cache
or are currently en route to the I-cache.
For evaluating the effectiveness of CGP S we ran the CGP S algorithm
only once to obtain a profile and generate a new database binary with
prefetch instructions. But in a production environment, if necessary, a new
DBMS binary can be generated periodically by profiling a production run.
In such environments the decision to generate a new binary could be made,
for example, whenever the I-cache miss rate increases significantly, as mea-
sured by a hardware monitor that can non-intrusively count the I-cache
misses [www.developer.intel.com/drg/mmx/appnotes/perfmon.htm]. As shown
in Section 5.4, CGP is highly effective even when the production run differs
from the profile run. Thus we expect that it will rarely be necessary in practice
to build a new database binary and swap it with an existing binary.
4. IMPLEMENTING CGP IN HARDWARE

The software implementation of CGP uses a profile run to determine the call
sequences in the DBMS. It lacks the ability to adapt dynamically to runtime
changes in the function call sequences. This section presents a hardware imple-
mentation of CGP (CGP H) that uses a hardware table, called the Call Graph
History Cache (CGHC), to dynamically store sequences of functions invoked
during program execution; CGP H uses this stored history when choosing
which functions to prefetch. As opposed to the software scheme, this hardware
scheme does incur hardware cost beyond what is needed to support tagged
NL prefetching, but has the ability to adapt to dynamic changes, if any, in the
function call sequences.
Fig. 6. Call graph history cache (state shown in CGHC occurs as Lock page is being prefetched
from Find page in buffer pool).
4.1 Exploiting Call Graph Information Using CGHC

The main hardware component of the CGP H prefetcher is the Call Graph
History Cache (CGHC) which comprises a tag array and a data array as shown
in Figure 6. The tag array is similar to the tag array of a direct mapped cache.
Each entry in the tag array stores the starting address of a function and an index
(I). The tag array is indexed using the lower order bits of the starting address
of a function, F ; when the full function address of F matches the function
address stored in the tag entry the corresponding data array entry is accessed.
The data array entry stores a sequence of starting addresses corresponding to
the sequence of functions that were called by F the last time that function F was
called. If F has not yet returned from its most recent call, this sequence may be
partially updated. For ease of explanation and readability here and in Figure 6
we use the function name to represent the starting address of the function.
By analyzing the executables using ATOM [Srivastava and Eustace 1994],
we discovered that 80% of the functions in our benchmarks make calls to no
more than 8 distinct functions. Hence we decided to let each entry in the data
array, as implemented in our evaluations, store up to 8 function addresses. If a
function in the tag entry invokes more than 8 functions, only the first 8 functions
invoked are stored in our evaluations. As shown below in Section 5.5, a small
direct mapped CGHC achieves nearly the same performance as an infinite size
CGHC; hence we chose to use a direct mapped CGHC instead of a more complex
set-associative CGHC.
Each call and each return instruction that is executed makes two accesses to
CGHC. In each case the first of these accesses uses the target address of the call
(or the return) to determine which function to prefetch next; the second access
uses the starting address of the currently executing function to update the
current function’s index and call sequence that is stored in CGHC. To quickly
generate the target address of a call or return instruction, the processor’s branch
predictor is used instead of waiting for the target address computation which
may not be available from the out-of-order processor pipeline for several more
cycles.
On a CGHC access, if there is no hit in the tag array, no prefetches are issued
and a new tag array entry is created with the desired tag and an index value
of 1. The corresponding data array entry is marked “invalid,” unless the CGHC
miss occurs on the second (update) access for a call (say P calls F ), in which
case the first slot of the data array entry for P is set to F .
In general, the index value in the tag array entry for a function F , points to
one of the functions in the data array entry for F . An index value of 1 selects
the first (leftmost) function in the data array entry. Note that the index value
is initialized to 1 whenever a new entry is created for F , and the index value
is reset to 1 whenever F returns.
When the branch predictor predicts that P is calling F , the first (call
prefetch) access to the direct mapped CGHC tag array is made by using the
lower order bits of the predicted target address, F , of the function call. If the
address stored in the tag entry matches F , given that the index value of a func-
tion being called should be 1, a prefetch is issued to the first function address
that is stored in the corresponding data array entry. The second function will be
prefetched when the first function returns, the third when the second returns,
and so on. The prefetcher thus predicts that the sequence of calls to be invoked
by F will be the same as the last time F was executed. We chose to implement
this prediction scheme because of the simplicity of the resulting prefetch logic
and the accuracy of this predictor for stable call sequences.
For the same call instruction (P calls F ), the second (call update) access
to the CGHC tag array is made using the lower order bits of the starting address
of the current function, P . If the address stored in the tag entry matches P , then
the index of that entry is used to select one of the 8 slots of the corresponding
data array entry, and the predicted call target, F , is stored in that slot. Finally
the index is incremented by 1 on each call update, up to a maximum value of 8.
On a return instruction, when the function F returns to function P , the lower
order bits of the starting address of P are used for the first (return prefetch)
access to the CGHC. On a tag hit, the index value in the tag array entry is used
to select a slot in the corresponding data array entry, and the function in that
slot is prefetched.
Note that on a return instruction, a conventional branch predictor only pre-
dicts the return address in P to which F returns; in particular it does not
provide the starting address of P . Since the entries in the tag array store only
starting addresses of functions, the target address of a return instruction can-
not be directly used for a tag match in CGHC. To overcome this problem, the
processor always keeps track of the starting address of the function currently
being executed. When a call instruction is encountered, the starting address of
the caller function is pushed onto the branch predictor’s return address stack
structure along with the return address. On a return instruction, the mod-
ified branch predictor retrieves the return address as usual, and also gets
the caller function’s starting address which is used to access the CGHC tag
array.
On the same return instruction, the second (return update) access to CGHC is
made using the lower order bits of the starting address of the current returning
function, F . On a tag hit, the index value in the tag array entry is reset to one.
Since CGP H predicts that the sequence of function calls made by a caller
will be the same as the last time that caller was executed, prefetching an entire
function based on this prediction may waste processor resources if the

prefetched function is not invoked during the actual execution, and may even
lead to cache pollution. Hence, as in the CGP S algorithm, the CGP H algo-
rithm prefetches only N cache lines from the beginning of the callee function,
where N is a parameter that can be based on the I-cache capacity, line size,
and miss latency. We attempt to prefetch the rest, if any, of the callee function
in a gradual fashion from within the callee function itself by using a tagged NL
prefetching scheme, as described in Section 3.2. We use the notation CGP H N
to represent a CGP H scheme that prefetches only N cache lines, rather than
an entire function, on each prefetch request.
4.2 Considerations for CGP H Implementation

Operations that access and update the CGHC are not on the critical path of
the processor pipeline and can be executed in the background. In our imple-
mentation the prefetch and update accesses to the CGHC are in different cy-
cles, so as to eliminate the need for having a dual-ported CGHC. The CGHC
is accessed n cycles after the branch predictor predicts the target of a call or
return instruction. We chose n based on the size of CGHC, namely 1 cycle for
a 1 KB CGHC, and 2 for a 16 KB CGHC. Since the CGHC is a small direct
mapped cache, we assume that the tag match of the target address is com-
pleted in this cycle. A prefetch is issued in the next cycle after a hit in CGHC.
To reflect the call sequence history, the CGHC is updated in the following clock
cycle.
Our current CGP implementation prefetches instructions from only a single
function at a time. In particular, it does not prefetch all callees at once from a
caller function. Depending on control flow, only a few callees may be called from
a caller, in which case prefetching all callees would unnecessarily aggravate bus
congestion.
CGP prefetches instructions directly into the L1 I-cache; no separate prefetch
buffer was used to hold the prefetched cache lines. Using a separate prefetch
buffer can potentially reduce cache pollution by reducing the chances of evicting
useful cache lines. As shown in the experimental evaluations of Section 5, less
than 4% of CGP’s prefetches are polluting. Hence using a separate prefetch
buffer would not provide significant additional benefits.
The traffic generated by the prefetches and the L1 cache misses are serviced
by the L2 cache in strict FIFO order without giving any priority to the demand
miss traffic. Although the lack of priority might increase the latency of the
demand misses, it simplifies the L2 accessing interface within the L1 cache. As
shown in the experimental evaluations of Section 5, this approach works quite
well.
5. SIMULATION RESULTS
In this section we first describe how we generated the database workloads that
we used to evaluate the effectiveness of CGP. We then present the experimental
results.
Table I. Microarchitecture Parameter Values for CGP

Evaluations
Fetch, Decode & Issue Width 4
Inst Fetch & L/S Queue Size 32
ROB entries 64
Functional Units 4 Generalpurpose/2mult
Memory system ports to CPU 4
L1 I and D cache each 32 KB, 2-way, 32 byte line
Unified L2 cache 1 MB, 4-way, 32 byte line
L1 hit latency(cycles) 1
Mem latency (cycles) 80
Branch Predictor comb(2-lev + 2-bit)
5.1 Methodology and Workloads

To evaluate the effectiveness of CGP we implemented a subset of the common
DBMS relational operators on top of the SHORE storage manager [Carey et al.
1994]. SHORE is a fully functional storage manager that has been used exten-
sively by the database research community and is also used in some commercial
database systems. SHORE provides storage volumes, files of untyped objects,
B+ trees and R* trees, full concurrency control and recovery with two-phase
locking and write-ahead logging. We implemented the following relational op-
erators on top of SHORE: select, indexed select, grace join, nested loops join,
indexed nested loop join, and hash-based aggregate. Each SQL query was trans-
formed into a query plan using these operators.
The relational operators and the underlying storage manager were compiled
on an Alpha 21264 processor running OSF Version 4.0F. We used the Compaq
C++ compiler, version 6.2, with −O5 -ifo -inline and speed optimization flags
turned on. The Compaq compiler is the only compiler that supports the OM tool
for doing feedback directed code layout. Since OM implements one of the best
known code layout techniques for improving I-cache performance, we felt that
it was important to use a compiler that implemented this optimization, so that
we could measure the effectiveness of CGP over a highly optimized baseline
case.
We used the SimpleScalar simulator [Burger and Austin 1997], for detailed
cycle-level processor simulation. The microarchitecture parameters were set as
shown in Table I. We chose a 2-way set-associative L1 cache for two reasons.
—Rupley et al. [2002] compared and contrasted the behavior of an Oracle
based Online Transaction Processing (OLTP) workload, called ODB, with
CPU2000 benchmarks. That analysis showed that increasing the associativ-
ity of I-cache does not reduce the miss rate of ODB. In fact, our measurements
showed that more than 99% of the cache misses in ODB are due to capacity
misses. Previous work [Ailamaki et al. 1999] also showed that database ap-
plications have large instruction and data footprints, and hence suffer from
a significant number of capacity misses. Increasing the set-associativity of a
given size cache only reduces conflict misses, not capacity misses.

—L1 cache designs are constrained by fast access time requirements, typically
single cycle access latency. It is difficult to design highly associative L1 caches
that operate at a high frequency and also meet the single cycle access time
requirement.
To evaluate the performance of CGP we used a database workload that

consists of eight queries from the Wisconsin benchmark [Bitton et al. 1983],
and five queries from the TPC-H [TPC 1999] benchmark. The selected Wiscon-
sin benchmark queries are queries 1 through 7 (1% and 10% range selection
queries with and without indices) and query 9 (a two-way join query). The
selected TPC-H queries are queries 1, 2, 3, 5 and 6; these comprise a simple
nested query (query 2) and four complex queries with aggregations and many
joins.
The following experiments evaluate the effectiveness of CGP for four differ-
ent workload configurations:
(1) Wisc-prof, a set of three queries from the Wisconsin benchmark: query 1
(sequential scan), query 5 (non-clustered index select), and query 9 (two-
way join). These queries were chosen since they include operations that are
frequently used by the other Wisconsin benchmark queries. These selected
queries were run on a dataset of 2100 tuples (1,000 tuples in each of the
first two relations, and 100 tuples in the third relation).
(2) Wisc-large-1 consists of the same three queries used in the Wisc-prof work-
load, except that the queries were run on a full 21,000 tuple Wisconsin
dataset (10,000 tuples in each of the first two relations, and 1,000 tuples
in the third relation). The total size of the dataset including the indices is
10 MB. This workload was selected to see how CGP performance differs
when running the same queries on a different size dataset.
(3) Wisc-large-2 consists of all eight Wisconsin queries running on a 10 MB
dataset.
(4) Wisc+tpch consists of all eight Wisconsin queries and the five TPC-H
queries running concurrently on a total dataset of size 40 MB. In this work-
load the size of the TPC-H dataset is 30 MB.
The queries in each workload were executed concurrently, each query run-
ning as a separate thread in the database server. Keeping the dataset sizes
relatively small (40 MB or less) allows the SimpleScalar simulation to complete
in a reasonable time. Even with this small dataset, the total number of instruc-
tions simulated in wisc+tpch was about 3 billion and required about 20 hours
per simulation run. Our results on wisc-prof and wisc-large-1 show that in-
creasing the size of the dataset for the same queries increases the number of
instructions executed, but does not significantly alter the types and sequences
of function calls that are made; CGP performance is in fact fairly independent
of the dataset size that is used. We also ran a few CGP simulations on the wisc-
large-2 queries with a 100 MB dataset and saw improvements that are quite
similar to those for the 10 MB dataset.

5.2 Generating Profile Information for CGP S

For the profile workload that is provided as input to CGP S, we used the
wisc-prof workload. Note, for example, that none of these queries include any
aggregations or joins of more than two relations, but the TPC-H queries we
selected do joins and aggregates on more than two relations. As shown in
our experimental evaluations, CGP is effective even when the actual query
workload differs significantly from the profile workload. This property of CGP
makes it easier to produce a broadly useful profile workload for the database
system.
5.3 Feedback-Directed Code Layout with OM

Before presenting the results for CGP, we briefly discuss the feedback-directed
code layout optimizations of OM, which reduce I-cache misses by increasing
spatial locality. Since CGP also targets I-cache misses, we applied CGP to both
the O5 optimized and the OM optimized binaries. Our performance results
show that even though OM improves the performance of the O5 optimized bi-
nary by 15%, CGP alone, without OM, achieves a 35% (CGP S) or 39% (CGP H)
performance improvement over O5. CGP with OM provides only a small addi-
tional performance improvement over CGP without OM (4% for either CGP S
or CGP H).
The OM [Srivastava and Wall 1992] tool on Alpha processors implements a
modified version of the Pettis and Hansen [1990] profile-directed code layout
algorithm for reducing I-cache misses. OM performs two levels of code layout
optimizations at link time. OM also performs traditional compiler optimiza-
tions at link time that could not be performed effectively at compile time. OM’s
ability to analyze object level code at link time opens up new opportunities
for redoing optimizations such as inter-procedural dead code elimination and
loop-invariant code motion.
In the first level of code layout optimization, OM uses profile information to
determine the most likely outcome of each conditional branch and rearranges
the basic blocks within each function so that conditional branches are most
likely not taken. This optimization increases the average number of instruc-
tions executed between two taken branches. Consequently, the number of in-
structions used in each cache line increases, which in turn reduces I-cache
misses.
The second level of code layout optimization rearranges functions using a
closest-is-best strategy. If one function calls another function frequently, the
two functions are allocated close to one another in the code segment so as to
improve the spatial locality. Since OM is a link-time optimizer, it has the ability
to rearrange functions that are spread across multiple files, including statically
linked library routines.
The profile information needed for OM optimizations was generated by run-
ning two workloads, wisc-prof and wisc+tpch to provide better feedback in-
formation than that provided by running just one workload. Providing the
feedback information from the largest workload makes the OM optimized
binary an even stronger baseline than might be achieved in practice. Each
Fig. 7. Performance of OM and CGP relative to O5 (Execution cycles of O5 optimized binary (X109
cycles): wisc-prof = 0.38, wisc-large-1 = 2.83, wisc-large-2 = 2.86, wisc+tpch = 5.36).
workload was run separately and the profile information of both runs was
merged to generate the feedback file required by OM. The OM optimizations
were applied to an O5 optimized binary. OM’s ability to perform traditional
compiler optimizations reduced the dynamic instruction count of the O5 code
by 12%.
5.4 CGP and OM Performance Comparisons

In this section we present the performance improvements due to OM optimiza-
tions. We also present the improvements due to CGP without OM, and those
from applying CGP to an OM optimized binary. Unless otherwise stated, all the
CGP H results shown in this paper use a 16 KB CGHC (512 entries) with a
2 cycle access time.
Figure 7 shows the run time, relative to O5, of the four workloads with
the O5+OM optimized binary, with the four binaries generated by using either
CGP S N or CGP H N on either the O5 binary or the O5+OM binary, and with
the O5+OM binary running on a system with perfect I-cache, which allows each
access to the I-cache to be completed in 1 cycle. N, the number of cache lines
prefetched on each prefetch request, is set to 4. The rightmost set of bars in
the graph, labeled Avg, shows the arithmetic average of the four workload run
times relative to the arithmetic average for O5. Each arithmetic average is
computed by summing the execution cycles required by each of the 4 database
workloads and dividing by 4.
Figure 7 shows that on average the OM optimizations result in a 15%
speedup over the O5 optimized code. On each benchmark CGP S 4 alone signifi-
cantly outperforms OM alone, and CGP H 4 outperforms CGP S 4. On average
CGP S 4 or CGP H 4, without OM, achieves a 35% or 39% speedup, respec-

tively, over O5 alone, corresponding to an 18% or 21% speedup over O5+OM.
When CGP S 4 or CGP H 4, respectively, is used with OM, they achieve a 41%
or 45% speedup, respectively, over O5 alone, corresponding to a 23% or 26%
speedup over O5+OM. This shows that CGP alone, without OM, can signif-
icantly improve performance, and that using OM with CGP gives some addi-
tional benefit. Since CGP can effectively prefetch functions from non-contiguous
locations, OM’s effort to layout the code contiguously provides only a 4% addi-
tional performance benefit when either CGP S 4 or CGP H 4 is used with OM.
CGP, with or without OM, reduces the I-cache miss stall time by about 50%
relative to O5+OM, taking us about half way from an already highly tuned
baseline system toward perfect I-cache performance.
One observation might help explain why CGP improves performance signifi-
cantly over OM. Namely, the closest-is-best strategy used by OM for code layout
is not very effective for functions that are frequently called from many different
places in the code. For instance, procedures such as lock record() can be invoked
by several functions in the database system, and OM’s closest-is-best strategy
can place lock record() close to multiple callers only by replicating lock record().
As aggressive function replication can cause significant code bloat, which can
adversely affect I-cache performance, OM tries to control this code bloat by plac-
ing lock record() close to only a few of its callers. On the other hand, CGP avoids
this dilemma entirely because it can simply prefetch lock record() from what-
ever functions invoke lock record(), and has no need to replicate or reallocate
the function.
Although CGP S uses wisc-prof as the profile workload to determine the
call sequences it uses for all four workloads, it significantly improves perfor-
mance not only for the wisc-prof workload, where the workload exactly matches
the profile workload, but even for the remaining three workloads in which the
query mix and the datasets differ quite significantly from the profile work-
load. This remarkable behavior of CGP S shows that the function call se-
quences in DBMS layers exhibit highly stable behavior across various work-
loads. Since the wisc-prof workload captures the function call sequences at
the bottommost two DBMS layers (storage manager and relational operator
layers) where every database query spends a significant fraction of its total
execution time, CGP S is successful in prefetching desired instructions into
the I-cache, even when the mix of queries differs significantly from the profile
workload.
Unlike CGP S, CGP H is implemented in hardware and does not need any
profiling information, except for the profile run of instrumented code required
by OM, if OM is used as a base. The ability of CGP H to dynamically adapt to
changes in the function call sequences at runtime gives it about a 3% additional
performance gain. The fact that the hardware scheme’s ability to dynamically
adapt to changes in function call sequences results in such a small performance
gain over the software scheme is further evidence that the layered software
architecture of DBMS results in highly predictable call sequences that do not
vary much, either dynamically over the runtime of a workload or from one
workload to another.
Fig. 8. Performance of four different CGHC configurations relative to an infinite CGHC.
5.5 Exploring the Design Space for CGHC

The performance of CGP H depends on the ability of the hardware to store
enough call graph history so as to effectively issue prefetches for repeated call
sequences. Since CGHC stores this history information, we explored the effect of
varying the size of CGHC on the overall performance of CGP. Figure 8 shows the
run time of CGP H 4 relative to the run time with an infinite CGHC for four dif-
ferent CGHC configurations: 1 KB CGHC (CGHC-1 K ), 16 KB CGHC (CGHC-
16 K ), 1 KB+8 KB two level CGHC (CGHC−1 K+8 K ), and 2 KB+16 KB two
level CGHC (CGHC−2 K+16 K ). In an infinite CGHC (CGHC-Inf ) each func-
tion in the program is allocated an entry in the CGHC that stores the entire
function call sequence of its most recent invocation.
The access time for the 1 KB one level CGHC is one cycle; two cycles for the
16 KB one level CGHC. The access times for both two level CGHCs are the same
as the access times of the two level I-cache hierarchy: 1 cycle to access the first
level CGHC and 16 cycles to access the second level. On a miss in the first
level CGHC, the second level CGHC is accessed. On a hit in the second level
CGHC, an entry from the first level CGHC is written back to the second level
and the hit entry in the second level CGHC is moved to the first level. On a miss
in the second level CGHC, a new entry is allocated in the first level CGHC and
the replaced entry from the first level is written back to the second level. Thus
the one level CGHC designs assume a simple, but aggressive design in order to
meet the fast access time requirements; the two level CGHC designs assume a
small, fast first level and a much less aggressive design for the second level, but
their overall design and control is still more complex than a one level design.
As seen from Figure 8, the 1 KB CGHC is 17% slower on average than the
infinite CGHC. But the performance gap between the other three finite CGHC
configurations and the infinite CGHC is very small. We therefore chose the
simpler one level 16 KB CGHC with a two cycle access time over the more
complex 2 level CGHC designs for all the other CGP H evaluations in this paper.
Fig. 9. Performance of OM, NL, stream buffers and CGP relative to O5.
5.6 Comparison with Stream Buffers and Tagged Next-N-Line Prefetching

Since I-cache accesses between two taken branches are sequential, a simple
hardware prefetching scheme such as Stream Buffers [Jouppi 1990] or tagged
NL (as described in Section 3.2) might improve performance by prefetching
long straight line sequences of code within a function. Furthermore, these tech-
niques might even prefetch from successive functions when they are allocated
contiguously, for example, by OM. For the results presented in this section we
implemented an 8-way stream buffer that prefetches from 8 instruction miss
streams concurrently, and each stream prefetches the next 4 consecutive I-cache
lines after its associated I-cache miss. We applied stream buffer prefetching to
the OM optimized binaries.
Figure 9 compares the performance of stream buffers and tagged NL
prefetching with CGP H. STR 4 refers to the stream buffer prefetching using
a 4 deep stream buffer with 8 parallel streams. NL 4 refers to the tagged Next-
N-Line scheme with N=4, as described in Section 3.2. The results show that
the NL 4 and CGP H 4 schemes (without OM) improve the performance of the
O5 binary by 23% and 39%, respectively, and when applied with OM improve
the performance of an OM optimized binary by 14% and 26%, respectively. The
STR 4 scheme with OM performs almost identically to the NL 4 scheme without
OM, and about 8% worse than NL 4 with OM. In our workloads on average only
43 instructions were executed between two successive function calls. These fre-
quent changes in control flow limit the effectiveness of the stream buffer and the
NL scheme. In the rest of this paper we compare CGP with the NL scheme only.
5.7 Prefetch Effectiveness and Bus Traffic Overhead

In addition to the gross metrics of cache misses, bus traffic, and overall runtime
performance, prefetching techniques are traditionally evaluated using two met-
rics that are related to individual prefetches: Coverage, which is the ratio of the
Table II. Prefetch Traffic and Miss Taxonomy [Srinivasan et al. 2003]
prefetch-cache outcomes conventional-cache outcomes extra
case x (prefetched) y (replaced) x (prefetched) y (replaced) traffic misses
1 hit miss hit hit 2 1
2 hit prefetched hit hit 1 0
3 hit don’t care hit replaced 1 0
4 hit miss miss hit 1 0
5 hit prefetched miss hit 0 −1
6 hit don’t care miss replaced 0 −1
7 replaced miss don’t care hit 2 1
8 replaced prefetched don’t care hit 1 0
9 replaced don’t care don’t care replaced 1 0
number of times that the next reference to a prefetched cache line is a hit (i.e.
the prefetched cache line was not replaced before its next reference) relative to
the total number of misses in a cache without prefetching, and Accuracy which
is the ratio of the number of times that the next reference to a prefetched cache
line is a hit relative to the total number of prefetches issued. Coverage and accu-
racy metrics, however, are not completely accurate because they do not account
for the effects of a prefetch that are due to the cache line that the prefetched
line replaces. For instance, these two metrics are not sufficient to infer whether
a prefetched line (X) has replaced another line (Y) that will be needed before
the next reference to X. Hence, to measure the effectiveness of CGP, we use a
more refined prefetch classification, the Prefetch Traffic and Miss Taxonomy
(PTMT), developed by Srinivasan et al. [2003].
PTMT requires the simultaneous simulation of a cache with prefetching
(prefetch cache), and a cache without prefetching (conventional cache). By com-
paring the next events for X and Y in the conventional cache and in the prefetch
cache, PTMT identifies 9 possible outcomes, as shown in Table II. Of all the
prefetches issued, only those that fall under cases 5 and 6 are useful prefetches
because only these result in a net reduction in cache misses; furthermore only
cases 5 and 6 generate no extra traffic relative to the conventional cache without
prefetching.
In case 6 when a prefetched line X replaces line Y , Y is also replaced in
the conventional cache sometime before its next reference; hence the replaced
line Y does not contribute to extra misses in the prefetch cache relative to the
conventional cache. On the other hand in case 5, the next reference to Y is a
hit (i.e. Y was not replaced) in the conventional cache and case 5 is only useful
because it relies on a subsequent prefetch of Y back into the prefetch cache
before its next reference. This subsequent prefetch of Y may in turn be useful
or useless depending on what happens in the conventional cache to the line
that it replaces in the prefetch cache. Hence although both cases 5 and 6 are
useful, case 6 prefetches are always useful, whereas a case 5 prefetch, although
it appears to be useful in isolation, begins a chain of related prefetches whose
total cost may or may not be beneficial.
Case 1 and case 7 prefetches are polluting prefetches because they gener-
ate an extra miss by replacing a useful line, and also increase the bus traffic.
Fig. 10. Distribution of prefetches for NL 4, CGP S 4 and CGP H 4.
Prefetches in the remaining five cases are called useless; they generate one
extra line of traffic for each issued prefetch without reducing the cache misses.
Table II does not account for one side effect caused by prefetching into a set-
associative cache that uses LRU replacement. In associative caches, a prefetch
has the side effect of inducing a re-ordering of the LRU stack of the set in
which the prefetch occurs, and this reordering may affect subsequent traffic
and misses. The following example, found in Srinivasan et al. [2003], illustrates
an occurrence of this side effect. X is prefetched, replacing the LRU line Y; an
existing line W in that set becomes the LRU line. The next cache access to that
set results in a miss in both caches; W is replaced in the prefetch cache while
Y is replaced in the conventional cache. If the next access to W follows soon
enough, it will be a hit in the conventional cache, but a miss in the prefetch
cache. Thus, although W is not replaced directly by prefetching X, the W miss
in the prefetch cache is a side effect of prefetching. This prefetch side effect is
referred to as case 10. The cost of case 10 is 1 line of extra traffic and 1 extra
miss.
An occurrence of case 10 can be detected when the following two conditions
hold:
(1) There is a demand fetch into L1 cache due to a miss in both the conven-
tional cache and the prefetch cache and different lines are replaced in the two
caches.
(2) The line replaced in the prefetch cache is subsequently referenced result-
ing in a hit in the conventional cache and a miss in the prefetch cache.
Srinivasan showed that these 10 cases of PTMT completely and disjointly ac-
count for all the extra traffic (always non-negative) and extra misses (hopefully
negative) of a prefetch algorithm.
Figure 10 shows the classification of prefetches issued by the NL 4, CGP S 4
and CGP H 4 schemes applied to O5+OM. With NL 4, 4% of the prefetches
generated are polluting prefetches, but less than 2% are polluting in CGP S 4
and CGP H 4. With NL 4, 39% of the prefetches are useful prefetches while
Fig. 11. CGP H 4 prefetches due to NL (left bar) and CGHC (right bar).
in CGP S 4 and CGP H 4, although they issued more prefetches, the use-
ful prefetches increase to 44% and 46%, respectively. As there are very few
case 5 prefetches, nearly all the useful prefetches are the more desirable case 6
prefetches where the prefetched line replaces a line that the conventional cache
also replaces before its next reference.
To understand why CGP generates about as many useless prefetches as NL
(mostly in case 9, with a substantial number in cases 3 and 8 as well) we
split the CGP prefetches into those that are issued by its NL prefetcher and
those that are issued by its CGHC. Figure 11 shows the results of this split for
CGP H 4. While only 34% of the prefetches issued by the NL component are
useful prefetches (cases 5 and 6), 58% of the prefetches issued by the CGHC
component are useful. Hence the prefetches in the CGHC component are much
more accurate than those in the NL component.
Since CGP uses CGHC only to prefetch across function boundaries and uses
NL to prefetch within a function, we might expect that CGHC and NL prefetch
disjoint sets of instructions. However, we see that the useful prefetches of
the NL portion of Figure 11 (2.6 × 107 on average) are fewer than those for
NL 4 in Figure 10 (5.6 × 107 on average). This decrease implies that some of
the useful prefetches issued by the NL 4 scheme when acting alone are is-
sued by the CGHC component, not the NL component, of the CGP 4 scheme.
Such a shift from NL to CGHC could occur, for example, if a callee func-
tion is laid out close to its caller and NL 4 prefetches past the end of the
caller to the beginning lines of the callee function due to the sequentiality
of the code layout, whereas under CGP 4 such callee prefetches would tend
to occur earlier during caller execution and fall within the CGHC portion
of CGP 4.
Thus CGHC allows the CGP scheme to issue some of the prefetches earlier
(i.e. at a more timely point) than those same prefetches would be issued by NL.
The NL prefetch of such cache lines in CGP will be squashed since the prefetch
was already issued by CGHC. The timely nature of CGHC prefetches can be
Fig. 12. Prefetch timeliness of NL and CGP.
inferred from Figure 12 which shows the timeliness of the prefetches issued by
NL 4, CGP S 4 and CGP H 4 by categorizing the total prefetch hits (sum of
categories 1 through 6) into two categories. The bottom component, Pref Hits,
shows the number of times that the next reference to a prefetched cache line
found the referenced instruction already in the L1 cache. The upper component,
Delayed Hits, shows the number of times that the next reference to a prefetched
cache line found that the referenced instruction was still en route to the cache
from the lower levels of memory. The total delayed hits of CGP 4 are fewer than
the delayed hits of NL 4 which is one measure of the increased timeliness of
CGP prefetches relative to NL. The total number of delayed hits of NL 4 is 36%
of the total prefetch hits while in CGP S 4 and CGP H 4 they are reduced to
25% of the total prefetch hits, despite the increased total and the use of NL 4
within CGP to prefetch lines from within a function.
5.8 I-cache Performance in Future Processors

The design and verification of future processors may well be compromised by
increasingly complex out-of-order processor cores. One way to reduce this com-
plexity is to design simpler in-order processor cores, branch predictors, and
smaller caches. To reduce cache misses and branch mispredictions, this sim-
pler design can be supplemented with prefetch and precomputation engines
[Annavaram et al. 2001b; Annavaram 2001] that are almost entirely decoupled
from the processor core.
The purpose of this section, however, is to explore the effectiveness of CGP
under the assumption that today’s trends are projected to a future processor
design that does use larger caches and wider out-of-order execution cores, and
simply suffers the significantly increased design and verification costs. Pro-
vided this trend continues into the near term future, it raises the question of
whether such future processor designs with larger I-caches and wider out-of-
order execution cores would make CGP redundant by eliminating the I-cache
Table III. Microarchitecture Parameter Values for future

Configuration
Fetch, Decode & Issue Width 8
Inst Fetch & L/S Queue Size 64
ROB entries 256
Functional Units 8add/4mult
Memory system ports to CPU 4
L1 I and D cache each 64 KB, 2-way, 32 byte line
Unified L2 cache 2 MB, 4-way, 32 byte line
Mem latency (cycles) 100
Branch Predictor comb(2-lev + 2-bit)
access bottleneck. We claim that CGP will continue to be useful for database
systems on such future processors.
On the more aggressive future processor model defined in Table III, CGP
with OM improves the performance of our database workloads by 43% (CGP S)
or 45% (CGP H) over O5, and 23% (CGP S) or 25% (CGP H) over O5+OM.
L2 cache size in the future configuration is slightly smaller than what we ex-
pect to see in future. As stated earlier, to get the simulation results within a
reasonable time, the size of the dataset was scaled down, and hence the size
of the L2 was also scaled down in appropriate proportion to provide realistic
results.
We simulated this very aggressive out-of-order processor model, future,
which can execute up to 8 instructions every cycle. Comparing this configu-
ration with the configuration shown in Table I, the I-cache size is now doubled,
which should reduce the number of I-cache misses. Note, however, that in the
future configuration, the Level 2 cache hit latency and the memory access la-
tency are also greater, as might be expected due to the widening gap between
processor and memory system speeds. Consequently, even though such a fu-
ture processor may suffer fewer I-cache misses, the penalty for each miss will
be higher.
Figure 13 shows the run time required to complete the four workloads on the
future configuration relative to the run time of the O5 optimized binary. CGP
still outperforms both OM and NL by about the same margin on the future
configuration as on the original 4-wide machine configuration.
CGP maintains its performance advantage despite the fact that in our bench-
marks the I-cache miss rates on the future configuration with a 64 KB I-cache
are reduced to less than 1% without any prefetching, and less than 0.1% with
CGP. Thus the working sets of our benchmarks are well accommodated by
the larger caches in the future configuration. These larger caches decrease the
number of misses sufficiently to gain in performance despite the increased miss
penalty. Consequently the percentage gains in performance of CGP relative to
OM and NL are slightly less when calculated on the future configuration, rather
than on the original 4-wide machine configuration. However, it is important to
note that CGP performance remains about half way between O5+OM and per-
fect cache.
Fig. 13. Performance of OM, NL and CGP relative to O5 on the future configuration (Execution
cycles of O5 optimized binary (X109 cycles): wisc-prof = 0.28, wisc-large-1 = 2.27, wisc-large-2 =
2.29, wisc+tpch = 4.21).
Furthermore, from current trends we expect that the working sets in fu-
ture databases will continue to increase and will be much larger than those
used in this study. In addition, database systems will continue to use a layered
software design approach so as to ensure the maintainability and portability
of the software. With larger working sets, cache misses will continue to be
a significant performance bottleneck, and consequently CGP will continue to
be a useful technique for reducing the number of I-cache misses of database
systems.
5.9 Applying CGP to SPEC CPU2000 Benchmarks

In this section we show that although CGP is a general technique that can be
applied to applications in other domains, the layered software architecture of
database applications makes CGP particularly attractive for DBMS. To quan-
tify the impact of CGP when applied to some other application domain, we
used CGP on the CPU-intensive SPEC benchmarks. We selected seven bench-
marks from the SPEC CPU2000 integer benchmark suite, namely gzip, gcc,
crafty, parser, gap, bzip2 and twolf. These benchmarks were selected because
our existing simulation infrastructure allows us to run them without modifying
the benchmark source codes. The perl benchmark, for example, uses multiple
processes to concurrently execute several input scripts, but SimpleScalar can-
not simulate concurrent execution of multiple processes. One way to execute
the perl benchmark is to modify the source code of the program and change the
concurrent execution of input scripts to sequential execution where only one in-
put script is executed at a time, but that would be a different benchmark. We
therefore chose to limit this study to only those benchmarks that ran without
modification.
The selected benchmarks were compiled with the Compaq C++ compiler with
O5 and then OM. The test input set, provided by SPEC, was used to generate
Fig. 14. Effectiveness of CGP on SPEC CPU2000 applications.
the required profile information for OM. The train input set was then run for
two billion instructions to generate the results presented in this section.
In Figure 14, the rightmost bar for each benchmark shows the execution
cycles required with a perfect I-cache, where each access to the I-cache is com-
pleted in 1 cycle. Without prefetching (O5+OM), the performance gap due to
using the 32 KB I-cache, rather than a perfect I-cache, is 17% in gcc, 9% in crafty,
2% in gap, and less than 1% for each of the other benchmarks. In fact with a 32
KB I-cache, for SPEC CPU2000, the I-cache miss ratios are nearly 0% except
for gcc and crafty which have 0.5% and 0.3% I-cache miss ratios, respectively.
The I-cache is thus not a significant performance bottleneck in any of these
SPEC CPU2000 applications, in which case it is unnecessary to use prefetch-
ing techniques such as CGP and NL. For those applications that do suffer from
I-cache misses, namely gcc and crafty, NL prefetching alone achieves perfor-
mance gains similar to those of CGP. NL 4 and CGP H 4 each speed up the
execution of gcc by 7% and crafty by 4% relative to O5+OM alone. These results
show that CGP is not needed for workloads with small I-cache footprints and/or
infrequent function calls. However, once again CGP performance is about half
way between no instruction prefetching and perfect I-cache performance.
6. RELATED WORK
Researchers have proposed several techniques to improve the I/O bottleneck of
database systems. Nyberg et al. [1994] suggested that if data intensive applica-
tions use software assisted disk striping, the performance bottleneck shifts from
I/O response time to the memory access time. Boncz et al. [1998] showed that
the query execution time of data mining workloads with a large main memory
buffer pool is memory bound rather than I/O bound. Shatdal et al. [1994] pro-
posed cache-conscious performance tuning techniques that improve the locality
of the data accesses for join and aggregation algorithms. These techniques re-
duce data cache misses, which is orthogonal to CGP’s goal of reducing I-cache
misses. CGP may be implemented on top of these cache-conscious algorithms.
It is only recently that researchers have examined the performance impact of
architectural features on DBMS [Ailamaki et al. 1999; Lo et al. 1998; Trancoso
et al. 1997; Eickemeyer et al. 1996; Cvetanovic and Bhandarkar 1994; Franklin
et al. 1994; Maynard et al. 1994]. Their results show that database appli-
cations have much larger instruction and data footprints and exhibit more
unpredictable branch behavior than benchmarks that are commonly used in
architectural studies (e.g. SPEC). Database applications have fewer loops and
suffer from frequent context switches, causing significant increases in the I-
cache miss rates [Franklin et al. 1994]. Lo et al. [1998] showed that in OLTP
workloads, the I-cache miss rate is nearly three times the data cache miss rate.
Ailamaki et al. [1999] analyzed three commercial DBMS on a Xeon processor
and showed that TPC-D queries spend about 20% of their execution time on
branch misprediction stalls and 20% on L1 I-cache miss stalls (even though the
Xeon processor uses special instruction prefetching hardware). Their results
also showed that L1 data cache misses that hit in L2 were not a significant
bottleneck, but L2 misses reduced the performance by 20%.
Researchers have proposed several schemes to improve I-cache performance.
Pettis and Hansen [1990] proposed a code layout algorithm that uses profile
guided feedback information to contiguously layout the sequence of basic blocks
that lie on the most commonly occurring control flow path. Romer et al. [1997]
implemented the Pettis and Hansen code layout algorithm using the Etch tool
and showed performance improvements for Win32 binaries. Hashemi et al.
[1997] used a cache line coloring scheme to remap procedures so as to reduce
conflict misses. Similarly Kalamatianos and Kaeli [1998] exploited the temporal
locality of procedure invocations to remap procedures in a binary. They used
a structure called a Conflict Miss Graph (CMG), where every edge weight in
CMG is an approximation of the worst-case number of misses two procedures
can inflict upon one another. The ordering implied by the edge weights is used
to apply color-based procedure mapping to eliminate conflict misses. Gloy et al.
[1997] compared several of these recent code placement techniques to improve
I-cache performance. In this paper we used OM [Srivastava and Wall 1992],
which implements a modified Pettis and Hansen algorithm to do feedback-
directed code layout. Our database workload results showed that OM improves
performance by 15% over O5, and CGP with OM achieves a 41% (CGP S) or
45% (CGP H) performance improvement over O5. CGP alone, without OM, does
not need recompilation of the source code and still achieves a 35% (CGP S) or
39% (CGP H) performance improvement over O5. Since CGP can effectively
prefetch functions from non-contiguous locations, OM’s effort to layout the code
contiguously provides only about a 4% additional performance benefit for CGP
with OM over CGP without OM.
Tagged Next-N-line prefetching (NL) [Smith 1978] is a sequential prefetching
technique that is often used. In this technique the next N sequential lines are
prefetched on a cache miss, as well as on the first hit to a cache line that was
prefetched. Tagged NL prefetching works well in programs that execute long
sequences of straight line code. CGP uses tagged NL prefetching for prefetching
code within a function, and profile-guided prefetching (in CGP S) or the CGHC
(in GCP˝) for prefetching across function calls. Our results show that CGP takes
good advantage of the tagged NL prefetching scheme and that OM+CGP S or
OM+CGP H outperforms OM+NL alone by 7% or 10%, respectively.
Recently researchers have proposed several techniques for non-sequential in-

struction prefetching [Hsu and Smith 1998; Luk and Mowry 2001; Srinivasan
et al. 2001]. Of these, the work of Luk and Mowry is closest to CGP. They pro-
posed cooperative prefetching where the compiler inserts prefetch instructions
to prefetch branch targets. Their approach, however, requires ISA extensions to
add four new prefetch instructions: two to prefetch the targets of branches, one
for indirect jumps and one for function returns. They use next-N-line prefetch-
ing for sequential accesses. Special hardware filters are used to reduce the
prefetch traffic. CGP S uses profiling to identify the common call sequences
and inserts prefetches on the more likely paths. Cooperative prefetching does
not need profile information to guide prefetch insertion. However, it inserts
prefetches on all possible paths (up to a certain distance) and then uses com-
piler optimizations to reduce the number of prefetches actually needed. CGP S
inserts prefetches to the next call immediately after the current call site. This
approach may not be effective in reducing very large memory latencies. How-
ever, CGP as implemented in this paper is targeted toward masking L1 I-cache
miss latencies of a few dozen cycles. As shown in the results section, our ap-
proach works well in masking such L1 I-cache miss latencies.
CGP uses NL prefetching to prefetch within a function boundary and can
benefit from using the OM tool at link time to make NL more effective by
reducing the number of taken branches, which increases the sequentiality of
the code. Hence using OM with NL can effectively prefetch instructions within a
function boundary, and thereby reduces the need for branch target prefetching
that occurs within a function boundary. By building on NL, CGP can focus on
prefetching for function calls.
Hsu and Smith [1998] proposed target line prefetching. In their scheme,
a target line prediction table stores the address of the current cache line, C,
and the address of a target cache line which the processor has accessed in the
recent past due to the execution of a branch instruction in C. Whenever the
processor accesses C, a prefetch is issued to the target address, if any, found
in the history information available in the target line prediction table. Since
database workloads are dominated by short forward branches, many cache lines
have multiple branch instructions. Hence multiple target addresses need to be
stored per cache line, which can significantly increase the size of the target line
prediction table.
Srinivasan et al. [2001] proposed branch history guided instruction prefetch-
ing. BHGP correlates the execution of branch instructions with I-cache misses
and uses branches to trigger prefetches to instructions that occur N −1 branches
later, for a given N > 1. In BHGP any branch instruction can potentially trig-
ger a prefetch, while CGP prefetches only at function boundaries and uses next
line prefetching to prefetch within a function.
Reinman et al. [1999] proposed fetch-directed instruction prefetching and
Chen et al. [1997] proposed branch prediction based prefetching. Both of these
schemes use a run-ahead branch predictor that predicts the program’s control
flow several branches ahead of the currently executing branch. Prefetches are
issued to instructions along the predicted control flow path. While Chen uses
only one additional program counter, called the look-ahead PC, to determine
the prefetch address, Reinman uses a fetch target queue to enqueue multiple
prefetch addresses. The accuracy of their prefetches is determined by the ac-
curacy of the run-ahead predictor. CGP uses history information rather than
a run-ahead engine, and does not employ a branch predictor to determine its
prefetch addresses.
Pierce and Mudge [1996] proposed wrong-path prefetching, which combines
next-line prefetching with the prefetching of all control instruction targets re-
gardless of the predicted directions of conditional branches. However, they also
showed that prefetching all branch targets aggravates bus congestion.
Joseph and Grunwald [1997] proposed Markov prefetching, which capitalizes
on the correlations in the cache miss stream to issue a prefetch for the next
predicted miss address. They use part of the L2 cache as a history buffer to
store a miss address, M , and a sequence of miss addresses that follow M . When
address M misses in the cache again, their scheme uses M to index the history
buffer and issues prefetches to a subset of the miss addresses that followed M
the last time. This scheme focuses primarily on data prefetching. In particular,
in Joseph and Grunwald [1997] there are no results on the effectiveness of this
scheme for instruction prefetching. For data prefetching their results showed
that Markov prefetching generates a significant number of extra prefetches and
requires a large amount of space to store the miss correlations.
Although it would be interesting to quantitatively compare the performance
of CGP with previous instruction prefetching schemes, due to time and resource
constraints we only present a qualitative discussion of the related work.
7. CONCLUSIONS AND FUTURE DIRECTIONS

With the trend toward denser and cheaper memory, a significant number of
datasets in database servers will soon reside in main memory. In such config-
urations, the performance bottleneck is the gap between the processing speed
of the CPU and the memory access latency. Database applications, with their
large instruction footprints and datasets, suffer significantly from poor cache
performance.
We have proposed and evaluated a technique called Call Graph Prefetching
(CGP) that increases the performance of database systems by improving their
I-cache utilization. CGP can be implemented either in software or in hardware.
The software implementation of CGP (CGP S) uses a profile run to build a call
graph of a database system and exploits the predictable call sequences in the
call graph to insert prefetches into the executable to prefetch the function that is
deemed likely to be called next. Although CGP S uses a profile workload to build
the call graphs, our experiments show that this scheme is insensitive to run
time variations. The hardware CGP scheme (CGP H) uses a hardware table,
called the Call Graph History Cache (CGHC), to dynamically store sequences of
functions invoked during program execution, and uses this stored history when
choosing which functions to prefetch. The hardware scheme, as opposed to the
software scheme, has the ability to adapt to runtime changes in the function
call sequences. The disadvantage of the hardware scheme is that it cannot be
implemented on existing processors because it requires adding special purpose
Fig. 15. Average performance improvements of OM, NL and CGP relative to O5 on the original
4-wide configuration (Average execution cycles of O5 optimized binary = 2.86 × 109 cycles).
hardware to implement the CGHC, as well as a simple modification to the

branch predictor. Our results show that CGP H provides only a small additional
performance improvement over CGP S in our experimental runs. However, we
feel that it is important to consider CGP H, as the hardware approach may well
be preferred by many people who have an aversion to or distrust of profiling.
Both the hardware and the software schemes are especially attractive for DBMS
since neither scheme modifies the original DBMS source code, or even requires
access to the original source code. Consequently, CGP can also work with user-
defined functions and data types in an Object Relational DBMS (ORDBMS),
for which rewriting the database code may not be feasible.
Our results summarized in Figure 15 show that CGP S or CGP H with OM
outperforms an OM optimized binary by 23% or 26%, respectively. Furthermore,
they provide an additional speedup of 7% or 10%, respectively, over NL prefetch-
ing alone. Even on an aggressive future configuration with larger caches that
reduce the number of cache misses, but with higher miss penalties, our results
summarized in Figure 16 show that CGP S or CGP H improves performance
by 43% or 45%, respectively, over O5, 23% or 25% over O5+OM, and 4% or 6%
over O5+OM+NL. The working sets of our benchmarks are well accommodated
by the larger caches in this future configuration. These larger caches decrease
the number of misses sufficiently to gain in performance despite the increased
miss penalty. The performance gains achieved by CGP on these benchmarks
relative to O5 and O5+OM are similar to those in Figure 15, but as the gains
achieved by NL are somewhat higher in Figure 16, the relative gains of CGP
over NL are slightly reduced in this more aggressive system. However, CGP
performance remains about half way between O5+OM and perfect cache.
Furthermore, from the current trends we expect that the working sets of
future databases will continue to increase, and hence will be much larger than
Fig. 16. Average performance improvements of OM, NL and CGP relative to O5 on the future
configuration (Average execution cycles of O5 optimized binary = 2.26 × 109 cycles).
those used in this study. Cache misses will no doubt continue to be a significant
performance bottleneck, and consequently techniques like CGP that reduce
the I-cache misses will remain critical to the performance of future database
systems on future processors.
As the complexity of software systems continues to grow, the instruction
footprint sizes are also increasing, thereby putting tremendous pressure on the
I-cache. As the complexity of the software grows, the behavior of the system
typically becomes more unpredictable. Research in memory system design can
gain significantly by analyzing the behavior of specific types of software sys-
tems at a higher level of granularity, rather than by trying to capitalize only
on low-level generic program behavior. The prevailing programming style for
today’s large and complex software systems favors modular software where the
flow of control at the function level is exposed while the implementation de-
tails within the functions are abstracted away. CGP exploits the regularity of
DBMS function call sequences, and avoids dealing with low-level details within
functions by simply prefetching the first few cache lines of a function, which
often constitutes the entire function, and using tagged next-N-line prefetching
to bring in successive lines of longer functions.
Although CGP does eliminate about half the I-cache miss penalty, there is
still room for further improvement. The cache misses that remain after applying
CGP are mostly either cold start misses or misses to infrequently executed
functions. As we have shown in the CGP performance results section, simply
using a bigger Call Graph History Cache to store more history information is not
the solution. History-based schemes, such as CGP, typically require a learning
period during which they acquire program knowledge before they can exploit
that knowledge to improve performance. Thus reducing cold start misses and
misses to infrequently executed functions by using history-based schemes is
difficult if not impossible. A simpler way to reduce these remaining misses might
be to give the DBMS more direct control of cache memory management. Today’s
DBMS already use application-specific main memory management routines.
They control page allocation and replacement policies in a more flexible manner
than the rigid “universal” policies provided by the operating system. In a similar
way the cache hierarchy could be placed under some degree of DBMS control. To
do more effective prefetching, database developers can provide hints to cache
management hardware regarding calling sequences to infrequently executed
functions and other code segments that cannot be captured by CGP.
REFERENCES
AILAMAKI, A., DEWITT, D., HILL, M., AND WOOD, D. 1999. DBMSs on a Modern Processor: Where
Does Time Go? In Proceedings of the 25th International Conference on Very Large Data Bases.
266–277.
ANNAVARAM, M. 2001. Prefetch Mechanisms that Acquire and Exploit Application Specific Knowl-
edge. Ph.D. thesis, University of Michigan, EECS Department.
ANNAVARAM, M., PATEL, J., AND DAVIDSON, E. 2001a. Call Graph Prefetching for Database Applica-
tions. In Proceedings of the 7th International Symposium on High Performance Computer Archi-
tecture. 281–290.
ANNAVARAM, M., PATEL, J., AND DAVIDSON, E. 2001b. Data Prefetching by Dependence Graph Pre-
computation. In Proceedings of the 28th International Symposium on Computer Architecture.
52–61.
BERNSTEIN, P., BRODIE, M., CERI, S., DEWITT, D., FRANKLIN, M., GARCIA-MOLINA, H., GRAY, J., HELD,
G., HELLERSTEIN, J., JAGADISH, H., LESK, M., MAIER, D., NAUGHTON, J., PIRAHESH, H., STONEBRAKER,
M., AND ULLMAN, J. 1998. The Asilomar Report on Database Research. SIGMOD Record 27, 4
(December), 74–80.
BITTON, D., DEWITT, D. J., AND TURBYFILL, C. 1983. Benchmarking database systems a systematic
approach. In Proceedings of the 9th International Conference on Very Large Data Bases. 8–19.
BONCZ, P., RÜHL, T., AND KWAKKEL, F. 1998. The Drill Down Benchmark. In Proceedings of the 24th
International Conference on Very Large Data Bases. 628–632.
BURGER, D. AND AUSTIN, T. 1997. The SimpleScalar Tool Set. Tech. Rep. 1342, University of
Wisconsin-Madison, Computer ScienceDepartment. June.
CAREY, M., DEWITT, D., FRANKLIN, M., HALL, N., MCAULIFFE, M., NAUGHTON, J., SCHUH, D., SOLOMON, M.,
TAN, C., TSATALOS, O., WHITE, S., AND ZWILLING, M. 1994. Shoring Up Persistent Applications.
In Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data.
383–394.
CHEN, I.-C. K., LEE, C.-C., AND MUDGE, T. 1997. Instruction Prefetching Using Branch Prediction
Information. In Proceedings of the International Conference on Computer Design. 593–601.
CVETANOVIC, Z. AND BHANDARKAR, D. 1994. Characterization of Alpha AXP Performance Using
TP and SPEC Workloads. In Proceedings of the 21st International Symposium on Computer
Architecture. 60–70.
EICKEMEYER, R., JOHNSON, R., KUNKEL, S., SQUILLANTE, M., AND LIU, S. 1996. Evaluation of Multi-
threaded Uniprocessors for Commercial Application Environments. In Proceedings of the 23rd
International Symposium on Computer Architecture. 203–212.
FRANKLIN, M., ALEXANDER, W., JAUHARI, R., MAYNARD, A., AND OLSZEWSKI, B. 1994. Commercial Work-
load Performance in the IBM POWER2 RISC System/6000 Processor. IBM J. Res. Dev. 38, 5
(April), 555–561.
GLOY, N., BLACKWELL, T., SMITH, M., AND CALDER, B. 1997. Procedure Placement Using Temporal
Ordering Information. In Proceedings of the 30th International Symposium on Microarchitecture.
303–313.
HASHEMI, A., KAELI, D., AND CALDER, B. 1997. Efficient Procedure Mapping Using Cache Line
Coloring. In Proceedings of the SIGPLAN ’97 Conference on Programming Language Design and
Implementation. 171–182.

HSU, W.-C. AND SMITH, J. 1998. A Performance Study of Instruction Cache Prefetching Methods.
IEEE Trans. Comput. 47, 5 (May), 497–508.
INTEL WEB SITE: http://www.developer.intel.com/drg/mmx/appnotes/perfmon.htm Survey of Pentium
Processor Performance Monitoring Capabilities & Tools.
JOSEPH, D. AND GRUNWALD, D. 1997. Prefetching Using Markov Predictors. In Proceedings of the
24th International Symposium on Computer Architecture. 252–263.
JOUPPI, N. 1990. Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-
Associative Cache and Prefetch Buffers. In Proceedings of the 17th International Symposium on
Computer Architecture. 364–373.
KALAMATIANOS, J. AND KAELI, D. 1998. Temporal-Based Procedure Reordering for Improved In-
struction Cache Performance. In Proceedings of the 4th International Symposium on High Per-
formance Computer Architecture. 244–253.
LO, J., BARROSO, L. A., EGGERS, S. J., GHARACHORLOO, K., LEVY, H. M., AND PAREKH, S. S. 1998. An
Analysis of Database Workload Performance on Simultaneous Multithreaded Processors. In Pro-
ceedings of the 25th International Symposium on Computer Architecture. 39–50.
LUK, C. AND MOWRY, T. 2001. Architectural and compiler support for effective instruction prefetch-
ing: a cooperative approach. ACM Trans. Comput. Syst. 19, 1 (Feb.), 71–109.
MAYNARD, A., DONNELLY, C., AND OLSZEWSKI, B. R. 1994. Contrasting characteristics and cache per-
formance of technical and multi-user commercial workloads. In Proceedings of the 6th Interna-
tional Conference on Architectural Support for Programming Languages and Operating Systems.
145–156.
NYBERG, C., BARCLAY, T., CVETANOVIC, Z., GRAY, J., AND LOMET, D. 1994. AlphaSort: a RISC machine
sort. In Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data.
233–242.
PETTIS, K. AND HANSEN, R. 1990. Profile Guided Code Positioning. In SIGPLAN ’90 Conference on
Programming Language Design and Implementation. 16–27.
PIERCE, J. AND MUDGE, T. 1996. Wrong-path Prefetching. In Proceedings of the 29th International
Symposium on Microarchitecture. 264–273.
REINMAN, G., CALDER, B., AND AUSTIN, T. 1999. Fetch Directed Instruction Prefetching. In Proceed-
ings of the 32nd International Symposium on Microarchitecture. 16–27.
ROMER, T., VOELKER, G., LEE, D., WOLMAN, A., WONG, W., LEVY, H., BERSHAD, B., AND CHEN, B. 1997.
Instrumentation and Optimization of Win32/Intel Executables Using Etch. In USENIX Windows
NT Workshop. 1–7.
RUPLEY, J., ANNAVARAM, M., DEVALE, J., DIEP, T., AND BLACK, B. 2002. Comparing and Contrast-
ing a Commercial OLTP Workload with CPU2000 on IPF. In the 5th Workshop on Workload
Characterization.
SHATDAL, A., KANT, C., AND NAUGHTON, J. 1994. Cache Conscious Algorithms for Relational Query
Processing. In Proceedings of the 20th International Conference on Very Large Data Bases. 510–
521.
SMITH, A. 1978. Sequential Program Prefetching in Memory Hierarchies. IEEE Comput. 11, 2
(December), 7–21.
SRINIVASAN, V., DAVIDSON, E., AND TYSON, G. 2003. A Prefetch Taxonomy. IEEE Trans. Comput.
SRINIVASAN, V., DAVIDSON, E., TYSON, G., CHARNEY, M., AND PUZAK, T. 2001. Branch History Guided
Instruction Prefetching. In Proceedings of the 7th International Symposium on High Performance
Computer Architecture. 291–300.
SRIVASTAVA, A. AND EUSTACE, A. 1994. ATOM: A System for Building Customized Program Analysis
Tools. Tech. Rep. 94/2, Digital Western Research Laboratory. March.
SRIVASTAVA, A. AND WALL, D. 1992. A Practical System for Intermodule Code Optimization at
Link-Time. Tech. Rep. 92/6, Digital Western Research Laboratory. June.
TRANCOSO, P., LARRIBA-PEY, J., ZHANG, Z., AND TORELLAS, J. 1997. The Memory Performance of DSS
Commercial Workloads in Shared-Memory Multiprocessors. In Procedings of the 3rd Interna-
tional Symposium on High Performance Computer Architecture. 211–220.
TPC. 1999. TPC Benchmark H Standard Specification (Decision Support). In Revision 1.1.0.
Received June 2001; revised July 2002, February 2003; accepted May 2003

Call Graph Prefetching For Database Applications

Uploaded by

Document Informationclick to expand document informationComputer Graphs

Document Informationclick to expand document information

Copyright:

Available Formats

Call Graph Prefetching For Database Applications

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Call Graph Prefetching For Database Applications

Uploaded by

Copyright:

Available Formats

Call Graph Prefetching for Database

1. PERFORMANCE BOTTLENECKS IN DBMS

Fig. 1. I-Cache misses in DBMS for 32 KB I-cache.

Fig. 2. Software layers in a typical DBMS.

2. MOTIVATION FOR CALL GRAPH PREFETCHING

Fig. 3. Call graph for the Create rec function.

2.1 A Simple Call Graph Example

next invoked. CGP capitalizes on this predictability by prefetching instructions

3. IMPLEMENTING CGP IN SOFTWARE

3.1 CGP S Algorithm

(1) The Call Graph Generation Phase

of the DBMS. To collect this profile information CGP uses ATOM

Since CGP S uses profile information, prefetching an entire function may

itself by using a tagged NL prefetching scheme. In particular, CGP assumes

Unlock page is prefetched immediately after returning from Update page.

3.2 Considerations for CGP S Implementation

4. IMPLEMENTING CGP IN HARDWARE

4.1 Exploiting Call Graph Information Using CGHC

function based on this prediction may waste processor resources if the

4.2 Considerations for CGP H Implementation

Table I. Microarchitecture Parameter Values for CGP

5.1 Methodology and Workloads

ACM Transactions on Computer Systems, Vol. 21, No. 4, November 2003.

To evaluate the performance of CGP we used a database workload that

ACM Transactions on Computer Systems, Vol. 21, No. 4, November 2003.

5.2 Generating Profile Information for CGP S

5.3 Feedback-Directed Code Layout with OM

5.4 CGP and OM Performance Comparisons

CGP S 4 or CGP H 4, without OM, achieves a 35% or 39% speedup, respec-

Fig. 8. Performance of four different CGHC configurations relative to an infinite CGHC.

5.5 Exploring the Design Space for CGHC

5.6 Comparison with Stream Buffers and Tagged Next-N-Line Prefetching

5.7 Prefetch Effectiveness and Bus Traffic Overhead

Fig. 10. Distribution of prefetches for NL 4, CGP S 4 and CGP H 4.

Fig. 12. Prefetch timeliness of NL and CGP.

5.8 I-cache Performance in Future Processors

Table III. Microarchitecture Parameter Values for future

5.9 Applying CGP to SPEC CPU2000 Benchmarks

Fig. 14. Effectiveness of CGP on SPEC CPU2000 applications.

Recently researchers have proposed several techniques for non-sequential in-

7. CONCLUSIONS AND FUTURE DIRECTIONS

hardware to implement the CGHC, as well as a simple modification to the

ACM Transactions on Computer Systems, Vol. 21, No. 4, November 2003.

ACM Transactions on Computer Systems, Vol. 21, No. 4, November 2003.

You might also like