Call Graph Prefetching For Database Applications
Call Graph Prefetching For Database Applications
Call Graph Prefetching For Database Applications
Applications
MURALI ANNAVARAM
Intel Corporation
and
JIGNESH M. PATEL and EDWARD S. DAVIDSON
The University of Michigan, Ann Arbor
With the continuing technological trend of ever cheaper and larger memory, most data sets in
database servers will soon be able to reside in main memory. In this configuration, the perfor-
mance bottleneck is likely to be the gap between the processing speed of the CPU and the memory
access latency. Previous work has shown that database applications have large instruction and
data footprints and hence do not use processor caches effectively. In this paper, we propose Call
Graph Prefetching (CGP), an N instruction prefetching technique that analyzes the call graph of
a database system and prefetches instructions from the function that is deemed likely to be called
next. CGP capitalizes on the highly predictable function call sequences that are typical of database
systems. CGP can be implemented either in software or in hardware. The software-based CGP
(CGP S) uses profile information to build a call graph, and uses the predictable call sequences in
the call graph to determine which function to prefetch next. The hardware-based CGP(CGP H) uses
a hardware table, called the Call Graph History Cache (CGHC), to dynamically store sequences
of functions invoked during program execution, and uses that stored history when choosing which
functions to prefetch.
We evaluate the performance of CGP on sets of Wisconsin and TPC-H queries, as well as on
CPU-2000 benchmarks. For most CPU-2000 applications the number of instruction cache (I-cache)
misses were very few even without any prefetching, obviating the need for CGP. On the other hand,
the database workloads do suffer a significant number of I-cache misses; CGP S improves their
performance by 23% and CGP H by 26% over a baseline system that has already been highly tuned
for efficient I-cache usage by using the OM tool. CGP, with or without OM, reduces the I-cache miss
stall time by about 50% relative to O5+OM, taking us about half way from an already highly tuned
baseline system toward perfect I-cache performance.
This work was done while M. Annavaram was at the University of Michigan. This material is based
upon work supported by the National Science Foundation under Grant IIS-0093059.
Authors’ addresses: Murali Annavaram, Intel Corporation, 220 Mission College Blvd., Santa Clara,
CA 95052-8119; email: murali.m.annavaram@intel.com; Jignesh M. Patel, University of Michigan,
2239 EECS, Ann Arbor, MI 48109-2122; email: jignesh@eecs.umich.edu; Edward S. Davidson, 1100
Chestnut Rd., Ann Arbor, MI 48104; email: davidson@eecs.umich.edu.
Permission to make digital or hard copies of part or all of this work for personal or classroom use is
granted without fee provided that copies are not made or distributed for profit or direct commercial
advantage and that copies show this notice on the first page or initial screen of a display along
with the full citation. Copyrights for components of this work owned by others than ACM must be
honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers,
to redistribute to lists, or to use any component of this work in other works requires prior specific
permission and/or a fee. Permissions may be requested from Publications Dept., ACM Inc., 1515
Broadway, New York, NY 10036 USA, fax: +1 (212) 869-0481, or permissions@acm.org.
°
C 2003 ACM 0734-2071/03/1100-0412 $5.00
ACM Transactions on Computer Systems, Vol. 21, No. 4, November 2003, Pages 412–444.
Call Graph Prefetching for Database Applications • 413
Categories and Subject Descriptors: C.4 [Performance of Systems]: Design studies; C.1.0
[Processor Architectures]: General
General Terms: Performance, Design, Experimentation
Additional Key Words and Phrases: Instruction cache prefetching, call graph, database
For each bar, the upper component, labeled CallTarget-Miss, shows the
I-cache misses that occur immediately following a function call—upon accessing
the target address of the function call. The bottom component, IntraFunc-Miss,
shows all the remaining I-cache misses (incurred while executing instructions
within a function boundary).
Four key observations may be made from the graph:
(1) Although OM does reduce the I-cache misses, there is still room for signifi-
cant improvement.
(2) Both OM optimizations and NL 4 prefetching significantly reduce the
IntraFunc-Misses; however, they are almost totally ineffective in reducing
the CallTarget-Misses.
(3) CGP is very effective in reducing the CallTarget-Misses, as intended; more-
over the CGP H optimized binary also achieves a considerable further re-
duction in IntraFunc-Misses.
(4) Finally, comparing CGP without OM and CGP with OM, it is apparent that
CGP makes OM optimizations almost unnecessary.
Previous research [Franklin et al. 1994] has shown that DBMS have a large
number of small functions due to the modular software architecture used in
their design. Furthermore after applying existing compiler techniques and sim-
ple prefetch schemes (cf. O5+OM+NL 4 in Figure 1), I-cache misses at function
call boundaries constitute a significant portion of all I-cache misses. In order
to recover the performance lost due to call target misses without sacrificing
the advantages of modular design we have developed Call Graph Prefetching
(CGP), an instruction prefetching technique that analyzes the call graph of an
application and prefetches instructions from a function that is likely to be called
next. Although CGP is a generic instruction prefetching scheme, it is particu-
larly effective for large software systems such as DBMS because of the layered
software design approach used by these systems. Section 2 argues intuitively
why CGP can be effective in prefetching for database applications.
Section 3 then describes an algorithm (CGP S) that implements CGP in
software. This algorithm uses a profile run to build a call graph of a database
ACM Transactions on Computer Systems, Vol. 21, No. 4, November 2003.
Call Graph Prefetching for Database Applications • 415
system, and exploits the predictable call sequences in the call graph to deter-
mine which function to prefetch next.
Section 4 reviews the hardware implementation of CGP introduced in
Annavaram et al. [2001a]. This implementation (CGP H) uses a hardware
table, called the Call Graph History Cache (CGHC), to dynamically store se-
quences of functions invoked during program execution, and uses that stored
history when choosing which functions to prefetch.
In this paper we show that CGP S performs nearly as well as CGP H, but
without the need for additional hardware. However, the hardware approach
may well be preferred by many people who have an aversion to or distrust of
profiling. In Annavaram et al. [2001a] we used two well known metrics, cover-
age and accuracy, to determine the effectiveness of CGP H. But the coverage
and accuracy metrics ignore the effects of a prefetch of line X on the cache line,
Y, that it replaces, for example, in the case that Y will be needed before the
next reference to X. This lack of information about the line that is replaced
to accommodate the prefetched line makes the coverage and accuracy metrics
insufficient to evaluate the effectiveness of the prefetches issued by a prefetch
scheme. Hence, to measure the effectiveness of CGP, we now use a more re-
fined prefetch classification, the Prefetch Traffic and Miss Taxonomy (PTMT),
developed by Srinivasan et al. [2003].
Section 5 describes the simulation environment and performance analysis
tools that we used to assess the effectiveness of CGP S and CGP H.
Section 6 describes previous related work and Section 7 presents conclusions
and suggests future directions.
query scheduler, the query optimizer and the query parser are then built on top
of the operator layer. Each layer in this modular architecture provides a set of
well-defined entry points and hides its internal implementation details so as
to improve the portability and maintainability of the software. The sequence
of function calls within each of these entry points is transparent to the layers
above. Although such layered code typically exhibits poor spatial and temporal
locality, the function call sequences can often be predicted with great accuracy.
CGP exploits this predictability to prefetch instructions from the function that
is deemed most likely to be executed next.
Fig. 4. Directed call graph with the edge labels from the profile execution.
Fig. 5. Create rec function after applying the CGP algorithm (new prefetch instructions are indi-
cated by *).
Fig. 6. Call graph history cache (state shown in CGHC occurs as Lock page is being prefetched
from Find page in buffer pool).
of 1. The corresponding data array entry is marked “invalid,” unless the CGHC
miss occurs on the second (update) access for a call (say P calls F ), in which
case the first slot of the data array entry for P is set to F .
In general, the index value in the tag array entry for a function F , points to
one of the functions in the data array entry for F . An index value of 1 selects
the first (leftmost) function in the data array entry. Note that the index value
is initialized to 1 whenever a new entry is created for F , and the index value
is reset to 1 whenever F returns.
When the branch predictor predicts that P is calling F , the first (call
prefetch) access to the direct mapped CGHC tag array is made by using the
lower order bits of the predicted target address, F , of the function call. If the
address stored in the tag entry matches F , given that the index value of a func-
tion being called should be 1, a prefetch is issued to the first function address
that is stored in the corresponding data array entry. The second function will be
prefetched when the first function returns, the third when the second returns,
and so on. The prefetcher thus predicts that the sequence of calls to be invoked
by F will be the same as the last time F was executed. We chose to implement
this prediction scheme because of the simplicity of the resulting prefetch logic
and the accuracy of this predictor for stable call sequences.
For the same call instruction (P calls F ), the second (call update) access
to the CGHC tag array is made using the lower order bits of the starting address
of the current function, P . If the address stored in the tag entry matches P , then
the index of that entry is used to select one of the 8 slots of the corresponding
data array entry, and the predicted call target, F , is stored in that slot. Finally
the index is incremented by 1 on each call update, up to a maximum value of 8.
On a return instruction, when the function F returns to function P , the lower
order bits of the starting address of P are used for the first (return prefetch)
access to the CGHC. On a tag hit, the index value in the tag array entry is used
to select a slot in the corresponding data array entry, and the function in that
slot is prefetched.
Note that on a return instruction, a conventional branch predictor only pre-
dicts the return address in P to which F returns; in particular it does not
provide the starting address of P . Since the entries in the tag array store only
starting addresses of functions, the target address of a return instruction can-
not be directly used for a tag match in CGHC. To overcome this problem, the
processor always keeps track of the starting address of the function currently
being executed. When a call instruction is encountered, the starting address of
the caller function is pushed onto the branch predictor’s return address stack
structure along with the return address. On a return instruction, the mod-
ified branch predictor retrieves the return address as usual, and also gets
the caller function’s starting address which is used to access the CGHC tag
array.
On the same return instruction, the second (return update) access to CGHC is
made using the lower order bits of the starting address of the current returning
function, F . On a tag hit, the index value in the tag array entry is reset to one.
Since CGP H predicts that the sequence of function calls made by a caller
will be the same as the last time that caller was executed, prefetching an entire
ACM Transactions on Computer Systems, Vol. 21, No. 4, November 2003.
Call Graph Prefetching for Database Applications • 423
5. SIMULATION RESULTS
In this section we first describe how we generated the database workloads that
we used to evaluate the effectiveness of CGP. We then present the experimental
results.
ACM Transactions on Computer Systems, Vol. 21, No. 4, November 2003.
424 • M. Annavaram et al.
—L1 cache designs are constrained by fast access time requirements, typically
single cycle access latency. It is difficult to design highly associative L1 caches
that operate at a high frequency and also meet the single cycle access time
requirement.
(1) Wisc-prof, a set of three queries from the Wisconsin benchmark: query 1
(sequential scan), query 5 (non-clustered index select), and query 9 (two-
way join). These queries were chosen since they include operations that are
frequently used by the other Wisconsin benchmark queries. These selected
queries were run on a dataset of 2100 tuples (1,000 tuples in each of the
first two relations, and 100 tuples in the third relation).
(2) Wisc-large-1 consists of the same three queries used in the Wisc-prof work-
load, except that the queries were run on a full 21,000 tuple Wisconsin
dataset (10,000 tuples in each of the first two relations, and 1,000 tuples
in the third relation). The total size of the dataset including the indices is
10 MB. This workload was selected to see how CGP performance differs
when running the same queries on a different size dataset.
(3) Wisc-large-2 consists of all eight Wisconsin queries running on a 10 MB
dataset.
(4) Wisc+tpch consists of all eight Wisconsin queries and the five TPC-H
queries running concurrently on a total dataset of size 40 MB. In this work-
load the size of the TPC-H dataset is 30 MB.
The queries in each workload were executed concurrently, each query run-
ning as a separate thread in the database server. Keeping the dataset sizes
relatively small (40 MB or less) allows the SimpleScalar simulation to complete
in a reasonable time. Even with this small dataset, the total number of instruc-
tions simulated in wisc+tpch was about 3 billion and required about 20 hours
per simulation run. Our results on wisc-prof and wisc-large-1 show that in-
creasing the size of the dataset for the same queries increases the number of
instructions executed, but does not significantly alter the types and sequences
of function calls that are made; CGP performance is in fact fairly independent
of the dataset size that is used. We also ran a few CGP simulations on the wisc-
large-2 queries with a 100 MB dataset and saw improvements that are quite
similar to those for the 10 MB dataset.
Fig. 7. Performance of OM and CGP relative to O5 (Execution cycles of O5 optimized binary (X109
cycles): wisc-prof = 0.38, wisc-large-1 = 2.83, wisc-large-2 = 2.86, wisc+tpch = 5.36).
workload was run separately and the profile information of both runs was
merged to generate the feedback file required by OM. The OM optimizations
were applied to an O5 optimized binary. OM’s ability to perform traditional
compiler optimizations reduced the dynamic instruction count of the O5 code
by 12%.
Fig. 9. Performance of OM, NL, stream buffers and CGP relative to O5.
Table II. Prefetch Traffic and Miss Taxonomy [Srinivasan et al. 2003]
prefetch-cache outcomes conventional-cache outcomes extra
case x (prefetched) y (replaced) x (prefetched) y (replaced) traffic misses
1 hit miss hit hit 2 1
2 hit prefetched hit hit 1 0
3 hit don’t care hit replaced 1 0
4 hit miss miss hit 1 0
5 hit prefetched miss hit 0 −1
6 hit don’t care miss replaced 0 −1
7 replaced miss don’t care hit 2 1
8 replaced prefetched don’t care hit 1 0
9 replaced don’t care don’t care replaced 1 0
number of times that the next reference to a prefetched cache line is a hit (i.e.
the prefetched cache line was not replaced before its next reference) relative to
the total number of misses in a cache without prefetching, and Accuracy which
is the ratio of the number of times that the next reference to a prefetched cache
line is a hit relative to the total number of prefetches issued. Coverage and accu-
racy metrics, however, are not completely accurate because they do not account
for the effects of a prefetch that are due to the cache line that the prefetched
line replaces. For instance, these two metrics are not sufficient to infer whether
a prefetched line (X) has replaced another line (Y) that will be needed before
the next reference to X. Hence, to measure the effectiveness of CGP, we use a
more refined prefetch classification, the Prefetch Traffic and Miss Taxonomy
(PTMT), developed by Srinivasan et al. [2003].
PTMT requires the simultaneous simulation of a cache with prefetching
(prefetch cache), and a cache without prefetching (conventional cache). By com-
paring the next events for X and Y in the conventional cache and in the prefetch
cache, PTMT identifies 9 possible outcomes, as shown in Table II. Of all the
prefetches issued, only those that fall under cases 5 and 6 are useful prefetches
because only these result in a net reduction in cache misses; furthermore only
cases 5 and 6 generate no extra traffic relative to the conventional cache without
prefetching.
In case 6 when a prefetched line X replaces line Y , Y is also replaced in
the conventional cache sometime before its next reference; hence the replaced
line Y does not contribute to extra misses in the prefetch cache relative to the
conventional cache. On the other hand in case 5, the next reference to Y is a
hit (i.e. Y was not replaced) in the conventional cache and case 5 is only useful
because it relies on a subsequent prefetch of Y back into the prefetch cache
before its next reference. This subsequent prefetch of Y may in turn be useful
or useless depending on what happens in the conventional cache to the line
that it replaces in the prefetch cache. Hence although both cases 5 and 6 are
useful, case 6 prefetches are always useful, whereas a case 5 prefetch, although
it appears to be useful in isolation, begins a chain of related prefetches whose
total cost may or may not be beneficial.
Case 1 and case 7 prefetches are polluting prefetches because they gener-
ate an extra miss by replacing a useful line, and also increase the bus traffic.
ACM Transactions on Computer Systems, Vol. 21, No. 4, November 2003.
432 • M. Annavaram et al.
Prefetches in the remaining five cases are called useless; they generate one
extra line of traffic for each issued prefetch without reducing the cache misses.
Table II does not account for one side effect caused by prefetching into a set-
associative cache that uses LRU replacement. In associative caches, a prefetch
has the side effect of inducing a re-ordering of the LRU stack of the set in
which the prefetch occurs, and this reordering may affect subsequent traffic
and misses. The following example, found in Srinivasan et al. [2003], illustrates
an occurrence of this side effect. X is prefetched, replacing the LRU line Y; an
existing line W in that set becomes the LRU line. The next cache access to that
set results in a miss in both caches; W is replaced in the prefetch cache while
Y is replaced in the conventional cache. If the next access to W follows soon
enough, it will be a hit in the conventional cache, but a miss in the prefetch
cache. Thus, although W is not replaced directly by prefetching X, the W miss
in the prefetch cache is a side effect of prefetching. This prefetch side effect is
referred to as case 10. The cost of case 10 is 1 line of extra traffic and 1 extra
miss.
An occurrence of case 10 can be detected when the following two conditions
hold:
(1) There is a demand fetch into L1 cache due to a miss in both the conven-
tional cache and the prefetch cache and different lines are replaced in the two
caches.
(2) The line replaced in the prefetch cache is subsequently referenced result-
ing in a hit in the conventional cache and a miss in the prefetch cache.
Srinivasan showed that these 10 cases of PTMT completely and disjointly ac-
count for all the extra traffic (always non-negative) and extra misses (hopefully
negative) of a prefetch algorithm.
Figure 10 shows the classification of prefetches issued by the NL 4, CGP S 4
and CGP H 4 schemes applied to O5+OM. With NL 4, 4% of the prefetches
generated are polluting prefetches, but less than 2% are polluting in CGP S 4
and CGP H 4. With NL 4, 39% of the prefetches are useful prefetches while
ACM Transactions on Computer Systems, Vol. 21, No. 4, November 2003.
Call Graph Prefetching for Database Applications • 433
Fig. 11. CGP H 4 prefetches due to NL (left bar) and CGHC (right bar).
in CGP S 4 and CGP H 4, although they issued more prefetches, the use-
ful prefetches increase to 44% and 46%, respectively. As there are very few
case 5 prefetches, nearly all the useful prefetches are the more desirable case 6
prefetches where the prefetched line replaces a line that the conventional cache
also replaces before its next reference.
To understand why CGP generates about as many useless prefetches as NL
(mostly in case 9, with a substantial number in cases 3 and 8 as well) we
split the CGP prefetches into those that are issued by its NL prefetcher and
those that are issued by its CGHC. Figure 11 shows the results of this split for
CGP H 4. While only 34% of the prefetches issued by the NL component are
useful prefetches (cases 5 and 6), 58% of the prefetches issued by the CGHC
component are useful. Hence the prefetches in the CGHC component are much
more accurate than those in the NL component.
Since CGP uses CGHC only to prefetch across function boundaries and uses
NL to prefetch within a function, we might expect that CGHC and NL prefetch
disjoint sets of instructions. However, we see that the useful prefetches of
the NL portion of Figure 11 (2.6 × 107 on average) are fewer than those for
NL 4 in Figure 10 (5.6 × 107 on average). This decrease implies that some of
the useful prefetches issued by the NL 4 scheme when acting alone are is-
sued by the CGHC component, not the NL component, of the CGP 4 scheme.
Such a shift from NL to CGHC could occur, for example, if a callee func-
tion is laid out close to its caller and NL 4 prefetches past the end of the
caller to the beginning lines of the callee function due to the sequentiality
of the code layout, whereas under CGP 4 such callee prefetches would tend
to occur earlier during caller execution and fall within the CGHC portion
of CGP 4.
Thus CGHC allows the CGP scheme to issue some of the prefetches earlier
(i.e. at a more timely point) than those same prefetches would be issued by NL.
The NL prefetch of such cache lines in CGP will be squashed since the prefetch
was already issued by CGHC. The timely nature of CGHC prefetches can be
ACM Transactions on Computer Systems, Vol. 21, No. 4, November 2003.
434 • M. Annavaram et al.
inferred from Figure 12 which shows the timeliness of the prefetches issued by
NL 4, CGP S 4 and CGP H 4 by categorizing the total prefetch hits (sum of
categories 1 through 6) into two categories. The bottom component, Pref Hits,
shows the number of times that the next reference to a prefetched cache line
found the referenced instruction already in the L1 cache. The upper component,
Delayed Hits, shows the number of times that the next reference to a prefetched
cache line found that the referenced instruction was still en route to the cache
from the lower levels of memory. The total delayed hits of CGP 4 are fewer than
the delayed hits of NL 4 which is one measure of the increased timeliness of
CGP prefetches relative to NL. The total number of delayed hits of NL 4 is 36%
of the total prefetch hits while in CGP S 4 and CGP H 4 they are reduced to
25% of the total prefetch hits, despite the increased total and the use of NL 4
within CGP to prefetch lines from within a function.
access bottleneck. We claim that CGP will continue to be useful for database
systems on such future processors.
On the more aggressive future processor model defined in Table III, CGP
with OM improves the performance of our database workloads by 43% (CGP S)
or 45% (CGP H) over O5, and 23% (CGP S) or 25% (CGP H) over O5+OM.
L2 cache size in the future configuration is slightly smaller than what we ex-
pect to see in future. As stated earlier, to get the simulation results within a
reasonable time, the size of the dataset was scaled down, and hence the size
of the L2 was also scaled down in appropriate proportion to provide realistic
results.
We simulated this very aggressive out-of-order processor model, future,
which can execute up to 8 instructions every cycle. Comparing this configu-
ration with the configuration shown in Table I, the I-cache size is now doubled,
which should reduce the number of I-cache misses. Note, however, that in the
future configuration, the Level 2 cache hit latency and the memory access la-
tency are also greater, as might be expected due to the widening gap between
processor and memory system speeds. Consequently, even though such a fu-
ture processor may suffer fewer I-cache misses, the penalty for each miss will
be higher.
Figure 13 shows the run time required to complete the four workloads on the
future configuration relative to the run time of the O5 optimized binary. CGP
still outperforms both OM and NL by about the same margin on the future
configuration as on the original 4-wide machine configuration.
CGP maintains its performance advantage despite the fact that in our bench-
marks the I-cache miss rates on the future configuration with a 64 KB I-cache
are reduced to less than 1% without any prefetching, and less than 0.1% with
CGP. Thus the working sets of our benchmarks are well accommodated by
the larger caches in the future configuration. These larger caches decrease the
number of misses sufficiently to gain in performance despite the increased miss
penalty. Consequently the percentage gains in performance of CGP relative to
OM and NL are slightly less when calculated on the future configuration, rather
than on the original 4-wide machine configuration. However, it is important to
note that CGP performance remains about half way between O5+OM and per-
fect cache.
ACM Transactions on Computer Systems, Vol. 21, No. 4, November 2003.
436 • M. Annavaram et al.
Fig. 13. Performance of OM, NL and CGP relative to O5 on the future configuration (Execution
cycles of O5 optimized binary (X109 cycles): wisc-prof = 0.28, wisc-large-1 = 2.27, wisc-large-2 =
2.29, wisc+tpch = 4.21).
Furthermore, from current trends we expect that the working sets in fu-
ture databases will continue to increase and will be much larger than those
used in this study. In addition, database systems will continue to use a layered
software design approach so as to ensure the maintainability and portability
of the software. With larger working sets, cache misses will continue to be
a significant performance bottleneck, and consequently CGP will continue to
be a useful technique for reducing the number of I-cache misses of database
systems.
the required profile information for OM. The train input set was then run for
two billion instructions to generate the results presented in this section.
In Figure 14, the rightmost bar for each benchmark shows the execution
cycles required with a perfect I-cache, where each access to the I-cache is com-
pleted in 1 cycle. Without prefetching (O5+OM), the performance gap due to
using the 32 KB I-cache, rather than a perfect I-cache, is 17% in gcc, 9% in crafty,
2% in gap, and less than 1% for each of the other benchmarks. In fact with a 32
KB I-cache, for SPEC CPU2000, the I-cache miss ratios are nearly 0% except
for gcc and crafty which have 0.5% and 0.3% I-cache miss ratios, respectively.
The I-cache is thus not a significant performance bottleneck in any of these
SPEC CPU2000 applications, in which case it is unnecessary to use prefetch-
ing techniques such as CGP and NL. For those applications that do suffer from
I-cache misses, namely gcc and crafty, NL prefetching alone achieves perfor-
mance gains similar to those of CGP. NL 4 and CGP H 4 each speed up the
execution of gcc by 7% and crafty by 4% relative to O5+OM alone. These results
show that CGP is not needed for workloads with small I-cache footprints and/or
infrequent function calls. However, once again CGP performance is about half
way between no instruction prefetching and perfect I-cache performance.
6. RELATED WORK
Researchers have proposed several techniques to improve the I/O bottleneck of
database systems. Nyberg et al. [1994] suggested that if data intensive applica-
tions use software assisted disk striping, the performance bottleneck shifts from
I/O response time to the memory access time. Boncz et al. [1998] showed that
the query execution time of data mining workloads with a large main memory
buffer pool is memory bound rather than I/O bound. Shatdal et al. [1994] pro-
posed cache-conscious performance tuning techniques that improve the locality
of the data accesses for join and aggregation algorithms. These techniques re-
duce data cache misses, which is orthogonal to CGP’s goal of reducing I-cache
misses. CGP may be implemented on top of these cache-conscious algorithms.
It is only recently that researchers have examined the performance impact of
architectural features on DBMS [Ailamaki et al. 1999; Lo et al. 1998; Trancoso
ACM Transactions on Computer Systems, Vol. 21, No. 4, November 2003.
438 • M. Annavaram et al.
et al. 1997; Eickemeyer et al. 1996; Cvetanovic and Bhandarkar 1994; Franklin
et al. 1994; Maynard et al. 1994]. Their results show that database appli-
cations have much larger instruction and data footprints and exhibit more
unpredictable branch behavior than benchmarks that are commonly used in
architectural studies (e.g. SPEC). Database applications have fewer loops and
suffer from frequent context switches, causing significant increases in the I-
cache miss rates [Franklin et al. 1994]. Lo et al. [1998] showed that in OLTP
workloads, the I-cache miss rate is nearly three times the data cache miss rate.
Ailamaki et al. [1999] analyzed three commercial DBMS on a Xeon processor
and showed that TPC-D queries spend about 20% of their execution time on
branch misprediction stalls and 20% on L1 I-cache miss stalls (even though the
Xeon processor uses special instruction prefetching hardware). Their results
also showed that L1 data cache misses that hit in L2 were not a significant
bottleneck, but L2 misses reduced the performance by 20%.
Researchers have proposed several schemes to improve I-cache performance.
Pettis and Hansen [1990] proposed a code layout algorithm that uses profile
guided feedback information to contiguously layout the sequence of basic blocks
that lie on the most commonly occurring control flow path. Romer et al. [1997]
implemented the Pettis and Hansen code layout algorithm using the Etch tool
and showed performance improvements for Win32 binaries. Hashemi et al.
[1997] used a cache line coloring scheme to remap procedures so as to reduce
conflict misses. Similarly Kalamatianos and Kaeli [1998] exploited the temporal
locality of procedure invocations to remap procedures in a binary. They used
a structure called a Conflict Miss Graph (CMG), where every edge weight in
CMG is an approximation of the worst-case number of misses two procedures
can inflict upon one another. The ordering implied by the edge weights is used
to apply color-based procedure mapping to eliminate conflict misses. Gloy et al.
[1997] compared several of these recent code placement techniques to improve
I-cache performance. In this paper we used OM [Srivastava and Wall 1992],
which implements a modified Pettis and Hansen algorithm to do feedback-
directed code layout. Our database workload results showed that OM improves
performance by 15% over O5, and CGP with OM achieves a 41% (CGP S) or
45% (CGP H) performance improvement over O5. CGP alone, without OM, does
not need recompilation of the source code and still achieves a 35% (CGP S) or
39% (CGP H) performance improvement over O5. Since CGP can effectively
prefetch functions from non-contiguous locations, OM’s effort to layout the code
contiguously provides only about a 4% additional performance benefit for CGP
with OM over CGP without OM.
Tagged Next-N-line prefetching (NL) [Smith 1978] is a sequential prefetching
technique that is often used. In this technique the next N sequential lines are
prefetched on a cache miss, as well as on the first hit to a cache line that was
prefetched. Tagged NL prefetching works well in programs that execute long
sequences of straight line code. CGP uses tagged NL prefetching for prefetching
code within a function, and profile-guided prefetching (in CGP S) or the CGHC
(in GCP˝) for prefetching across function calls. Our results show that CGP takes
good advantage of the tagged NL prefetching scheme and that OM+CGP S or
OM+CGP H outperforms OM+NL alone by 7% or 10%, respectively.
ACM Transactions on Computer Systems, Vol. 21, No. 4, November 2003.
Call Graph Prefetching for Database Applications • 439
the prefetch address, Reinman uses a fetch target queue to enqueue multiple
prefetch addresses. The accuracy of their prefetches is determined by the ac-
curacy of the run-ahead predictor. CGP uses history information rather than
a run-ahead engine, and does not employ a branch predictor to determine its
prefetch addresses.
Pierce and Mudge [1996] proposed wrong-path prefetching, which combines
next-line prefetching with the prefetching of all control instruction targets re-
gardless of the predicted directions of conditional branches. However, they also
showed that prefetching all branch targets aggravates bus congestion.
Joseph and Grunwald [1997] proposed Markov prefetching, which capitalizes
on the correlations in the cache miss stream to issue a prefetch for the next
predicted miss address. They use part of the L2 cache as a history buffer to
store a miss address, M , and a sequence of miss addresses that follow M . When
address M misses in the cache again, their scheme uses M to index the history
buffer and issues prefetches to a subset of the miss addresses that followed M
the last time. This scheme focuses primarily on data prefetching. In particular,
in Joseph and Grunwald [1997] there are no results on the effectiveness of this
scheme for instruction prefetching. For data prefetching their results showed
that Markov prefetching generates a significant number of extra prefetches and
requires a large amount of space to store the miss correlations.
Although it would be interesting to quantitatively compare the performance
of CGP with previous instruction prefetching schemes, due to time and resource
constraints we only present a qualitative discussion of the related work.
Fig. 15. Average performance improvements of OM, NL and CGP relative to O5 on the original
4-wide configuration (Average execution cycles of O5 optimized binary = 2.86 × 109 cycles).
Fig. 16. Average performance improvements of OM, NL and CGP relative to O5 on the future
configuration (Average execution cycles of O5 optimized binary = 2.26 × 109 cycles).
those used in this study. Cache misses will no doubt continue to be a significant
performance bottleneck, and consequently techniques like CGP that reduce
the I-cache misses will remain critical to the performance of future database
systems on future processors.
As the complexity of software systems continues to grow, the instruction
footprint sizes are also increasing, thereby putting tremendous pressure on the
I-cache. As the complexity of the software grows, the behavior of the system
typically becomes more unpredictable. Research in memory system design can
gain significantly by analyzing the behavior of specific types of software sys-
tems at a higher level of granularity, rather than by trying to capitalize only
on low-level generic program behavior. The prevailing programming style for
today’s large and complex software systems favors modular software where the
flow of control at the function level is exposed while the implementation de-
tails within the functions are abstracted away. CGP exploits the regularity of
DBMS function call sequences, and avoids dealing with low-level details within
functions by simply prefetching the first few cache lines of a function, which
often constitutes the entire function, and using tagged next-N-line prefetching
to bring in successive lines of longer functions.
Although CGP does eliminate about half the I-cache miss penalty, there is
still room for further improvement. The cache misses that remain after applying
CGP are mostly either cold start misses or misses to infrequently executed
functions. As we have shown in the CGP performance results section, simply
using a bigger Call Graph History Cache to store more history information is not
the solution. History-based schemes, such as CGP, typically require a learning
period during which they acquire program knowledge before they can exploit
that knowledge to improve performance. Thus reducing cold start misses and
misses to infrequently executed functions by using history-based schemes is
ACM Transactions on Computer Systems, Vol. 21, No. 4, November 2003.
Call Graph Prefetching for Database Applications • 443
difficult if not impossible. A simpler way to reduce these remaining misses might
be to give the DBMS more direct control of cache memory management. Today’s
DBMS already use application-specific main memory management routines.
They control page allocation and replacement policies in a more flexible manner
than the rigid “universal” policies provided by the operating system. In a similar
way the cache hierarchy could be placed under some degree of DBMS control. To
do more effective prefetching, database developers can provide hints to cache
management hardware regarding calling sequences to infrequently executed
functions and other code segments that cannot be captured by CGP.
REFERENCES
AILAMAKI, A., DEWITT, D., HILL, M., AND WOOD, D. 1999. DBMSs on a Modern Processor: Where
Does Time Go? In Proceedings of the 25th International Conference on Very Large Data Bases.
266–277.
ANNAVARAM, M. 2001. Prefetch Mechanisms that Acquire and Exploit Application Specific Knowl-
edge. Ph.D. thesis, University of Michigan, EECS Department.
ANNAVARAM, M., PATEL, J., AND DAVIDSON, E. 2001a. Call Graph Prefetching for Database Applica-
tions. In Proceedings of the 7th International Symposium on High Performance Computer Archi-
tecture. 281–290.
ANNAVARAM, M., PATEL, J., AND DAVIDSON, E. 2001b. Data Prefetching by Dependence Graph Pre-
computation. In Proceedings of the 28th International Symposium on Computer Architecture.
52–61.
BERNSTEIN, P., BRODIE, M., CERI, S., DEWITT, D., FRANKLIN, M., GARCIA-MOLINA, H., GRAY, J., HELD,
G., HELLERSTEIN, J., JAGADISH, H., LESK, M., MAIER, D., NAUGHTON, J., PIRAHESH, H., STONEBRAKER,
M., AND ULLMAN, J. 1998. The Asilomar Report on Database Research. SIGMOD Record 27, 4
(December), 74–80.
BITTON, D., DEWITT, D. J., AND TURBYFILL, C. 1983. Benchmarking database systems a systematic
approach. In Proceedings of the 9th International Conference on Very Large Data Bases. 8–19.
BONCZ, P., RÜHL, T., AND KWAKKEL, F. 1998. The Drill Down Benchmark. In Proceedings of the 24th
International Conference on Very Large Data Bases. 628–632.
BURGER, D. AND AUSTIN, T. 1997. The SimpleScalar Tool Set. Tech. Rep. 1342, University of
Wisconsin-Madison, Computer ScienceDepartment. June.
CAREY, M., DEWITT, D., FRANKLIN, M., HALL, N., MCAULIFFE, M., NAUGHTON, J., SCHUH, D., SOLOMON, M.,
TAN, C., TSATALOS, O., WHITE, S., AND ZWILLING, M. 1994. Shoring Up Persistent Applications.
In Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data.
383–394.
CHEN, I.-C. K., LEE, C.-C., AND MUDGE, T. 1997. Instruction Prefetching Using Branch Prediction
Information. In Proceedings of the International Conference on Computer Design. 593–601.
CVETANOVIC, Z. AND BHANDARKAR, D. 1994. Characterization of Alpha AXP Performance Using
TP and SPEC Workloads. In Proceedings of the 21st International Symposium on Computer
Architecture. 60–70.
EICKEMEYER, R., JOHNSON, R., KUNKEL, S., SQUILLANTE, M., AND LIU, S. 1996. Evaluation of Multi-
threaded Uniprocessors for Commercial Application Environments. In Proceedings of the 23rd
International Symposium on Computer Architecture. 203–212.
FRANKLIN, M., ALEXANDER, W., JAUHARI, R., MAYNARD, A., AND OLSZEWSKI, B. 1994. Commercial Work-
load Performance in the IBM POWER2 RISC System/6000 Processor. IBM J. Res. Dev. 38, 5
(April), 555–561.
GLOY, N., BLACKWELL, T., SMITH, M., AND CALDER, B. 1997. Procedure Placement Using Temporal
Ordering Information. In Proceedings of the 30th International Symposium on Microarchitecture.
303–313.
HASHEMI, A., KAELI, D., AND CALDER, B. 1997. Efficient Procedure Mapping Using Cache Line
Coloring. In Proceedings of the SIGPLAN ’97 Conference on Programming Language Design and
Implementation. 171–182.
HSU, W.-C. AND SMITH, J. 1998. A Performance Study of Instruction Cache Prefetching Methods.
IEEE Trans. Comput. 47, 5 (May), 497–508.
INTEL WEB SITE: http://www.developer.intel.com/drg/mmx/appnotes/perfmon.htm Survey of Pentium
Processor Performance Monitoring Capabilities & Tools.
JOSEPH, D. AND GRUNWALD, D. 1997. Prefetching Using Markov Predictors. In Proceedings of the
24th International Symposium on Computer Architecture. 252–263.
JOUPPI, N. 1990. Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-
Associative Cache and Prefetch Buffers. In Proceedings of the 17th International Symposium on
Computer Architecture. 364–373.
KALAMATIANOS, J. AND KAELI, D. 1998. Temporal-Based Procedure Reordering for Improved In-
struction Cache Performance. In Proceedings of the 4th International Symposium on High Per-
formance Computer Architecture. 244–253.
LO, J., BARROSO, L. A., EGGERS, S. J., GHARACHORLOO, K., LEVY, H. M., AND PAREKH, S. S. 1998. An
Analysis of Database Workload Performance on Simultaneous Multithreaded Processors. In Pro-
ceedings of the 25th International Symposium on Computer Architecture. 39–50.
LUK, C. AND MOWRY, T. 2001. Architectural and compiler support for effective instruction prefetch-
ing: a cooperative approach. ACM Trans. Comput. Syst. 19, 1 (Feb.), 71–109.
MAYNARD, A., DONNELLY, C., AND OLSZEWSKI, B. R. 1994. Contrasting characteristics and cache per-
formance of technical and multi-user commercial workloads. In Proceedings of the 6th Interna-
tional Conference on Architectural Support for Programming Languages and Operating Systems.
145–156.
NYBERG, C., BARCLAY, T., CVETANOVIC, Z., GRAY, J., AND LOMET, D. 1994. AlphaSort: a RISC machine
sort. In Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data.
233–242.
PETTIS, K. AND HANSEN, R. 1990. Profile Guided Code Positioning. In SIGPLAN ’90 Conference on
Programming Language Design and Implementation. 16–27.
PIERCE, J. AND MUDGE, T. 1996. Wrong-path Prefetching. In Proceedings of the 29th International
Symposium on Microarchitecture. 264–273.
REINMAN, G., CALDER, B., AND AUSTIN, T. 1999. Fetch Directed Instruction Prefetching. In Proceed-
ings of the 32nd International Symposium on Microarchitecture. 16–27.
ROMER, T., VOELKER, G., LEE, D., WOLMAN, A., WONG, W., LEVY, H., BERSHAD, B., AND CHEN, B. 1997.
Instrumentation and Optimization of Win32/Intel Executables Using Etch. In USENIX Windows
NT Workshop. 1–7.
RUPLEY, J., ANNAVARAM, M., DEVALE, J., DIEP, T., AND BLACK, B. 2002. Comparing and Contrast-
ing a Commercial OLTP Workload with CPU2000 on IPF. In the 5th Workshop on Workload
Characterization.
SHATDAL, A., KANT, C., AND NAUGHTON, J. 1994. Cache Conscious Algorithms for Relational Query
Processing. In Proceedings of the 20th International Conference on Very Large Data Bases. 510–
521.
SMITH, A. 1978. Sequential Program Prefetching in Memory Hierarchies. IEEE Comput. 11, 2
(December), 7–21.
SRINIVASAN, V., DAVIDSON, E., AND TYSON, G. 2003. A Prefetch Taxonomy. IEEE Trans. Comput.
SRINIVASAN, V., DAVIDSON, E., TYSON, G., CHARNEY, M., AND PUZAK, T. 2001. Branch History Guided
Instruction Prefetching. In Proceedings of the 7th International Symposium on High Performance
Computer Architecture. 291–300.
SRIVASTAVA, A. AND EUSTACE, A. 1994. ATOM: A System for Building Customized Program Analysis
Tools. Tech. Rep. 94/2, Digital Western Research Laboratory. March.
SRIVASTAVA, A. AND WALL, D. 1992. A Practical System for Intermodule Code Optimization at
Link-Time. Tech. Rep. 92/6, Digital Western Research Laboratory. June.
TRANCOSO, P., LARRIBA-PEY, J., ZHANG, Z., AND TORELLAS, J. 1997. The Memory Performance of DSS
Commercial Workloads in Shared-Memory Multiprocessors. In Procedings of the 3rd Interna-
tional Symposium on High Performance Computer Architecture. 211–220.
TPC. 1999. TPC Benchmark H Standard Specification (Decision Support). In Revision 1.1.0.
Received June 2001; revised July 2002, February 2003; accepted May 2003