Storegpu: Exploiting Graphics Processing Units To Accelerate Distributed Storage Systems
Storegpu: Exploiting Graphics Processing Units To Accelerate Distributed Storage Systems
4. EXPERIMENTAL EVALUATION
We evaluate StoreGPU both with synthetic benchmarks (Section
4.1) and an application driven benchmark: we estimate the
performance gain of an application using StoreGPU to compare
similarities between multiple versions of the same checkpoint
image (Section 4.2).
Figure 3: Sliding window hashing module architecture. The
blocks with circular arrows represent the standard hashing 4.1 Synthetic Benchmarks
kernel. Stage numbers correspond to Table 1. This section presents the performance speedup delivered by
Each of the hash functions in Figure 3 can be executed in a StoreGPU under a synthetic workload. We first compare GPU-
separate thread since there are no dependencies between supported performance with the performance of the same
computations. The challenge in implementing this module lies in workload running on a commodity CPU. Next, this section
the memory management to extract maximum performance. Note investigates the factors that determine the observed performance.
that the input data is not divided into smaller blocks as the 4.1.1 Experiment Design
previous case. The reason is that the input data for each thread The experiments are divided into two parts, each corresponding to
may overlap with the neighboring threads. the evaluation of one of the two uses of hashing described in
3.4 Optimized Memory Management Section 3 (i.e. Direct Hashing and Sliding Window Hashing). The
Although the design of the two modules presented here are performance metric of interest is execution speedup.
relatively simple, optimizing their performance for GPU Table 2 summarizes the factors that influence performance.
deployment is a challenging task. For example, one aspect that Currently, StoreGPU provides the implementation of two hashing
induces additional complexity is maximizing the number of algorithms: MD5 and SHA1. The data size variation is intended to
threads to extract maximal parallelism (around 100K threads are expose the impact of memory copy between the host and the GPU.
created for large blocks) while avoiding multiple threads Additionally the sliding-window hashing module has two specific
accessing the same shared memory bank and maximizing the use parameters: window and offset size.
of each processors’ registers. To this end, we have implemented
our own shared memory management mechanism with two main Additionally, we explore the impact of the three performance
functions. First, it allocates a fixed space for every thread in a optimizations presented in section 3 : i) the optimized use of
single shared memory bank and avoids assigning workspaces shared memory; ii) memory pinning; and iii) reduced output size.
allocated on the same memory bank to concurrent threads in the
same multiprocessor. When a thread starts, it copies its data from
Table 2: Factors considered in the experiments and their
respective levels. Note that the sliding-window hashing module
has extra parameters.
Direct and Sliding Window Hashing
Factors Levels
Algorithm MD5 & SHA1
Data Size 8KB to 96MB
Shared Memory Enabled or Disabled
Pinned Memory Enabled or Disabled
Sliding-Window Hashing only
Window Size 20 or 52 bytes
Offset 4, 20 or 52 bytes
Reduced Hash Size Enabled or Disabled
The devices used in the performance analysis are: an Intel Core2 Figure 4: StoreGPU speedup for MD5 implementations for
direct hashing. SHA1 performs similarly.
Duo 6600 processor (released late 2006) and an NVIDIA GeForce
8600 GTS GPU (released early 2007). We note that, in both cases, Sliding window hashing introduces two extra parameters that
our implementation uses out-of-the-box hash function influence performance: the window size and the offset. The
implementations. These implementations are single-threaded and window size determines how much data is hashed while the offset
use only one core of the Intel processor. We defer the discussion determines by how many bytes the window is advanced after each
on the impact of the experiment platform choices to Section 6. hash operation. The experiments explore four combinations for
For all performance data points, we report averages over multiple these two factors with values chosen to match those used by
experiments. The number of experiments is adjusted to guarantee storage systems like LBFS [32], Jumbostore [20], and stdchk [9].
90% confidence intervals. We applied a full factorial experimental Figure 5 shows the results for a configuration that leads to intense
design to evaluate the impact of each combination of factors computational overheads: a window size of 20 bytes and an offset
presented in Table 2. The following sections present a summary of 4 bytes. In this configuration (in fact suggested by LBFS),
of these experiments. StoreGPU hashes the input data up to 9x faster for MD5 and up to
5x faster for SHA1. For slightly larger chunks (56 bytes),
4.1.2 Experimental Results StoreGPU performs a little slower when compared to the previous
The first question addressed by our experiments is: What is the scenario. The speedup offered is over 7x for the MD5 algorithm
execution time speedup offered by StoreGPU compared to a CPU and about 4.8x for SHA1. The same trend is also observed for
implementation? To answer this question, we determine the ratio experiments where the offset is increased to 20 bytes, as shown in
between the execution time on the GPU and the CPU for both Figure 6 .
MD5 and SHA1 hashing algorithms.
Figure 4 shows the speedup achieved by StoreGPU for MD5 and
SHA1 respectively for the Direct Hashing module. Values larger
than one indicate performance improvements, while values lower
than one indicate a slow down. The results show that the
optimized (pinned and shared memory optimizations enabled)
StoreGPU starts to offer speedups for blocks larger than 300KB
and offer up to 4x speedup for large data blocks (>5MB).
Note that as the data size increases, the performance improvement
reaches a saturation point. It is also important to observe that non-
optimized GPU implementations may perform much worse than
its CPU counterpart. When memory accesses are not optimized,
the performance can decrease up to 30x for small blocks (8KB
and MD5). This fact highlights two aspects: first, efficient
memory management is paramount to achieving maximum Figure 5: StoreGPU sliding-window hashing module speedup
performance in data-intensive applications running on GPUs; for MD5. Window size=20 bytes, offset=4 bytes.
second, as the data size grows, the impact of the overhead in
The sliding window hashing achieves higher speedup compared to
moving the data from the host to the device lowers compared to
direct hashing module for two reasons: First, the CPU
the processing cost. We discuss the latter point in more detail in
implementation of the sliding window hashing will pay an
the next section.
additional overhead of a function call to hash each window, while
Figure 5 and Figure 6 present the results of experiments for the StoreGPU spawns one thread per window that can execute in
sliding-window hashing module. Qualitatively, the observed parallel. Second, since the window size is usually less than 64
behavior is similar to the direct hashing module. Quantitatively, bytes (the input size for SHA or MD5), every window is padded
however, the speedup delivered by StoreGPU is much higher. The to complete the 64 bytes. This translates to hashing considerably
figures show the results for MD5 sliding window hashing module. larger amounts of data for the same given input data, making this
Other parameter choices and choosing SHA1 lead to similar module more computationally intensive and thus a better fit for
patterns. Hence we do not include these results here. We direct the GPU processing. This is also the reason we observe larger
reader to our technical report. speedups with smaller window sizes and offsets.
Finally, we observed that the speedup achieved for MD5 is better
than SHA1. Although we do not have a precise understanding of
the reasons for this performance disparity, our intuition is that this
is due to the intrinsic characteristics of the algorithms.
32
128
375
1500
6000
24000
96000
speedup using MD5.
Data Size (KB)
Throughput (MBps) Similarity
Figure 10: Percentage of total execution time spent on each StoreGPU Standard ratio detected
stage when none of the optimizations are enabled.
Fixed block size (using 840 193
23%
Figure 10 and Figure 11 illustrate the proportion of total execution direct hashing) Speedup: 4.3x
time that corresponds to each execution stage. These results show Variable block size 114 13.5
the major impact of pinned and shared memory optimizations on the (LBFS technique using 80%
contribution of each stage to the total runtime Using pinned sliding window hashing) Speedup: 8.4x
memory reduces the impact of data transfer (compare Stage 2 in
Figure 10 to Figure 11), while using the shared memory reduces Table 3 and Table 4 compare the throughput of online similarity
kernel execution impact (compare Stage 2 in Figure 10 to Figure detection between using standard hashing functions running on
11). Finally, enabling both optimizations increases the impact of the CPU and using StoreGPU. These results show dramatic
copy operation, since pinning memory demands a higher overhead improvement in the throughput of online similarity detection with
during the allocation stage (Stage 1 in Figure 11). both fixed and variable size blocks. Despite the fact that we are
using a lower-end GPU, the results indicate that fixed-block
Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 similarity detection can be used even on 10Gbps systems while the
100% variable block size technique can be used for systems connected
Runtime Percentage
32
128
375
1500
6000
24000
96000