Content Beyond Syllabus PDF
Content Beyond Syllabus PDF
Content Beyond Syllabus PDF
Abstract Active memory systems improve application cache behavior by either performing data
parallel computation in the memory elements or
supporting address re-mapping in a specialized
memory controller. The former approach allows
more than one memory element to operate on the
same data, while the latter allows the processor to
access the same data via more than one address
therefore data coherence is essential for correctness
and transparency in active memory systems. In
this paper we show that it is possible to extend a
conventional DSM coherence protocol to handle this
problem efficiently and transparently on uniprocessor as well as multiprocessor active memory systems. With a specialized programmable memory
controller we can support several active memory
operations with simple coherence protocol code modifications, and no hardware changes. This paper
presents details of the DSM cache coherence protocol extensions that allow speedup from 1.3 to 7.6
over normal memory systems on a range of simulated uniprocessor and multiprocessor active memory applications.
Keywords: Active memory systems, address
re-mapping, cache coherence, distributed shared
memory, flexible memory controller.
Introduction
2.1
Active
sions
Memory
Exten-
01 24365
;=<?>A@CB
78
9 :
DE
F=G
H
#$
IJ
!"
()
*
KL.M
NO.P
+,.-/
2.2
value and the most recent value is w1 residing in P 0s cache. On receiving the reply from P 0, the protocol updates w10 with
the correct value and also writes back C1 to
memory. For C2, we find that it is cached
by P 0 and P 1 in the shared state, so the
protocol sends invalidations to P 0 and P 1
for this line. In this case, the protocol can
read the correct value of w20 directly from
main memory. The case for C14 is similar to C1 except that P 1, instead of P 0, is
caching this line. For the other cache lines
that are clean in main memory, the protocol need not do anything. Now that the
protocol has evicted all the cached lines remapped to C 0 from the caches of P 0 and P 1
and updated the data of C 0 with the most
recent values, it is ready to reply to P 0 with
C 0 . Finally, the protocol updates the AM
bits of all the directory entries of the cache
lines re-mapped to C 0 . Because C 0 is now
cached, the AM bits of C0, C1, . . . , C15 are
set and that of C 0 is clear. This guarantees correctness for future accesses to any
of these cache lines.
We have described how our Matrix
Transpose protocol enforces mutual exclusion between the caching states of normal and re-mapped lines. However, this is
overly strict since it is legal to cache both
normal and re-mapped lines provided both
are in the shared state. We find though
that for Matrix Transpose, enforcing mutual exclusion achieves higher performance
because all transpose applications we have
examined have the following sharing pattern: a processor first reads the normal
space cache lines from the portion of the
data set assigned to it, eventually writes
to it and then moves on to read and eventually update the re-mapped space cache
lines. Therefore, accesses tend to migrate
from one space to another. When the active
memory controller sees a read request for
normal or re-mapped space it knows that
eventually there will be an upgrade request
2.3
ing invalidation acknowledgments on multinode systems. The active memory protocol needs to invalidate cache lines that are
mapped to the requested line and cached
by one or more processors. But the requested line and the lines to be invalidated
have different addresses. Therefore, invalidation acknowledgment and invalidation request messages should carry different addresses or at the time of gathering invalidation acknowledgments a mapping procedure has to be invoked so that the invalidation requests get matched to the corresponding acknowledgments. Finally, we
also need to give special consideration to
remote interventions. While conventional
protocols may have to send at most one intervention per memory request, the active
memory protocol may have to send multiple
interventions whose addresses are different
but mapped to the requested line. Therefore, the intervention reply handler must
gather all the intervention replies before replying with the requested line.
Protocol Evaluation
3
2.75
Normal
AM
AM+Prefetch
2.5
App.
Normal
AM
FFTW
CG
MST
5644644
13886869
48582789
1421816
3628477
11829608
2.25
Speedup
2
1.75
1.5
Reduction
Factor
3.97
3.83
4.11
1.25
1
0.75
Simulation Results
0.5
0.25
0
FFTW
Conjugate Gradient
MST
Speedup
Normal
AM
SPLASH2 FFT
2
1
0
Speedup
Normal
AM
REDUCTION MICROBENCHMARK
2
1
0
2
Number of Processors
App.
Normal
AM
FFTW
CG
MST
5369071
211323
12544
1156683
136731
8947
Reduction
Factor
4.64
1.55
1.40
App.
FFTW
CG
MST
Normal
(% of texec )
721.51
(15.72%)
26.63
(0.46%)
838.58
(4.47%)
AM
(% of texec )
13.42
(0.51%)
12.56
(0.48%)
714.03
(14.22%)
Reduction
Factor
53.76
2.12
1.17
Conclusions
port new active memory techniques without changes to the memory controller hardware. This paper presents representative
results on uniprocessors and single-node
multiprocessors that confirm that our approach scales and performs well. Further,
this protocol extension naturally lends itself
to the research and development of multinode active memory systems that we call
Active Memory Clusters [3], which have
the ability to attain hardware DSM performance on commodity clusters.
Acknowledgments
This research was supported by Cornells
Intelligent Information Systems Institute
and NSF CAREER Award CCR-9984314.
References
[1] J. B. Carter et al. Impulse: Building a Smarter
Memory Controller. In Proceedings of the Fifth
International Symposium on High Performance
Computer Architecture, January 1999.
[2] M. Hall et al. Mapping Irregular Applications
to DIVA, A PIM-based Data-Intensive Architecture. Supercomputing, Portland, OR, Nov.
1999.
[3] M. Heinrich, E. Speight, and M. Chaudhuri.
Active Memory Clusters: Efficient Multiprocessing on Commodity Clusters. In Proceedings of the Fourth International Symposium on
High-Performance Computing, Lecture Notes
in Computer Science, Springer-Verlag, May
2002.
[4] Y. Kang et al. FlexRAM: Toward an Advanced
Intelligent Memory System. International Conference on Computer Design, October 1999.
[5] D. Kim, M. Chaudhuri, and M. Heinrich. Leveraging Cache Coherence in Active Memory Systems. In Proceedings of the 16th ACM International Conference on Supercomputing, New
York City, June 2002.
[6] J. Kuskin et al. The Stanford FLASH Multiprocessor. In Proceedings of the 21st International Symposium on Computer Architecture,
pages 302313, April 1994.
[7] M. Oskin, F. T. Chong, and T. Sherwood. Active Pages: A Computation Model for Intelligent Memory. In Proceedings of the 25th International Symposium on Computer Architecture,
1998.