Cash Coherency
Cash Coherency
Cash Coherency
Abstract—Current hardware implementations of TLS (thread- very short piece of codes which collects access information.
level speculation) in both Hydra and Renau’s SESC simulator When memory access is a write, the wrapper would detect a
use a global component to check data dependence violations, violation [5].In hardware, many architecture features are
e.g. L2 Cache or hardware list. Frequent memory accesses available to enable TLS [2][4].
cause global component bottlenecks. In this paper, we propose To provide the desired memory behavior, the data
a cache coherence protocol using a distributed data speculation hardware must [2]:
dependence violation checking mechanism for TLS. The 1. Forward data between parallel threads.
proposed protocol extends the traditional MESI cache 2. Detect when reads occurs too early (RAW hazards)
coherence protocol by including several methods to exceed the 3. Safely discard speculative state after violations.
present limits of centralized violation checking methods. The
4. Retire speculative writes in the correct order (WAW
protocol adds an invalidation vector to each private L1 cache
to record threads that violate RAW data dependence. It also
hazards).
adds a versioning priority register that compares data 5. Provide memory renaming (WAR hazards).
versions. Added to each private L1 cache block is a snooping Our paper is organized as follows. First, we introduce the
bit which indicates whether the thread possesses a bus design considerations and the principle of the proposed
snooping right for the block. The proposed protocol is much protocol in Section 2. Then we discuss implementation
more complicated than the traditional MESI protocol and hard details including, but not limited to, hardware alteration,
to be completely verified only through simulation. So we cache block state transition, and cache miss events handling.
applied formal verification to the proposed cache protocol to Protocol formal verification is reported in Section 3. We
confirm its correctness. The verification result shows that the brief related works in Section 4. Finally we draw conclusions
proposed protocol will function correctly in TLS system. in Section 5.
Keywords-TLS; Cache Coherence Protocol; Snooping Ring; II. RELATED WORKS
Invalidation Vector; Formal Verification Thread Level Speculation Systems. Many systems
have implemented the TLS mechanism. Some are supported
I. BACKGROUND directly with architectural features [2] [3] [4], while others
As Moore’s Law [1] predicts, hardware is becoming use software methods [5][6]. The software ones complement
progressively smaller and execution times quicker. The our work.
current hardware world is dominated by the multi-cores or Speculative Versioning Caches [4]. SVC combines the
many-cores. While the effort to design and implement CMP speculative versioning with cache coherence based on the
(Chip Multi-Processor) has been alleviated, there is a heavy snooping bus. SVC adds some useful bits to each cache line,
mission undertaken by the compiler or programmer. such as bits of commit, load, store, stale etc., that is
Thread level speculation (TLS) has been proposed to somewhat similar to our work. Our work eliminates ARB’s
help the programmers to make full use of the abundant latency and bandwidth problems and provides a more
computing resources. TLS divides a sequential code into efficient performance for distributed caches and in all of the
pieces (threads) and executes them in parallel. TLS above methods. Our work fits the SMP systems well, and not
optimistically assumes that the sequential semantics will not only for the current CMP. The most precious values of this
be violated. The architecture transparently detects any data research is the model analyzing the speculative action
dependence violations as the threads execute. Then if an between issued load and store instructions.
offending thread read too early, it will be squashed and The Stanford Hydra CMP [2][7][8]. Hydra provides
returned to a previous non-speculative correct state. Thus, speculative write buffers for each speculative thread in the
TLS can improve performance by exploiting the Thread- on-chip L2 cache, but only one thread owning the writing
Level Parallelism (TLP). token is permitted to retire. These embedded buffers have
A TLS technique can be installed in ether hardware or many advantages. First, they can forward data between
software. In software, every memory access is wrapped by a parallel threads. Second, because of the adjacent buffers,
330
mode. When a thread is invalidated by other threads, TLS 4. A write commit from the local processor while the
runtime resets the L1 Cache to re-execute. When a thread is cache block is in the EWR or EW state. The cache coherence
executed and ready to commit, TLS runtime changes the controller addresses this event by invalidating other threads
status of the L1 cache to pre-commit mode. according to the priority established in the invalidate vector
and changes the cache block state to invalid.
5. An exposed write miss from the bus while the cache
block is in the EWR status. The cache coherence controller
addresses this event by clearing the R bit in the cache block.
Figure 1. L1 Cache block organization. ER: exposed read EW: exposed 6. An exposed read miss from the bus while the cache
write R: Snooping ring flag V: valid block is in the EWR status. The cache coherence controller
addresses this event by setting the corresponding bit in the
As we noted in the previous section, we have added three invalidate vector according to the core ID.
bits to each L1 cache block. Figure 1 shows the organization The state machine for the proposed protocol is shown in
of private L1 cache block. We have also introduced an Figure 2, 3.
invalidation vector and a version priority register to each
core.
1. Each cache block maintains an ER bit as shown in
Figure 3. The ER bit is set when a thread reads data from
main memory or the shared cache, while in the speculative
execution mode. If a thread reads data before less speculative
threads store it, RAW dependence occurs and the thread
must be invalidated and re-executed.
2. Each cache block maintains an EW bit as shown in
Figure 3. The EW bit is set when a thread modifies data
which has been loaded into the L1 cache, while in the
speculative execution mode. Different threads may modify
the same data before they write data back to the shared cache
or the main memory. This event leads to different data
versions residing in different L1 caches. We discriminate Figure 2. State Transition between Speculative States
between the “freshest” data version and non-freshest data
version. Only the “freshest” modified data can be forwarded
to other threads.
3. Each cache block maintains an R [snooping ring] bit.
If a thread writes data to an L1 cache for the first time and
that L1 cache is not in the re-execute mode, than the write
operation triggers an exposed write miss event. The thread is
than awarded the snooping rights. This R bit cooperates with
the EW bit to provide for RAW data dependence violation
detection and data forwarding.
4. Each core maintains an invalidation vector to provide
quick thread invalidation. Each bit is mapped to one core.
In addition to traditional requests from the local
processor or bus, several new requests are introduced that the
coherence controller must address. The new requests
include:
1. An exposed write miss from the local processor while
cache block is invalid. The cache coherence controller
addresses this event by placing an exposed write miss
notation on the bus and changing the cache block state to
EWR.
2. An exposed read miss from the local processor while
Figure 3. State Transition between Speculative States and non-speculative
cache block is in the invalid status. The cache coherence
States
controller addresses this event by placing an exposed read
miss on the bus and changing the cache block state to ER. Exposed Write When a speculative thread executes for
3. An exposed write miss from the local processor while the first time, a write to any address causes an exposed write
the cache block is in the ER status. The cache coherence miss event. The coherence controller detects the write miss
controller addresses this event by placing an exposed write and places the exposed write miss event on the bus to inform
miss on the bus and changing the cache block state to EWR. any other processor to give up its bus snooping rights.
Simultaneously, the coherence controller sets the EW and R
331
bits to change the corresponding cache block to the EWR Adhering to the 1st thread management policy, if a thread
state. While in the EWR state, any exposed read or write hits gets the snooping ring, it will record threads that violate the
from the local processor will be ignored. RAW data dependence when relinquishing the snooping ring.
If a cache block is in the EWR state, the coherence The coherence controller doesn’t need to monitor read
controller updates the invalidation vector by setting the bit misses on the bus and record threads a second time if a
according to the core ID from the bus if the address of a read thread lacking the snooping ring is invalidated and re-
miss from the bus matches the cache block address field. A executes. However the proposed protocol does not
write miss from the bus forces the cache coherence controller distinguish between a write miss from a thread running for
to clear the R bit in the cache block. When a cache block the first time nor does it distinguish between a write miss
loses its snooping ring, the coherence controller changes its from a re-executing thread. A write miss from a re-executing
state to EW. In this state, the cache coherence controller thread that lacks the snooping ring will cause a newer, or
ignores all requests from either the local processor or bus. more speculative, thread to relinquish the snooping ring and
When a speculative thread becomes non-speculative, the the cache coherence controller will record any read misses
cache coherence controller writes cache blocks with the EW on the bus belonging to the more speculative thread. This
bit set back to the shared cache or main memory and changes event is named ‘priority reverse’. When a thread attempts to
their states to exclusive. If the invalidation vector is not commit, TLS runtime will invalidate wrong threads. A re-
empty, the cache coherence controller must send an execute mode was added to the L1 cache in order to
invalidation signal to any core whose invalidation bit has distinguish between the two different types of write miss.
been set in the invalidation vector. Otherwise the controller When a thread is invalidated and re-executes, TLS
simply clears the cache block and changes its status to runtime changes the L1 cache to the re-execute mode. When
invalid. After committing, the invalidation vector must be the L1 cache is In the re-execute mode, any write miss from
cleared to execute the next speculative thread. the local core that happens to the data cache block with the
There is a significant difference between the proposed EW bit set will not be placed on the bus and obtain the
protocol and the traditional MESI protocol in managing snooping ring.
exposed read and writes misses caused by address conflicts. When a thread has executed and is ready to commit, the
The traditional MESI protocol only supports non-speculative cache coherence controller continues to monitor bus activity.
execution, any modification to data can be written back to If a cache block is with the R bit set, the cache coherence
the shared cache or main memory and cache lines can be controller will detect RAW data dependence violation and
replaced with new data. TLS is unable to write data back update the invalidation vector. The thread may forward data
unless threads become non-speculative. When an address to other threads without canceling those threads executing
conflict occurs, a cache block with the EW bit set cannot be when it commits. System performance is degraded by re-
written back nor can it be discarded. Current cache launching a thread because it a wastes a bit.
implementations provide functions to resolve address After a thread has run, TLS runtime establishes the pre-
conflicts, such as locking a cache line or mapping the new commit bit for the L1 Cache. If a low version priority read
data to other cache blocks. We use a victim cache to hold the miss request from the bus hits a cache block with the R bit
conflicting blocks. Present TLS implementation threads are set, the coherence controller will place data on the bus if the
always either “while” loop or “for” loop and does not contain L1 cache is in the pre-commit mode. The controller then
sufficient data to trigger address conflict. Other speculative updates the invalidation vector.
threads are not permitted to be scheduled to the same core
before the active thread commits. The chance of inter-threads IV. PROTOCOL FORMAL VERIFICATION
address conflict is negligible. In order to apply formal verification to the proposed
Exposed Read Handling an exposed read is much cache coherence protocol, we introduced the following
simpler than handling an exposed write. The coherence symbols.
controller places a read miss together with the version
priority and core ID on the bus. When data from a remote TABLE II. NOTATION
core is ready, access to the shared cache or main memory is
aborted. After a cache line has been filled with data, the Variable Description
coherence controller sets the ER bit to change the block The state of the cache line in the core i with
S (addri )
status to ER. the address field equaling to addr
When a cache block is in the ER status, the coherence The initial exposed load operation launched
controller ignores all requests from the local core except read LDEI(addri ) by the core i for data with the physical
misses or write misses. The coherence controller handles a
read miss from the local processor by replacing the cache address addr
block with new data and keeps the cache block in the ER The initial exposed store operation
status. Upon receiving a write miss request, the coherence STEI(addri ) launched by the core i for data with the
controller not only replaces the cache block with new data physical address addr
but also clears the ER bit and sets the EW and R bits. Time(event) The event happening time
Invalidation Optimization The proposed protocol
works well if no thread is invalidated during execution. Spec(addri ) The speculation degree of the thread
332
Variable Description frm:
executed in the core i that accesses the ∀addr, i, j, k, STEI (addri ), LDEI (addrj ), STEI (addrk ), i ≠ j ≠ k
address addr ⎧ Spec(addri ) < Spec(addrj ), Ring(addri )
The core i pose the bus snooping ring for ⎨
Ring(addri )
the address addr ⎩Spec(addrk ) < Spec(addri ) ∨ Spec(addrk ) < Spec(addrk )
InvVec(i) Invalidation vector in the core i ⇒ j ∈ InvVec(i)
Spec(addrk ) < Spec(addri )
S (addri ) ∈{Invalid , Exclusive, Shared , EWR, EW , ER} prof: (1)
The state transition can be expressed by the following ⇒ Ring (addrk ) ⎯⎯⎯⎯→
STEI ( addri )
Ring (addri )
equations. From the above discussion, we know that the Spec(addrk ) < Spec(addrk )
proposed TLS mechanism will keep memory accesses (2)
operation in order, which means if any memory access is ⇒ Time( STEI (addrk )) > Time( LDEI (addrj ))
behind other memory accessed it will not be scheduled to be STEI ( addrk )
executed in advance. ∴!∃k Ring (addri ) ⎯⎯⎯⎯→ Ring (addrk ) (3)
Theorem 1. For any two exposed load operations for the Combining E.q (1) with (3), we can get the following
same address addr in different active speculative threads. equation.
If there are other exposed load operations between them in LDEI ( addr )
Ring (addri ) ⎯⎯⎯⎯⎯
j
→ j ∈ InvVec(i) .
original application, the later executed load operation in the
original sequential program will get the bus snooping ring Because L1 cache will keep snooping ring even the
speculative is squashed. The theorem 2 insures that the
right for the address addr from the former one. proposed protocol will keep the RAW sematic in original
program. From the theorem 2 verification process, we notice
∀addr, i ≠ j, Spec(addri ) < Spec(addrj ), Ring(addri ) the following result. If a speculative store does not fall
frm: STEI (addrj ) between another speculative store and a speculative load, it
⇒ Ring(addrj ) will not try to get the snooping ring before the speculative
prof: load happens. So the WAR sematic will be kept, a
∀k, k ≠ i ≠ j ∩!∃h, Spec(addri ) < Spec(addrh ) < Spec(addrk ) speculative will not be squashed by an “inappropriate” thread.
Learn from the above verification process, we confirm
∵!∃h, Spec(addri ) < Spec(addrh ) < Spec(addrk ) that the proposed cache coherence protocol will violate data
∴Spec(addrh ) > Spec(addrk ) ∨ Spec(addrh ) < Spec(addri ) dependences existing in original application.
Spec(addrh ) > Spec(addrk ) V. CONCLUSION
⇒Time(STEI (addrh )) > Time(STEI (addrk )) Memory system design in TLS systems is very important.
Spec(addrh ) < Spec(addri ) Tradeoffs between hardware complexity and functionality
are essential. In this paper we proposed a distributed data
⇒Time(STEI (addrh )) < Time(STEI (addri )) dependence violation detection cache coherence protocol for
∴!∃h,Time(STEI (addri )) < Time(STEI (addrh )) < Time(STEI (addrk )) TLS. Data dependence violation detection is done by each
STEI (addrk )
core, not by a global component as the SESC simulator does,
∴Ring(addri ) ⎯⎯⎯⎯ →Ring(addrk ) nor by altering cache functionality as transaction memory
STEI (addrl )
Re cursely, Ring(addrk ) ⎯⎯⎯⎯ →Ring(addrl ) [TM] does. Private L1 cache provides a perfect memory
STEI (addr )
location for data renaming. The greatest advantage of the
Ring(addrp ) ⎯⎯⎯⎯
j
→Ring(addrj ) proposed protocol is that it does not need to broadcast write
events while executing store instructions. It uses a snooping
Because L1 cache will keep snooping ring even the ring together with an invalidation vector to address RAW
speculative is squashed. The theorem 1 insures that the dependence violation detection issues. The correctness of the
proposed protocol will keep the WAW sematic in original proposed protocol is confirmed through formal verification.
program. The snooping will not be transferred from the more Data forwarding is important feature in TLS system, but
speculative thread back to the less speculative thread. our protocol does not include the feature because of thread
Theorem 2. For any exposed load operations that follow squash. It is also fruitless effort to forward data between
an exposed read operation for the same address addr in speculative threads when intermediate data is not the final
different active speculative threads. If there are other result. It is our next target to provide an effect data
exposed load operations between them in original application, forwarding method in TLS system when keeping the TLS
the core i will catch all exposed read operations before the system implementation as simple as possible.
speculative thread gives up the bus snooping ring for the REFERENCE
address addr . [1] G. E. Moore, “Cramming More Components onto Integrated
Circuits,” Electronics, vol. 38, April 1965.
333
[2] L. Hammond, B. Hubbert, M. Siu, M. Prabhu, M. Chen,and K.
Olukotun. The Stanford Hydra CMP. IEEE Micro Magazine, March-
April 2000.
[3] J. Renau, J. Tuck, W. Liu, L. Ceze, K. Strauss, and J. Torrellas.
Tasking with Out-of-Order Spawn in TLS Chip Multiprocessors:
Microarchitecture and Compilation. In International Conference on
Supercomputing (ICS), pages 179–188, June 2005.
[4] S. Gopal, T. Vijaykumar, J. Smith, and G. Sohi. Speculative
versioning cache. In HPCA 4, February 1998.
[5] P. Rundberg and P. Stenström. An All-Software Thread-Level Data
Dependence Speculation System for Multiprocessors. The Journal of
Instr.-Level Par., 1999.
[6] C. J. F. Pickett and C. Verbrugge. Software Thread Level Speculation
for the Java Language and Virtual Machine Environment. In Lang.
Comp. Par. Comp. (LCPC), Oct 2005.
[7] L. Hammond, M. Willey, and K. Olukotun, “Data Speculation
Support for a Chip Multiprocessor,” Proc. Eighth Int’l Conf.
Architectural Support for Programming Languages and Operating
Systems (ASPLOS-VIII), Oct. 1998
[8] K. Olukotun, L. Hammond, and M. Willey, “Improving the
Performance of Speculatively Parallel Applications on the Hydra
CMP,” Proc. 1999 Int’l Conf. Supercomputing(ICS), June 1999.
[9] J. Renau, B. Fraguela, J. Tuck, W. Liu, M. Prvulovic, L. Ceze, S.
Sarangi, P. Sack, K. Strauss, and P. Montesinos. SESC Simulator,
January 2005. http://sesc.sourceforge.net.
[10] Steffan J.G, Colohan C.B, Mowry T.C., Architectural Support for
Thread-Level Data Speculation[R], Technical Report CMU-CS-97-
188, School of Computer Science, Carnegie, Mellon University,
1997.
[11] Steffan J.G., Mowry T.C., The Potential for Thread-Level Data
Speculation in Tightly-Coupled Multiprocessor [R]., Technical
Report CSRI-TR-350, Computer Science Research Institute,
University of Toronto, 1997.
[12] Steffan J.G., Colohan C.B., Zhai A., et al., A Scalable Approach to
Thread-Level Speculation [C], Proceedings of the 27th Annual
International Symposium on Computer Architecture, 2000.
[13] Steffan J.G., Colohan C.B., Zhai A., et al., Improving Value
Communication for Thread-Level Speculation [C]. High-Performance
Computer Architecture , 2000. Proceedings. Eighth International
Symposium, 2000,P. 65-75.
[14] Steffan J.G., Colohan C.B., Zhai A. et al., The STAMPede Approach
to Thread-Level Speculation [J], ACM Transaction on Computer
Systems, 23, issue3, 2005, P.253-300.
334