1 of 19
Speculative Multithreaded Processors
Pedro Marcuello, Antonio González and Jordi Tubella
Departament d’Arquitectura de Computadors
Universitat Politècnica de Catalunya,
Jordi Girona 1-3, Edifici D6, 08034 Barcelona, Spain
e-mail: {pmarcue,antonio,jordit}@ac.upc.es
Abstract
In this paper we present a novel processor microarchitecture that relieves four of the most important
bottlenecks of superscalar processors to exploit instruction level parallelism: the serialization imposed by
true dependences, the instruction window size, the complexity of a wide issue machine and the instruction
fetch bandwidth requirements. The new microarchitecture executes simultaneously multiple threads of
control obtained from a single program by means of control speculation techniques that do not require any
compiler/user support. In this way, it works on a large instruction window composed of multiple
nonadjacent small windows. Multiple simultaneous threads execute different iterations of the same loop,
which requires the same fetch bandwidth as a single thread since they share the same code. Dependences
among different threads as well as the values that flow through them are speculated by means of data
prediction techniques. The novel processor organization does not require any special feature in the
instruction set architecture; its novel features are completely based on hardware mechanisms. The
architecture is scalable in the sense that it consists of a number of processing units with separate hardware
for issuing and executing instructions. The preliminary evaluation results show the potential of the new
architecture to achieve a high IPC rate. For instance, a processor with 4 four-issue processing units achieves
an IPC from 2.2 to 9.9 for the Spec95 benchmarks.
Keywords: data speculation, data dependence speculation, control speculation, dynamically scheduled
processors, multithreaded processors.
1.
Introduction
Several studies on the limits of the instruction-level parallelism (ILP) that current superscalar organizations
can attain show that it is rather limited when a realistic configuration is considered (see for instance [36]
among others). Four of the most important bottlenecks that cause this limitation are: the serialization
imposed by data dependences, the instruction window size, the complexity of the logic required by a wide
issue machine and the fetch bandwidth.
Whereas a lot of effort has been devoted to reduce the penalties caused by control and name
dependences, techniques to relieve the serialization caused by data dependences1 have been practically
1. In this paper, data dependences and true dependences are synonyms [16].
2 of 19
ignored so far. Data speculation techniques are emerging as a new family of techniques that can provide a
significant boost in ILP [12] [14] [15] [18] [19] [25] [26] [37]. Data speculation is based on predicting
either the source or destination operands of some instructions in order to execute speculatively the
instructions that depend on them.
The amount of ILP that a superscalar processor can exploit is highly dependent on the size of the
instruction window. However, increasing the window size poses new problems that limit its feasibility or
its effectiveness. First, branch prediction accuracy limits the average window size. To go beyond a single
basic block, superscalar processors rely on predicting the outcome of unresolved branches. However, this
process is sequential in nature because the instruction window is composed of a contiguous region of the
dynamic instruction sequence, which is called thread of control (or thread for short) in this paper1. In
consequence, a single mispredicted branch prevents the instruction window to grow further until the branch
is resolved. Second, the complexity and delay of the issue logic grows with the instruction window size
[22]. It has been shown in a recent study [22] that the issue and bypass logic is likely to be one of the most
important hurdles to build a wide issue superscalar processor due to its impact on the clock cycle.
Finally, in addition to the branch prediction accuracy, the two main factors that limit the instruction
fetch bandwidth are: the branch prediction throughput and the potential to fetch noncontiguous
instructions.
In this paper, we propose a novel processor microarchitecture that relieves the four bottlenecks
mentioned above.
First, the processor implements an effective large instruction window that is made up of several
nonadjacent smaller windows. That is, the instructions that are in-flight in the processor at any point in time
consist of several subsequences of the dynamic instruction stream such that there are instructions among
consecutive subsequences that are not known (not fetched yet). The execution inside each small window
uses conventional control speculation techniques whereas the creation of new small windows are based on
speculating on highly predictable branches (e.g., branches that close loops). Each small window
corresponds to a different thread of control of the same program. These threads, which are not necessarily
independent, are created from a single sequential program completely by the hardware without compiler
intervention and are executed by several thread units with distributed resources. In particular, the issue and
bypass logic is local to each thread unit, which allows the processor to scale to higher issue widths by
adding more thread units, without increasing the complexity of this logic and thus, without compromising
the cycle time.
Second, the execution ordering constraints imposed by dependences among different threads (interthread dependences for short) are avoided through the extensive use of data dependence and data
speculation. The proposed architecture speculates on both inter-thread data dependences and the data that
flows through them. That is, for each new speculative thread it predicts which register and memory
dependences it has with previous threads in the control flow and which values are going to flow through
1. Regardless of the particular approach used to obtain it (i.e. the partition of a program into threads of control could
be done by the hardware, as proposed in this work).
3 of 19
such dependences. The thread is then executed obeying the predicted dependences and using the predicted
values to avoid waiting for the actual data.
Third, since the multiple threads of control are obtained by speculating on loop-closing branches,
simultaneously active threads of control share the same code (the loop body), and thus, a simple fetch
engine can feed all the threads with the same fetch bandwidth as that required by a single thread.
The new processor microarchitecture, which is called Speculative Multithreaded (SM) architecture,
does not require any modification in the instructions-set architecture: ordinary programs compiled for a
superscalar implementation can run in this new processor architecture.
The rest of this paper is organized as follows. The SM processor microarchitecture is presented in
section 2. Performance figures of the SM architecture are analyzed in section 3. Section 4 reviews the
related work. Finally, section 5 summarizes the main contributions of this paper.
2.
Speculative multithreaded processor microarchitecture
The microarchitecture of a Speculative Multithreaded Processor (SM) is shown in Figure 1. It consists of
several thread units (TU) that execute concurrently different threads of a sequential program. These threads
are dynamically obtained by a control speculation mechanism based on identifying loops and executing
speculatively different iterations of a loop [32] (not necessarily an innermost loop). These threads do not
need to be independent and therefore, there are plenty of them in any program. Thread units are
interconnected through a ring topology and iterations are allocated to thread units following the execution
order. Each thread unit has its own physical register file (local registers in Figure 1), register map table,
instruction queue, functional units, local memory and reorder buffer in order to execute multiple
instructions out-of-order. In this way, the issue bandwidth of a SM processor is scalable and the main
bottlenecks observed in superscalar processors (wakeup, select and bypass logic) are avoided [22]. High
issue rates can be achieved by increasing the number of thread units, which has no impact on the cycle time.
An important feature of SM processors is the aggressive use of speculation techniques. As pointed out
in the introduction, data dependences are one of the main barriers for the exploitation of instruction level
parallelism. This problem is addressed in SM processors through speculation mechanisms that can be
classified into three categories:
• Control speculation is used at two levels. On the one hand it is used to obtain threads from a
sequential program. On the other hand, it is used to speculate on individual branches inside each
thread like superscalar processors do.
• Data dependence speculation: Dependences among different threads (inter-thread dependences) are
predicted. Inter-thread memory dependences are predicted by means of address prediction. When a
thread executes a memory instruction, it is speculatively disambiguated against other memory
instructions in previous threads by using the predicted address of those instructions if they have not
yet been executed. The memory dependence speculation scheme is implemented by means of the
4 of 19
Icache
Rmap
local
registers
live-in
registers
Rmap
local
registers
live-in
registers
instr.
queue
local
memory
live-in
registers
instr.
queue
func
units
TU0
local
registers
local
memory
instr.
queue
func
units
TU1
local
memory
to TU0
from TU2
Rmap
control
speculation
func
units
TU2
multi-value cache
Figure 1: A speculative multithreaded processor with three thread units.
multi-value cache. Inter-thread register dependences are managed by identifying at run time which
registers hold live values at the beginning of an iteration (live-in registers) and the number of writes
that each iteration performs to them. Live-in registers whose value is not predictable are read from
the live-in register file after being produced by the previous thread.
• Data speculation. Inter-thread data dependences, either through registers or memory, do not cause a
serialization between the producer and the consumer in SM processors, provided that the value that
flow through such dependences are predictable. Previous works have shown that many of such
values are predictable, including the results of arithmetic instructions, the values read from memory
and the register values used to compute the effective address of memory instructions [12] [14] [15]
[18] [19] [25] [26] [37].
Another important feature of SM processors is that they may exploit a large amount of instruction
level parallelism with a simple instruction fetching mechanism. This feature is based on the observation
that the majority of iterations of the same loop follow the same control flow. In particular, we have
evaluated that the most frequent control flow of each loop represents about 85% of the total number of
iterations for the Spec95. This suggests that if thread units are devoted to execute iterations of the same
loop with the same control flow, the instruction fetching mechanism can be shared by all the threads. A
single fetch engine fetches a single instruction stream from the instruction cache using a conventional
branch predictor. The instructions fetched in each cycle are broadcasted to all the thread units where their
registers are renamed using a different register map table and then, they are dispatched to their
5 of 19
corresponding instruction queue. In this way, the processor peak performance can be equal to the actual
fetch bandwidth multiplied by the number of thread units. This organization overcomes one of the most
important hurdles of multithreaded architectures. In those machines, the processor is required to fetch from
different program counters, simultaneously or alternatively, which makes the fetch engine to be one of the
critical parts of such architectures [31]. In SM processors, instructions are always fetched from a single
program counter. Each thread validates the predicted intra-thread control flow by executing branch
instructions and comparing the result with the prediction. In case of misprediction, this thread and the
following ones are squashed1.
Finally, precise exceptions are supported by SM processors. Instructions of the same thread are retired
in order with respect to the instructions of the same thread by means of a local reorder buffer. Besides,
memory values that are produced by each thread are kept in the multi-value cache and will not updated the
next level of the memory hierarchy until the thread is committed (i.e., until the thread becomes non
speculative). Each thread unit has a local memory to store the predicted live memory values at the
beginning of the corresponding iteration (live-in memory values) and also to speedup the access to data
produced by itself or reused several times.
The novel features of the SM microarchitecture are based completely on hardware techniques and do
not require any extension in the instruction set architecture. Below the main parts of the SM
microarchitecture are commented in more detail.
2.1.
Inter-thread control speculation
A program is executed out-of-order by means of a large instruction window that consists of several non
contiguous small windows, each one corresponding to a different thread of control. Such multiple threads
of control are built at run-time through the control speculation mechanism proposed in [32]. In this section
we just outline the main features of this mechanism.
The idea of such mechanism is to identify loops at run-time and to execute concurrently several
iterations of the same loop, even if they are not independent. Among all the threads that proceed in parallel
at any given time, there is only one of them that is not control dependent on previous threads, which is
called the non speculative thread. The remaining ones are called speculative threads.
Initially, there are not speculative threads. When the non speculative thread starts a new iteration of a
loop, a number of speculative threads are created and allocated to execute the following iterations. When
a speculative thread reaches the closing branch of its iteration it is suspended and waits to be either
committed or squashed. When the non speculative thread finishes an iteration of a loop all the speculative
threads of this loop are squashed if the branch is not taken. Otherwise, the thread allocated to the next
iteration is committed and becomes the new non speculative thread.
1. Alternative ways of dealing with branch misspeculation, which are based on a more selective squashing may also
be implemented but they are not considered in this paper.
6 of 19
# write
live
value val_str
Register dep.
conf
# stores.
# logical reg.
target address of
backward branch
log. reg.
offset
Memory dep.
Figure 2: The loop iteration table
A small table, which is called loop execution table, can be used to predict the number of iterations of
a loop based on previous history. A 16-entry table has a hit ratio of 92% for the Spec95 benchmarks [32].
2.2.
Data dependence and data speculation
Inter-thread dependences (which correspond to loop carried dependences) and the values that flow through
them are predicted by means of a history table that is called loop iteration table. Each entry of this table
contains information about the last iteration of a different loop. The loop iteration table (see Figure 2) is
indexed with the loop identifier (the target address of backward branches) and it contains the following
fields:
• Register dependences. This field stores for each logical register the number of writes performed by
the last iteration of the corresponding loop, whether it contained a live value at the beginning of the
iteration, the value at the beginning, the difference between the values of the last two iterations
(val_str) and a field (conf) that indicates whether such value is predictable (for instance a 2-bit
saturating counter can be used to assign confidence to the predictions).
• Memory dependences. For each store this table stores the logical register identifier that was used to
compute the effective address and the offset of the effective address in relation to the initial value of
the register. Besides, information regarding live-in memory values could be added in order to predict
such values. However, this is not considered in this paper.
A new entry in this table is allocated every time that a new loop is started. The entry corresponding to
the loop with the least recently started iteration is chosen for replacement.
7 of 19
100.0
hit ratio
80.0
60.0
40.0
20.0
0.0
2
4
8
16
Figure 3: Hit ratio of the loop iteration table for 2, 4, 8 and 16 entries.
Each entry of the iteration table is relatively large. However, very few entries are enough to obtain
significant benefits since programs usually spend large intervals of time in the same few loops. Figure 3
shows the percentage of iterations that find their history in the iteration table for a number of entries ranging
from 2 to 16 averaged for the whole Spec95 benchmark suite. This percentage is 85% for 2 entries and 90%
for 4 entries.
2.2.1. Dependences through registers
When a speculative thread is created, its local register file and its register map table is copied from its
predecessor. Then, for each register Ri that is live and predictable, the instruction queue of the thread unit
is initialized with the instruction add Ri,Ri,stride. This instructions are not in the static code but they
are inserted in the instruction queue by the hardware when a thread is created. In this way, the registers are
initialized with the predicted values. If a register Ri is live but not predictable, then the i-th entry of the
register map table is set to point to the i-th entry of the live-in register file. This implies that the size of the
live-in register file is equal to the number of logical registers. In fact, the number of live-in registers per
iteration is much lower as it is shown in section 3.2 (see Figure 10). Thus, a shorter live-in register file could
suffice but then, a map table would be necessary to indicate the physical register in the live-in register file
allocated to each live-in logical register.
Besides, each thread unit has another table, which is called register write table, that contains for each
logical register the number of remaining writes to that register. This table is initialized with the #writes field
of the loop iteration table. When an instruction with destination register Ri is retired from its thread unit,
the corresponding entry of the register write table is decreased and if it becomes zero, the result of this
instruction is also written in the i-th entry of the live-in register file of the succeeding thread unit. At this
point, the instruction of the next thread that was stalled waiting for such value, if any, is awaken. Since all
the simultaneously active speculative threads follow the same control flow, the number of register writes
is known on before hand. If different control flows were allowed as suggested in the extensions to the
architecture discussed in section 3, the number of writes would not be known but it could be predicted and
perform all the actions based on this prediction.
8 of 19
Multi-value cache
address P
value0 V0 NW0 value1 V1 NW1 value2 V2 NW2 value3 V3 NW3
Figure 4: The multi-value cache for a SM processor with four thread units.
2.2.2. Dependences through memory
Inter-thread memory dependences are enforced by means of the multi-value cache. Besides, this structure
provides the required support for speculating on inter-thread memory dependences by means of address
prediction.
Notice that the SM processor stores multiple states of the registers, each one corresponding to a
different point in time of the execution. In the same way, multiple states of the memory are supported by
the multi-value cache (see Figure 4). It stores for each address as many different data words as number of
thread units. For each replicated word the multi-value cache contains two additional fields: the number of
writes that the corresponding thread is expected to perform (NW) and a flag indicating whether the data
corresponding to that thread unit has been produced (V). Finally, each entry contains a single presence bit
(P) that indicates whether those entries with a valid flag actually contain the data or it has to be searched
from the next memory level. This is intended to implement a “delayed copy” policy to bring data to the
multi-value cache in order to reduce the pressure on the next memory level at initialization time.
This cache is initialized when a speculative thread is created. A new entry is allocated for each store
whose base register is predictable. This is done by inserting in the instruction queue a special instruction
that adds the register with the corresponding offset (provided by the loop iteration table). Such instruction,
which is not in the static code but it is inserted by the hardware, computes the effective address and
initializes the corresponding multi-value cache entry as follows.
For each predicted write address, a line is allocated in the multi-value cache if not present. If some
thread has not enough entries in the multi-value cache, it and its successors are not created. When a new
cache line is allocated, the NW field of the corresponding thread is set to 1, the P flag is reset and the V
flags of the thread and the preceding ones are set whereas the V fields of succeeding threads are reset. If
the line is already in the multi-value cache, the NW field of the thread is increased and its V bits of the
succeeding threads are reset.
Store instructions update both the local memory and the multi-value cache. The local memory is just
updated to store the new value, regardless of whether such address was in the local memory. If the line
corresponding to the written address is in the multi-value cache, its V flag is set and its NW field is
decreased. If it becomes zero, the data is copied to all succeeding threads until the next one that is expected
to produce (but has not yet produced it) a different value for the same address. This implies that it is copied
9 of 19
from the next thread to the first one that has either NW or V different from zero. The data is copied into
this latter thread only if it has the V bit reset. The V bits of the threads where the produced value is copied
are set. If NW becomes negative, a misspeculation checking mechanism is activated.
Misspeculations are handled by broadcasting the store effective address to the succeeding threads.
Each thread checks in its load/store queue for a matching load and if it finds any, it is re-executed as well
as those instructions that depend on it through the selective re-issuing approach proposed elsewhere [23]
[33].
When a thread performs a store and the corresponding line is not in the multi-value cache, the
misspeculation checking mechanism is also activated. Besides, a new line is allocated into the multi-value
cache with all the V bits set and all the NW fields equal to zero.
When a thread executes a load instruction, the local memory is checked first and in case of miss, the
multi-value cache is looked up. If the corresponding data line is in the multi-value cache, it will contain a
different copy for each thread. If the data corresponding to that thread has its V bit set, then this value is
available. The presence bit indicates whether it is present in the multi-value cache or it has to be read from
the next level of the memory hierarchy. In this later case it is copied in the multi-value cache and the
presence bit is set. If the V flag is not set, the load is cancelled and stored in a load wait queue. Loads from
this queue are tried again in idle cycles of the multi-value cache. If the corresponding cache line is not in
the multi-value cache, the data is read from the next memory level. Optionally, any read data can be copied
into the local memory to speedup further references to the same data.
When a thread finishes (or is squashed), if the corresponding NW field of any line of the multi-value
cache is greater than zero it is reset to zero, and the value is propagated to succeeding threads as in the case
when the counter becomes zero. This occurs when some predicted store did not actually occur. In this case,
all dependences have been obeyed but there may be loads of succeeding threads waiting for a non existent
write that must be awaken.
A line of the multi-value cache can be considered for replacement only if all its NW fields are equal
to zero (this will always be the case when all the speculative threads have finished). If the line is dirty, it is
considered for replacement if in addition there are not speculative threads. This ensures that the next
memory level is only updated with committed values. Deadlock is guaranteed not to happen since new lines
are only necessary to be allocated at speculative thread creation.
Finally, note that the local memory is not necessary for a correct functioning of the system. It is used
for performance reasons, to exploit locality in the values created by the same thread. It could be also used
to store live-in memory values using a value prediction scheme like the one proposed for registers. This
feature will be studied in future extensions of this work.
10 of 19
3.
Performance evaluation
In this section we present the results of a preliminary evaluation of the SM microarchitecture. The objective
is to demonstrate the potential of the new architecture to exploit ILP. Evaluation of different configurations
as well as the tuning of critical parameters of the architecture like the prediction scheme, size of the cache,
size of register file, etc., are beyond the scope of this paper.
3.1.
Experimental framework
The SM architecture has been evaluated through trace-driven simulation of the Spec95 benchmark suite.
The programs have been compiled with the DEC Fortran and C compilers for a DEC AphaStation 600 5/
266 with full optimization, and instrumented by means of the Atom tool [29]. A cycle-by-cycle simulation
is performed in order to obtain accurate timing results.
We have assumed a SM processor with 4 thread units, an issue bandwidth of 4 instructions per cycle
for each thread unit, 4 entries in the loop iteration table, a multi-value cache with 128 entries (4-KB
capacity) and 4 local memories with 64 entries each (512-byte capacity). The latency of these memories is
2 cycles and the next level of the memory hierarchy is assumed to have a latency of 2 cycles and an infinite
capacity. The number of functional units per thread unit is (latency in brackets): 2 simple integer (1), 1
integer multiplication (2), 2 simple FP (1), 1 FP multiplication (4) and 1 FP division (17). Every thread unit
has a local reorder buffer with 64 entries. The fetch bandwidth of the single fetch engine is just up to 4
consecutive instructions (i.e., no more than one taken branch). Branch prediction is performed through a
branch history table with 2048 2-bit entries.
11 of 19
5
10
4
8
3
IPC
TPC
Speed-Up
2
6
4
1
2
0
AVG-FP
wave5
turb3d
tomcatv
swim
su2cor
mgrid
hydro2d
fpppp
apsi
applu
AVG-INT
compress
gcc
go
ijpeg
li
m88ksim
perl
vortex
0
Figure 5: IPC (instructions per cycle) and TPC (threads per cycle) of the SM processor and
speedup when compared with a superscalar processor.
3.2.
Performance figures
Figure 5 shows the average number of committed instructions per cycle of the SM processor for the Spec95
benchmarks. It also depicts the average number of active threads per cycle (TPC) that are correctly
speculated. The TPC is a measure of the thread level parallelism exploited by the SM processor. Figure 5
also includes the speedup of the SM processor over a superscalar processor with the same fetch bandwidth
as the SM processor and the same resources as one of the thread units. It can be observed that the speedup
of the SM processor is quite correlated with the TPC, which confirms that the TPC is a rough estimation
of the additional parallelism exploited by the novel features of the SM microarchitecture. These results
correspond to 100 million of instructions for each benchmark after skipping the initial part that corresponds
to initialization of data structures.
The IPC of FP programs is quite high. Notice that in spite of a very simple instruction fetching scheme
whose bandwidth is bounded by 4 instructions per cycle, the processor can achieve in many cases an IPC
that is about twice the fetch bandwidth. For comparison, the performance of a superscalar processor is
always lower than the fetch bandwidth. Increasing the fetch bandwidth is hard since it involves the
prediction of multiple branches and the fetch of non consecutive code. Moreover, to achieve the same level
of performance a superscalar processor should also increase the issue bandwidth. However, using the
results of a study recently published by Palacharla et al. [22], in a 0.18 µm process, which is expected to
be used in a few years [39], the worst case delay is reported to increase from 578ps for a four-issue
processor to 1056ps for an eight-issue processor, that is, the cycle time would increased by 83%. Under
such circumstances, the maximum speedup that may be obtained by moving from a four-issue processor to
an eight-issue processor is bounded by 1.09. In consequence, it could not reach the performance level
attained by the SM processor. In fact, the results in [22] suggest that the route for higher ILP should be
based on microarchitectures that have a scalable design in most of its parts, especially in the issue logic.
This is the approach taken by SM processors.
AVG_FP
0.0
wave5
0.0
turb3d
20.0
tomcatv
20.0
swim
40.0
mgrid
40.0
su2cor
60.0
hydro2d
60.0
apsi
80.0
fpppp
80.0
AVG_INT
100.0
compress
gcc
go
ijpeg
li
m88ksim
perl
vortex
100.0
applu
12 of 19
Figure 6: Percentage of dynamic instructions in innermost loops.
The performance of the SM processor for integer programs is significantly lower than for FP
programs. It can be observed in Figure 5 that the main reason for such difference is their low TPC. Whereas
the TPC for FP codes is quite high, it is very low for integer codes. There are mainly two reasons for such
low TPC:
• Most of the loops in integer codes have a very low number of iterations. In average, their number of
iterations per loop is 5.56 (geometric mean) [32]. Besides, due to the use of stride predictors, thread
speculation does not take place until the third iteration of the loop. Thus, if SM processors are limited
to speculate just in one loop at the same time, as assumed here, the amount of thread level parallelism
is very limited. Besides, when just one loop is speculated at the same time, this loop is usually
(although not necessarily) the innermost loop of a nest. Figure 6 shows the percentage of dynamic
instructions in innermost loops. Whereas innermost loop instructions represent 64% of the total
executed instructions in the FP benchmarks, they only account for 30% for integer codes (the results
in these graphs, as well as those in the rest of the figures of the paper, correspond to the first 109
instructions of each program).
• Intra-thread control instructions are much more frequent in integer than in FP codes. Besides, such
control instructions exhibit a more variable behavior in integer codes. In consequence, we have that
the probability that one iteration follows the same control flow as the previous one is quite high in
FP codes but it is low in some integer codes (see Figure 7). This has a direct consequence in the
number of misspeculated threads due to the fact that they follow a different control flow than the non
speculative thread.
We are currently investigating extensions to the SM microarchitecture to deal with these two issues.
Regarding loops with few iterations and nests with few instructions in the innermost loop, the solution
could be to allow to speculate on more than one loop simultaneously. This mainly implies to allow the
processor to create speculative threads from other speculative ones. Consider for instance a two deep loop
nest. If the outer loop is speculated, each thread could be forked when it reaches the inner loop into different
threads corresponding to different iterations of the inner loop.
AVG_FP
0.0
wave5
0.0
turb3d
20.0
swim
20.0
tomcatv
40.0
su2cor
40.0
mgrid
60.0
hydro2d
60.0
fpppp
80.0
applu
80.0
AVG_INT
100.0
compress
gcc
go
ijpeg
li
m88ksim
perl
vortex
100.0
apsi
13 of 19
Figure 7: Percentage of iterations that follow the same control flow as the previous one of the
same loop.
The problem of a variable intra-thread control flow could be handled by not restricting all the
concurrent iterations of the same loop to follow the same control flow. Although the required fetch
bandwidth would increase, it could be supported by a special single-ported cache, which we call loop cache
(see Figure 8), that has some similarities with the trace cache [23]. This cache is indexed by a path identifier
and each entry corresponds to the dynamic sequence of instructions executed by a particular loop iteration,
which is called a path. A path identifier consists of a loop identifier plus a particular control flow of the
branches of one iteration. Since a path may consist of many instructions, it is split into several linked lines.
Multiple fetch engines would access the loop cache alternatively, in different cycles, to get one line of a
different path per cycle and broadcast it to those thread units that are executing such path. Notice that very
few paths and very few fetch engines are required since a few paths account for most of the executed
instructions. This can be observed in Figure 9, which shows that the 2 most frequent paths of each loop
cover 91% and 96% of the total number of iterations for integer and FP codes respectively.
It should be pointed out that the problem in integer codes is not due to the data speculation technique.
Figure 10 shows the average number of live-in registers and live-in memory values per iteration. In
average, an iteration of the Spec95 suite has 4.3 live-in values in registers and 3.5 live-in values in memory
for integer codes and 10.9 and 12.8 respectively for FP programs. Figure 11 shows the percentage of livein registers, live-in memory values and addresses of live-in memory values that are predictable with a
instructions
path id.
tag
instructions nil
Figure 8: A loop cache. The figure shows a loop that occupies two lines.
14 of 19
100.0
100.0
80.0
80.0
60.0
60.0
40.0
40.0
20.0
20.0
0.0
0.0
1
2
4
8
16
1
2
Integer Spec95
4
8
16
Fp Spec95
Figure 9: Percentage of iterations versus number of different paths.
stride-based predictor. It can be seen that in general this percentage is high and that there is not significant
difference between integer and FP codes. Obviously, the performance of SM processors could be improved
with more sophisticated predictors. We believe that this is an interesting topic for further research.
The main objective of this preliminary evaluation is to confirm the potential benefits of the SM
microarchitecture. More exhaustive evaluations are required to evaluate the benefits for different
configurations, like branch, value and dependence predictors, and in general to identify critical design
parameters and propose alternative solutions.
4.
Related work
Multithreaded architectures have been studied for long but so far the focus has been the improvement of
throughput by executing several independent threads or dependent threads with the necessary
synchronization added by the compiler in order to obey all dependences. On the other hand, in this paper
we focus on multithreaded architectures that try to reduce the execution time by dynamically speculating
10.0
30.0
8.0
20.0
6.0
reg
mem
4.0
10.0
2.0
AVG_FP
wave5
turb3d
swim
tomcatv
su2cor
mgrid
hydro2d
apsi
fpppp
applu
AVG_INT
0.0
compress
gcc
go
ijpeg
li
m88ksim
perl
vortex
0.0
Figure 10: Average number of live-in registers and live-in memory values per loop iteration.
15 of 19
100.0
100.0
80.0
80.0
AVG_FP
wave5
turb3d
swim
tomcatv
mgrid
0.0
su2cor
0.0
AVG_INT
20.0
compress
gcc
go
ijpeg
li
m88ksim
perl
vortex
20.0
fpppp
40.0
hydro2d
40.0
60.0
apsi
reg
mem val
mem addr
applu
60.0
Figure 11: Percentage of predictable live-in registers, live-in memory values and addresses of livein memory values.
(on control dependences, data dependences and data values) on multiple threads of control from a single
sequential application.
Control speculation has been extensively researched. Proposed schemes for superscalar processors are
based on predicting branches in the sequential order of the program. This means that a single mispredicted
branch will cause the squash of every instruction fetched after it. In these schemes, branches that are
difficult to predict may prevent the processor from speculating beyond them, even if successive branches
are highly predictable. To obtain multiple threads of control, SM processors speculate only on highly
predictable branches, and therefore they have more potential to build a large instruction window. On the
other hand, speculating on non contiguous branches results in a non contiguous instruction window that is
more complex to manage.
Data dependence speculation is used by some processors in the memory disambiguation stage. Both
the address resolution buffer [11] and the time-sequence cache [3] of the Multiscalar, as well as the address
reorder buffer of the HP PA8000 [17] are examples of such mechanism. However, these techniques assume
that any load is independent of all the previous stores whose addresses are unknown. More sophisticated
approaches have recently been proposed in order to predict more accurately the existence of a data
dependence through memory and avoid some of the expensive recovery actions that are required by
misspeculations [20] [21] [33].
There are few proposals in the literature dealing with the dynamic management of a large window that
consists of several threads of control not necessary independent among them obtained from a sequential
program. Pioneer work on this area was the Expandable Split Window paradigm [10] and the follow-up
work on Multiscalar processors [28]. Other proposals are the he SPSM architecture [7]; the Superthreaded
architecture [30]; the Multithreaded Decoupled architecture [6]; and Trace processors [24]. There are
important differences between the SM microarchitecture and those previous proposals:
• The Multiscalar, SPSM, Superthreaded and Multithreaded Decoupled architectures require some
addition/extension to the ISA. On the other hand, the SM architecture uses speculation techniques to
16 of 19
obtain and manage multiple threads of control dynamically from a sequential conventional object
code without any support from the user/compiler. Moreover, in those architectures data dependences
are always enforced by executing the producer instruction before the consumer one. On the other
hand, the SM architecture uses data speculation to deal with inter-thread dependences, both through
registers and memory. In this way, the producer and consumer instructions can be executed in any
order as though they were independent, provided that the predicted values are correct.
• Data speculation has been used in previous proposals, mainly in the context of a superscalar
processor with a single thread of control [14] [15] [18] [19] [25]. Data speculation is also used by
Trace processors. However, there are significant differences between Trace processors and SM
processors. First, SM processors speculate on inter-thread data dependences by predicting all
memory references at the beginning of a thread, whereas Trace processors execute speculatively
memory instructions based on the predicted value of source operands but these instructions compete
with the other instructions for securing issue slots. Thus, when a SM processor disambiguates a load
instruction all the addresses of previous stores have been predicted (assuming that they are
predictable) whereas this is not the case of Trace processors. In consequence, Trace processors will
experience a much higher number of memory dependence misspeculations, that is, the data
dependence speculation mechanism of SM processors is more accurate than that of Trace processors.
The second important difference is that Trace processors require a global register file that may
become a bottleneck for the scalability of the system, whereas SM processors have all the register
files completely distributed. Third, unlike SM processors, the multiple instructions windows
simultaneously managed by Trace processors are adjacent. Finally, the approach to build the
instruction window is different: whereas in Trace processors it is based on a trace cache [23], in SM
processors it is based on a loop prediction technique [32].
• Data dependence speculation is also used by the Multiscalar and SPSM architectures. However, they
just implement an “always-independent” prediction scheme, whereas the approach used by the SM
architecture is more powerful since it is based on memory address prediction. Memory addresses
have been shown to be highly predictable. Previous works have used this fact to implement hardware
prefetching ([4] among many others), to reduce the memory latency perceived by the processor [1]
[2] [9] [13], or to implement data speculation [14] [15] [25] in the context of a superscalar processor.
Improving the instruction fetch bandwidth has been the target of some recent works. However, all of
them are oriented towards a processor that supports a single thread of control. Some proposals have focused
on improving the branch prediction throughput [5] [8] [38], whereas others have in addition addressed the
problem of non contiguous instruction fetching [23]. The SM architecture takes a different route: it reduces
the fetch bandwidth requirements by taking advantage of the fact that simultaneously active threads process
the same code with different data. A similar feature is exploited by the CONDEL architecture [34] and the
dynamic vectorization approach proposed in [35]. However, those approaches are more restrictive than the
one used by SM processors since the former is limited to loops whose static body does not exceed the
implemented instruction window, whereas the latter is feasible only if the dynamic sequence of instructions
17 of 19
executed by the loop are the same for all the iterations and they fit into a single instruction cache line (the
instruction cache organization that they use is the trace cache [23]).
5.
Conclusions
We have presented a novel processor microarchitecture, which is called Speculative Multithreaded (SM).
A novel feature of such architecture is its ability to dynamically extract and execute multiple threads of
control from a single sequential program written in a conventional ISA without requiring any compiler
support. Multiple concurrent threads execute different iterations of the same loop. These threads are not
necessarily independent (usually they are dependent) but inter-thread data dependences are resolved by
speculation techniques: both dependences and values that flow through them are predicted. In this way,
loops that are not parallelizable by the compiler can be executed in parallel if data dependences and data
values are correctly predicted. The second main feature of the architecture is that the additional instruction
level parallelism due to inter-thread parallelism does not require any additional fetch bandwidth since all
the threads share the same code. Once a new instruction is fetched, it is copied into the instruction register
of every thread. Then, its operands are renamed using a different register map table for each thread and
afterwards, the renamed instructions are dispatched to their respective instruction queues.
A preliminary evaluation of a SM processor has shown that it can achieve a high IPC (instructions
committed per cycle) for FP programs, which can be even much higher than the fetch bandwidth. Besides,
since the architecture is based on a scalable design, it does not suffer from the cycle time penalties that wide
issue superscalar processors are expected to experience. For integer codes the performance is much lower.
The low thread level parallelism exploited for these programs has been identified as the principal reason
for that. Several extensions to the architecture have been proposed to alleviate such problem: speculating
on multiple loops simultaneously and speculating on iterations of the same loop with different control flow.
A loop cache has been proposed as an extension to support the increased fetch bandwidth required by these
extensions.
In summary, we have shown that the combination of data speculation, data dependence speculation
and multiple speculative threads of control is a promising alternative to relieve the most critical bottlenecks
of current superscalar microprocessors: data dependences, the instruction window size, the complexity of
a wide issue machine and the limited instruction fetch bandwidth.
Acknowledgments
This work has been supported by the Spanish Ministry of Education under grant CICYT TIC 429/95 and
AP96-52274600 the Direcció General de Recerca of the Generalitat de Catalunya under grant 1996FI03039-APDT. The research described in this paper has been developed using the computing resources of
the CEPBA.
18 of 19
References
[1]
T.M. Austin, D.N.Pnevmastikatos G.S. Sohi, “Streamlining Data Cache Access with Fast Address
Calculation”, in Proc of the Int. Symp. on Computer Architecture, pp. 369-380, 1995.
[2]
T.M. Austin, G.S. Sohi, “Zero-Cycle Loads: Microarchitecture Support for Reducing Load
Latency”, in Proc. of Int. Symp. on Microarchitecture, pp 82-92, 1995.
[3]
S.E. Breach, T.N. Vijaykumar, S. Gopal, J.E. Smith and G.S. Sohi, “Data Memory Alternatives for
Multiscalar Processors”, technical report # CS-TR-97-1344, University of Wisconsin, 1997
[4]
T-F. Chen and J-L. Baer, “A Performance Study of Software and Hardware Data Prefetching
Schemes”, in Proc of the Int. Symp. on Computer Architecture, pp. 223-232, 1994.
[5]
T. Conte, K. Menezes, P. Mills and B. Patel, “Optimization of Instruction Fetch Mechanism for
High Issue rates”, in Proc. of Int. Symp. on Computer Architecture, pp. 333-344, 1995
[6]
M.N. Dorojevets and V.G. Oklobdzija, “Multithreaded Decoupled Architecture”, Int. J. of High
Speed Computing, 7(3), pp. 465-480, 1995.
[7]
P.K. Dubey, K. O’Brien, K.M. O’Brien and C. Barton, “Single-Program Speculative Multithreading
(SPSM) Architecture: Compiler-Assisted Fine-Grained Multithreading”, in Proc. Int. Conf. on Parallel Architectures and Compilation Techniques, pp. 109-121, 1995.
[8]
S. Dutta and M. Franklin, “Control Flow Prediction with Tree-like Graphs for Superscalar Processors”, in Proc. of Int. Symp. on Microarchitecture, pp. 258-263, 1995
[9]
R.J. Eickemeyer and S. Vassiliadis, “A Load Instruction Unit for Pipelined Processors”, IBM Journal of Research and Development, 37(4), pp. 547-564, July 1993.
[10] M. Franklin and G.S. Sohi, “The Expandable Split Window Paradigm for Exploiting Fine Grain
Parallelism”, in Proc. of Int. Symp. on Computer Architecture, pp. 58-67, 1992.
[11] M. Franklin and G.S. Sohi, “ARB: A Hardware Mechanism for Dynamic Reordering of Memory
References”, IEEE Transactions on Computers, 45(6), pp. 552-571, May 1996.
[12] F. Gabbay and A. Mendelson, “Can Program Profiling Support Value Prediction?”, in Proc. of the
30th. Int. Symp. on Microarchitecture, Dec. 1997
[13] M.Golden and T.N. Mudge, “Hardware Support for Hiding Cache Latency”, Technical report #
CSE-TR-152-93. University of Michigan, 1993.
[14] J. González and A. González, “Memory Address Prediction for Data Speculation”, in Proc. of
EURO-PAR 97 Workshop on ILP, pp. 1084-1091, 1997.
[15] J. González and A. González, “Speculative Execution via Address Prediction and Data Prefetching”, in Proc of 11th. ACM Int. Conf. on Supercomputing, pp. 196-203,1997.
[16] J.L. Hennessy and D.A. Patterson, Computer Architecture. A Quantitative Approach. Morgan Kaufmann Publishers, San Francisco CA, 1996.
[17] D. Hunt, “Advanced Performance Features of the 64-bit PA-8000”, in Proc. of the CompCon’95,
pp. 123-128, 1995.
[18] M.H. Lipasti and J.P. Shen, “Exceeding the Dataflow Limit via Value Prediction”, in Proc. of Int.
Symp. on Microarchitecture, pp. 226-237, 1996.
[19] M.H. Lipasti, C.B. Wilkerson and J.P. Shen, “Value Locality and Load Value Prediction”, in Proc.
of the 7th. Conf. on Architectural Support for Programming Languages and Operating Systems, pp.
138-147, Oct. 1996.
19 of 19
[20] A. Moshovos, S.E. Breach, T.N. Vijaykumar and G.S. Sohi, “Dynamic Speculation and Synchronization of Data Dependences”, in Proc. of Int. Symp. on Computer Architecture, pp. 181-193, 1997.
[21] A. Moshovos and G.S. Sohi, “Streamlining Inter-operation Memory Communication wia Data
Dependence Prediction”, in Proc of the 30th. Int. Symp. on Microarchitecture, Dec. 1997
[22] S. Palacharla, N.P. Jouppi and J.E. Smith, “Complexity-Effective Superscalar Processors”, in Proc.
of Int. Symp. on Computer Architecture, pp. 206-218, 1997.
[23] E. Rotenberg, S. Bennett and J.E. Smith, “Trace Processors”, in Proc. of the 30th. Int. Symp. on
Microarchitecture, Dec. 1997.
[24] E. Rotenberg, Q. Jacobson, Y. Sazeides and J.E. Smith, “Trace Cache: a Low Latency Approach to
High Bandwidth Instruction Fetching”, in Proc. of the 29th. Int. Symp. on Microarchitecture, Dec.
1996.
[25] Y. Sazeides, S. Vassiliadis and J.E. Smith, “The Performance Potential of Data Dependence Speculation & Collapsing”, in Proc. of the 29th. Int. Symp. on Microarchitecture, pp. 238-247, Dec.
1996.
[26] Y. Sazeides and J.E. Smith, “The Predictability of Data Values”, in Proc. of the 30th. Int. Symp. on
Microarchitecture, Dec. 1997
[27] J.E. Smith and A.R. Pleszkun, “Implementing Precise Interrupts in Pipelined Processors”, IEEE
Transactions on Computers, 37(5), pp. 562-573, May 1988.
[28] G.S. Sohi, S.E. Breach and T.N. Vijaykumar, ”Multiscalar Processors”, in Proc. of the Int. Symp. on
Computer Architecture, pp. 414-425, 1995.
[29] A. Srivastava and A. Eustace, “ATOM: A system for building customized program analysis tools”,
in Proc of the 1994 Conf. on Programming Languages Design and Implementation, 1994.
[30] J-Y. Tsai and P-C. Yew, “The Superthreaded Architecture: Thread Pipelining with Run-Time Data
Dependence Checking and Control Speculation”, in Proc. Int. Conf. on Parallel Architectures and
Compilation Techniques, pp. 35-46, 1996.
[31] D.M. Tullsen, S.J. Eggers and H.M. Levy, “Simultaneous Multithreading: Maximizing On-Chip
Parallelism”, in Proc. of the Int. Symp. on Computer Architecture, pp. 392-403, 1995.
[32] Authors and title removed for anonimity
in Proc. of the 4th Int. Conf. on High-Performance Computing Architecture, Feb. 1998
[33] G. S. Tyson and T.M. Austin, “Improving the Accuracy and Performance of Memory Communication Through Renaming”, in Proc of the 30th. Int. Symp. on Microarchitecture, Dec. 1997
[34] A. K. Uht, “Concurrency Extraction via Hardware Methods Executing the Static Instruction
Stream”, IEEE Trans. on Computers, vol 41, July 1992.
[35] S. Vajapeyam and T. Mitra, “Improving Superscalar Instruction Dispatch and Issue by Exploiting
Dynamic Code Sequences”, in Proc. of the Int. Symp. on Computer Architecture, pp. 1-12, 1997.
[36] D.W. Wall, “Limits of Instruction-Level Parallelism”, Tech. Report WRL 93/6, Digital Western
Research Laboratory, 1993.
[37] K. Wang and M. Franklin, “Highly Accurate Data Value Prediction using Hybrid Predictors”, in
Proc. of the 30th. Int. Symp. on Microarchitecture, Dec. 1997
[38] T-Y. Yeh, D.T. Marr and Y.N. Patt, “Increasing the Instruction Fetch Rate via Multiple Branch Prediction and an Address Cache”, in Proc. Int. Conf. on Supercomputing, pp. 67-76, 1993.
[39] A. Yu, “The Future of Microprocessors”, IEEE Micro, pp. 46-53, Dec. 1996