Labrecque Martin 201111 PHD Thesis
Labrecque Martin 201111 PHD Thesis
Labrecque Martin 201111 PHD Thesis
by
Martin Labrecque
Martin Labrecque
Doctor of Philosophy
University of Toronto
2011
Packet processing is the enabling technology of networked information systems such as the Internet
and is usually performed with fixed-function custom-made ASIC chips. As communication protocols
evolve rapidly, there is increasing interest in adapting features of the processing over time and, since
software is the preferred way of expressing complex computation, we are interested in finding a
platform to execute packet processing software with the best possible throughput. Because FPGAs are
widely used in network equipment and they can implement processors, we are motivated to investigate
executing software directly on the FPGAs. Off-the-shelf soft processors on FPGA fabric are currently
geared towards performing embedded sequential tasks and, in contrast, network processing is most often
inherently parallel between packet flows, if not between each individual packet.
Our goal is to allow multiple threads of execution in an FPGA to reach a higher aggregate throughput
than commercially available shared-memory soft multi-processors via improvements to the underlying
soft processor architecture. We study a number of processor pipeline organizations to identify which
ones can scale to a larger number of execution threads and find that tuning multithreaded pipelines can
provide compact cores with high throughput. We then perform a design space exploration of multicore
soft systems, compare single-threaded and multithreaded designs to identify scalability limits and
develop processor architectures allowing threads to execute with as little architectural stalls as possible:
in particular with instruction replay and static hazard detection mechanisms. To further reduce the wait
times, we allow threads to speculatively execute by leveraging transactional memory. Our multithreaded
multiprocessor along with our compilation and simulation framework makes the FPGA easy to use for
an average programmer who can write an application as a single thread of computation with coarse-
grained synchronization around shared data structures. Comparing with multithreaded processors using
ii
lock-based synchronization, we measure up to 57% additional throughput with the use of transactional-
memory-based synchronization. Given our applications, gigabit interfaces and 125 MHz system clock
rate, our results suggest that soft processors can process packets in software at high throughput and low
iii
Acknowledgements
First, I would like to thank my advisor, Gregory Steffan, for his
also thank the UTKC, my sempais and Tominaga Sensei for introducing
me to the ’basics’.
Most of the credit goes to my parents, who are coaches and unconditional
this page for making the sun shine even on rainy days.
iv
Contents
List of Figures ix
List of Tables x
Glossary 1
1 Introduction 1
1.1 FPGAs in Packet Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Soft Processors in Packet Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Research Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Background 6
2.1 Software Packet Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.1 Application Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 Network Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.3 Fast- vs Slow-Path Processing . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 FPGAs and their Programming Model . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 Hard Processors on FPGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.2 Soft Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
v
5 Understanding Scaling Trade-offs in Soft Processor Systems 40
5.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.2 Experimental Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.3 Integrating Multithreaded Processors with Off-Chip Memory . . . . . . . . . . . . . . 45
5.3.1 Reducing Cache Conflicts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.3.2 Tolerating Miss Latency via Replay . . . . . . . . . . . . . . . . . . . . . . . 46
5.3.3 Cache Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.4 Scaling Multithreaded Processor Caches . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.5 Scaling Multiprocessors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
9 Conclusions 106
9.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
9.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
vi
A Application-Specific Signatures for Transactional Memory 111
A.1 Transactional Memory on FPGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
A.1.1 Signatures for Conflict Detection . . . . . . . . . . . . . . . . . . . . . . . . . 112
A.1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
A.2 Previous Signature Implementations for HTM . . . . . . . . . . . . . . . . . . . . . . 114
A.3 Application-Specific Signatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
A.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
A.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
Bibliography 130
vii
List of Figures
5.1 Cache organizations and the corresponding impact on the execution of a write hit. . . . 47
5.2 CPI versus area for the various processors. . . . . . . . . . . . . . . . . . . . . . . . . 48
5.3 Area efficiency versus total cache capacity per thread. . . . . . . . . . . . . . . . . . . 50
5.4 CPI versus area for multithreaded designs supporting varying numbers of contexts. . . 51
5.5 Diagram showing an arbiter connecting multiple processor cores in a multiprocessor. . 52
5.6 CPI versus area for various multiprocessors. . . . . . . . . . . . . . . . . . . . . . . . 53
5.7 CPI versus area for the shared and partitioned designs. . . . . . . . . . . . . . . . . . 54
5.8 CPI versus total thread contexts across all benchmarks. . . . . . . . . . . . . . . . . . 55
5.9 CPI versus fraction of load misses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.10 CPI versus area for our two best-performing maximal designs. . . . . . . . . . . . . . 58
viii
8.6 Probability of no packet drops for UDHCP. . . . . . . . . . . . . . . . . . . . . . . . 101
8.7 Throughput improvement relative to locks-only (NetThreads). . . . . . . . . . . . . . 102
8.8 Simulated normalized throughput resulting from unlimited mutexes. . . . . . . . . . . 104
ix
List of Tables
x
Glossary
xi
Chapter 1
Introduction
Packet processing is a key enabling technology of the modern information age and is at the foundation
of digital telephony and TV, the web, emails and social networking. Because of the rapidly evolving
standards for hardware (e.g. from 10 to 1000 Mbits/sec over copper wires), protocols (e.g. from
CDMA to LTE wireless protocols) and applications (from static web content to streaming video), the
bandwidth requirements are increasing exponentially and the tasks that define packet processing for a
given application and transport medium must be updated periodically. In the mean time, accommodating
those needs by designing and fabricating processors with an application-specific integrated circuit
(ASIC) approach has progressively become more difficult because of the increasing complexity of
the circuits and processes, the high initial cost and the long time to market. Therefore processing
packets in software at high throughput is increasingly desirable. To execute software, a designer can
choose network processor ASICs which have a fixed organization, or FPGAs (Field-Programmable
Gate Arrays) which are configurable chips that can implement any kind of digital circuit and do not
Other than being able to implement a considerable number of equivalent logic gates, FPGAs present
advantages specifically in the context of packet processing. First, FPGAs can be connected with a very
low latency to a network link and can be reprogrammed easily, thus providing a way for the networking
1
C HAPTER 1. I NTRODUCTION 2
hardware equipment to keep up with the constantly changing demands of the Internet. Also, FPGAs can
partition their resources to exploit the parallelism available in packet streams, rather than processing
packets in sequence as single-core systems do. FPGAs provide several other advantages for computing:
they allow for application-specific hardware acceleration, their code can be reused in several designs,
and they can be integrated with almost any memory or peripheral technology, thus reducing a board’s
FPGAs are already widely adopted by network appliance vendors such as Cisco Systems and
Huawei and there exists a number of commercial reference designs for high-speed edge and metro
network nodes consisting of Altera and Xilinx devices [10, 60]. Overall, sales in the communications
industry represent more than 40%1 of the market for Altera and Xilinx (who, in turn, account for 87% of
the total PLD market [11]). In most of these use-cases however, FPGAs serve as data paths for passing
packets between other chips in charge of the bulk of the packet processing. The challenge to the wide-
spread use of FPGAs for computing is providing a programming model for the application functions
that need to be updated frequently. The current design flow of FPGAs usually involves converting a full
application into a hardware circuit—a process that is too cumbersome for the needs of many applications
and design teams. To make configurable chips easy to program for a software developer, FPGA-based
designs increasingly implement soft processors in a portion of their configurable fabric. While software
packet processing on FPGAs is only in its infancy, the processor-on-FPGA execution paradigm has
gained considerable traction in the rest of the embedded community. There remains several challenges
in making a system of these processors efficient, and this thesis aims at addressing some of them.
Although there exist tools to compile a high-level program down to a logic circuit, the quality of
the circuit often suffers because of the difficulty of analyzing pointer references. Since converting
complex applications into state machines leads to the creation of a very large number of states, current
synthesis tools often produce bulky circuits that are not always desirable from a frequency or area
standpoint. Reprogramming the configurable fabric of an FPGA in the field may also require board-level
1 Telecom and wireless represent 44% of Altera’s sales [11]. Communications account for 47% of Xilinx’s sales [156].
C HAPTER 1. I NTRODUCTION 3
circuitry that adds to the cost and complexity of the final system. Software programmable processor
systems, i.e. programmable cores that potentially have accelerators sharing a bus, are therefore very
desirable, assuming that they can deliver the performance required for given clock frequency and area
specifications.
While data intensive manipulations can be performed in parallel on a wide buffer of data either
with wide-issue or vector-processing elements, control intensive applications are best described in a
software compiled from a high-level programming language, we focus on the architecture of general-
purpose processors on FPGA platforms, which could later be augmented with application-specific
accelerators to execute the data-intensive portions of an application. We next highlight areas where
currently available commercial soft processors need improvement for packet processing, and how we
The goal of this research is to allow multiple threads of execution, such as in packet processing
applications, to reach a higher aggregate throughput via improvements to the underlying soft processor
architecture. We demonstrate that our resulting soft processors can provide an efficient packet
processing platform while being easy to use for software programmers, by setting the following goals.
1. To build soft processors with an improved area efficiency with the intent of instantiating
a plurality of them. To this end, we augment the single-threaded soft processor architectures
that are currently commercially available with support for customized code generation and
2. To explore the memory system and processor organization trade-off space for scaling up
reach this goal, we assemble instances of these processors into a multiprocessor infrastructure
that requires the design of a memory hierarchy. Studying different cache architectures to hide the
latency of off-chip memory accesses and evaluating the scalability limits of such designs gives us
exploration with real applications. For this purpose, we integrate our soft multiprocessor on
a platform with high-speed packet interfaces, a process requiring the support for fast packet input
4. To characterize and tackle the primary bottleneck in the system, which we identify as
introduce hardware thread scheduling which is both a compiler and architectural technique in
our solution. While this technique improves throughput significantly, significant synchronization
stalls remain. To utilize the available hardware thread contexts more efficiently, we introduce
optimistic concurrency through transactional memory. We first study in isolation and on single-
Finally, to measure the benefits of our programming model, we compare our system with
This thesis is organized as follows: in Chapter 2, we provide some background information on FPGAs,
network processing and soft processors. In Chapter 3, we identify representative packet processing
applications and investigate how to execute them efficiently. In particular, we quantify performance
problems in the pipeline programming model for packet processing applications and determine that
the run-to-completion approach (Section 3.1.2) is easier to program and more apt at fully utilizing
multiple concurrent processing threads. In consequence, we determine early on that the synchronization
bottleneck must be addressed for the simpler run-to-completion approach to be competitive in terms
of performance; this objective is the driving force for the rest of this thesis. In Chapter 4, we study
multithreaded soft processor variations: those cores act as a building block for the remainder of the
thesis. Our target for this chapter is a Stratix FPGA with 41,250 logic elements, a mid-range device.
In Chapter 5, we replicate our soft cores and study their scalability when using conventional single-
C HAPTER 1. I NTRODUCTION 5
threaded soft processors as a baseline. As we are interested in maximizing the use of larger FPGAs,
we leverage the Transmogrifier 4 infrastructure with a Stratix FPGA containing 79,040 logic elements
(the largest chip in its family). Encouraged by the results, we select an FPGA board for packet
processing with 4 gigabit Ethernet ports, which represents even today a considerable bandwidth at the
non-ISP level. We migrate our processor infrastructure to this new board and build the rest of our
experimentation on the board’s Virtex-II Pro FPGA with 53,136 logic cells : we dedicate Chapter 6
to describe this platform and our baseline processor organization. As we determine that our packet
processing applications are hindered by the use of mutexes, in Chapter 7, we present a hardware
thread scheduler to address the synchronization bottleneck. In Chapter 8, we present our full system
integration of speculative execution with multithreaded multiprocessors and we are able to validate that
compared to a programming model where threads have a specialized behavior. Finally, in Chapter 9, we
Background
This chapter presents some background information regarding software packet processing and its
implementation in network processor ASICs. We then provide some context information on FPGAs,
on their programming model, and how they can be used to process packets via software-defined
applications. For reading convenience, we place the related work specific to the processor architecture
components that we study in their respective later thesis chapters. We start by giving some high-level
With close to 2 billion users in 2010 [104] and growing needs for interactive multimedia and online
services, the computing needs of the Internet are expanding rapidly. The types of interconnect devices
range from traditional computers, to emerging markets such as sensors, cell phones and miniaturized
computers the size of a wall adapter called “plug computers” [26]. Ericsson predicts that there will be
50 billion network connections between devices by 2020 [40]. As an increasing number of businesses
depend on network services, packet processing has broad financial and legal impacts [45, 46].
Many packet processing applications must process packets at line rate, and to do so, they must
scale to make full use of a system composed of multiple processors and accelerators cores and of the
available bandwidth to and from packet-buffers and memory channels. As network users are moving
towards networks requiring higher-level application processing, flexible software is best suited to adapt
6
C HAPTER 2. BACKGROUND 7
to fast changing requirements. Given the broad and varied use of packet processing, in this section we
clarify the application-types and bandwidths for which software packet processing is suitable, and what
1) Basic Common packet processing tasks performed in small office/home office network equipment
include learning MAC addresses, switching and routing packets, and performing port forwarding, port
and IP filtering, basic QoS, and VLAN tagging. These functions are typically limited to a predefined
number of values (e.g. 10 port forwarding entries) such that they can be implemented in an ASIC switch
routines, apply a regular transformation to most of the bytes of a packet. Because these workloads often
require several iterations of specialized bit-wise operations, they benefit from hardware acceleration
such as the specialized engines present in network processors [56], modern processors (e.g. Intel
AES instruction extensions), and off-the-shelf network cards; they also generally do not require the
programmability of software.
3) Control-Flow Intensive Network packet processing is no longer limited solely to routing, with many
applications that require deep packet inspection becoming increasingly common. Some applications,
such as storage virtualization and server load balancing, are variations on the theme of routing that reach
deeper into the payload of the packets to perform content-based routing, access control, and bandwidth
allocation. Other applications have entirely different computing needs such as the increasingly complex
firewall, intrusion detection and bandwidth management systems that must recognize applications,
scan for known malicious patterns, and recognize new attacks among a sea of innocuous packets.
Furthermore, with the increasing use of application protocols built on HTTP and XML, the distinction
between payload and header processing is slowly disappearing. Hence in this thesis we focus on
shared, persistent data structures are modified during the processing of most packets.
C HAPTER 2. BACKGROUND 8
Until recently, the machines that process packets were exclusively made out of fixed ASICs performing
increasingly complex tasks. To keep up with these changing requirements, computer architects have
devised a new family of chips called network processors. They are software programmable chips
designed to process packets at line speed: because the processing latency usually exceeds the packet
inter-arrival time, multiple packets must be processed concurrently. For this reason, network processors
(NPs) usually consist of multithreaded multiprocessors. Multithreading has been also used extensively
The factors limiting the widespread adoption of network processors are (i) the disagreement on what
is a good architecture for them; and (ii) their programming complexity often related to their complex
ISAs and architectures. Network processor system integrators therefore avoid the risk of being locked-in
a particular ASIC vendor’s solutions. We next give an overview of the literature on network processor
StepNP [114] provides a framework for simulating multiprocessor networks and allows for multi-
threading in the processor cores. It has even been used to explore the effect of multithreading in
the face of increasing latencies for packet forwarding [115]. In our work, we explore similar issues
but do so directly in FPGA hardware, and focus on how performance can be scaled given an FPGA
Ravindran et al. [122] created hand-tuned and automatically generated multiprocessor systems for
packet-forwarding on FPGA hardware. However, they limited the scope of their work to routing tables
which can fit in the on-chip FPGA Block RAMs. Larger networks will demand larger routing tables
hence necessitating the use of off-chip RAM. Our work differs in that we heavily focus on the effect of
the on-board off-chip DDR(2)-SDRAM, and do so over a range of benchmarks rather than for a specific
application.
There exists a large body of work focusing on architecture exploration for NPs [31, 32, 49, 114, 130,
135, 149], however none of them has seriously investigated nor argued against the run-to-completion
programming model on which we focus (see Section 3) and that is becoming increasingly common
in commercial products such as the PowerNP/Hifn 5NP4G [5, 36], Mindspeed M27483 TSP3 [68],
C HAPTER 2. BACKGROUND 9
Broadcom BCM1480 [20], AMCC nP7510 [29] and Vitesse IQ2200 [118].
Network equipment typically connects multiple network ports on links with speeds that span multiple
orders of magnitude: 10Mbps to 10Gbps are common physical layer data rates. While local area
networks can normally achieve a high link utilization, typical transfer speeds to and from the Internet
are on the order of megabits per second [27] as determined by the network utilization and organization
The amount of processing performed on each packet will directly affect the latency introduced on
each packet and the maximum allowable sustained packet rate. The amount of buffering available on
the network node will also help mitigate bursts of traffic and/or variability in the amount of processing.
For current network appliances that process an aggregate multi-gigabit data stream across many ports,
there is typically a division of the processing in data plane (a.k.a. fast path) and control plane (a.k.a.
slow path) operations. The data plane takes care of forwarding packets at full speed based on rules
defined by the control plane which only processes a fraction of the traffic (e.g. routing protocols).
Data plane processing is therefore very regular from packet to packet and deterministic in the number
of cycles per packet. Data plane operations are typically implemented in ASICs on linecards; control
plane operations are typically implemented on a centralized supervisor card. The control plane, often
software programmable and performing complex control-flow intensive tasks, still has to be provisioned
to handle high data rates. For example, the Cisco SPP network processor in the CRS-1 router is designed
to handle 40Gbps [24]. On smaller scale network equipment (eg., a commodity desktop-based router at
the extreme end of the spectrum), the two planes are frequently implemented on a single printed circuit
board either with ASICs or programmable network processors or a combination of both. In that case,
the amount of computation per packet has a high variance, as the boundary between the fast and slow
In this thesis, we focus on complex packet processing tasks that are best suited to a software
applications therefore target the control plane, rather than the data plane of multi-gigabit machines.
We now introduce FPGAs, in contrast with ASIC network processors, and explain how FPGAs can
C HAPTER 2. BACKGROUND 10
An FPGA is a semiconductor device with programmable lookup-tables (LUTs) that are used to
implement truth tables for logic circuits with a small number of inputs (on the order of 4 to 6 typically).
Thousands of these building blocks are connected with a programmable interconnect to implement
larger-scale circuits. FPGAs may also contain memory in the form of flip-flops and block RAMs
(BRAMs), which are small memories (on the order of a few kilobits), that together provide a small
FPGAs have been used extensively for packet processing [15,55,92,98,102,110,129] due to several
advantages that they provide: (i) ease of design and fast time-to-market; (ii) the ability to connect to a
number of memory channels and network interfaces, possibly of varying technologies; (iii) the ability
to fully exploit parallelism and custom accelerators; and (iv) the ability to field-upgrade the hardware
design.
Other than being used in commercial end-products, FPGAs also provide an opportunity for
prototyping and high-speed design space exploration. While the Internet infrastructure is dominated
by vendors with proprietary technologies, there is a push to democratize the hardware, to allow
researchers to revisit some low-level network protocols that have not evolved in more than a decade.
This desire to add programmability in the network is formally embraced by large scale projects such as
CleanSlate [41], RouteBricks [37] and GENI [137], in turn supported by massive infrastructure projects
such as Internet2 [6] and CANARIE [131]. FPGAs are a possible solution to fulfill such a need as
they allow one to rapidly develop low-level packet processing applications. As an example of FPGA-
based board, the NetFPGA development platform [91] (see Chapter 6) allows networking researchers
to create custom hardware designs affordably, and to test new theories, algorithms, and applications at
line-speeds much closer to current state-of-the-art. The challenge is that many networking researchers
are not necessarily trained in hardware design; and even for those that are, composing packet processing
FPGA network processing with software programmable processors has been explored mostly in the
C HAPTER 2. BACKGROUND 11
perspective of one control processor with FPGA acceleration [3, 4, 38, 64, 89, 140]. A pipeline-based
network processor has been proposed where an ILP solver finds the best processor count per pipeline
stage [67, 165]. The most relevant work to our research is found in Kachris and Vassiliadis [65] where
two Microblaze soft-processors with hardware assists are evaluated with a focus on the importance of the
load balance between the processors and the hardware assists. The amount of parallelism they exploit is
limited to two threads programmed by hand and they do not make use of external memory. FPL-3 [30]
is another project suggesting compilation from a high-level language of packet processing; however, we
While the FPGA on the current NetFPGA board on which we perform our measurements has two
embedded PowerPC hard cores, the next generation of NetFPGA will not (assuming a Virtex-5
XCV5TX240T-2 FPGA device [90]), making an alternative way of implementing processors—via the
abandoned hard cores on FPGAs with the Excalibur device family that included an ARM922T processor,
the Xilinx Virtex-5 family of FPGAs provides a limited number of devices with hard PowerPC cores,
and the Virtex-6 family does not offer any hard processor cores. Interestingly, because processors
are common and useful, the latest Xilinx 7-Series introduced ARM-based processors [157]. So
far, hard processors FPGA cannot reach the gigahertz speed of operation of modern processors for
single-threaded workloads: the PowerPC cores in the FXT family of Virtex-5 FPGAs can reach at
most 550MHz [155] and the state-of-the-art Xilinx 7-Series Zynq cores are reported to operate at
800MHz [157]. In retrospect, while the two major FPGA vendors, Xilinx and Altera, formerly had hard
processors in some of their devices, it seems that they have, up until recently, slowed their investment
Improving logic density and maximum clock rates of FPGAs have led to an increasing number of FPGA-
based system-on-chip (i.e. single-chip) designs, which in turn increasingly contain one or more soft
processors—processors composed of programmable logic on the FPGA. Despite the raw performance
C HAPTER 2. BACKGROUND 12
drawbacks, a soft processor has several advantages compared to creating custom logic in a hardware-
description language: it is easier to program (e.g., using C), portable to different FPGAs, flexible
(i.e., can be customized), and can be used to control or communicate with other components/custom
accelerators in the design. Soft processors provide a familiar programming environment allowing
non-hardware experts to target FPGAs and can provide means to identify and accelerate bottleneck
The traditional network packet forwarding and routing are now well understood problems that can be
accomplished at line speed by FPGAs but more complex applications are best described in a high-level
software executing on a processor. Soft processors are very well suited to packet processing applications
that have irregular data access and control flow, and hence unpredictable processing times. As FPGA-
based systems including one or more soft processors become increasingly common, we are motivated
to better understand the architectural trade-offs and improve the efficiency of these systems.
FPGAs are now used in numerous packet processing tasks and even if many research projects have
demonstrated working systems using exclusively soft processors on FPGAs [64, 67, 108], the bulk of
the processing is often however assumed by another on-board ASIC processor. Our proposed soft
processor system improves on commercially available soft processors [16, 152] by: (i) specifically
taking advantage of the features of FPGAs; and (ii) incorporating some benefits from ASIC network
processor architectures such as multithreaded in-order single issue cores, which can be found in the
Intel IXP processor family [57] 1 and the QuantumFlow processor [25]. Our final system targets the
NetFPGA [91] card, which is unique with its four Gigabit Ethernet ports; we envision however that our
design could be adapted for system with a different number of Ethernet ports such as RiceNIC [132].
Because most soft processors perform control-intensive tasks (the bulk of the reconfigurable
fabric being reserved for data intensive tasks), commercial SPs (in particular NIOS-II [16] and
Microblaze [152]) issue instructions one at a time and in order. There exists a large number of
open-source soft processors with instruction sets as varied as ARM, AVR, SPARC, MIPS and full-
custom [1, 117]. Vector soft processors [71, 164] offer instructions for array-based operations, which
relate to applications domains such as graphics and media processing, which are not our focus in
this thesis. The most related soft processors to our investigations are the PicaRISC processor from
Altera [69] which has not been publicly documented yet and the multithreaded UT-II processor [43],
which we describe in Chapter 4. SPREE [161] gives an overview of the performance and area
consumption of soft processors. As a reference point, on a Stratix EP1S40F780C5 device with the
fastest speed grade, a platform that we use in Chapter 4, the NiosII-fast (the variation with the fastest
clock rate [16]) reaches 135 MHz (but instructions are not retired on every cycle). We next review how
packet processing can be expressed, i.e. how will the soft-processors be programmed inside an FPGA.
Chapter 3
While nearly all modern packet processing is done on multicores, the mapping of the application to
those cores is often specific to the underlying processor architecture and is also a trade-off between
performance and ease-of-programming. In this chapter, we first explore the multiple ways of managing
parallelism in packet processing and focus on the most versatile and convenient approach. Then, after
defining a set of representative software packet processing applications, we quantitatively justify our
In this section, we examine the three main programming models illustrated in Figure 3.1.
3.1.1 Pipelining
To program the multiple processing elements (PEs) of a network processor, most research focuses
on breaking the program into one or several parallel pipelines of tasks that map to an equal number
of processor pipelines (as shown in Figure 3.1 (a) and (c)). In this simplest single-pipeline model,
while inter-task communication is allowed, sharing of data between tasks is usually not supported. The
pipeline model rather focuses on exploiting data locality, to limit the instruction storage per PE and to
regroup accesses to shared resources for efficient instruction scheduling. Pipelining is widely used as
the underlying parallelization method [33, 59, 145, 148], most often to avoid the difficulty of managing
14
C HAPTER 3. C HOOSING A P ROGRAMMING M ODEL 15
a) Pipelined PE0 PE 1 PE n
PE0
b) Run−to−completion PE1
PEn
PE 0 PE PEn−m
m+1
PE 1 PEm+2 PE
c) Hybrid n−m+1
PE m PE 2m PE n
locks: there is no need to synchronize if writes to shared data structures are done in a single pipeline
The pipeline programming model is well suited for heterogeneous PEs arranged in a one-
dimensional array. Conversely, programs are best suited for pipelining if they are composed of data-
independent and self-contained kernels executing in a stable computation pattern with communication at
the boundaries [136]. To obtain the best performance on such a pipeline architecture, all the processing
stages have to be of similar latency so that one stage is ready to accept work at the same time as the
previous stage is ready to hand off some work. However, efficiently balancing the pipeline stages and
and often not possible for complex applications. Even if an acceptable result is obtained, this difficult
process must be repeated when the application is updated or the code is ported to a different processor
architecture. Furthermore, handling packets that require varying amounts of computation or breaking
down frequent accesses to large stateful data structures (such as routing tables) to ensure lock-free
operation is impractical in a pipeline with many stages. We quantify those difficulties when pipelining
In earlier work [74], we have demonstrated the difficulties of the pipeline model. We have
processor. Our applications were defined as task graphs that we attempted to transform to execute
efficiently on the underlying hardware. We created a parametric model of a network processor that
allowed us to approximate the performance of different execution schemes. We observed that the
decomposition of the application into a graph of tasks resulted in a significant amount of idle cycles
in processing elements, a result that was also observed in NepSim [93]. The system remained complex
to analyze, debug and to target by a compiler, because of problems such as load imbalance and the
competing effects of task scheduling and mapping. We also identified that not all workloads can easily
be expressed as task graphs: the bulk of numerous network processor applications is contained in a
Earlier publications [85, 97] have asserted that network processing is a form of streaming [138], a
computing model related to pipelining. In stream programming, the programmer uses a custom language
to describe regular computations and their parallelism such that the compiler can schedule precisely
concurrent SIMD operations in automatically balanced pipeline stages. We argue that complex control
flow in advanced network processing is too unpredictable to be best described as a stream program. For
example, a network application that interprets TCP packets to provide data caching for a database server
will have widely varying amounts of processing to apply on packets: describing the work as a pipeline
3.1.2 Run-to-Completion
The model in Figure 3.1(b) refers to the method of writing a program where different processors process
packets from beginning-to-end by executing the same program. Different paths in the program will
exercise different parts of the application on different threads, which do not execute in lock-step. The
programming is therefore intuitive but typically requires the addition of locks to protect shared data
structures and coherence mechanisms when shared data can be held in multiple locations.
Nearly all modern multicore processors are logically organized in a grid to which the programmer
C HAPTER 3. C HOOSING A P ROGRAMMING M ODEL 17
can map an execution model of his choice. The major architectural factors that would push a
programmer away from the intuitive run-to-completion model are: (i) a high cache coherence penalty
across private data caches, or (ii) a reduced instruction and data cache storage compared to a task
decomposition in pipeline(s). Most network processors do not implement cache coherence (except
commodity multicore machines) and for our experiments, our data cache is shared and therefore has
roughly the same hit rate with or without task pipelining. For network processors with no special
communication channel between the cores for pipeline operations, such as the 100Gbps-rated 160-
threads Cisco QuantumFlow Processor [25], run-to-completion is the natural programming model.
The grid of processors or hybrid scheme (see Figure 3.1 (c)) also requires a task decomposition and
presents, to some extent, the same difficulty as the pipelined model. While packets can flow across
different pipelines, the assumption is that a specialized engine at the input would dispatch a packet to a
given pipeline, which would function generally independently. Enforcing this independence to minimize
lock contention across pipelines is actually application specific and can lead to severe load-imbalance.
The number of processors assigned to each pipeline must also be sized according to the number of
network interfaces to provide a uniform response time. In the presence of programs with synchronized
sections, abundant control flow and variable memory access latencies, achieving a good load balance is
often infeasible. The hybrid model is also inflexible: the entire code must be re-organized as a whole
if the architecture or the software is changed. A variation on the hybrid model consists of using the
processors as run-to-completion but delegating atomic operations to specialized processors [143]. That
model also removes the need for locks (assuming point-to-point lock-free communication channels) but
poses the same problems in terms of ease of programming and load imbalance as the pipeline model.
Because the synchronization around shared data structures in stateful applications makes it
impractical to extract parallelism otherwise (e.g., with a pipeline of balanced execution stages), we adopt
the run-to-completion/pool-of-threads model, where each thread performs the processing of a packet
from beginning-to-end, and where all threads essentially execute the same program code. Consequently
focus on a single replicated pipeline stage. We next present our applications and quantify in Section 3.3
C HAPTER 3. C HOOSING A P ROGRAMMING M ODEL 18
To measure packet throughput, we need to define the processing performed on each packet. Network
packet processing is no longer limited solely to routing, with many applications that require deeper
packet inspection becoming increasingly common and desired. Most NP architecture evaluations
to date have been based on typical packet processing tasks taken individually: microbenchmarks.
NetBench [101], NPBench [86] and CommBench [150] provide test programs ranging from MD5
message digest to media transcoding. Stateless kernels that emulate isolated packet processing routines
fall into the first two categories in Section 2.1.1 which are not our focus. For those kernels that are
limited to packet header processing, the amount of instruction-level parallelism (ILP) can exceed several
thousand instructions [86]. Because such tasks are best addressed by SIMD processors or custom
ASICs, in this thesis we instead focus on control-flow intensive applications where the average ILP
is only five (a number in agreement with other studies on control-flow intensive benchmarks [144]).
While microbenchmarks are useful when designing an individual PE or examining memory behavior,
they are not representative of the orchestration of an entire NP application. There is little consensus
in the research community on an appropriate suite of benchmarks for advanced packet processing, at
which current fixed-function ASICs perform poorly and network processors should excel in the future.
To take full advantage of the software programmability of our processors, our focus is on control-
flow intensive applications performing deep packet inspection (i.e., deeper than the IP header). In
addition, and in contrast with prior work [86, 101, 150], we focus on stateful applications—i.e.,
applications in which shared, persistent data structures are modified during the processing of most
packets. Since there is a lack of packet processing benchmark suites representing applications that
are threaded and synchronized, we have developed the four control-flow intensive applications detailed
in Table 3.1. Except for Intruder [22] and its variation Intruder2, the benchmarks are the result of
discussions with experts in the networking community, in particular at the University of Toronto and at
Table 3.1 also describes the nature of the parallelism in each application. Given that the applications
C HAPTER 3. C HOOSING A P ROGRAMMING M ODEL 19
plication recognition. In the absence of a match, 2007) [28] HTTP server replies are
the payloads of packets are reassembled and added to all packets presumably
tested up to 500 bytes before a flow is marked coming from an HTTP server to
as non-matching. As a use case, we configure trigger the classification.
the widely used PCRE matching library [53]
(same library that the popular Snort [124] intru-
sion detection/prevention system uses) with the
HTTP regular expression from the “Linux layer
7 packet classifier” [87].
Exemplifies network address translation by Exhibits short transactions that en- Same packet trace as Classifier.
rewriting packets from one network as if orig- compass most of the processing;
inating from one machine, and appropriately exploits parallelism across flows
NAT
rewriting the packets flowing in the other direc- stored in a global synchronized
tion. As an extension, NAT collects flow statistics hash-table.
and monitors packet rates.
Derived from the widely-used open-source Periodic polling on databases for Packet trace modeling the expected
DHCP server. As in the original code, leases time expired records results in many DHCP message distribution of a
UDHCP
are stored in a linearly traversed array and IP read-dominated transactions as seen network of 20000 hosts [14].
addresses are leased after a ping request for them in Table 6.2. Has high contention on
expires, to ensure that they are unused. shared lease and awaiting-for-ping
array data structures.
Network intrusion detection [22] modified for Packets are stored in a synchro- 256 flows sending random messages
packetized input. Extensive use of queues and nized associative array until com- of at most 128 bytes, broken ran-
Intruder
lists, reducing the effectiveness of signatures due plete messages are fully reassem- domly in at most 4 fragments, con-
to random memory accesses [75]; mostly CPU- bled. They are then checked against taining 10% of ’known attacks’. The
bound with bursts of synchronized computation a dictionary before being removed fragments are shuffled with a sliding
on highly- contended data structures. from the synchronized data struc- window of 16 and encapsulated in IP
tures and sent over the network. packets.
Multiple lists and associative array
make extensive use of the memory
allocation routines.
Network intrusion detection [22] modified for Similar to above, with significantly Same as row above.
packetized input and re-written to have array- lighter data strutures because of stat-
based reassembly buffers to avoid the overhead ically allocated memory. Has two
Intruder2
of queues, lists and maps that also reduced synchronization phases: first a per-
the effectiveness of signatures due to the large flow lock is acquired and released
amount of malloc()/free() calls [75]. to allow processing each packet in-
dividually, then most of the com-
putation is performed on reassem-
bled messages before the per-flow
variables are modified again under
synchronization.
C HAPTER 3. C HOOSING A P ROGRAMMING M ODEL 20
initially exist as sequential programs where an infinite loop processes one packet on each iteration, the
parallelism discussed is across the iterations of the main infinite loop. We only utilize the Intruder
benchmarks starting in Chapter 8 because they present a different behavior with regards to speculative
execution, which is the focus of the last part of this thesis. Since quantitative metrics about our
benchmarks have evolved slightly as we edited the benchmarks, we report updated numbers in the
appropriate sections of the thesis. The baseline measurements in Table 6.2 (co-located with the
experimental results) reports statistics on the dynamic accesses per critical section for each application.
Note that the critical sections comprise significant numbers of loads and stores with a high disparity
between the average and maximum values, showing that our applications are stateful and irregular in
terms of computations per packet. We next analyze the representative traits of each application and
generalize them to other control-flow intensive network applications, particularly with respect to packet
Packet Ordering In a network device, there is typically no requirement to preserve the packet ordering
across flows from the same or different senders: they are interpreted as unrelated. For a given flow, one
source of synchronization is often to preserve packet ordering, which can mean: i) that packets must be
processed in the order that they arrived; and/or ii) that packets must be sent out on the network in the
order that they arrived. The first criteria is often relaxed because it is well known that packets can be
reordered in a network [146], which means that the enforced order is optimistically the order in which
the original sender created the packets. The second criteria can be managed at the output queues and
does not usually affect the core of packet processing. For our benchmark applications in Table 3.1,
while we could enforce ordering in software, we allow packets to be processed out-of-order because our
Data Parallelism Packet processing typically implies tracking flows (or clients for UDHCP) in a
database, commonly implemented as a hash-table or direct-mapped array. The size of the database
is bounded by the size of the main memory—typically larger than what can be contained in any single
data cache—and there is usually little or a very short-term reuse of incoming packets. Because a network
device executes continuously, a mechanism for removing flows from the database after some elapsed
time is also required. In stateful applications, i.e. applications where shared, persistent data structures
C HAPTER 3. C HOOSING A P ROGRAMMING M ODEL 21
are modified during the processing of most packets, there may be variables that do not relate directly
to flows (e.g. a packet counter). Therefore, it is possible that the processing of packets from different
flows access the same shared data and therefore the processing of those packets in parallel may conflict.
Also, for certain applications, it may possible to extract intra-packet parallelism (e.g. parallelization of a
loop), however those cases are rare because they are likely to leave some processors underutilized so we
do not consider them further. Whenever shared data is accessed by concurrent threads, those accesses
feasible since repeatedly entering and exiting critical sections will likely add significant overhead. For
example, NAT and Classifier have a significant fraction of their code synchronized because there is an
interaction between the hash table lock and the per-flow lock (see Table 3.1): a thread cannot release the
lock on the hash table prior to acquiring a lock on a flow descriptor to ensure that the flow is not removed
in the mean time. Mechanisms for allowing coarser-grained sections while preserving performance are
Because a number of network processors implement the pipeline model [88] with the promise of
extracting parallelism while being lock-free, we must justify our choice of a different model (run-to-
completion). For this purpose, we use benchmarks from NetBench [100]2 , and our stateful benchmarks
running on a single thread. As Netbench’s applications require a number of system and library calls, they
cannot be ported easily to our NetThreads embedded target (Chapter 6), so we instead record execution
traces using the PIN tool [123]. We only monitor the processing for each packet and ignore the packet
As seen in Figure 3.2(a), our four applications span the spectrum of latency variability (i.e. jitter)
per packet that is represented by the NetBench benchmarks. Route and ipchain have completely
deterministic behavior (no variability), while table lookup tl and the regular expression matching
Classifier have the most variation across packets. Considering that for those applications the amount
2 Except dh which is not packet based nor ssl because of its inlined console output.
C HAPTER 3. C HOOSING A P ROGRAMMING M ODEL 22
(a) Normalized variability in packet processing latency with a single thread (stddev/mean
latency).
(b) Imbalance in pipeline stages with an emulated pipeline of 8 threads (max stage latency
divided by mean stage latency).
Figure 3.2: Single-threaded computational (a) variability and (b) load imbalance in an emulated 8-stage
pipeline based on instruction clustering for NetBench [100] benchmarks and our benchmarks (marked
with a *).
C HAPTER 3. C HOOSING A P ROGRAMMING M ODEL 23
of computation can be more than doubled depending on the packet, we conclude that they are less
amenable to pipelining. Even if re-circulating a packet from the end to the beginning of a pipeline were
effective at mitigating this huge variation in latency [88], we would also have to effectively divide the
clusters instructions with the highest control and data flow affinity [121] to eliminate cyclic dependences
and minimize communication between pipeline stages. Since NetTM has 8 threads, we cluster the
instructions into 8 pipeline stages based on the profile information3 . In all benchmarks in Figure 3.2(b),
clustering the code into pipeline stages leads to significant load imbalance. url has the largest pipeline
imbalance (i.e. the rate of the pipeline is 7.9 times slower than the average rate of all the stages) because
of the clustering of the Boyer-Moore string search function in a single pipeline stage. Even route
which has a deterministic execution (Figure 3.2(a)) has load imbalance because of the clustering of the
checksum routine in a single longer-latency pipeline stage, and ipchains has similar problems. While
hardware accelerators could be used to accelerate checksum operations, a programmer cannot rely on
them to balance the latency of arbitrary code in stages. To get a better load balance, a programmer would
replicate the slowest stages and move to the hybrid or run-to-completion model, and add synchronization
3.4 Summary
Modern network appliance programmers are faced with larger and more complex systems-on-chip
composed of multiple processor and acceleration cores and must meet the expectation that performance
should scale with the number of compute threads [112, 145]. When the application is composed of
parallel threads, accesses to shared data structures must be synchronized. These dependences make it
difficult to pipeline the code into balanced stages of execution to extract parallelism. We demonstrated
in this chapter that such applications are more suitable to a run-to-completion model of execution, where
a single thread performs the complete processing of a packet from start to finish. The main advantages
of this model is that there is a unique program, it can be reused on new architectures and its scalability
is more predictable. The programmer only writes one program in the most natural way possible and
the compilation infrastructure and the processor architecture ease the parallelization problem. For
performance, the system must be able to execute multiple instances of the program in parallel. We
next investigate ways of improving the efficiency of soft processor cores to later build multicores.
Chapter 4
In an FPGA packet processor, each processing element must be compact and optimized to deliver the
maximum performance, as we want to replicate those processing elements to take advantage of all the
available memory bandwidth. For soft processors in general, especially multicore systems, raw single-
threaded performance is often not as crucial as the aggregate performance of all the cores—otherwise
an off-chip or on-chip (if available) single hard processor would be preferable. Other metrics or their
combination may also be of importance, such as minimizing the area of the processor, matching the
clock frequency of another key component in the same clock domain, handling requests or interrupts
within specified time constraints (i.e., real-time), or processing a stream of requests or data at a sufficient
rate. Flexibility and control over performance/area trade-offs in the soft processor design space are key,
and hence, for comparing soft processor designs, a summarizing metric that combines area, frequency,
In this chapter, we first summarize our work on custom code generation for processors to improve
area efficiency. After, we demonstrate how to make 3, 5, and 7-stage pipelined multithreaded soft
processors 33%, 77%, and 106% more area efficient than their single-threaded counterparts, the result
25
CHAPTER 4. IMPROVING SOFT PROC. AREA EFFICIENCY WITH MULTITHREADING 26
Since soft processors may be easily modified to match application requirements, it is compelling to
go beyond default compilation (e.g., default gcc), and customize compilation aspects such as the code
generation. A first step in our research is to determine the extent to which we can customize the soft
processors to (i) improve the performance of the existing processors and (ii) save area to accommodate
more processors on a chip. As well, we are interested in using a compiler to enable code transformations
that can potentially save reconfigurable hardware resources. We perform our experiments on the
processors generated by the SPREE infrastructure [161, 162]. SPREE takes as input an architectural
description of a processor and generates an RTL implementation of it, based on a library of pre-defined
modules. The RTL currently targets hardware blocks of the Altera FPGAs so the Quartus tools are
integrated in the SPREE infrastructure to characterize the area, frequency and power of the resulting
processors.
In a recent publication [83], we summarize our findings on soft processor customization, notably: (i)
we can improve area efficiency by replacing a variable-amount shifter with two fixed-amount shifters;
(ii) hazard detection logic is a determining factor in the processor’s area and operating frequency; (iii) we
can eliminate load delay slots in most cases; (iv) branch delay slots can be removed in a 7-stage pipeline
even with no branch prediction; (v) 3-operand multiplies are only justified for a 3-stage processor
(and otherwise Hi/Lo registers are best); (vi) unaligned memory loads and stores do not provide a
significant performance benefit for our benchmarks; (vii) we are able to remove one forwarding line
with simple operand scheduling and improve area efficiency; and (viii) we can limit the compiler’s
use of a significant fraction of the 32 architected registers for many benchmarks without degrading
performance. To maximize the efficiency of the customized architecture of our soft processors, we
combined several of these optimizations and obtained a 12% additional area efficiency increase on
average (and up to 47% in the best case). By including instruction subsetting [83] in the processors
and our optimizations, the mean improvement is 13% but the maximum is 51%. For the remainder of
this thesis, we will implement 3-operand multiplies and the removal of hazard detection, delay slots
and unaligned access instructions; all four techniques become even more beneficial in the processors
that we propose in the next chapter. Since subsetting the architecture actually requires recompilation
CHAPTER 4. IMPROVING SOFT PROC. AREA EFFICIENCY WITH MULTITHREADING 27
of the FPGA design, we next investigate another architectural technique that may be more rewarding
for software-only programmers that do not have the option of re-customizing the processor architecture
A promising way to improve area efficiency is through the use of multithreading. Fort et al. [43] present
a 4-way multithreaded soft processor design, and demonstrate that it provides significant area savings
over having four soft processors—but with a moderate cost in performance. In particular, they show
that a multithreaded soft processor need not have hazard detection logic nor forwarding lines, so long
as the number of threads matches the number of pipeline stages such that all instructions concurrently
executing in different stages are independent. Researchers in the CUSTARD project [35] have also
In this chapter, rather than comparing with multiple soft processors, we show that a multithreaded
soft processor can be better than one that is single-threaded. In particular, we demonstrate: (i) that
multithreaded soft processors are more area-efficient and are capable of a better sustained instructions-
per-cycle (IPC) than single-threaded soft processors; (ii) that these benefits increase with the number
of pipeline stages (at least up to and including 7-stage pipelines); (iii) that careful optimization of any
unpipelined multi-cycle paths in the original soft processor is important, and (iv) that careful selection
of certain ISA features, the number of registers, and the number of threads are key to maximizing area-
efficiency.
In this section we briefly describe our infrastructure for designing and measuring soft processors,
including the SPREE system for generating single-threaded pipelined soft processors, our methodology
for comparing soft processor designs, our compilation infrastructure, and the benchmark applications
we study.
SPREE: We use the SPREE system [163] to generate a wide range of soft processor architectures.
SPREE takes as input ISA and datapath descriptions and produces RTL which is synthesized, mapped,
CHAPTER 4. IMPROVING SOFT PROC. AREA EFFICIENCY WITH MULTITHREADING 28
placed, and routed by Quartus 5.0 [9] using the default optimization settings. The generated
processors target Altera Stratix FPGAs, and we synthesize for a EP1S40F780C5 device—a mid-sized
device in the family with the fastest speed grade. We determine the area and clock frequency of each soft
processor design using the arithmetic mean across 10 seeds (which produce different initial placements
before placement and routing) to improve our approximation of the true mean. For each benchmark, the
soft processor RTL design is simulated using Modelsim 6.0b [103] to (i) obtain the total number of
execution cycles, and (ii) to generate a trace which is validated for correctness against the corresponding
Measurement: For Altera Stratix FPGAs, the basic logic element (LE) is a 4-input lookup table plus
a flip-flop—hence we report the area of these processors in equivalent LEs, a number that additionally
accounts for the consumed silicon area of any hardware blocks (e.g. multiplication or block-memory
units). For the processor clock rate, we report the maximum frequency supported by the critical path of
the processor design. To combine area, frequency, and cycle count to evaluate an optimization, we use
a metric of area efficiency, in million instructions per second (MIPS) per thousand equivalent LEs. It is
important to have such a summarizing metric since a system designer may be most concerned with soft
processor area in some cases, or frequency or wallclock-time performance in others. Finally, we obtain
dynamic power metrics for our benchmarks using Quartus’ Power Play tool [9]. The measurement is
consumed in nano-Joules by the number of instructions executed (nJ/instr), discounting the power
Single-Threaded Processors: The single-threaded processors that we compare with are pipelined
with 3 stages (pipe3), 5 stages (pipe5), and 7 stages (pipe7). The 2 and 6 stage pipelines were
previously found to be uninteresting [163], hence we study the 3, 5, and 7 stage pipelines for even
spacing. All three processors have hazard detection logic and forwarding lines for both operands. The
3-stage pipeline implements shift operations using the multiplier, and is the most area-efficient processor
generated by SPREE [163] (at 1256 equiv. LEs, 78.3 MHz). The 5-stage pipeline also has a multiplier-
based shifter, and implements a compromise between area efficiency and maximum operating frequency
(at 1365 equiv. LEs, 86.8 MHz). The 7-stage pipeline has a barrel shifter, leads to the largest processor,
CHAPTER 4. IMPROVING SOFT PROC. AREA EFFICIENCY WITH MULTITHREADING 29
and has the highest frequency (at 1639 equiv. LEs, 100.6 MHz). Pipe3 and pipe5 both take one extra
cycle for shift and multiply instructions, and pipe3 requires an extra cycle for loads from memory.
Compilation: Our compiler infrastructure is based on modified versions of gcc 4.0.2, Binutils
2.16, and Newlib 1.14.0 that target variations of the 32-bits MIPS I [66] ISA; for example, we can trade
support for Hi/Lo registers with 3-operand multiplies, enable or disable branch delay slots, and vary the
Benchmarking: We evaluate our soft processors using the 10 embedded benchmark applications
described in Table 4.1, which are divided into 4 categories: dominated by loads (L), shifts (S), multiplies
(M) or by none of the above (other, O).1 By selecting benchmarks from each category, we also create
1 The benchmarks chosen are a subset of those used in previous work [83] since for now we require both data and
instructions to each fit in separate single MegaRAMs in the FPGA.
CHAPTER 4. IMPROVING SOFT PROC. AREA EFFICIENCY WITH MULTITHREADING 30
3 multiprogrammed mixes (rows of Table 4.2) that each execute until the completion of the shortest
application in the mix. To use these mixes of up to 6 applications, threads available on a processor are
filled by choosing one application per thread from the left to right of the mix along a row of Table 4.2. We
measure multithreaded processors using (i) multiple copies of the same program executing as separate
threads (with separate data memory), and (ii) using the multiprogrammed mixes.
Commercially-available soft processors such as Altera’s NIOS II and Xilinx’s Microblaze are both
single-threaded, in-order, and pipelined, as are our SPREE processors (described in the previous
section). Such processors require hazard detection logic and forwarding lines for correctness and good
performance. These processors can be multithreaded with minimal extra complexity by adding support
for instructions from multiple independent threads to be executing in each of the pipeline stages of the
processor—an easy way to do this is to have as many threads as there are pipeline stages. This approach
is known as Fine-Grained Multithreading (FGMT [113]), and is also the approach adopted by Fort et
In this section we evaluate several SPREE processors of varying pipeline depth that support fine-
grained multithreading. Since each pipe stage executes an instruction from an independent thread, these
processors no longer require hazard detection logic nor forwarding lines—which as we show can provide
improvements in both area and frequency. Focusing on our base ISA (MIPS), we also found that load
and branch delay slots are undesirable, which makes intuitive sense since the dependences they hide
are already hidden by instructions from other threads—hence we have removed them from the modified
To support multithreading, the main challenge is to replicate the hardware that stores state for a
thread: in particular, each thread needs access to independent architected registers and memory. In
contrast with ASICs, for FPGAs the relative latency of on-chip memory vs logic latency grows very
slowly with the size of the memory—hence FPGAs are amenable to implementing the replicated storage
required for multithreading. We provide replicated program counters that are selected in a round-robin
2 The CUSTARD group also investigated Block Multi-Threading where threads are switched only at long-latency events,
but found this approach to be inferior to FGMT.
CHAPTER 4. IMPROVING SOFT PROC. AREA EFFICIENCY WITH MULTITHREADING 31
Figure 4.1: Area efficiency of single-threaded (st) and multithreaded (mt) processors with varying
pipeline depths, where the number of threads is equal to the number of stages for the multithreaded
processors. Results are the mean across all single benchmarks (i.e., not the mixes).
fashion each cycle. Rather than physically replicating the register file, which would require the addition
of costly multiplexers, we index different ranges of a shared physical register file by appropriately
shifting the register numbers. We implement this register file using block memories available on Altera
Stratix devices; in particular, we could use either M4Ks (4096 bits capacity, 32 bits width) or M512s (512
bits capacity, 16 bits width). We choose M4Ks because (i) they more naturally support the required 32-bit
register width; and (ii) we can implement the desired register file using a smaller total number of block
We must also carefully provide separation of instruction and data memory as needed. For the
processors in this chapter, we support only on-chip memory—we support caches and off-chip memory
in Chapter 5. Similar to the register file, we provide only one physical instruction memory and one
physical data memory, but map to different ranges of those memories as needed. In particular, every
thread is always allocated a unique range of data memory. When we execute multiple copies of a single
program, then threads share a range of instruction memory,3 otherwise instruction memory ranges are
unique as well.
Figure 4.1 shows the mean area efficiency, in MIPS per 1000 equivalent LEs, across all single
3 Since each thread needs to initialize its global pointer and stack pointer differently (in software), we create a unique
initialization routine for each thread, but otherwise they share the same instruction memory range.
CHAPTER 4. IMPROVING SOFT PROC. AREA EFFICIENCY WITH MULTITHREADING 32
benchmarks from Table 4.1 (i.e., we do not yet consider the multiprogrammed mixes). We measure
single-threaded (st) and multithreaded (mt) processors with varying pipeline depths, and for the
multithreaded processors the number of threads is equal to the number of pipeline stages. The area
efficiency of the 3, 5, and 7-stage pipelined multithreaded processors is respectively 33%, 77% and
106% greater than each of their single-threaded counterparts. The 5-stage pipeline has the maximum
area efficiency, as it benefits the most from the combination of optimizations we describe next. The
3 and 7-stage pipeline have similar area efficiencies but offer different trade-offs in IPC, thread count,
over single-threaded processors. For the 3, 5, and 7-stage pipelines IPC improves by 24%, 45% and
104% respectively. These benefits are partly due to the interleaving of independent instructions in the
multithreaded processors which reduce or eliminate the inefficiencies of the single-threaded processors
such as unused delay slots, data hazards, and mispredicted branches (our single-threaded processors
predict branches are “not-taken”). We have shown that multithreading offers compelling improvements
in area-efficiency and IPC over single-threaded processors. In the sections that follow we describe the
Figure 4.3: Hi/Lo registers vs 3-operand multiplies for various pipeline depths, normalized to the
corresponding single-threaded processor.
In this section, we identify two architectural features of multithreaded processors that differ significantly
from their single-threaded version and hence must be carefully tuned: the choice of Hi/Lo registers
Optimizing the Support for Multiplication: By default, in the MIPS ISA the 64-bit result of 32-
bit multiplication is stored into two special 32-bit registers called Hi and Lo—the benefit of these being
that multicycle multiplication need not have a write-back path into the regular register file, allowing
also replicate the Hi and Lo registers. Another alternative is to modify MIPS to support two 3-operand
multiply instructions, which target the regular register file and compute the upper or lower 32-bit result
independently. We previously demonstrated that Hi/Lo registers result in better frequency than 3-
operand multiplies but at the cost of extra area and instruction count, and are a better choice for more
deeply-pipelined single-threaded processors [83]. In this chapter we re-investigate this option in the
context of multithreaded soft processors. Figure 4.3 shows the impact on area, frequency, and energy-
per-instruction with Hi/Lo registers or 3-operand multiplies, for multithreaded processors of varying
pipeline stages each relative to the corresponding single-threaded processor. We observe that Hi/Lo
CHAPTER 4. IMPROVING SOFT PROC. AREA EFFICIENCY WITH MULTITHREADING 34
registers require significantly more area than 3-operand multiplies due to their replicated storage but
Since the frequency benefits of Hi/Lo registers are no longer significant, and 3-operand multiplies
also have significantly reduced energy-per-instruction, we chose support for 3-operand multiplies as our
default in all multithreaded experiments in this chapter. In Figure 4.4(a) we show the raw IPC of the
multithreaded processors on all benchmarks as well as the multiprogrammed mixes, demonstrating two
key things: (i) that our results for replicated copies of the individual benchmarks are similar to those of
multiprogrammed mixes; and (ii) that stalls for the multithreaded 7-stage pipeline have been completely
eliminated (since it achieves an IPC of 1). For the three-stage pipeline, the most area efficient for
single-threaded processors, the baseline multithreaded processor is only 5% more area efficient than the
single-threaded processor (see Figure 4.4(b)). Area and frequency of our multithreaded processors are
similar to those of the single-threaded processors, hence the majority of these gains (36% and 106% for
the 5 and 7-stage pipelines) are related to reduction in stalls due to various hazards in the single-threaded
designs. Figure 4.4(b) shows that the reduction of instructions due to the removal of delay slots and 3-
operand multiplies also contributes by 3% on average to the final area efficiency that utilizes the scaled
instruction count of single-threaded processors to compare a constant amount of work. Comparing with
Figure 4.1, we see that single-threaded soft processors favor short pipelines while multithreaded soft
Optimizing Multicycle Paths: Our 3 and 5-stage processors must both stall for certain instructions
(such as shifts and multiplies), which we call unpipelined multicycle paths [163]. It is important
to optimize these paths, since otherwise such stalls will impact all other threads in a multithreaded
processor. Fort et al. [43] address this challenge by queuing requests that stall in a secondary pipeline
as deep as the original, allowing other threads to proceed. We instead attempt to eliminate such stalls by
For the single-threaded 3-stage pipeline, multicycle paths were created by inserting registers to
divide critical paths, improving frequency by 58% [163]; this division could also have been used to
create two pipeline stages such that shifts and multiplies would be pipelined, but this would have created
new potential data hazards (see Figure 4.5), increasing the complexity of hazard detection logic—
hence this option was avoided for the single-threaded implementation. In contrast, for a multithreaded
CHAPTER 4. IMPROVING SOFT PROC. AREA EFFICIENCY WITH MULTITHREADING 35
(a) Raw IPC. The 7-stage pipeline has an IPC of 1 because it never stalls.
(b) Area efficiency normalized to the single-threaded processors computed with the instruc-
tion count on the multithreaded-processors (icount mt) and with the scaled instruction
count on the single-threaded processor (icount st).
Figure 4.4: IPC and area efficiency for the baseline multithreaded processors.
CHAPTER 4. IMPROVING SOFT PROC. AREA EFFICIENCY WITH MULTITHREADING 36
add F E W F E W
shift F E W F E M W
load F E W F E M W
sub F E W F E W
Time
(a) unpipelined (baseline) (b) with intra−stage pipelining
Figure 4.5: Example execution showing multicycle paths in the 3-stage pipeline where: (i) shifts
and loads require two cycles in the execute stage; and (ii) we assume each instruction has a
register dependence on the previous. Assuming a single-threaded processor, forward arrows represent
forwarding lines required, while the backward arrow indicates that a stall would be required. The
pipeline stages are: F for fetch, E for execute, M for memory, and W for write-back.
processor we can pipeline such paths without concern for data hazards (since consecutive instructions
are from independent threads)—hence we do so for the 3-stage processor. The single-threaded 5-stage
pipeline also contains multicycle paths. However, we found that for a multithreaded processor that
pipelining these paths was not worth the cost, and instead opted to revert the multicycle paths to be
single-cycle at a cost of reduced frequency but improved instruction rate. Consequently, eliminating
these multicycle paths results in an IPC of 1 for the 5-stage multithreaded processor. The 7-stage
Figure 4.6 shows the impact on both cycle count and area-efficiency of optimizing multicycle
paths for the 3 and 5-stage pipeline multithreaded processors, relative to the corresponding baseline
multithreaded processors. First, our optimizations reduced area for both processors (by 1% for the 3-
stage, and by 4% for the 5-stage); however, frequency is also reduced for both processors (by 3% for
the 3-stage and by 5% for the 5-stage). Fortunately, in all cases cycle count is reduced significantly,
improving the IPC by 24% and 45% for the 3-stage and 5-stage processors over the corresponding
single-threaded processors. Overall, this technique alone improves area-efficiency by 18% and 15% for
Focusing on the multiprogrammed mixes, we see that the cycle count savings is less pronounced:
when executing multiple copies of a single program, it is much more likely that consecutive instructions
CHAPTER 4. IMPROVING SOFT PROC. AREA EFFICIENCY WITH MULTITHREADING 37
Figure 4.6: Impact on both cycle count and area-efficiency of optimizing multicycle paths for the
3 and 5-stage pipeline multithreaded processors, relative to the corresponding baseline multithreaded
processors.
will require the same multicycle path, resulting in avoided stalls with the use of pipelining4 ; for
multiprogrammed workloads such consecutive instructions are less likely. Hence for multiprogrammed
mixes we achieve only 5% and 12% improvements in area-efficiency for the 3 and 5-stage pipelines.
In this section, we investigate two techniques for improving the area-efficiency of multithreaded soft
Reducing the Register File: In previous work [83] we demonstrated that 8 of the 32 architected
registers (s0-s7) could be avoided by the compiler (such that programs do not target them at all) with
only a minor impact on performance for most applications. Since our multithreaded processors have
a single physical register file, we can potentially significantly reduce the total size of the register file
by similarly removing these registers for each thread. Since our register file is composed of M4K block
memories, we found that this optimization only makes sense for our 5-stage pipeline: only for that
4 As shown in Figure 4.5(b), the transition from an instruction that uses a multicycle path to an instruction that doesn’t
creates a pipeline stall.
CHAPTER 4. IMPROVING SOFT PROC. AREA EFFICIENCY WITH MULTITHREADING 38
Figure 4.7: Impact of having one thread less than the pipeline depth, normalized to processors having
the number of threads equal to the pipeline depth.
processor does the storage saved by removing registers allow us to save entire M4K blocks. In particular,
if we remove 7 of 32 registers per thread then the entire resulting 25-register register file fits in a single
M4K block (since 25 registers × 5 threads × 32 bits < 4096 bits).5 In fact, since our register file is
actually duplicated to provide enough read and write ports, this optimization allows us to use two M4Ks
instead of four. For our 5-stage multithreaded processor this optimization allows us to save 5% area
and improve frequency by 3%, but increases cycle count by 3% on average due to increased register
pressure.
Reducing the Number of Threads: Multithreaded soft processors proposed to date have supported
a number of threads equal to the number of pipeline stages [35, 43]. For systems where long multicycle
stalls are possible (such as with high latency off-chip memory), supporting a larger number of threads
than pipeline stages may be interesting. However, for our work which so far assumes on-chip memory,
it may also be beneficial to have fewer threads than pipeline stages: we actually require a minimum
number of threads such that the longest possible dependence between stages is hidden, which for the
processors we study in this chapter requires one less thread than there are pipeline stages. This reduction
by one thread may be beneficial since it will reduce the latency of individual tasks, result in the same
5 However, rather than simply shifting, we must now add an offset to register numbers to properly index the physical register
file.
CHAPTER 4. IMPROVING SOFT PROC. AREA EFFICIENCY WITH MULTITHREADING 39
overall IPC, and reduce the area by the support for one thread context. Figure 4.7 shows the impact
on our CAD metrics of subtracting one thread from the baseline multithreaded implementation. Area
is reduced for the 3-stage pipeline, but frequency also drops significantly because the computation of
branch targets becomes a critical path. In contrast, for the 5 and 7-stage pipelines we achieve area and
power savings respectively, while frequency is nearly unchanged. Overall this possibility gives more
flexibility to designers in choosing the number of threads to support for a given pipeline, with potential
area or power benefits. When combined with eliminating multicycle paths, reducing the number of
threads by one for the 5-stage pipeline improves area-efficiency by 25%, making it 77% more area-
4.7 Summary
We have shown that, relative to single-threaded 3, 5, and 7-stage pipelined processors, multithreading
can improve overall IPC by 24%, 45%, and 104% respectively, and area-efficiency by 33%, 77%, and
106%. In particular, we demonstrated that (i) intra-stage pipelining is undesirable for single threaded
processors but can provide significant increases in area-efficiency for multithreaded processors; (ii)
optimizing unpipelined multicycle paths is key to gaining area-efficiency; (iii) for multithreaded soft
processors that 3-operand multiplies are preferable over Hi/Lo registers such as in MIPS; (iv) reducing
the registers used can potentially reduce the number of memory blocks used and save area; (v) having
one thread less than the number of pipeline stages can give more flexibility to designers while potentially
saving area or power. Other than removing registers, which can be detrimental to performance, we will
incorporate all the above optimizations in the multithreaded designs that we evaluate in the remainder
of this thesis. In summary, this chapter shows that there are significant benefits to multithreaded soft
processor designs over single-threaded ones, and gives system designers a strong motivation to program
with independent threads. In the next chapter, we investigate the impact of broadening the scope of our
Based on encouraging performance and area-efficiency results in the previous chapter, we are motivated
to better understand ways to scale the performance of such multithreaded systems and multicores
composed of them. In this chapter, we explore the organization of processors and caches connected
to a single off-chip memory channel, for workloads composed of many independent threads. A typical
performance goal for the construction of such a system is to fully-utilize a given memory channel. For
example, in the field of packet processing the goal is often to process packets at line rate, scaling up a
system composed of processors and accelerators to make full use of the available bandwidth to and from
a given packet-buffer (i.e., memory channel). In this chapter, we design and evaluate real FPGA-based
thread contexts to maximize throughput while minimizing area. For now, we consider systems similar to
packet processing where there are many independent tasks/threads to execute, and maximizing system
throughput is the over-arching performance goal. Our main finding is that while a single multithreaded
of single-threaded processors scale better than those composed of multithreaded processors. We next
Scaling Soft Multithreaded Processors In the previous chapter, we demonstrated that a soft multi-
40
CHAPTER 5. UNDERSTANDING SCALING TRADE-OFFS IN SOFT SYSTEMS 41
threaded processor can eliminate nearly all of the stall-cycles observed by a comparable single-threaded
processor by executing an independent instruction in every stage of the processor pipeline [76]. In
this chapter, we extend our multithreaded processors to interface with off-chip DDR memory through
caches, which in contrast with uniprocessors presents some interesting challenges and design options. In
particular, we present an approach called instruction replay to handle cache misses without stalling other
threads, and a mechanism for handling the potential live-locks resulting from collisions in the cache from
the replayed instructions. We also evaluate several cache organization alternatives for soft multithreaded
processors, namely shared, private, and partitioned caches, as well as support for different numbers of
hardware contexts. We finally investigate issues related to sharing and conflicts between threads for soft
multithreaded processors. In contrast with previous studies of systems with on-chip memory [43, 76],
we find with off-chip memory that single-threaded processors are generally more area-efficient than
multithreaded processors.
tithreaded processors. We find that, for a system of given area, multiprocessors composed of
multithreaded processors provide a much larger number of thread contexts, but that uniprocessor-based
In this chapter, we are not proposing new architectural enhancements to soft processors: we are
rather trying to understand the trade-offs to give us direction when later building soft multiprocessors
Caches built in the FPGA fabric [160] are routinely utilized to improve the performance of systems
with off-chip memory. The commercial off-the-shelf soft processors Nios-II [7] and Microblaze [154]
both support optional direct-mapped instruction and data caches with configurable cache line sizes.
Both processors allocate a cache line upon write misses (allocate-on-write) but the Microblaze uses
a write-through policy while NIOS-II uses write-back. Both vendors have extended their instruction
set to accomplish tasks such as cache flushing, invalidating and bypassing. Our implementation is
comparable to those designs (other than we do not allocate cache lines on writes) and we did not require
CHAPTER 5. UNDERSTANDING SCALING TRADE-OFFS IN SOFT SYSTEMS 42
There is little prior work which has studied soft multiprocessor organization with off-chip memory
in a real FPGA system, with commercial benchmarks. Fort et al. [43] compared their soft multithreaded
processor design to having multiple uniprocessors, using the Mibench benchmarks [50] and a system
with on-chip instruction memory and off-chip shared data storage (with no data cache). They
conclude that multithreaded soft processors are more area efficient than multiple uniprocessors. We
further demonstrated that a soft multithreaded processor can be more area efficient than a single soft
The addition of off-chip memory and caches introduces variable-latency stalls to the processor
pipeline. Handling such stalls in a multithreaded processor without stalling all threads is a challenge.
Fort et. al. [43] use a FIFO queue of loads and stores, while Moussali et. al. [107] use an instruction
scheduler to issue ready instructions from a pool of threads. In contrast with either of these approaches,
In this section we briefly describe our infrastructure for designing and measuring soft processors, our
methodology for comparing soft processor designs, our compilation infrastructure, and the benchmark
Caches The Altera Stratix FPGA that we target provides three sizes of block-memory: M512 (512bits),
M4K (4Kbits) and M-RAM (512Kbits). We use M512s to implement register files. In contrast with
M-RAM blocks, M4K blocks can be configured to be read and written at the same time (using two
ports), such that the read will return the previous value—hence, despite their smaller size, caches built
with M4Ks typically out-perform those composed of M-RAMs, and we choose M4K-based caches for
our processors.
Platform Our RTL is synthesized, mapped, placed, and routed by Quartus 7.2 [9] using the default
optimization settings. The resulting soft processors are measured on the Transmogrifier platform [42],
where we utilize one Altera Stratix FPGA EP1S80F1508C6 device to (i) obtain the total number of
execution cycles, and (ii) to generate a trace which is validated for correctness against the corresponding
CHAPTER 5. UNDERSTANDING SCALING TRADE-OFFS IN SOFT SYSTEMS 43
execution by an emulator (the MINT MIPS emulator [141]). Our memory controller connects a 64-bit-
wide data bus to a 1Gbyte DDR SDRAM DIMM clocked at 133 MHz, and configured to transfer two
Measurement For Altera Stratix FPGAs, the basic logic element (LE) is a 4-input lookup table plus a
flip-flop—hence we report the area of our soft processors in equivalent LEs, a number that additionally
accounts for the consumed silicon area of any hardware blocks (e.g. multiplication or block-memory
units). Even if a memory block is partially utilized by the design, the area of the whole block is
nonetheless added to the total area required. For consistency, all our soft processors are clocked at
50 MHz and the DDR remains clocked at 133 MHz. The exact number of cycles for a given experiment
is non-deterministic because of the phase relation between the two clock domains, a difficulty that is
amplified when cache hit/miss behavior is affected. However, we have verified that the variations are
Compilation and Benchmarks Our compiler infrastructure is based on modified versions of gcc 4.0.2,
Binutils 2.16, and Newlib 1.14.0 that target variations of the 32-bit MIPS I [66] ISA; for example,
for multithreaded processors we implement 3-operand multiplies (rather than MIPS Hi/Lo registers [76,
83]), and eliminate branch and load delay slots. Integer division is implemented in software. Table 5.1
shows the selected benchmarks from the EEMBC suite [39], avoiding benchmarks with significant file
I/O that we do not yet support, along with the benchmarks dynamic instruction counts as impacted
by different compiler settings. For systems with multiple threads and/or processors, we run multiple
simultaneous copies of a given benchmark (i.e., similar to packet processing), measuring the time from
the start of execution for the first copy until the end of the last copy.1
The processor and caches are clocked together at 50 MHz while the DDR controller is clocked at
133 MHz. There are three main reasons for the reduced clock speed of the processor and caches: i)
the original 3-stage pipelined processor with on-chip memory could only be clocked at 72 MHz on
the slower speed grade Stratix FPGAs on the TM4; ii) adding the caches and bus handshaking further
reduced the clock frequency to 64 MHz; and iii) to relax the timing constraints when arbitrating signals
crossing clock domains, we chose a 20 ns clock period which can be obtained via a multiplication of
1 We verified that in most cases no thread gets significantly ahead of the others.
CHAPTER 5. UNDERSTANDING SCALING TRADE-OFFS IN SOFT SYSTEMS 44
Table 5.1: EEMBC benchmark applications evaluated. ST stands for single-threaded and MT stands for
multithreaded.
Dyn. Instr. Counts (x106 )
Category Benchmark ST MT
Automotive A 2 TIME 01 374 356
AIFIRF 01 33 31
BASEFP 01 555 638
BITMNP 01 114 97
CACHEB 01 16 15
CANRDR 01 38 35
IDCTRN 01 62 57
IIRFLT 01 88 84
PUWMOD 01 17 14
RSPEED 01 23 21
TBLOOK 01 149 140
Telecom AUTCOR 00 DATA 2 814 733
CONVEN 00 DATA 1 471 451
FBITAL 00 DATA 2 2558 2480
FFT 00 DATA 3 61 51
VITERB 00 DATA 2 765 750
Networking IP PKTCHECKB 4 M 42 38
IP REASSEMBLY 385 324
OSPFV 2 49 33
QOS 981 732
CHAPTER 5. UNDERSTANDING SCALING TRADE-OFFS IN SOFT SYSTEMS 45
a rational number with the 133 MHz DDR clock period (7.5 ns). In our evaluation in Section 5.4, we
estimate the impact of higher processor clock frequencies that match the actual critical paths of the
underlying circuits, and find that the results do not alter our conclusions.
Our system has a load-miss latency of only 8 processor cycles and our DDR controller uses a closed-
page policy so that every request opens a DRAM row and then closes it [62]. Furthermore, our current
memory controller implementation has room for improvement such as by: (i) setting the column access
latency to 2 instead of 3; (ii) tracking open DRAM pages and saving unnecessary row access latency;
(iii) fusing the single edge conversion, phase re-alignment, and clock crossing which together amount
As prior work has shown [43, 76], multithreaded soft processors can hide processor stalls while
saving area, resulting in more area-efficient soft systems than those composed of uniprocessors or
instruction for a different thread context is fetched each clock cycle in a round-robin fashion. Such
a processor requires the register file and program counter to be logically replicated per thread context.
However, since consecutive instructions in the pipeline are from independent threads, we eliminate the
need for data hazard detection logic and forwarding lines—assuming that there are at least N − 1 threads
for an N-stage pipelined processor [43, 76]. Our multithreaded processors have a 5-stage pipeline that
never stalls: this pipeline depth was found to be the most area-efficient for multithreading in the previous
chapter [76]. In this section we describe the challenges in connecting a multithreaded soft processor to
The workload we assume for this study is comprised of multiple copies of a single task (i.e., similar
to packet processing), hence instructions and an instruction cache are easily shared between threads
without conflicts. However, since the data caches we study are direct-mapped, when all the threads
access the same location in their respective data sections, these locations will all map to the same cache
CHAPTER 5. UNDERSTANDING SCALING TRADE-OFFS IN SOFT SYSTEMS 46
entry, resulting in pathologically bad cache behavior. As a simple remedy to this problem we pad the
data sections for each thread such that they are staggered evenly across the data cache, in particular
by inserting multiples of padding equal to the cache size divided by the number of thread contexts
sharing the cache. However, doing so makes it more complicated to share instruction memory between
threads: since data can be addressed relative to the global pointer, we introduce a short thread-specific
initialization routine that adjusts the global pointer by the padding amount; there can also be static
pointers and offsets in the program, that we must adjust to reflect the padding. We find that applying
this padding increases the throughput of our base multithreaded processor by 24%, hence we apply this
When connected to an off-chip memory through caches, a multithreaded processor will ideally not stall
other threads when a given thread suffers a multiple-cycle cache miss. In prior work, Fort et. al. [43]
use a FIFO queue of loads and stores, while Moussali et. al. [107] use an instruction scheduler to issue
ready instructions from a pool of threads. For both instruction and data cache misses, we implement
a simpler method requiring little additional hardware that we call instruction replay. The basic idea is
as follows: whenever a memory reference instruction suffers a cache miss, that instruction fails—i.e.,
the program counter for that thread is not incremented. Hence that instruction will execute again (i.e.,
replay) when it is that thread context’s turn again, and the cache miss is serviced while the instruction
is replaying. Other threads continue to make progress, while the thread that suffered the miss fails and
replays until the memory reference is a cache hit. However, since our processors can handle only a
single outstanding memory reference, if a second thread suffers a cache miss it will itself fail and replay
To safely implement the instruction replay technique we must consider how cache misses from
different threads might interfere. First, it is possible that one thread can load a cache block into the
cache, and then another thread replaces that block before the original thread is able to use it. Such
interference between two threads can potentially lead to live-lock. However, we do not have to provide
a solution to this problem in our processors because misses are serviced in order and the miss latency
is guaranteed to be greater than the latency of a full round-robin of thread contexts—hence a memory
CHAPTER 5. UNDERSTANDING SCALING TRADE-OFFS IN SOFT SYSTEMS 47
data path
data cache N
..
{
store intr. F D R/ EX
EX / M WB F D R/ EX
EX / M WB
F D R/
EX F D R/ EX
EX / M WB
load instr.
(a) (b)
Figure 5.1: Cache organizations and the corresponding impact on the execution of a write hit from one
thread followed by a load from a consecutive thread: (a) a shared data cache, for which the load is
aborted and later replays (the hazard is indicated by the squashed pipeline slots marked with “—”); (b)
partitioned and private caches, for which the load succeeds and there is no hazard. The pipeline logic is
divided into fetch F, decode D, register R, execute EX, memory M, and write-back WB.
reference suffering a miss is guaranteed to succeed before the cache line is replaced. However, a second
possibility is one that we must handle: the case of a memory reference that suffers a data cache miss, for
which the corresponding instruction cache block is replaced before the memory reference instruction
can replay. This subtle pathology can indeed result in live-lock in our processors, so we prevent it by
saving a copy of the last successfully fetched instruction for each thread context.
Each thread has its own data section, hence despite our padding efforts (Section 5.3.1), a shared data
cache can still result in conflicts. A simple solution to this problem is to increase the size of the shared
data cache to accommodate the aggregate data set of the multiple threads, although this reduces the
area-saving benefits of the multithreaded design. Furthermore, since our caches are composed of FPGA
memory blocks which have only two ports (one connected to the processor, one connected to the DRAM
channel), writes take two cycles: one cycle to lookup and compare with the tag, and another cycle to
perform the write (on a hit). As illustrated in Figure 5.1(a), this can lead to further contention between
consecutive threads in a multithreaded processor that share a cache: if a second consecutive thread
CHAPTER 5. UNDERSTANDING SCALING TRADE-OFFS IN SOFT SYSTEMS 48
1.8
2 single (1D$/1T)
4 shared (1D$/4T)
8 partitioned (2D$/4T)
1.7 16 private (4D$/4T)
1
1.6
CPI (cycles per instruction)
1.5 2
1 4
1.4
2
4
1.3 8
1.2 2
4
8
16
1.1
1
4000 5000 6000 7000 8000 9000 10000 11000 12000 13000
Area (Equivalent LEs)
Figure 5.2: CPI versus area for the single threaded processor and for the multithreaded processors with
shared, partitioned, and private caches. 1D$/4T means there is one data cache and 4 threads total. Each
point is labeled with the total cache capacity in KB available per thread.
attempts a memory reference directly after a write-hit, we apply our failure/replay technique for the
second thread, rather than stall that thread and subsequent threads.
Rather than increasing the size of the shared cache, we consider two alternatives. The first is to
have private caches such that each thread context in the multithreaded processor accesses a dedicated
data cache. The second, if the number of threads is even, is to have partitioned caches such that non-
consecutive threads share a data cache—for example, if there are four threads, threads 1 and 3 would
share a cache and threads 2 and 4 would share a second cache. As shown in Figure 5.1(b), both of
these organizations eliminate port contention between consecutive threads, and reduce (partitioned) or
In this section we compare single-threaded and multithreaded soft processors, and study the impact of
cache organization and thread count on multithreaded processor performance and area efficiency.
In Figure 5.2 we plot performance versus area for the single threaded processor and the three
CHAPTER 5. UNDERSTANDING SCALING TRADE-OFFS IN SOFT SYSTEMS 49
possible cache organizations for multithreaded processors (shared, partitioned, and private), and for
each we vary the sizes of their caches. For performance, we plot cycles-per-instruction (CPI), which is
computed as the total number of cycles divided by the total number of instructions executed; we use this
metric as opposed to simply execution time because the single-threaded and multithreaded processors
run different numbers of threads, and because the compilation of benchmarks for the single-threaded
and multithreaded processors differ (as shown in Table 5.1). CPI is essentially the inverse of throughput
for the system, and this is plotted versus the area in equivalent LEs for each processor design—hence
We first observe that the single-threaded and different multithreaded processor designs with various
cache sizes allow us to span a broad range of the performance/area space, giving a system designer
interested in supporting only a small number of threads the ability to scale performance by investing
more resources. The single-threaded processor is the smallest but provides the worst CPI, and this
is improved only slightly when the cache size is doubled (from 2KB to 4KB). Of the multithreaded
processors, the shared, partitioned, and private cache designs provide increasing improvements in CPI
at the cost of corresponding increases in area. The shared designs outperform the single-threaded
processor because of the reduced stalls enjoyed by the multithreaded architecture. The partitioned
designs outperform the shared designs as they eliminate replays due to contention. The private cache
designs provide the best performance as they eliminate replays due to both conflicts and contention, but
for these designs performance improves very slowly as cache size increases.
In Figure 5.2, there are several instances where increasing available cache appears to cost no
additional area: a similar behavior is seen for the single-threaded processor moving from 2KB to 4KB
of cache and for the partitioned multithreaded processor moving from 1KB to 2KB of cache per thread.
This is because the smaller designs partially utilize M4K memories, while in the larger designs they
are more fully utilized—hence the increase appears to be free since we account for the entire area of
an M4K regardless of whether it is fully utilized. For the private-cache multithreaded designs, moving
from 2KB to 4KB of cache per thread actually saves a small amount of area, for similar reasons plus
To better understand the trade-off between performance and area for different designs, it is
instructive to plot their area efficiency as shown in Figure 5.3(a). We measure area efficiency as millions
CHAPTER 5. UNDERSTANDING SCALING TRADE-OFFS IN SOFT SYSTEMS 50
7.5
single (1D$/1T)
shared (1D$/4T)
partitioned (2D$/4T)
7 private (4D$/4T)
6.5
Area efficiency (MIPS/1000LEs)
5.5
4.5
3.5
1 2 4 8 16
Cache per thread (Total cache size in kbytes / thread count)
(a) 50MHz processor clock frequency.
10
single (1D$/1T)
shared (1D$/4T)
partitioned (2D$/4T)
private (4D$/4T)
9
Area efficiency (MIPS/1000LEs)
4
1 2 4 8 16
Cache per thread (Total cache size in kbytes / thread count)
(b) Max allowable processor clock frequency.
Figure 5.3: Area efficiency versus total cache capacity per thread for the single-threaded processor and
the multithreaded processors, reported using (a) the implemented clock frequency of 50MHz, and (b)
the maximum allowable clock frequency per processor design.
CHAPTER 5. UNDERSTANDING SCALING TRADE-OFFS IN SOFT SYSTEMS 51
1.9
shared 8T
0.5 shared 5T
1.8 shared 4T
partitioned 8T
1.7 0.8 1.0 partitioned 4T
private 8T
CPI (cycles per instruction)
1
1.6 private 5T
1.6 1 private 4T
2.0
3.2
1.5 2
1 4 4.0
2 6.4
1.4 4
2 8
4
1.3 8
1.2 2 2
4 4 8 2
8 4 16 8 16
1.1
1
4000 6000 8000 10000 12000 14000 16000
Area (Equivalent LEs)
Figure 5.4: CPI versus area for multithreaded designs supporting varying numbers of thread contexts.
Each point is labeled with the total available cache capacity per thread.
of instructions per second (MIPS) per 1000 LEs. The single-threaded processors are the most area-
efficient, in contrast with previous work comparing similar processors with on-chip memory and without
caches [76], as we provide a private data cache storage for each thread in the multithreaded core. The
partitioned design with 2KB of cache per thread is nearly as area-efficient as the corresponding single-
threaded processor. The shared-cache designs with 1KB and 2KB of cache per thread are the next most
area-efficient, with the private-cache designs being the least area-efficient. These results tell us to expect
the single-threaded and partitioned-cache multithreaded designs to scale well, as they provide the best
Due to limitations of the Transmogrifier platform, all of our processors are actually clocked at
50MHz, while their maximum possible frequencies (i.e., fmax ) are on average 65MHz. To investigate
whether systems measured using the true maximum possible frequencies for the processors would lead
to different conclusions, we estimate this scenario in Figure 5.3(b). We observe that the relative trends
are very similar, with the exception of the single-threaded processor with 2KB of cache for which the
area efficiency drops below that of the corresponding partitioned design to close to that of the shared
cache design.
CHAPTER 5. UNDERSTANDING SCALING TRADE-OFFS IN SOFT SYSTEMS 52
Soft processor 0
... Off−chip
DDR SDRAM
Soft processor N
I−Cache D−Cache
Figure 5.5: Diagram showing an arbiter connecting multiple processor cores in a multiprocessor.
Impact of Increasing Thread Contexts Our multithreaded processors evaluated so far have all
implemented a minimal number of threads contexts: four thread contexts for five pipeline stages.
To justify this choice, we evaluated multithreaded processors with larger numbers of threads for the
different cache designs and for varying amounts of available cache per thread (shown in Figure 5.4).
For shared and partitioned designs we found that increasing the number of thread contexts (i) increases
the CPI, due to increased contention and conflicts, and (ii) increases area, due to hardware support for
the additional contexts. Since the private cache designs eliminate all contention and conflicts, there is
a slight CPI improvement as area increases significantly with additional thread contexts. These results
confirmed that the four-thread multithreaded designs are the most desirable.
For systems with larger numbers of threads available, another alternative for scaling performance is to
instantiate multiple soft processors. In this section we explore the design space of soft multiprocessors,
with the goals of maximizing (i) performance, (ii) utilization of the memory channel, and (iii) utilization
of the resources of the underlying FPGA. To support multiple processors we augment our DRAM
controller with an arbiter that serializes requests by queuing up to one request per processor (shown
in Figure 5.5); note that this simple interconnect architecture does not impact the clock frequencies of
our processors.
In Figure 5.6 we plot CPI versus area for multiprocessors composed of single-threaded or
multithreaded processors; we replicate the processor designs that were found to be the most area-
efficient according to Figure 5.3(a). For each multiprocessor design, each design point has double
CHAPTER 5. UNDERSTANDING SCALING TRADE-OFFS IN SOFT SYSTEMS 53
1.8
1 shared 1x4K/4T
partitioned 2x4K/4T
4 private 4x4K/4T
1.6 single 1x4K/1T
1.4 4
CPI (cycles per instruction)
1.2 4
8
1 2
8
0.8 16
8 68
0.6
4 16
44
0.4 16
8
24 16
0.2
0 10000 20000 30000 40000 50000 60000 70000
Area (Equivalent LEs)
Figure 5.6: CPI versus area for multiprocessors composed of single-threaded and multithreaded
processors. 1x4K/4T means that each processor has one 4KB data cache and 4 thread contexts, and
each point is labeled with the total number of thread contexts supported.
the number of processors as the previous, with the exception of the largest (rightmost) for which we plot
the largest design supported by the FPGA—in this case the design that has exhausted the M4K block
memories.
Our first and most surprising observation is that the Pareto frontier (the set of designs that minimize
CPI and area) is mostly comprised of single-threaded multiprocessor designs, many of which out-
perform multithreaded designs that support more than twice the number of thread contexts. For example,
the 16-processor single-threaded multiprocessor has a lower CPI than the 44-thread-context partitioned-
cache multithreaded multiprocessor of the same area. We will pursue further insight into this result
later in this section. For the largest designs, the private-cache multithreaded multiprocessor provides the
second-best overall CPI with 24 thread contexts (over 6 processors), while the partitioned and shared-
cache multithreaded multiprocessor designs perform significantly worse at a greater area cost.
Sensitivity to Cache Size Since the shared and partitioned-cache multithreaded designs have less cache
per thread than the corresponding private-cache multithreaded or single-threaded designs, in Figure 5.7
we investigate the impact of doubling the size of the data caches (from 4KB to 8KB per cache) for those
CHAPTER 5. UNDERSTANDING SCALING TRADE-OFFS IN SOFT SYSTEMS 54
1.8
shared 1x4K/4T
shared 1x8K/4T
4 partitioned 2x4K/4T
1.6 partitioned 2x8K/4T
4
1.4
CPI (cycles per instruction) 4
4
1.2
8
1
8
8
0.8 8 16
16 68
0.6 48
16
16
28 44
0.4
0 10000 20000 30000 40000 50000 60000 70000
Area (Equivalent LEs)
Figure 5.7: CPI versus area for the shared and partitioned designs as we increase the size of caches from
4KB to 8KB each. Each point is labeled with the total number of thread contexts supported.
two designs. We observe that this improves CPI significantly for the shared-cache designs and more
modestly for the partitioned cache design, despite the fact that the 4KB designs were the most area-
efficient (according to Figure 5.3(a)). Hence we evaluate the 8KB-cache shared and partitioned-cache
Per-Thread Efficiency In this section, we try to gain an understanding of how close to optimality
each of our architectures performs—i.e., how close to a system that experiences no stalls. The optimal
CPI is 1 for our single-threaded processor, and hence 1/X for a multiprocessor composed of X single-
threaded processors. For one of our multithreaded processors the optimal CPI is also 1, but since
there are four thread contexts per processor, the optimal CPI for X multithreaded processors is 4/X.
In Figure 5.8(a) we plot CPI versus total number of thread contexts for our single and multithreaded
designs, as well as the two ideal curves (as averaged across all of our benchmarks). As expected, for
a given number of threads the single-threaded processors exhibit better CPI than the corresponding
multithreaded designs. However, it is interesting to note that the private-cache multithreaded designs
perform closer to optimally than the single-threaded designs. For example, with 16 threads (the largest
design), the single-threaded multiprocessor has a CPI that is more than 4x greater than optimal, but
CHAPTER 5. UNDERSTANDING SCALING TRADE-OFFS IN SOFT SYSTEMS 55
2
shared 1x8K/4T
partitioned 2x8K/4T
1.8 private 4x4K/4T
single 1x4K/1T
4/x
1.6 1/x
1.4
1.2
CPI
1
0.8
0.6
0.4
0.2
0
5 10 15 20 25 30 35 40 45 50
Number of threads
(a) All benchmarks.
shared 1x8K/4T
1.4 partitioned 2x8K/4T
private 4x4K/4T
single 1x4K/1T
4/x
1.2 1/x
1
3 best CPI values
0.8
0.6
0.4
0.2
0
5 10 15 20 25 30 35 40 45 50
Number of threads
(b) Three best performing benchmarks per design.
shared 1x8K/4T
1.4 partitioned 2x8K/4T
private 4x4K/4T
single 1x4K/1T
4/x
1.2 1/x
1
3 best CPI values
0.8
0.6
0.4
0.2
0
5 10 15 20 25 30 35 40 45 50
Number of threads
(c) Three worst performing benchmarks per design.
Figure 5.8: CPI versus total thread contexts across all benchmarks (a), and the three best (b) and worst
(c) performing benchmarks per design.
CHAPTER 5. UNDERSTANDING SCALING TRADE-OFFS IN SOFT SYSTEMS 56
3
1P
2P
4P
2.5
2
CPI
1.5
0.5
0
0 0.05 0.1 0.15 0.2
Fraction of load misses
Figure 5.9: CPI versus fraction of load misses. For each surface the upper edge plots the single-threaded
multiprocessor designs (4KB data cache per processor) and the lower edge plots the private-cache
multithreaded multiprocessor designs (4KB data cache per thread context).
regardless, this design provides the lowest CPI across any design. This graph also illustrates that private-
cache designs outperform partitioned-cache designs which in turn outperform shared-cache designs.
Potential for Customization A major advantage of soft systems is the ability to customize the hardware
to match the requirements of the application—hence we are motivated to investigate whether the
multithreaded designs might dominate the single-threaded design for certain applications. However,
we find that this is not the case. To summarize, in Figures 5.8(b) and 5.8(c) we plot CPI versus total
number of thread contexts for the three best performing and three worst performing benchmarks per
design, respectively. For neither extremity do the multithreaded designs outperform the single-threaded
designs. Looking at Figure 5.8(b), we see that for the best performing benchmarks the private-cache
multithreaded designs perform nearly optimally. In the worst cases, the single-threaded designs maintain
their dominance, despite the 16-processor design performing slightly worse than the 8-processor design.
multiprocessor designs, we use a synthetic benchmark that allows us to vary the density of load misses.
In particular, this benchmark consists of a thousand-instruction loop comprised of loads and no-ops,
CHAPTER 5. UNDERSTANDING SCALING TRADE-OFFS IN SOFT SYSTEMS 57
and the loads are designed to always miss in the data cache. This benchmark allows us to isolate the
impact of load misses since they can be the only cause of stalls in our processors. In Fig 5.9 we compare
the CPI for single-threaded designs with the private-cache multithreaded designs as the fraction of load
misses increases. In particular, for a certain number of processors we plot a surface such that the top
edge of the surface is the single-threaded design and the bottom edge is the corresponding private-cache
multithreaded design. Hence the three surfaces show this comparison for designs composed of 1, 2, and
4 processors. Looking at the surface for a single processor, as the fraction of load misses increases, the
multithreaded processor (bottom edge) gains a somewhat consistent CPI advantage (of about 0.5 at 10%
misses) over the single-threaded processor (top edge). However, this advantage narrows as the number
of processors–and the resulting pressure on the memory channel–increases, and for four processors,
the multithreaded designs have only a negligible advantage over the single-threaded designs—and the
multithreaded processors require marginally greater area than their single-threaded counterparts.
Exploiting Heterogeneity In contrast with ASIC designs, FPGAs provide limited numbers of certain
resources, for example block memories. This leads to an interesting difference when targeting an FPGA
as opposed to an ASIC: replicating only the same design will eventually exhaust a certain resource
while under-utilizing others. For example, our largest multiprocessor design (the 68-thread shared-cache
multithreaded design) uses 99% of the M4K block memories but only 43% of the available LEs. Hence
we are motivated to exploit heterogeneity in our multiprocessor design to more fully utilize all of the
resources of the FPGA—in this case we consider adding processors with M-RAM based caches despite
the fact that individually they are less efficient than their M4K-based counterparts. In particular, we
extend our two best-performing multiprocessor designs with processors having M-RAM-based caches
as shown in Figure 5.10. In this case, extending the multithreaded processor with further processors
does not improve CPI but only increases total area, i.e. potentially taking away resources available for
other logic in the FPGA. For the single-threaded case, the heterogeneous design improves CPI slightly
but at a significant cost in area—however, this technique does allow us to go beyond the previously
maximal design.
CHAPTER 5. UNDERSTANDING SCALING TRADE-OFFS IN SOFT SYSTEMS 58
0.4
0.36
CPI (cycles per instruction)
0.32
0.26
35000 40000 45000 50000 55000 60000 65000 70000 75000
Area (Equivalent LEs)
Figure 5.10: CPI versus area for our two best-performing maximal designs (the 16-thread-context
single-threaded design and the 24-thread-context private-cache multithreaded design), and for those
designs extended with processors with M-RAM-based caches.
CHAPTER 5. UNDERSTANDING SCALING TRADE-OFFS IN SOFT SYSTEMS 59
5.6 Summary
In this chapter we explored architectural options for scaling the performance of soft systems, focussing
on the organization of processors and caches connected to a single off-chip memory channel, for
workloads composed of many independent threads. For soft multithreaded processors, we present the
technique of instruction replay to handle cache misses processors without stalling other threads. Our
investigation of real FPGA-based processor, multithreaded processor, and multiprocessor systems has
led to a number of interesting conclusions. First, we showed that multithreaded designs help span
the throughput/area design space, and that private-cache based multithreaded processors offer the best
perform the best for a given total area, followed closely by private-cache multithreaded multiprocessor
designs. We demonstrated that as the number of processors increases, multithreaded processors lose
their latency-hiding advantage over single-threaded processors, as both designs become bottlenecked
on the memory channel. Finally, we showed the importance of exploiting heterogeneity in FPGA-based
multiprocessors to fully utilize a diversity of FPGA resources when scaling-up a design. Armed with
this knowledge of multiprocessors, we attempt to build our packet processing system on the circuit
board with gigabit network interfaces that we describe in the next chapter.
Chapter 6
To avoid the tedious and complicated process of implementing a networking application in low-level
hardware-description language (which is how FPGAs are normally programmed), we instead propose
to run the application on soft processors – processors composed of programmable logic on the FPGA. To
build such a system, we leverage the compiler optimizations, the instruction replay mechanism and the
multithreading with off-chip memory from Chapters 4 and 5, while still aiming to execute applications
with the run-to-completion paradigm discussed in Chapter 3. In this chapter, we briefly describe the
baseline NetThreads [81] multithreaded multiprocessor system that allows us to program the NetFPGA
platform [91] in software using shared memory and conventional lock-based synchronization; we also
motivate our choice of the NetFPGA board as our experimentation platform for packet processing.
Next, we describe our benchmarks and show the baseline performance of our system. This architecture
and platform serves as a starting point for evaluating new soft systems architectures in the subsequent
chapters. We conclude this section by describing how this infrastructure was also successfully used
as the enabling platform for two other projects led by software programmers in the network research
community.
60
C HAPTER 6. N ET T HREADS : A M ULTITHREADED S OFT M ULTIPROCESSOR 61
synch. unit
4−thread I$ 4−thread I$
processor processor
instr.
data
busses
input mem.
output mem.
to DDR2 SDRAM
Figure 6.1: The architecture of a 2-processor soft packet multiprocessor. The suspension dots indicate
that the architecture can allow for more cores (see Appendix B).
Our base processor is a single-issue, in-order, 5-stage, 4-way multithreaded processor, shown to be the
most area efficient compared to a 3- and 7-stage pipeline in Chapter 5. While the previous chapter also
showed us that single-threaded processors could implement more cores in a fixed amount of area, the
situation is different here. Because we are now using a Xilinx FPGA with coarser 18kbits block RAMs,
and we assign all our thread contexts to perform the same packet processing in a shared—rather than
private—data memory space, the multithreaded processors actually do not require more block RAMs
than single-threaded cores (in this baseline architecture), and block RAMs were the limiting factor to
adding more cores in Chapter 5. Since the FPGA logic elements are not a critical limiting factor either
(see Section 6.2), we are in a situation where it is most advantageous to capitalize on the better CPI of
multithreaded processors (as demonstrated in Chapter 4). Also, in Chapter 5 we assumed independent
application threads, whereas from this chapter onwards we use real packet processing applications
with threads that share data and synchronize, so we expect more stalls related to synchronization that
As shown in Figure 6.1 and summarized in Table 6.1, the memory system of our packet processing
design is composed of a private instruction cache for each processor, and three data memories that are
shared by all processors; this organization is sensitive to the two-port limitation of block RAMs available
on FPGAs. The first memory is an input buffer that receives packets on one port and services processor
C HAPTER 6. N ET T HREADS : A M ULTITHREADED S OFT M ULTIPROCESSOR 62
requests on the other port via a 32-bit bus, arbitrated across processors. The second is an output memory
buffer that sends packets to the NetFPGA output-queues on one port, and is connected to the processors
via a second 32-bit arbitrated bus on the second port. Both input and output memories are 16KB, allow
single-cycle random access and are controlled through memory-mapped registers; the input memory is
read-only and is logically divided into ten fixed-sized packet slots able to hold at most one packet each.
The third is a shared memory managed as a cache, connected to the processors via a third arbitrated
32-bit bus on one port, and to a DDR2 SDRAM controller on the other port. For simplicity, the shared
cache performs 32-bit line-sized data transfers with the DDR2 SDRAM controller (similar to previous
work [134]), which is clocked at 200MHz. The SDRAM controller services a merged load/store queue
in-order; since this queue is shared by all processors it serves as a single point of serialization and
memory consistency, hence threads need only block on pending loads but not stores (as opposed to the
increased complexity of having private data caches). In Chapter 7, the queue has 16 entries but using the
techniques proposed in that chapter, we are able to increase the size of the queue to 64 for the remaining
chapters. Finally, each processor has a dedicated connection to a synchronization unit that implements
16 mutexes. In our current instantiation of NetThreads, 16 mutexes is the maximum number that we
can support while meeting the 125MHz target clock speed. In the NetThreads ISA, each lock/unlock
Similar to other network processors [25, 56], our packet input/output and queuing in the input and
output buffers are hardware-managed. In addition to the input buffer in Figure 6.1, the NetFPGA
framework can buffer incoming packets for up to 6100 bytes (4 maximally sized packets) but the small
C HAPTER 6. N ET T HREADS : A M ULTITHREADED S OFT M ULTIPROCESSOR 63
overall input storage, while consistent with recent findings that network buffers should be small [17],
is very challenging for irregular applications with high computational variance and conservatively caps
the maximum steady-state packet rate sustainable via packets dropped at the input of the system.
Since our multiprocessor architecture is bus-based, in its current form it will not easily scale to
a large number of processors. However, as we demonstrate later in Section 7.3, our applications are
mostly limited by synchronization and critical sections, and not contention on the shared buses; in other
words, the synchronization inherent in the applications is the primary roadblock to scalability.
This section describes our evaluation infrastructure, including compilation, our evaluation platform, and
Compilation: Our compiler infrastructure is based on modified versions of gcc 4.0.2, Binutils
2.16, and Newlib 1.14.0 that target variations of the 32-bit MIPS-I [66] ISA. We modify MIPS
to eliminate branch and load delay slots (as described in Section 4.1 [83]). Integer division and
multiplication, which are not heavily used by our applications, are both implemented in software. To
minimize cache line conflicts in our direct-mapped data cache, we align the top of the stack of each
software thread to map to equally-spaced blocks in the data cache. The processor is big-endian which
avoids the need to perform network-to-host byte ordering transformations (IP information is stored
in packet headers using the big-endian format). Network processing software is normally closely-
integrated with operating system networking constructs; because our system does not have an operating
system, we instead inline all low-level protocol-handling directly into our programs. To implement
time-stamps and time-outs we require the FPGA hardware to implement a device that can act as the
Platform: Our processor designs are inserted inside the NetFPGA 2.1 Verilog infrastructure [91]
that manages four 1GigE Media Access Controllers (MACs), which offer a considerable bandwidth (see
Section 2.1.3). We added to this base framework a memory controller configured through the Xilinx
Memory Interface Generator to access the 64 Mbytes of on-board DDR2 SDRAM clocked at 200MHz.
The system is synthesized, mapped, placed, and routed under high effort to meet timing constraints
C HAPTER 6. N ET T HREADS : A M ULTITHREADED S OFT M ULTIPROCESSOR 64
by Xilinx ISE 10.1.03 and targets a Virtex II Pro 50 (speed grade 7ns). Other than using the network
interfaces, our soft processors can communicate through DMA on a PCI interface to a host computer
when/if necessary. This configuration is particularly well suited for many packet processing applications
because: (i) the load of the soft processors is isolated from the load on the host processor, (ii) the soft
processors suffer no operating system overheads, (iii) they can receive and process packets in parallel,
and (iv) they have access to a high-resolution system timer (much higher than that of the host timer). As
another advantage, this platform is well supported and widely-used by researchers worldwide [90], thus
FPGA resource utilization Our two-CPU full system hardware implementation consumes 165
block RAMs (out of 232; i.e., 71% of the total capacity). The design occupies 15,671 slices (66% of the
total capacity) and more specifically, 23158 4-input LUTs when optimized with high-effort for speed.
Considering only a single CPU, the post-place & route timing results give an upper bound frequency of
129MHz.
Timing: Our processors run at the clock frequency of the Ethernet MACs (125MHz) because there
are no free PLLs (Xilinx DCMs) after merging-in the NetFPGA support components. Due to these
stringent timing requirements, and despite some available area on the FPGA, (i) the private instruction
caches and the shared data write-back cache are both limited to a maximum of 16KB, and (ii) we are
also limited to a maximum of two processors. These limitations are not inherent in our architecture, and
would be relaxed in a system with more PLLs and a more modern FPGA.
Validation: At runtime in debug mode and in RTL simulation (using Modelsim 6.3c), the
processors generate an execution trace that has been validated for correctness against the corresponding
execution by a simulator built on MINT [141]. We validated the simulator for timing accuracy against
Measurement: We drive our design with a modified Tcpreplay 3.4.0 that sends packet traces from
a Linux 2.6.18 Dell PowerEdge 2950 system, configured with two quad-core 2GHz Xeon processors
and a Broadcom NetXtreme II GigE NIC connecting to a port of the NetFPGA used for input and
a NetXtreme GigE NIC connecting to another NetFPGA port used for output. We characterize the
throughput of the system as being the maximum sustainable input packet rate obtained by finding,
through a bisection search, the smallest fixed packet inter-arrival time where the system does not drop
C HAPTER 6. N ET T HREADS : A M ULTITHREADED S OFT M ULTIPROCESSOR 65
any packet when monitored for five seconds—a duration empirically found long enough to predict the
To put the performance of the soft processors in perspective, handling a 109 bps stream (with a
standard inter-frame gap equivalent to the delay of transmitting 12 bytes at 1Gbps), i.e. a maximized
1Gbps link, with 2 processors running at 125 MHz, implies a maximum of 152 cycles per packet for
minimally-sized 64B packets; and 3060 cycles per packet for maximally-sized 1518B packets.
In this section, we present our system performance in terms of latency and throughput, showing the
potential and limitations of our implementation. We then focus on describing where the performance
6.3.1 Latency
When replying to ICMP ping packets (an example of minimum useful computation) with only 1 thread
out of 4 executing in round-robin, we measured a latency of 5.1µs with a standard deviation of 44ns.
By comparison, when using an HP DL320 G5p server (Quad Xeon 2.13GHz) running Linux 2.6.26
and equipped with an HP NC326i PCIe dual-port gigabit network card as the host replying to the
ping requests, we measured an average round-trip time of 48.9µs with a standard deviation of 17.5µs.
NetThreads therefore offers a latency on average 9.6 times shorter, with a standard deviation 397 times
smaller.
6.3.2 Throughput
We begin by evaluating the raw performance that our system is capable of, when performing minimal
packet processing for tasks that are completely independent (i.e., unsynchronized). We estimate this
upper-bound by implementing a packet echo application that simply copies packets from an input port
to an output port. With minimum-sized packets of 64B, the echo program executes 300±10 dynamic
instructions per packet, and a single round-robin CPU can echo 124 thousand packets/sec (i.e., 0.07
Gbps). With 1518B packets, the maximum packet size allowable by Ethernet, each echo task requires
1300±10 dynamic instructions per packet. In this case, the generator software on the host server (see
C HAPTER 6. N ET T HREADS : A M ULTITHREADED S OFT M ULTIPROCESSOR 66
Figure 6.2: Throughput (in packets per second) measured on the NetFPGA with either 1 or 2 CPUs.
Section 6.2) sends copies of the same preallocated ping request packet through Libnet 1.4. With two
CPUs and 64B packets, or either one or two CPUs and 1518B packets, our PC-based packet generator
cannot generate packets fast enough to saturate our system (i.e., cannot cause packets to be dropped).
This amounts to more than 58 thousand packets/sec (>0.7 Gbps). Hence the scalability of our system
will ultimately be limited by the amount of computation per packet/task and the amount of parallelism
across tasks, rather than the packet input/output capabilities of our system.
Figure 6.2 shows the maximum packet throughput of our (real) hardware system with thread
scheduling. We find that our applications do not benefit significantly from the addition of a second
CPU due to increased lock and bus contention and cache conflicts: the second CPU either slightly
To reduce the number of designs that we would pursue in real hardware, and to gain greater insight
into the bottlenecks of our system, we developed a simulation infrastructure. While verified for timing
accuracy, our simulator cannot reproduce the exact order of events that occurs in hardware, hence there
is some discrepancy in the reported throughput. For example, Classifier has an abundance of control
paths and events that are sensitive to ordering such as routines for allocating memory, hash table access,
and assignment of mutexes to flow records. We depend on the simulator only for an approximation of
C HAPTER 6. N ET T HREADS : A M ULTITHREADED S OFT M ULTIPROCESSOR 67
Figure 6.3: Breakdown of how cycles are spent for each instruction (on average) in simulation.
To obtain a deeper understanding of the bottlenecks of our system, we use our simulator to obtain a
breakdown of how cycles are spent for each instruction, as shown in Figure 6.3. In the breakdown,
a given cycle can be spent executing an instruction (busy), awaiting a new packet to process (no
packet), awaiting a lock owned by another thread (locked), squashed due to a mispredicted branch
or a preceding instruction having a memory miss (squashed), awaiting a pipeline hazard (hazard
bubble), or aborted for another reason (other, memory misses or bus contention). The fraction of time
spent waiting for packets (no packet) is significant and we verified in simulation that it is a result of
the worst-case processing latency of a small fraction of packets, since the packet rate is held constant at
In Table 6.2, we measure several properties of the computation done per packet in our system.
First, we observe that task size (measured in dynamic instructions per second) has an extremely large
variance (the standard deviation is larger than the mean itself for all three applications). This high
variance is partly due to our applications being best-effort unpipelined C code implementations, rather
than finely hand-tuned in assembly code as packet processing applications often are. We also note that
the applications spend over 90% of the packet processing time either awaiting synchronization or within
critical sections (dynamic synchronized instructions), which limits the amount of parallelism and the
overall scalability of any implementation, and in particular explains why our two CPU implementation
C HAPTER 6. N ET T HREADS : A M ULTITHREADED S OFT M ULTIPROCESSOR 68
Table 6.2: Application statistics (mean±standard-deviation): dynamic instructions per packet, dynamic
synchronized instructions per packet (i.e., in a critical section) and number of unique synchronized
memory read and write accesses.
Figure 6.4: Throughput in packets per second for NAT as we increase the tolerance for dropping packets
from 0 to 5%, with either 1 or 2 CPUs.
provides little additional benefit over a single CPU in Figure 6.2. These results motivate future work to
Our results so far have focused on measuring throughput when zero packet drops are tolerated
(over a five second measurement). However, we would expect performance to improve significantly for
measurements when packet drops are tolerated. In Figure 6.4, we plot throughput for NAT as we increase
the tolerance for dropping packets from 0 to 5%, and find that this results in dramatic performance
collaborate to perform the same application and synchronization has been inserted in our benchmarks
to keep shared data coherent. With multiple packets serviced at the same time and multiple packet
C HAPTER 6. N ET T HREADS : A M ULTITHREADED S OFT M ULTIPROCESSOR 69
flows tracked inside a processor, the shared data accessed by all threads is not necessarily the same,
and can sometimes be exclusively read by some threads. In those cases, critical sections may be overly
conservative by preventively reducing the number of threads allowed in a critical section. In our baseline
measurements, we observe a large fraction of cycles spent waiting on locks and on packets to arrive in
Figure 6.3, i.e. a low processor utilization, along with the large throughput sensitivity to tolerating
momentary packet contention on the input packet buffer. These factors are indicative that the threads
are often blocked on synchronization, given that we measured that our system can have a CPI of 1
without memory misses (Section 4.5) and that it can sustain an almost full link utilization of network
traffic (Section 6.3.2). We therefore find that in our benchmarks, and we expect this to be the case for
applications with a pool of threads executing the same program, synchronization is a severe bottleneck,
NetThreads is available online [81, 82]. NetThreads has been downloaded 392 times at the time of this
writing and the authors have supported users from sites in Canada, India, and the United States. In a
tutorial in our university, computer science students exposed for the first time to NetThreads were able
to successfully run regression tests within 5 minutes by compiling a C program and uploading it along
with the NetThreads bitfile (no CAD tools required, and other executables are supplied).
With the participation of academic partners, in particular Professor Ganjali’s group and Professor
Jacobsen’s group and at the University of Toronto, we have developed proof-of-concept applications
group in Computer and Electrical Engineering at the University of Toronto [126]. Pub-
lish/subscribe applications are part of an emergent technology that enables among others the
ubiquitous news feeds on the Internet, the timely dissemination of strategic information for rescue
missions and the broadcasting of financial data to trading agents. Network intrusion detection
Ganjali’s group in Computer Science at the University of Toronto [44, 127]. This hardware
accelerated application enables precise traffic flow modeling that is vital to creating accurate
network simulations and for network equipment buyers to make cost-effective decisions. The
bandwidth and precision required for this system exceed what a conventional computer can
deliver.
Chapter 7
Chapters 4 and 5 have demonstrated that support for multithreading in soft processors can tolerate
pipeline and I/O latencies as well as improve overall system throughput—however this earlier work
assumes an abundance of completely independent threads to execute. In this chapter, we show that
for real workloads, in particular packet processing applications, there is a large fraction of processor
cycles wasted while awaiting the synchronization of shared data structures, limiting the benefits of
hardware that allows the multithreaded pipeline to be more fully utilized without significant costs in area
or frequency. We evaluate our technique relative to conventional multithreading using both simulation
and a real implementation on a NetFPGA board, evaluating three deep-packet inspection applications
that are threaded, synchronize, and share data structures, and show that overall packet throughput can be
increased by 63%, 31%, and 41% for our three applications compared to the baseline system presented
Prior work [35, 43, 106] and Chapters 4 and 5 have demonstrated that supporting multithreading can be
very effective for soft processors. In particular, by adding hardware support for multiple thread contexts
(i.e., by having multiple program counters and logical register files) and issuing an instruction from
71
C HAPTER 7. FAST C RITICAL S ECTIONS VIA T HREAD S CHEDULING 72
a different thread every cycle in a round-robin manner, a soft processor can avoid stalls and pipeline
bubbles without the need for hazard detection logic [43, 76]: a pipeline with N stages that supports
N − 1 threads can be fully utilized without hazard detection logic [76]. Such designs are particularly
well-suited to FPGA-based processors because (i) hazard detection logic can often be on the critical path
and can require significant area [83], and (ii) using the block RAMs provided in an FPGA to implement
compelling because it can tolerate memory and I/O latency [84], as well as the compute latency of
custom hardware accelerators [106]. In prior work and Chapters 4 and 5, it is generally assumed
independent benchmark kernels [35, 43, 76, 84, 106]. However, in real systems, threads will likely share
memory and communicate, requiring (i) synchronization between threads, resulting in synchronization
latency (while waiting to acquire a lock) and (ii) critical sections (while holding a lock).
synchronization latency, the simple round-robin thread-issue schemes used previously fall short for
two reasons: (i) issuing instructions from a thread that is blocked on synchronization (e.g., spin-loop
instructions or a synchronization instruction that repeatedly fails) wastes pipeline resources; and (ii) a
thread that currently owns a lock and is hence in a critical section only issues once every N − 1 cycles
(assuming support for N − 1 thread contexts), exacerbating the synchronization bottleneck for the whole
system.
The closest work on thread scheduling for soft processors that we are aware of is by Moussali et.
al. [106] who use a table of pre-determined worst-case instruction latencies to avoid pipeline stalls.
Our technique can handle the same cases while additionally prioritizing critical threads and handling
unpredictable latencies. In the ASIC world, thread scheduling is an essential part of multithreading
with synchronized threads [139]. The IXP [57] family of network processors use non-preemptive thread
scheduling where threads exclusively occupy the pipeline until they voluntarily de-schedule themselves
when awaiting an event. Other examples of in-order multithreaded processors include the Niagara [73]
and the MIPS 34K [72] processors where instructions from each thread wait to be issued in a dedicated
pipeline stage. While thread scheduling and hazard detection are well studied in general (operating
C HAPTER 7. FAST C RITICAL S ECTIONS VIA T HREAD S CHEDULING 73
systems provide thread management primitives [109] and EPIC architectures, such as IA-64 [58],
bundle independent instructions to maximize instruction-level parallelism), our goal is to achieve thread
scheduling efficiently in the presence of synchronization at the fine grain required to tolerate pipeline
hazards.
A multithreaded processor has the advantage of being able to fully utilize the processor pipeline by
issuing instructions from different threads in a simple round-robin manner to avoid stalls and hazards.
However, for real workloads with shared data and synchronization, one or more threads may often
spin awaiting a lock, and issuing instructions from such threads is hence a waste of pipeline resources.
Also, a thread which holds a lock (i.e., is in a critical section) can potentially be the most important,
since other threads are likely waiting for that lock; ideally we would allocate a greater share of pipeline
resources to such threads. Hence in this section we consider methods for scheduling threads that are
more sophisticated than round-robin but do not significantly increase the complexity nor area of our soft
multithreaded processor.
The most sophisticated possibility would be to give priority to any thread that holds a critical lock,
and otherwise to schedule a thread having an instruction that has no hazards with current instructions
in the pipeline. However, this method is more complex than it sounds due to the possibility of nested
critical sections: since a thread may hold multiple locks simultaneously, and more than one thread
may hold different locks, scheduling such threads with priorities is very difficult and could even result
in deadlock. A correct implementation of this aggressive scheduling would likely also be slow and
Instead of giving important threads priority, in our approach we propose to only de-schedule any
thread that is awaiting a lock. In particular, any such thread will no longer have instructions issued
until any lock is released in the system—at which point the thread may spin once attempting to acquire
the lock and if unsuccessful it is blocked again.1 Otherwise, for simplicity we would like to issue
To implement this method of scheduling we must first overcome two challenges. The first is
relatively minor: to eliminate the need to track long latency instructions, our processors replay
instructions that miss in the cache rather than stalling (Chapter 5 [84]). With non-round-robin thread
scheduling, it is possible to have multiple instructions from the same thread in the pipeline at once—
hence to replay an instruction, all of the instructions for that thread following the replayed instruction
The second challenge is greater: to support any thread schedule other than round-robin means that
there is a possibility that two instructions from the same thread might issue with an unsafe distance
between them in the pipeline, potentially violating a data or control hazard. We consider several methods
for re-introducing hazard detection. First, we could simply add hazard detection logic to the pipeline—
but this would increase area and reduce clock frequency, and would also lead to stalls and bubbles
in the pipeline. Second, we could consult hazard detection logic to find a hazard-free instruction to
issue from any ready thread—but this more complex approach requires the addition of an extra pipeline
stage, and we demonstrate that it does not perform well. A third solution, which we advocate in this
chapter, is to perform static hazard detection by identifying hazards at compile time and encoding hazard
information into the instructions themselves. This approach capitalizes on unused bits in block RAM
words on FPGAs2 to store these hazard bits, allowing us to fully benefit from more sophisticated thread
scheduling.
With the ability to issue from any thread not waiting for a lock, the thread scheduler must ensure
that dependences between the instructions from the same thread are enforced, either from the branch
target calculation to the fetch stage, or from the register writeback to the register read. The easiest
way to avoid such hazards is to support forwarding lines. By supporting forwarding paths between the
writeback and register-read stages of our pipeline, we can limit the maximum hazard distance between
Our scheduling technique consists of determining hazards at compile-time and inserting hazard
distances as part of the instruction stream. Because our instructions are fetched from off-chip DDR2
2 Extrabits in block RAMs are available across FPGA vendors: block RAMs of almost all granularities are configured in
widths that are multiples of nine bits, while processors normally have busses multiples of eight bits wide.
C HAPTER 7. FAST C RITICAL S ECTIONS VIA T HREAD S CHEDULING 75
Figure 7.1: Example insertion of hazard distance values. The 3 register operands for the instructions
are, from left to right, a destination and two source registers. Arrows indicate register dependences,
implying that the corresponding instructions must be issued with at least two pipeline stages between
them. A hazard distance of one encoded into the second instruction commands the processor to ensure
that the third instruction does not issue until cycle 3, and hence the fourth instruction cannot issue until
cycle 4.
memory into our instruction cache, it is impractical to have instructions wider than 32 bits. We
therefore compress instructions to accommodate the hazard distance bits in the program executable, and
decompress them as they are loaded into the instruction cache. We capitalize on the unused capacity of
block RAMs, which have a width multiple of 9 bits—to support 32-bit instructions requires four 9-bit
block RAMs, hence there are 4 spare bits for this purpose.
To represent instructions in off-chip memory in fewer than 32 bits, we compress them according to
the three MIPS instruction types [19]: for the R-type, we merge the function bit field into the opcode
field and discard the original function bit field; for the J-type instructions, we truncate the target bit field
to use fewer than 26 bits; and for the I-type, we replace the immediate values by their index in a lookup
table that we insert into our FPGA design. To size this lookup table, we found that there are usually
more than 1024 unique 16-bit immediates to track, but that 2048 entries is sufficient to accommodate
the union of the immediates of all our benchmarks. Therefore, the instruction decompression in the
processor incurs a cost of some logic and 2 additional block RAMs3 , but not on the critical path of the
processor pipeline. After compression, we can easily reclaim 4 bits per instruction: 2 bits are used to
encode the maximum hazard distance, and 2 bits are used to identify lock request and release operations.
Our compiler automatically sets these bits accordingly when it recognizes memory-mapped accesses for
the locks.
3 For ease of use, the immediate lookup table could be initialized as part of the loaded program executable, instead of
currently, the FPGA bit file.
C HAPTER 7. FAST C RITICAL S ECTIONS VIA T HREAD S CHEDULING 76
An example of code with hazard distances is shown in Figure 7.1: the compiler must account for the
distances inserted between the previous instructions to avoid inserting superfluous hazard distances.
It first evaluates hazards with no regard to control flow: if a branch is not-taken, the hazards will
be enforced by the hazard distance bits; if the branch is taken, the pipeline will be flushed from
instructions on the mispredicted path and hazards will be automatically avoided. If we extended our
processors to be able to predict taken branches, we would have to recode the hazard distance bits to
account for both execution paths. As a performance optimization, we insert a hazard distance of 2
for unconditional jumps to prevent the processor from fetching on the wrong path, as it takes 2 cycles
to compute the branch target (indicated by the arrows from the E to F pipeline stages in Figure 7.2).
In our measurements, we found it best not to insert any hazard distance on conditional branches; an
improvement would consist of using profile feedback to selectively insert hazard distances on mostly
taken branches. With more timing margin in our FPGA design, we could explore other possible
refinements in the thread scheduler to make it aware of load misses and potentially instantiate lock
priorities.
At runtime upon instruction fetch, the hazard distances are loaded into counters that inform the
scheduler about hazards between instructions in unblocked threads as illustrated in Figure 7.2. When
no hazard-free instruction is available for issue (Figure 7.2c), the scheduler inserts a pipeline bubble. In
our processors with thread scheduling, when two memory instructions follow each other and the first
one misses, we found that the memory miss signal does not have enough timing slack to disable the
second memory access while meeting our frequency requirements. Our solution is to take advantage of
our hazard distance bits to make sure that consecutive memory instructions from the same thread are
Note that in off-the-shelf soft processors, the generic hazard detection circuitry identifies hazards
at runtime (potentially with a negative impact on the processor frequency) and inserts bubbles in the
pipeline as necessary. To avoid such bubbles in multithreaded processors, ASIC implementations [72,
73] normally add an additional pipeline stage for the thread scheduler to select hazard-free instructions.
We evaluate the performance of this approach along with ours in the next section.
C HAPTER 7. FAST C RITICAL S ECTIONS VIA T HREAD S CHEDULING 77
Thread:Inst. Haz.Dist.
T0:i0 2 F R E M W
T1:i0 0 F R E M W
T2:i0 0 F R E M W
T0:i1 0 F R E M W
T1:i1 0 F R E M W
Time
a) T3 is de−scheduled, awaiting a lock.
T0:i0 2 F R E M W
T1:i0 0 F R E M W
T1:i1 0 F R E M W
T0:i1 F R E M W
0
T1:i2 F R E M W
0
Time
b) T2 and T3 are de−scheduled, awaiting a lock.
T0:i0 2 F R E M W
T0:i1 0 F R E M W
T0:i2 F R E M W
0
T0:i3 0 F R E M W
Time
c) T1, T2, and T3 are de−scheduled, awaiting a lock.
Figure 7.2: Examples using hazard distance to schedule threads. The pipeline stages are: F for fetch,
R for register, E for execute, M for memory, and W for write-back. The arrows indicate potential branch
target (E to F) or register dependences (W to R).
C HAPTER 7. FAST C RITICAL S ECTIONS VIA T HREAD S CHEDULING 78
In this chapter, our application characteristics are unchanged from Table 6.2 and Figure 7.3(a) shows
the maximum packet throughput of our simulated system (Section 6.3.3), normalized to that of a single
round-robin CPU; these results estimate speedups for our scheduling on a single CPU (S1) of 61%,
57% and 47% for UDHCP, Classifier and NAT respectively.4 We also used the simulator to estimate
the performance of an extra pipeline stage for scheduling (E1 and E2, as described in Section 7.2), but
find that our technique dominates in every case: the cost of extra squashed instructions for memory
misses and mispredicted branches for the longer pipeline overwhelms any scheduling benefit—hence
Figure 7.3(b) shows the maximum packet throughput of our (real) hardware system, normalized
to that of a single round-robin CPU. We see that with a single CPU our scheduling technique (S1)
significantly out-performs round-robin scheduling (RR1) by 63%, 31%, and 41% across the three
applications. However, we also find that our applications do not benefit significantly from the addition
of a second CPU due to increased lock and bus contention, and reduced cache locality: for Classifier
two round-robin CPUs (RR2) is 16% better, but otherwise the second CPU either very slightly improves
or degrades performance, regardless of the scheduling used. We also observe that our simulator
(Figure 7.3(a)) indeed captures the correct relative behaviour of the applications and our system.
Comparing two-CPU full system hardware designs, the round-robin implementation consumes 163
block RAMs (out of 232, i.e., 70% of the total capacity) compared to 165 blocks (71%) with scheduling:
two additional blocks are used to hold the lookup table for instruction decoding (as explained in
Section 7.2). The designs occupy respectively 15,891 and 15,963 slices (both 67% of the total capacity)
when optimized with high-effort for speed. Considering only a single CPU, the post-place & route
timing results give an upper bound frequency of 136MHz for the round-robin CPU and 129MHz for
scheduling. Hence the overall overhead costs of our proposed scheduling technique are low, with a
Much like in Section 6.3.3, to obtain a deeper understanding of the bottlenecks of our system,
4 We will report benchmark statistics in this order from this point on.
C HAPTER 7. FAST C RITICAL S ECTIONS VIA T HREAD S CHEDULING 79
Figure 7.3: Throughput (in packets per second) normalized to that of a single round-robin CPU. Each
design has either round-robin scheduling (RR), our proposed scheduling (S), or scheduling via an extra
pipeline stage (E), and has either 1 or 2 CPUs.
C HAPTER 7. FAST C RITICAL S ECTIONS VIA T HREAD S CHEDULING 80
Figure 7.4: Average cycles breakdown for each instruction at the respective maximum packet rates from
Figure 7.3(a).
we use our simulator to obtain a breakdown of how cycles are spent for each instruction, as shown in
Figure 7.4. In the breakdown, a given cycle can be spent executing an instruction (busy), awaiting a new
packet to process (no packet), awaiting a lock owned by another thread (locked), squashed due to a
mispredicted branch or a preceding instruction having a memory miss (squashed), awaiting a pipeline
hazard (hazard bubble), or aborted for another reason (other, memory misses or bus contention).
Figure 7.4 shows that our thread scheduling is effective at tolerating almost all cycles spent spinning
for locks. The fraction of time spent waiting for packets (no packet) is reduced by 52%, 47%, and
48%, a result of reducing the worst-case processing latency of packets: our simulator reports that the
task latency standard deviation decreases by 34%, 33%, and 32%. The fraction of cycles spent on
squashed instructions (squashed) becomes significant with our proposed scheduling: recall that if one
instruction must replay that we must also squash and replay any instruction from that thread that has
already issued. The fraction of cycles spent on bubbles (hazard bubble) also becomes significant: this
indicates that the CPU is frequently executing instructions from only one thread, with the other threads
While our results in this chapter have focused on measuring throughput when zero packet drops are
tolerated (over a five second measurement), we are interested in re-doing the experiment from Figure 6.4.
Again, we would expect performance to improve significantly for measurements when packet drops are
C HAPTER 7. FAST C RITICAL S ECTIONS VIA T HREAD S CHEDULING 81
Figure 7.5: Throughput in packets per second for NAT as we increase the tolerance for dropping packets
from 0 to 5%. Each design has either round-robin scheduling (RR) or our proposed scheduling (S) and
has either 1 or 2 CPUs.
tolerated. In Figure 7.5, we plot throughput for NAT as we increase the tolerance for dropping packets
from 0 to 5%, and find that this results in dramatic performance improvements for both fixed round-
robin (previously shown in Figure 6.4) and our more flexible thread scheduling—again confirming our
7.4 Summary
In this chapter, we show that previously studied multithreaded soft processors with fixed round-robin
thread interleaving can spend a significant amount of cycles spinning for locks when all the threads
contribute to the same application and have synchronization around data dependences. We thus
demonstrate that thread scheduling is crucial for multithreaded soft processors executing synchronized
workloads. We present a technique to implement a more advanced thread scheduling that has minimal
area and frequency overhead, because it capitalizes on features of the FPGA fabric. Our scheme builds
on static hazard detection and performs better than the scheme used in ASIC processors with hazard
detection logic because it avoids the need for an additional pipeline stage. Our improved handling of
critical sections with thread scheduling improves the instruction throughput which results in reduced
processing latency average and variability. Using a real FPGA-based network interface, we measured
packet throughput improvements of 63%, 31% and 41% for our three applications.
For the remainder if this thesis, to eliminate the critical path for hazard detection logic, we employ
C HAPTER 7. FAST C RITICAL S ECTIONS VIA T HREAD S CHEDULING 82
the static hazard detection scheme presented in this chapter [77] both in our architecture and compiler.
While we were able to reclaim an important fraction of processor cycles, Figure 7.4 shows that our
processors are still very under-utilized, motivating more aggressive synchronization techniques to
increasing number of processor and accelerator cores, supporting sharing and synchronization in a
way that is scalable and easy to program becomes a challenge. As we first discuss in this chapter,
Transactional memory (TM) is a potential solution to this problem, and an FPGA-based system provides
the opportunity to support TM in hardware (HTM). Although there are many proposed approaches to
support HTM for ASIC multicores, these do not necessarily map well to FPGA-based soft multicores.
We propose NetTM: support for HTM in an FPGA-based soft multithreaded multicore that matches
the strengths of FPGAs—in particular by careful selection of TM features such as version and contention
management, and with conflict detection via support for application-specific signatures. We evaluate
our system using the NetFPGA [91] platform and four network packet processing applications that
are threaded and shared-memory. Relative to NetThreads [81], an existing two-processor four-way-
multithreaded system with conventional lock-based synchronization, we find that adding HTM support
(i) maintains a reasonable operating frequency of 125MHz with an area overhead of 20%, (ii) can
“transactionally” execute lock-based critical sections with no software modification, and (iii) achieves
6%, 54% and 57% increases in packet throughput for three of four packet processing applications
83
CHAPTER 8. NETTM: IMPROVING NETTHREADS WITH HTM 84
In this section, we motivate both quantitatively and qualitatively the need for having parallel programs
that can be synchronized efficiently without sacrificing performance, i.e. in which TM could play a
critical role.
Conventional lock-based synchronization is both difficult to use and also often results in frequent
While systems based on shared memory can ease the orchestration of sharing and communication
between cores, they require the careful use of synchronization (i.e., lock and unlock operations).
Consequently, threads executing in parallel wanting to enter the same critical section (i.e., a portion of
code that accesses shared data delimited by synchronization) will be serialized, thus losing the parallel
advantage of such a system. Hence designers face two important challenges: (i) multiple processors
need to share memory, communicate, and synchronize without serializing the execution, and (ii) writing
parallel programs with manually inserted lock-based synchronization is error-prone and difficult to
debug.
Alternatively, we propose that Transactional memory (TM) is a good match to software packet
processing: it both (i) can allow the system to optimistically exploit parallelism between the processing
of packets and reduce false contention on critical sections whenever it is safe to do so, and (ii) offers
an easier programming model for synchronization. A TM system optimistically allows multiple threads
inside a critical section—hence TM can improve performance when the parallel critical sections access
independent data locations. With transactional execution, a programmer is free to employ coarser
critical sections, spend less effort minimizing them, and not worry about deadlocks since a properly
implemented TM system does not suffer from them. To guarantee correctness, the underlying system
dynamically monitors the memory access locations of each transaction (the read set and write set) and
detects conflicts between them. While TM can be implemented purely in software (STM), a hardware
implementation (HTM) offers significantly lower performance overhead. The key question is: how
CHAPTER 8. NETTM: IMPROVING NETTHREADS WITH HTM 85
with interconnected processor or accelerator cores that synchronize and share memory?
In NetThreads, each processor has support for thread scheduling (see Chapter 7): when a thread
cannot immediately acquire a lock, its slot in the round-robin order can be used by other threads
until an unlock operation occurs—this leads to better pipeline utilization by minimizing the execution
provides a synchronization unit containing 16 hardware mutexes; in our ISA, each lock/unlock operation
specifies a unique identifier, indicating one of these 16 mutexes. In Section 8.7.5, through simulation, we
find that supporting an unlimited number of mutexes would improve the performance of our applications
by less than 2%, except for Classifier which would improve by 12%. In the rest of this section,
we explain that synchronization itself can in fact have a much larger impact on throughput for our
benchmarks and we explain how is this also a challenge for many emerging other packet processing
applications.
Although it depends on the application, in many cases, only the processing of packets belonging to
the same flow (i.e., a stream of packets with common application-specific attributes, such as addresses,
protocols, or ports) results in accesses to the same shared state. In other words, there is often parallelism
available in the processing of packets belonging to independent flows. Melvin et al. [99] show that for
two NLANR [116] packet traces the probability of having at least two packets from the same flow in
a window of 100 consecutive preceding packets is approximately 20% and 40%. Verdú et al. [142]
further show that the distance between packets from the same flow increases with the amount of traffic
aggregation on a link, and therefore generally with the link bandwidth in the case of wide area networks.
To demonstrate the potential for optimistic parallelism in our benchmark applications, we profiled
them using our full-system simulator (Section 6.3.3). In particular, what we are interested in is how
often the processing of packets has a conflict—i.e. for two threads each processing a packet, their
write sets intersect or a write set intersects a read set. In Figure 8.1, we show the average fraction of
packets for which their processing conflicts for varying windows of 2 to 16 packets: while the number
of dependent packets increases with the window size, the number increases very slowly because of
CHAPTER 8. NETTM: IMPROVING NETTHREADS WITH HTM 86
Figure 8.1: Average fraction of conflicting packet executions for windows of 2 to 16 consecutive packets.
the mixing of many packet flows in the packet stream so the fraction (or ratio) of dependent packets
does diminish with the increasing window size. For three of our applications, NAT, Classifier, and
Intruder2, the fraction of conflicting packet-executions varies from around 20% to less than 10%
as the window considered increases from 2 to 16 packets, indicating two important things: first, that
80% to 90% of the time, and second that there is hence a strong potential for optimistic synchronization
for these applications. For UDHCP, our profile indicates that nearly all packet-executions conflict. In
reality, UDHCP contains several critical sections, some that do nearly always conflict, but many others
that do not conflict—hence the potential for optimistic parallelism exists even for UDHCP. Now with
the knowledge that synchronization is generally conservative for our packet processing applications, we
are interested in knowing if synchronization would also be a problem when parallelizing other, more
The Internet has witnessed a transformation from static web pages to social networking and peer-
to-peer data transfers. This transformation of user behavior patterns requires network connectivity
providers to constantly and proactively adapt their services to adequately provision and secure their
infrastructure. As networking requirements evolve constantly, many network equipment vendors opt for
network processors to implement functions that are likely to change over time. Because many network
protocols are now considered outdated, there is even a desire to have vendors open the hardware to accept
CHAPTER 8. NETTM: IMPROVING NETTHREADS WITH HTM 87
user/researcher code [41]. However, once faced with board-specific reference code, programmers are
often hesitant to edit it, in part due to the challenge of modifying the synchronization and parallelization
With multicore processors becoming commodity, general-purpose processors are closing the
performance gap with network processors for network-edge applications [59] fueling an increased
industry use of open-source software that than can turn an off-the-shelf computer into a network device;
examples include the Click Modular router [105], Snort [124], Quagga [61], and the XORP project [52].
Those applications are often coded as a single thread of computation with many global data structures
that are unsynchronized—hence porting them to multicore is a substantial challenge when performance
depends on constructing finer-grained synchronized sections. There is therefore a need for simpler
In this section, we show that the processing of different packets is not always independent, and
the application must maintain some shared state. We also show that protecting this shared state with
traditional synchronization leads to conservative serializations of parallel threads in most instances and,
beyond this quantitative argument, there is a demand for easier parallelism to scale the performance
of packet processing applications to multicores. Because transactional memory can improve both
performance and ease of programming, we are motivated to find ways to enable it in our soft processor-
There is a large body of work studying the design space for HTM in ASIC multicores [18, 94]. Earlier
proposed transactional memory implementations [51] buffer transactional writes and make them visible
only at the end of a transaction. They augment their caches with speculatively read and written bits
and rely on iterating in the cache for commits and instantaneous clear of speculative cache tag bits.
Later extensions [12, 120] allow caches to overflow speculative data to a specified location in main
memory. In these proposals, long transactions were required to amortize commit overheads, but were
in turn more likely to conflict. For the Rock processor [34] (a commercial ASIC implementation of
HTM), it was found that most of these proposals were too complex to implement given the other
CHAPTER 8. NETTM: IMPROVING NETTHREADS WITH HTM 88
types of speculation already present in the processor and a limited version of TM was implemented
that doesn’t support function calls in transactions. LOGTM-SE [158], the most similar approach to
ours, decouples conflict detection from the caches to make commits faster. In contrast with LOGTM-
SE, which handles transaction conflicts in software and is evaluated with only small transactions,
we alleviate the programmer from having to modify lock-based code to support transactions, and we
Most previous FPGA implementations of HTM were used to prototype an ASIC design [48,
134, 147]—as opposed to targeting the strengths of an FPGA as a final product. To provide a low-
overhead implementation, our work also distinguishes itself in the type of TM that we implement and
in the way that we perform conflict detection. To track transactional speculative state, prior FPGA
implementations [48, 63, 147] use (i) extra bits per line in a private cache per thread or in a shared
cache, and (ii) lazy version management (i.e., regular memory is modified only upon commit), and (iii)
lazy conflict detection (i.e., validation is only performed at commit time). These approaches are not a
good match for product-oriented FPGA-based systems because of the significant cache storage overhead
The most closely related work to ours is the CTM system [63] which employs a shared-cache
and per-transaction buffers of addresses to track accesses. However, the evaluation of CTM is
limited to at-most five modified memory locations per transaction—in contrast, NetTM supports more
than a thousand per transaction. Rather than using off-the-shelf soft cores as in CTM and other
work [48, 134, 147] and thus requiring the programmer to explicitly mark each transactional access
in the code, in NetTM we integrate TM with each soft processor pipeline and automatically handle
loads and stores within transactions appropriately. This also allows NetTM to transactionally execute
lock-based critical sections, assuming that the programmer has followed simple rules (described later in
Section 8.3).
Specifying Transactions TM semantics [54] imply that any transaction will appear to have executed
atomically with respect to any other transaction. For NetTM, as in most TM systems, a transactional
CHAPTER 8. NETTM: IMPROVING NETTHREADS WITH HTM 89
critical section is specified by denoting the start and end of a transaction, using the same instruction API
as the lock-based synchronization for NetThreads—i.e., lock(ID) can mean “start transaction” and
unlock(ID) can mean “end transaction”. Hence existing programs need not be modified, since NetTM
can use existing synchronization in the code and simply interpret critical sections as transactional. We
next describe how the lock identifier, ID in the previous example, is interpreted by NetTM.
Locks vs Transactions NetTM supports both lock-based and TM-based synchronization, since a code
region’s access patterns can favor one approach over the other. For example, lock-based critical sections
are necessary for I/O operations since they cannot be undone in the event of an aborted transaction:
specifically, for processor initialization, to protect the sequence of memory-mapped commands leading
to sending a packet, and to protect the allocation of output packet memory (see later in this section). We
use the identifier associated with lock/unlock operations to distinguish whether a given critical section
should be executed via a transaction or via traditional lock-based synchronization: this mapping of the
identifiers is provided by the designer as a parameter to the hardware synchronization unit (Figure 8.4).
When writing a deadlock-free application using locks, a programmer would typically need to examine
carefully which identifier is used to enter a critical section protecting accesses to which shared variables.
NetTM simplifies the programmer’s task when using transactions: NetTM enforces the atomicity of all
transactions regardless of the lock identifier value. Therefore only one identifier can be designated to be
of the transaction type, and doing so frees the remaining identifiers/mutexes to be used as unique locks.
However, to support legacy software, a designer is also free to designate multiple identifiers to be of the
transaction type.
Composing Locks and Transactions It is desirable for locks and transactions in our system to be
composable, meaning they may be nested within each other. For example, to atomically transfer a
record between two linked lists, the programmer might nest existing atomic delete and insert operations
within some outer critical section. NetTM supports composition as follows. Lock within lock is
straightforward and supported. Transaction within transaction is supported, and the start/end of the
inner transaction are ignored. As opposed to lock-based synchronization, a deadlock is therefore not
possible across transactions. NetTM uses a per-thread hardware counter to track the nesting level of
lock operations to decide when to start/commit a transaction, or acquire/release a lock in the presence of
CHAPTER 8. NETTM: IMPROVING NETTHREADS WITH HTM 90
(a) Undefined behavior: nest- (b) Undefined behavior: par- (c) Race condition: lock- (d) Race condition: transac-
ing lock-based synchroniza- tial nesting does not provide based critical sections are not tions are not atomic with re-
tion inside transaction. atomicity. atomic with respect to transac- spect to unsynchronized ac-
tions. cesses (i.e., to x).
Figure 8.2: Example mis-uses of transactions as supported by NetTM. ID TX identifies a critical section
as a transaction, and ID MUTEX as a lock. Critical section nesting occurs when a program is inside more
than one critical section is at a particular instant.
nesting. Lock within transaction is not supported as illustrated in Figure 8.2(a), since code within
a lock-based critical section should never be undone, and we do not support making transactions
irrevocable [47]. Transaction within lock is supported, although the transaction must be fully nested
within the lock/unlock, and will not be executed atomically—meaning that the transaction start/end are
essentially ignored, under the assumption that the enclosing lock properly protects any shared data.
Our full-system simulator can assist the programmer by monitoring the dynamic behavior of a program
and identifying the potentially unsafe nesting of transactions inside locked-based critical sections, as
exemplified in Figure 8.2(b). NetTM implements weak atomicity [96], i.e. it guarantees atomicity
between transactions or between lock-based critical sections, but not between transactions and non-
transactional code. Figures 8.2(c) and (d) shows examples of non-atomic accesses to the variable x
that could result in race conditions: while those accesses are supported and are useful in some cases
(e.g. to get a snapshot of a value at a particular instant), the programmer should be aware that they are
not thread safe, just like with traditional synchronization when using locks with different identifiers in
Memory Semantics NetTM comprises several on-chip memories (Table 6.1) with varying properties,
hence it is important to note the interaction of TM with each. The input buffer is not impacted by
CHAPTER 8. NETTM: IMPROVING NETTHREADS WITH HTM 91
TM since it is read-only, and it’s memory-mapped control registers should not be accessed within a
transaction. The output buffer can be both read and written, however it only contains packets that are
each accessed by only one thread at a time (since the allocation of output buffer packets is protected by a
regular lock). Hence for simplicity the output buffer does not support an undo log, and the programmer
must take into account that it does not roll-back on a transaction abort (i.e., within a transaction, a
program should never read an output buffer location before writing it).
Improving Performance via Feedback TM allows the programmer to quickly achieve a correct
threaded application. Performance can then be improved by reducing transaction aborts, using feedback
that pin-points specific memory accesses and data structures that caused the conflicts. While this
feedback could potentially be gathered directly in hardware, for now we use our cycle-accurate simulator
of NetTM to do so. For example, we identified memory management functions (malloc() and free())
as a frequent source of aborts, and instead employed a light-weight per-thread memory allocator that
Benchmark Applications Table 3.1 describes the nature of the parallelism in each application, and
Table 8.1 shows statistics on the dynamic accesses per transaction for each application (filtering will
be described later in Section 8.6). Note that the transactions comprise significant numbers of loads and
stores; the actual numbers can differ from the data in Table 6.2 because of the code transformations
There are many options in the design of a TM system. For NetTM our over-arching goals are to match
our design to the strengths and constraints of an FPGA-based system by striving for simplicity, and
minimizing area and storage overhead. In this section, we focus on options for version management,
which have a significant impact on the efficiency and overheads of adding support for HTM to an FPGA-
based multicore. Version management refers to the method of segregating transactional modifications
from other transactions and regular memory. For a simple HTM, the two main options for version
Lazy Version Management In a lazy approach, write values are saved in a write-buffer until the
transaction commits, when they are copied/flushed to regular memory. Any read must first check the
write-buffer for a previous write to the same location, which can add latency to read operations. To
support writes and fast reads, a write-buffer is often organized as a cache with special support for
conflicts (e.g., partial commit or spill to regular memory). CTM [63] (see Section 8.2) minimizes cache
line conflicts by indexing the cache via Cuckoo hashing, although this increases read latency. Because
lazy schemes buffer modifications from regular memory: (i) they can support multiple transactions
writing to the same location (without conflict), (ii) conflict detection for a transaction can be deferred
until it commits, (iii) aborts are fast because the write-buffer is simply discarded, and (iv) commits are
Eager Version Management In an eager approach, writes modify main memory directly and are not
buffered—therefore any conflicts must be detected before a write is performed. To support rollback for
aborts, a backup copy of each modified memory location must be saved in an undo-log. Hence when a
transaction aborts, the undo-log is copied/flushed to regular memory, and when a transaction commits,
the undo-log is discarded. A major benefit of an eager approach is that reads proceed unhindered and
can directly access main memory, and hence an undo-log is much simpler than a write-buffer since the
undo-log need only be read when flushing to regular memory on abort. Because eager schemes modify
regular memory directly (without buffering): (i) they cannot support multiple transactions writing to
the same location (this results in a conflict), (ii) conflict detection must be performed on every memory
access, (iii) aborts are slow because the undo-log must be flushed/copied to regular memory, and (iv)
CHAPTER 8. NETTM: IMPROVING NETTHREADS WITH HTM 93
Our Decision After serious consideration of both approaches, we concluded that eager version
management was the best match to FPGA-based systems such as NetTM for several reasons. First,
while similar in required number of storage elements, a write buffer is necessarily significantly more
complex than an undo-log since it must support fast reads via indexing and a cache-like organization.
Our preliminary efforts found that it was extremely difficult to create a write-buffer with single-cycle
read and write access. To avoid replacement from a cache-organized write-buffer (which in turn must
result in transaction stall or abort), it must be large or associative or both, and these are both challenging
for FPGAs. Second, an eager approach allows spilling transactional modifications from the shared data
cache to next level of memory (in this case off-chip), and our benchmarks exhibit large write sets as
shown in Table 8.1. Third, via simulation we observed that disallowing multiple writers to the same
memory location(s) (a limitation of an eager approach) resulted in only a 1% increase in aborts for our
applications in the worst case. Fourth, we found that transactions commit in the common case for our
A key consequence of our decision to implement eager version management is that we must be able to
detect conflicts with every memory access; hence to avoid undue added sources of stalling in the system,
we must be able to do so in a single cycle. This requirement led us to consider implementing conflict
detection via signatures, which are essentially bit-vectors that track the memory locations accessed by
a transaction via hash indexing [23], with each transaction owning two signatures to track its read and
write sets. Signatures can represent an unbounded set of addresses, and allow us to decouple conflict
detection from version management, and provide an opportunity to capitalize on the bit-level parallelism
of FPGAs. In Appendix A [75], we explored the design space of signature implementations for an
FPGA-based two-processor system (i.e., for two total threads), and proposed a method for creating
application-specific hash functions to reduce signature size without impacting their accuracy. In this
1 When measured at maximum saturation rate in our TM system with the default contention manager, we measure that 78%,
67%, 92% and 79% of the transactions commit without aborting for Classifier, NAT, Intruder, and UDHCP respectively.
CHAPTER 8. NETTM: IMPROVING NETTHREADS WITH HTM 94
Tx0 TxN
..
..
..
Undo−Log
Rd Wr Version Addr Data
Signature Table bit bit bits
Tx0
..
.. .. ..
.. .. .. ..
.. .. ..
TxN
..
..
..
x ..
row
t?
inde
Figure 8.3: Integration of conflict detection hardware with the processor pipeline for a memory access.
CHAPTER 8. NETTM: IMPROVING NETTHREADS WITH HTM 95
section, we briefly summarize our signature framework, and describe how we adapted the previously
proposed scheme to support a multithreaded multicore (ie., for eight total threads), as we walk through
Figure 8.3.
Application-Specific Hash Functions Signatures normally have many fewer bits than there are
memory locations, and comparing two signatures can potentially indicate costly false-positive conflicts
between transactions. Hence prior HTMs employ relatively large signatures—thousands of bits long—
to avoid such false conflicts. However, it is difficult to implement such large signatures in an FPGA
without impacting cycle time or taking multiple cycles to compare signatures. Our key insight was to
that could minimize false conflicts, by mapping the most contentious memory locations to different
signature bits, while minimizing the total number of signature bits. Our approach is based on trie
hashing (Appendix A), and we build a trie-based conflict detection unit by (i) profiling the memory
addresses accessed by an application, (ii) using this information to build and optimize a trie (i.e. a tree
based on address prefixes) that allocates more branches to frequently-conflicting address prefixes, and
(iii) implementing the trie in a conflict detection unit using simple combinational circuits, as depicted
by the “hash function” logic in Figure 8.3. The result of the hash function corresponds to a leaf in
the trie, and maps to a particular signature bit. Note that since NetTM implements weak atomicity
(see Section 8.3), only transactional memory accesses are recorded in signatures, which means that the
signature hash function depends on the memory profile of only the critical sections of our applications,
which are the parts of an embedded application which are often the least changing across revisions.
Signature Table Architecture In contrast with prior signature work on FPGAs [75], in NetTM we
store signatures in block RAMs. To detect conflicts, we must compare the appropriate signature bits
(as selected by the hash index) from every transaction in a single cycle—as shown in Figure 8.3, we
decided to have the hash function index the rows of a block RAM (identifying a single signature bit),
and to map the corresponding read and write bits for every transaction/thread-context across each row.
Therefore with one block RAM access we can read all of the read and write signature bits for a given
address for all transactions in parallel. A challenge is that we must clear the signature bits for a given
transaction when it commits or aborts, and it would be too costly to visit all of the rows of the block
CHAPTER 8. NETTM: IMPROVING NETTHREADS WITH HTM 96
00111100 00111100
synch. unit mutexes signatures
1
0 00
11 00
11 1
0 00
11 00
11
0
1 00
11 00
11 0
1 00
11 00
11
0
1 00
11 00
11
4−thread 0
1 00
11 00
11
4−thread
processor I$ processor I$
instr.
data
busses
input mem.
output mem.
to DDR2 SDRAM
Figure 8.4: The NetThreads architecture, currently with two processors. NetTM supports TM by
extending NetThreads, mainly with signatures and an undo-log.
RAM to do so. Instead we add to the signature table a version number per transaction (incremented
on commit or rollback), that allows us to compare to a register holding the true version number of the
current transaction for that thread context. Comparing version numbers produces a Valid signal that is
used to ignore the result of comparing signature bits when appropriate. We clear signature bits lazily:
for every memory reference a row of the signature table is accessed, and we clear the corresponding
signature bits for any transaction with mismatching version numbers. This lazy-clear works well in
practice, although it is possible that the version number may completely wrap-around before there is an
intervening memory reference to cause a clear, resulting in a false conflict (which hurts performance but
not correctness). We are hence motivated to support version numbers that are as large as possible.
In this section, we describe the key details of the NetTM implementation. As shown earlier in Figure 8.4,
the main additions over the NetThreads implementation are the signature table, the undo-log, and
Signature Table As shown in Figure 8.3, this data structure of fixed size (one read and one write
signature for each hardware thread context) requires concurrent read and write access. For the device
we target, the block RAMs are 36bits wide, and we determined experimentally that a signature table
composed of at most two block RAMs could be integrated in the TM processor pipeline while preserving
CHAPTER 8. NETTM: IMPROVING NETTHREADS WITH HTM 97
the 125MHz target frequency. We could combine the block RAMs horizontally to allow larger version
numbers, or vertically to allow more signature bits; we determined experimentally that the vertical
option produced the fewest false conflicts. Hence in NetTM each block RAM row contains a read bit, a
write bit, and two version bits (four bits total) for each of eight transactions/thread-contexts, for a total
length ranging from 618 to 904 bits for our applications (limited by the hash function size).
Undo-Log The undo-log is implemented as a single physical structure that is logically partitioned into
equal divisions per thread context. On a transaction commit, a per-thread undo-log can be cleared in one
cycle by resetting the appropriate write-pointer. On a transaction abort, the undo-log requests exclusive
access to the shared data cache, and flushes its contents to the cache in reverse order. This flush is
performed atomically with respect to any other memory access, although processors can continue to
execute non-memory-reference instructions during an undo-log flush. We buffer data in the undo-log
at a word granularity because that matches our conflict detection resolution. In NetTM, the undo-log
must be provisioned to be large enough to accommodate the longest transactions. A simple extension to
support undo-log overflows could consist of aborting all the other transactions to allow the transaction
Undo-Log Filter To minimize the required size of the undo-log as well as its flush latency on aborts,
we are motivated to limit the number of superfluous memory locations saved in the undo-log. Rather
than resort to a more complex undo-log design capable of indexing and avoiding duplicates, we instead
attempt to filter the locations that the undo-log saves. By examining our applications we found that a
large source of undo-log pollution was due to writes that need not be backed-up because they belong to
the uninitialized portion of stack memory. For example, the recursive regular expression matching code
in Classifier results in many such writes to the stack. It is straightforward to filter addresses backed-
up by the undo-log using the stack pointer of the appropriate thread context; to maintain the clock
frequency of our processors, we keep copies of the stack pointer values in a block RAM near the undo-
log mechanism. We found that such filtering reduced the required undo-log capacity requirement for
Classifier from 32k entries to less than 1k entries, as shown in Table 8.1 by comparing the maximum
CHAPTER 8. NETTM: IMPROVING NETTHREADS WITH HTM 98
number of stores per transaction and the filtered number (in parenthesis in the same column).2
Pipeline Integration To accommodate the additional signature logic and for NetTM hit/miss
processing, a memory access is split into two cycles (as shown in Figure 8.3)—hence NetTM has a
6-stage pipeline, although the results for non-memory instructions are ready in the 5th stage. Because
the data cache and the signature table are accessed in two different cycles, NetTM can potentially suffer
from additional stalls due to contention on the shared cache lines and signature read-after-write hazards.
While NetThreads’ non-transactional data cache does not allocate a cache line for write misses, NetTM
is forced to perform a load from memory on transactional write misses so that the original value may be
Transactional Registers As with writes to memory, all transactional modifications to the program
counter and registers must be undone on a transaction abort. The program counter is easily backed-up
in per-thread registers. The register file is actually composed of two copies, where for each register
a version bit tracks which register file copy holds the committed version and another bit tracks if the
register value was transactionally modified. This way all transactional modifications to the register file
can be committed in a single cycle by toggling the version bit without performing any copying.
Thread Scheduling The thread scheduler implemented in NetThreads allows better pipeline utilization:
when a thread cannot immediately acquire a lock, its slot in the round-robin order can be used by other
threads until an unlock operation occurs [77]. For NetTM we extended the scheduler to similarly de-
schedule a thread that is awaiting the contention manager to signal a transaction restart.
In this section, we evaluate the benefits of TM for soft multicores by comparing resource utilization
and throughput of NetTM (supporting TM and locks) relative to NetThreads (supporting only locks).
Because the underlying NetFPGA board is a network card, our application domain is packet processing
where threads execute continuously and only have access to a shared DDR2 SDRAM memory as the
2A programmer could further reduce the occupation of the undo-log with the alloca() function to lower the stack of a
transaction (in case the checkpoint event is deeply nested in a function call) and/or defer the allocation of stack variables until
after the checkpoint and alloca() calls.
CHAPTER 8. NETTM: IMPROVING NETTHREADS WITH HTM 99
last level of storage. We report the maximum sustainable packet rate for a given application as the
packet rate with 90% confidence of not dropping any packet over a five-second run—thus our results are
conservative given that network appliances are typically allowed to drop a small fraction of packets.3
We start by describing the TM resource utilization, then show how the baseline TM performance can be
improved via tuning and we finally compare our TM implementation with other approaches of exploiting
additional parallelism.
In total, NetTM consumes 32 more block RAMs than NetThreads, so its block RAM utilization is 71%
(161/232) compared to 57% (129/232) for NetThreads. The additional block RAMs are used as follows:
2 for the signature bit vectors, 2 for the log filter (to save last and checkpoint stack pointers) and 26 for
the for undo log (1024 words and addresses for 8 threads). NetThreads consumes 18980 LUTs (out of
47232, i.e. 40% of the total LUT capacity) when optimized with high-effort for speed; NetTM design
variations range from 3816 to 4097 additional LUTs depending on the application-specific signature
size, an overhead of roughly 21% over NetThreads. The additional LUTs are associated with the extra
In Figure 8.5, the first bar for each application (CM1) reports the packet rate for NetTM normalized to
that of NetThreads. NetTM improves packet throughput by 49%, 4% and 54% for Classifier, NAT,
and UDHCP respectively, by exploiting the optimistic parallelism available in critical sections. The TM
speedup is the result of reduced time spent awaiting locks, but moderated by the number of conflicts
between transactions and the time to recover from them. Classifier has occasional long transactions
that do not always conflict, providing an opportunity for reclaiming the corresponding large wait times
with locks. Similarly, UDHCP has an important fraction of read-only transactions that do not conflict.
NAT has a less pronounced speedup because of less-contended shorter transactions. Despite having a
high average commit rate, for Intruder TM results in lower throughput due to bursty periods of large
3 We use a 10-point FFT smoothing filter to determine the interpolated intercept of our experimental data with the 90%
threshold (see Figure 8.6 for an example).
CHAPTER 8. NETTM: IMPROVING NETTHREADS WITH HTM 100
Figure 8.5: Packet throughput of NetTM normalized to NetThreads, with varying contention managers
(see Section 8.7.3) that restart: at most one aborted transaction at a time (CM1), conditionally one or two
(CM1/2), at most two (CM2), or at most four (CM4).
transactions during which there are repeated aborts, leading to limited parallelism and a reduced packet
rate. Two indicators could have predicted this behavior: (i) across our benchmarks, Intruder has
the highest CPU utilization with locks-only, meaning that wait times for synchronization are smaller
and true TM conflicts will be harmful when large transactions are aborted; and (ii) Intruder has a
accesses, limiting the effectiveness of application-specific signatures and leading to increased false
conflicts. Furthermore, the throughput of Intruder with locks-only can be improved by 30% by
reducing contention through privatizing key data structures: we named this optimized version of the
benchmark Intruder2 (Table 8.1). Despite having a high average commit rate, Intruder2 has a
throughput reduction of 8% (Figure 8.7 in Section 8.7.4): optimistic parallelism is difficult to extract
in this case because Intruder2 has short transactions and an otherwise high CPU utilization such that
any transaction abort directly hinders performance. This demonstrates that TM isn’t necessarily the best
option for every region of code or application. One advantage of NetTM is that the programmer is free
to revert to using locks since NetTM integrates support for both transactions and locks.
UDHCP: A Closer Look Figure 8.6 shows for UDHCP the probability of not dropping packets as a
CHAPTER 8. NETTM: IMPROVING NETTHREADS WITH HTM 101
Figure 8.6: For UDHCP, the probability of no packet drops vs packet throughput for both NetThreads
and NetTM, normalized to NetThreads. The packet throughput matches the incoming packet rate for a
system with no packet drops: as the incoming packet rate is increased the probability of not dropping
any packet is reduced. For each implementation, the point of 90% probability is highlighted by a vertical
line.
function of the normalized packet throughput (i.e. the inter-arrival packet rate). In addition to providing
higher throughput than NetThreads, it is apparent that NetTM also provides a more graceful degradation
of packet drops versus throughput, and does so for Classifier and NAT as well.
When a conflict between two transactions is detected there are a number of ways to proceed, and the
best way is often dependent on the access patterns of the application—and FPGA-based systems such
as NetTM provide the opportunity to tune the contention management strategy to match the application.
In essence, one transaction must either stall or abort. There are also many options for how long to
stall or when to restart after aborting [13]. Stalling approaches require frequent re-validation of the
read/write sets of the stalled transaction, and can lead to live-lock (if a stalled transaction causes conflicts
with a repeatedly retrying transaction), hence for simplicity we unconditionally abort and restart any
The decision of when to restart must be made carefully, and we studied several options as shown in
Figure 8.5 with the goals of (i) requiring minimal hardware support, (ii) deciding locally (per-processor,
CHAPTER 8. NETTM: IMPROVING NETTHREADS WITH HTM 102
Figure 8.7: Throughput improvement relative to locks-only (NetThreads) for flow-affinity scheduling
and TM (NetTM). In its current form, UDHCP is unable to exploit flow-affinity scheduling, which
explains the missing bars in the graph.
rather than centrally), and (iii) guaranteeing forward progress. On a commit event, CM1 and CM2 allow
only one or two transactions to restart respectively (other transactions must await a subsequent commit).
CM1/2 adaptively restarts up to two transactions when they are the only aborted transactions in the
system, or otherwise restarts only one transaction when there are more than two aborted transactions in
the system. CM4 makes all aborted transactions await some transaction to commit before all restarting
simultaneously.
As shown in Figure 8.5, UDHCP significantly benefits from the CM1 contention manager: UDHCP has
writer transactions that conflict with multiple reader transactions, and CM1 minimizes repeated aborts in
that scenario. In contrast, Classifier has more independent transactions and prefers a greater number
of transactions restarted concurrently (CM4). NAT shows a slight preference for CM2. In summary, in
Figure 8.5, once we have tuned contention management we can achieve speedups of 57% (CM4), 6%
A way to avoid lock contention that is potentially simpler than TM is attempting to schedule packets
from the same flow (i.e., that are likely to contend) to the same hardware context (thread context or
CPU). Such an affinity-based scheduling strategy could potentially lower or eliminate the possibility of
CHAPTER 8. NETTM: IMPROVING NETTHREADS WITH HTM 103
lock contention which forces critical sections to be executed serially and threads to wait on each other.
We implement flow-affinity scheduling by modifying the source code of our applications such that, after
receiving a packet, a thread can either process the packet directly or enqueue (in software with a lock)
the packet for processing by other threads. In Figure 8.7 we evaluate two forms of flow-affinity, where
packets are either mapped to a specific one of the two CPUs (CPU-Affinity), or to a specific one of
Flow-affinity is determined for NAT and Classifier by hashing the IP header fields, and for
Intruder2 by extracting the flow identifier from the packet payload. We cannot evaluate flow-
affinity for UDHCP because we did not find a clear identifier for flows that would result in parallel
packet processing, since UDHCP has many inter-related critical sections (as shown in Figure 8.1).
To implement a reduced lock contention for the flow-affinity approaches we (i) replicated shared
data structures when necessary, in particular hash-tables in NAT and Classifier and (ii) modified
the synchronization code such that each CPU operates on a separate subset of the mutexes for
Figure 8.7 shows that flow-affinity scheduling only improves Classifier. NAT shows a slight
improvement for CPU-based affinity scheduling and otherwise NAT and Intruder2 suffer slowdowns
due to load-imbalance: the downside of flow-affinity scheduling is that it reduces flexibility in mapping
packets to threads, and hence can result in load-imbalance. The slowdown due to load-imbalance is less
pronounced for CPU-Affinity because the packet workload is still shared among the threads of each
CPU. Overall, Figure 8.7 shows that TM outperforms the best performing flow-affinity approach by 4%
for NAT and 31% for Classifier, while requiring no special code modification.
Other than the true data dependences in Figure 8.1, we are motivated to verify if applications are
not serialized because of the round-robin assignment of mutexes to a larger number of shared data
structures, i.e. two independent flows that have been assigned the same mutex identifier. While there are
NetThreads, like FPGA multicores made out of off-the-shelf NIOS-II or Microblaze processors, does
not yet support them. Figure 8.8 shows that an unlimited number of mutexes improves the performance
CHAPTER 8. NETTM: IMPROVING NETTHREADS WITH HTM 104
Figure 8.8: Simulated normalized throughput resulting from the use of an unlimited number of mutexes,
relative to the existing 16 mutexes.
by less than 2%, except for Classifier (12%). While this is a motivation to implement mechanisms to
8.8 Summary
In this chapter we describe, implement, and evaluate four threaded, stateful, control-flow-intensive
networking applications that share memory and synchronize, a real implementation of an HTM system
called NetTM on the NetFPGA [91] platform, and compare with a two-processor, eight-threaded base
system that implements only conventional lock-based synchronization (called NetThreads [81]). We
describe the selection of TM features that match the strengths of FPGAs, including an eager version
management mechanism that allows for longer transactions with more writes and can support both
Also, we show that an eager TM can be integrated into a multithreaded processor pipeline without
On a NetFPGA board, we measure that NetTM outperforms flow-affinity scheduling, but that
NetTM could be extended to exploit a flow affinity approach, assuming the code has such affinity and
the programmer is able and willing to replicate and provide scheduling for global data structures as
necessary. All in all, we have demonstrated that transactional memory (TM) provides the best overall
performance, by exploiting the parallelism available in the processing of packets from independent
CHAPTER 8. NETTM: IMPROVING NETTHREADS WITH HTM 105
flows, while allowing the most flexible packet scheduling and hence load balancing. Our NetTM
implementation makes synchronization (i) easier, by allowing more coarse-grained critical sections and
eliminating deadlock errors, and (ii) faster, by exploiting the optimistic parallelism available in many
concurrent critical sections. For multithreaded applications that share and synchronize, we demonstrate
that NetTM can improve throughput by 6%, 55%, and 57% over our NetThreads locks-only system,
although TM is inappropriate for one application due to short transactions that frequently conflict. As
an extension to this study, in Appendix B, we show that NetTM can scale up to 8 4-threaded cores, and
that NetTM performs best for coarse-grained critical sections with low conflict intensity.
Chapter 9
Conclusions
As the capacity of FPGAs continues to grow, they are increasingly used by embedded system designers
to implement complex digital systems, especially in telecommunication equipment, the core market
of FPGAs. Since software is the method of choice for programming applications with elaborate and
a possible solution, one can already license off-the-shelf soft processors with the caveat that they
are geared towards running sequential control tasks rather then throughput-oriented applications. By
improving soft processors, designers will be able to instantiate a number of them for efficiently executing
In this thesis, we make soft processors more amenable to exploit the parallelism inherent in packet
processing applications by (ii) increasing the area efficiency of the cores; (ii) describing the design space
of multiprocessor architecture; and (iii) providing techniques to manage the synchronization overheads,
in particular using transactional memory. We show that multithreaded multicore organizations can
significantly improve the throughput of single-threaded soft processors and that transactional memory
can be used to program multiple threads while avoiding the difficulties traditionally associated with
This thesis led to the publication of a number of papers [75–78, 80, 83, 84] and the public release
of NetThreads [81], NetThreads-RE [82] and NetTM [79] along with a their documentation, which
were welcomed by the NetFPGA networking research community. Our experimental approach consists
of collaborating with such experts in the networking area to do full-scale studies with real hardware
106
C HAPTER 9. C ONCLUSIONS 107
and packets, as opposed to focusing on micro-benchmarks with simulated inputs and outputs. Our
NetThreads infrastructure was used as the enabling technology by non-FPGA adepts to create a precise
packet generator [127] that was extended to operate in closed-loop, a system that is also publicly
released [44]. Finally, our infrastructure was used to build a flexible system for low-latency processing
of stock quotes in algorithmic financial trading [126]. We consider our system architecture successful
if it can leverage the configurable fabric of FPGAs to: (i) provide options for application-specific area
efficiency tuning; (ii) use knowledge about the application behavior to optimize the synchronization
wait times; and (iii) scale the performance of a given (fixed) program to a larger number of threads. To
9.1 Contributions
1. Design Space Exploration of Multithreaded and Multicore Soft Systems To demonstrate that
parallel software threads can execute efficiently on FPGAs, we show that for 3, 5, and 7-stage
pipelined processors, careful tuning of multithreading can improve overall instructions per cycle
by 24%, 45%, and 104% respectively, and area-efficiency by 33%, 77%, and 106%. When
considering a system with off-chip memory, the block RAM requirements of a multithreaded
system grows much faster per core when each thread requires its own cache space and register
file. Therefore we find that single-threaded cores are able to more fully utilize the area of an
FPGA and deliver the best performance when the number of block RAMs is the limiting factor.
We show that our processor designs can span a broad area versus performance design space,
we demonstrate a technique to handle stalls in the multithreaded pipeline that is able to deliver
if the single-threaded core has a larger cache size. We point to a number of designs that can be
of interest to a designer when he must trade off area, performance and frequency metrics in a
full-system implementation.
Because of their added performance and compact size in a shared memory setting, we elect
C HAPTER 9. C ONCLUSIONS 108
multithreaded cores for packet processing. In their most basic form, multithreaded cores are
limited to executing instructions from all of their thread contexts in round-robin. Imposing this
limitation is what allows baseline multithreaded cores to have no hazard detection and therefore
an improved area and clock frequency. We find that in real applications, synchronization across
the threads leads to phases where some threads must wait on mutexes for extended periods of time.
As this phenomenon was measured to be the leading source of contention, we present a technique
that allows to suspend an arbitrary number threads, and recuperate their pipeline slots without
re-introducing runtime hazard detection in the pipeline. This technique specifically exploits the
multiple-of-9bits width of block RAMs. Using a real FPGA-based network interface, we measure
packet throughput improvements of 63%, 31% and 41% for our three applications because of the
tional Memory for Packet Processing Because there are many programming models available
for packet processing, we evaluate a number of them with our soft core platform to determine
if our platform supports the most desirable one. We demonstrate that many complex packet
processing applications written in a high-level language, especially those that are stateful, are
unsuitable to pipelining. We also show that scheduling packets to different clusters of threads can
improve throughput at the expense of code changes that are only possible when such an affinity of
packets to threads is possible. We demonstrate that transactional memory (TM) provides the best
overall performance in most cases, by exploiting the parallelism available in the processing of
packets from independent flows, while allowing the most flexible packet scheduling and hence
load balancing. Our implementation of transactional memory also supports locking and can
4. The First Hardware Transactional Memory Design Integrated with Soft Processor Cores
Our NetTM implementation makes synchronization (i) easier, by allowing more coarse-grained
critical sections and eliminating deadlock errors, and (ii) faster, by exploiting the optimistic
parallelism available in many concurrent critical sections. Integrating TM directly with the
processor cores makes the processor architectural changes seamless for the programmer and allow
C HAPTER 9. C ONCLUSIONS 109
for efficient checkpoint, rollback and logging mechanisms. For multithreaded applications that
share and synchronize, we demonstrated that a 2-core system with transactional memory can
improve throughput by 6%, 55%, and 57% over a similar system with locks only. We also studied
aspects that can hinder the benefits of speculative execution and identified that TM is inappropriate
lization and area efficient soft cores for packet processing through releases [79, 81, 82] which
have been downloaded 504 times at the time of this writing. These infrastructures can reach close
to full utilization of a 1Gbps link for light workloads [127] (see Section 6.3.2)1 , have been used to
demonstrate real uses of software packet processing and are also an enabling platform for future
research (Section 6.4). Our released FPGA implementation also has a simulator counterpart [82]
that can be used to perform extensive application-level debugging and performance tuning.
In this section, we present three avenues to improve on our implementation to further increase its
performance.
1. Custom Accelerators Because soft processors do not have the high operating frequency of ASIC
processors, it is useful for some applications to summarize a block of instructions into a single
custom instruction [125]. The processor can interpret that new instruction as a call to a custom
logic block (potentially written in a hardware description language or obtained through behavioral
synthesis). Hardware accelerators are common in network processing ASICs (e.g. NFP-32xx,
Octeon and Advanced PayloadPlus ): in the context of FPGAs, accelerators will require special
area constraints, threads may have to share accelerators, pipeline them and/or make them multi-
purpose to assist in different operations in the packet processing. We envision that this added
hardware could also be treated like another processor on chip, with access to the shared memory
1 The actual bandwidth attained is a function of the amount of computation per packet.
C HAPTER 9. C ONCLUSIONS 110
busses and able to synchronize with other processors. Because of the bit-level parallelism of
FPGAs, custom instructions can provide significant speedup to some code sections [64, 95].
2. Maximizing the Use of On-Chip Memory Bandwidth. Our research on packet processing so
far has focused on a particular network card with fast network links, but containing an FPGA
that is a few generations old [81, 82]. While we achieved acceptable performance with only
two processors, another of our studies has shown that more recent FPGAs could allow scaling
to a much larger number of processors and computing threads (Appendix B). In a newer
FPGA, memory contention would gradually become a bottleneck as a designer would connect
more processors to the shared data cache of our current architecture. While an eager version
management scheme scales to multiple data caches [158], a previous implementation of coherent
data caches on an FPGA [151] increases considerably the access latency to the data cache, e.g. 18
cycles pure coherence overhead per read miss. Our intuition is thus that a shared data cache
performs better for a small number of processors and an interesting future work consists of
investigating alternatives in scaling to more processors (possibly with an efficient and coherent
system of caches) should focus specifically to match the strengths and weaknesses of FPGAs.
3. Compiler optimizations Despite using a number of compiler techniques to improve the area-
efficiency of our cores (Section 4.1), we have not tuned gcc strictly to deliver more performance.
There are many possible automatic compiler optimizations that could, e.g.: (i) schedule
instructions based on our pipeline depth and memory latency; (ii) schedule instruction and
organize data layout based on the interaction of threads inside a single multi-threaded core; (iii)
maximize the benefits of transactional memory by hoisting conflicting memory accesses closer to
the beginning of a transaction, optimizing the data layout to ease conflict detection or provide a
method for feedback-directed selective parallelization of the code that is least prone to conflict.
Appendix A
increasing number of processor and accelerator cores, supporting sharing and synchronization in a
way that is scalable and easy to program becomes a challenge. As we explained in Chapter 8,
Transactional memory (TM) is a potential solution to this problem, and an FPGA-based system provides
the opportunity to support TM in hardware (HTM). Although there are many proposed approaches
to support HTM for ASICs, these do not necessarily map well to FPGAs. In particular, in this
work we demonstrate that while signature-based conflict detection schemes (essentially bit vectors)
should intuitively be a good match to the bit-parallelism of FPGAs, previous schemes result in either
for HTM conflict detection. Using both real and projected FPGA-based soft multiprocessor systems that
support HTM and implement threaded, shared-memory network packet processing applications, relative
to signatures with bit selection, we find that our application-specific approach (i) maintains a reasonable
operating frequency of 125MHz, (ii) has an area overhead of only 5%, and (iii) achieves a 9% to 71%
increase in packet throughput due to reduced false conflicts. We start off by introducing signatures and
111
A PPENDIX A. A PPLICATION -S PECIFIC S IGNATURES FOR T RANSACTIONAL M EMORY 112
In this chapter, we focus on implementing TM for an FPGA-based soft multiprocessor. While TM can
hardware (HTM) with much lower performance overhead than an STM. There are many known methods
for implementing HTM for an ASIC multicore processor, although they do not necessarily map well to
focus specifically on the design of the conflict detection mechanism, and find that an approach based on
signatures [23] is a good match for FPGAs because of the underlying bit-level parallelism. A signature
is essentially a bit-vector [128] that tracks the memory locations accessed by a transaction via hash
indexing. However, since signatures normally have many fewer bits than there are memory locations,
comparing two signatures can potentially indicate costly false-positive conflicts between transactions.
Hence prior HTMs employ relatively large signatures—thousands of bits long—to avoid such false
conflicts. One important goal for our system is to be able to compare signatures and detect conflicts
in a single pipeline stage, otherwise memory accesses would take an increasing number of cycles and
large signatures in the logic-elements of an FPGA can be detrimental to the processor’s operating
frequency. Or, as an equally unattractive alternative, one can implement large and sufficiently fast
signatures using block RAMs but only if the indexing function is trivial—which can itself exacerbate
the resulting false conflicts. We capitalize on the reconfigurable nature of the underlying FPGA and
propose a method for implementing an application-specific signature mechanism that achieves these
goals. An application-specific signature is created by (i) profiling the memory addresses accessed by an
application, (ii) using this information to build and optimize a trie (a tree based on address prefixes) that
allocates more branches to frequently-conflicting address prefixes, and (iii) implementing the trie in a
As described in Chapter 6, our evaluation system is built on the NetFPGA platform [91], comprising
a Virtex II Pro FPGA, 4 × 1GigE MACs, and 200MHz DDR2 SDRAM. On it, we have implemented
a dual-core multiprocessor (the most cores that our current platform can accommodate), composed of
125MHz MIPS-based soft processors, that supports an eager HTM [158] via a shared data cache. We
have programmed our system to implement several threaded, shared-memory network packet processing
We use a cycle-accurate simulator to explore the signature design space, and implement and evaluate
the best schemes in our real dual-core multiprocessor implementation. We focus on single-threaded
processor cores (the updated system diagram can be found in Figure A.2) and, for comparison, we also
report the FPGA synthesis results for a conflict detection unit supporting 4 and 8 threads. Relative to
signatures with bit selection, the only other signature implementation that can maintain a reasonable
operating frequency of 125MHz, we find that our application-specific approach has an area overhead of
only 5%, and achieves a 9% to 71% increase in packet throughput due to reduced false conflicts.
There is an abundance of prior work on TM and HTM. Most prior FPGA implementations of HTM
were intended as fast simulation platforms to study future multicore designs [48, 147], and did not
specifically try to provide a solution tuned for FPGAs. Conflict detection has been previously
implemented by checking extra bits per line in private [48, 147] or shared [63] caches. In contrast
with caches with finite capacity that require complex mechanisms to handle cache line collisions for
speculative data, signatures can represent an unbounded set of addresses and thus do not overflow.
Signatures can be efficiently cleared in a single cycle and therefore advantageously leverage the bit-level
parallelism present in FPGAs. Because previous signature work was geared towards general purpose
processors [119, 128, 159], to the best of our knowledge there is no prior art in customizing signatures
on a per-application basis.
A PPENDIX A. A PPLICATION -S PECIFIC S IGNATURES FOR T RANSACTIONAL M EMORY 114
A TM system must track read and write accesses for each transaction (the read and write sets), hence
an HTM system must track read and write sets for each hardware thread context. The signature method
of tracking read and write sets implements Bloom filters [128], where an accessed memory address is
represented in the signature by asserting the k bits indexed by the results of k distinct hashes of the
address, and a membership test for an address returns true only if all k bits are set. Since false conflicts
can have a significant negative impact on performance, the number and type of hash functions used must
be chosen carefully. In this chapter, we consider only the case where each one of the k hash functions
indexes a different partition of the signature bits—previously shown to be more efficient [128]. The
following reviews the four known hash functions that we consider in this chapter.
Bit Selection [128] This scheme directly indexes a signature bit using a subset of address bits. An
example 2-bit index for address a = [a3 a2 a1 a0 ] could simply be h = [a3 , a2 ]. This is the most simple
scheme (i.e., simple circuitry) and hence is important to consider for an FPGA implementation.
H3 [128] The H3 class of hash functions is designed to provide a uniformly-distributed hashed index
for random addresses. Each bit of the hash result h = [h1 , h0 ] consists of a separate XOR (⊕) tree
determined by the product of an address a = [a3 a2 a1 a0 ] with a fixed random matrix H as in the following
Page-Block XOR (PBX) [159] This technique exploits the irregular use of the memory address space
to produce hash functions with fewer XOR gates. An address is partitioned into two non-overlapping
bit-fields, and selected bits of each field are XOR’ed together with the purpose of XOR’ing high entropy
bits (from the low-order bit-field) with lower entropy bits (from the high order bit-field). Modifying the
previous example, if the address is partitioned into 2 groups of 2 bits, we could produce the following
Locality-sensitive XOR [119] This scheme attempts to reduce hash collisions and hence the probability
of false conflicts by exploiting memory reference spatial locality. The key idea is to make nearby
memory locations share some of their k hash indices to delay filling the signature. This scheme produces
k H3 functions that progressively omit a larger number of least significant bits of the address from the
computation of the k indices. When represented as H3 binary matrices, functions require an increasing
number of lower rows to be null. Our implementation, called LE-PBX, combines this approach with
the reduced XOR’ing of PBX hashing. In LE-PBX, we XOR high-entropy bits with low-entropy bits
within a window of the address, then shift the window towards the most significant (low entropy) bits
All the hashing functions listed in the previous section create a random index that maps to a signature
bit range that is a power of two. In Section A.4, we demonstrate that these functions require too many
bits to be implemented without dramatically slowing down our processor pipelines. To minimize the
hardware resources required, the challenge is to reduce the number of false conflicts per signature bit,
motivating us to more efficiently utilize signature bits by creating application-specific hash functions.
Our approach is based on compact trie hashing [111]. A trie is a tree where each descendant of
a node has in common the prefix of most-significant bits associated with that node. The result of the
hash of an address is the leaf position found in the tree, corresponding to exactly one signature bit.
Because our benchmarks can access up to 64 Mbytes of storage (16 million words), it is not possible to
explicitly represent all possible memory locations as a leaf bit of the trie. The challenge is to minimize
false conflicts by mapping the most contentious memory locations to different signature bits, while
problem [70]. In the first step, we record in our simulator a trace of the read and write sets of a
benchmark functioning at its maximum sustainable packet rate. We organize the collected memory
addresses in a trie in which every leaf represents a signature bit. This signature is initially too large to be
practical (Figure A.1(b)) so we truncate it to an initial trie (Figure A.1(c)), selecting the most frequently
A PPENDIX A. A PPLICATION -S PECIFIC S IGNATURES FOR T RANSACTIONAL M EMORY 116
Combined read and Trie of all addresses Trie with low false positive Resulting address signature
write address sets based on compacted trie
Initial trie
root root
a2 a1 a0
a2=1 a2=0 a2=1 a2=0 sig[0]= a2 & a0;
* 0 0 0
0 1 1 1xx 0xx 1xx 0xx sig[1]= a2 & ~a0;
1 0 0 a1=1 a1=0 a1=1 a1=0 sig[2]= ~a2;
1 0 1
* 1 1 0 11x 10x 01x 00x 11x 00x
* 1 1 1
a0=1 a0=0 a0=1 a0=0
* most frequent 111 110 101 100 011 000 111 110 000
accesses
* * * * * *
(a) (b) (c) (d)
Figure A.1: Example trie-based signature construction for 3-bit addresses. We show (a) a partial address
trace, where * highlights frequently accessed addresses, (b) the full trie of all addresses, (c) the initial
and final trie after expansion and pruning to minimize false positives, and (d) the logic for computing
the signature for a given address (i.e., to be AND’ed with read and write sets to detect a conflict).
synch. unit
single−thread I$ single−thread I$
processor processor
instr.
data
busses
input mem.
output mem.
input data output
packet buffer cache buffer packet
input output
to DDR2 SDRAM
Figure A.2: The architecture of our soft multiprocessor with 2 single-threaded processor cores.
accessed branches. To reduce the hardware logic to map an address to a signature (Figure A.1(d)), only
the bits of the address that lead to a branch in the trie are considered. For our signature scheme to be
safe, an extra signature bit is added when necessary to handle all addresses not encompassed by the hash
function. We then replay the trace of accesses and count false conflicts encountered using our initial
hashing function. We iteratively expand the trie with additional branches and leaves to eliminate the
most frequently occurring false-positive conflicts (Figure A.1(c)). Once the trie is expanded to a desired
false positive rate, we greedily remove signature bits that do not negatively impact the false positive rate
(they are undesirable by-products of the expansion). Finally, to further minimize the number of signature
bits, we combine signature bits that are likely (> 80%) to be set together in non-aborted transactions.
A PPENDIX A. A PPLICATION -S PECIFIC S IGNATURES FOR T RANSACTIONAL M EMORY 117
Transactional memory support The single port from the processors to the shared cache in Figure A.2
implies that memory accesses undergo conflict detection one by one in transactional execution, therefore
a single trie hashing unit suffices for both processors. Our transactional memory processor uses a
shadow register file to revert its state upon rollback (versioning [2] avoids the need for register copy).
Speculative memory-writes trigger a backup of the overwritten value in an undo-buffer [158] that we
over-provision with storage for 2048 values per thread. Each processor has a dedicated connection to a
synchronization unit that triggers the beginning and end of speculative executions when synchronization
is requested in software.
A.4 Results
In this section, we first evaluate the impact of signature scheme and length on false-positive conflicts,
application throughput, and implementation cost. These results guide the implementation and evaluation
of our real system. For reference, we report in Table A.1 the statistics of our benchmarks (Table 3.1)
that specifically pertain to the synchronization (note that some columns do not have the same units as in
Table 6.2, and some numbers reflect code the optimizations explained in Section 8.3).
Resolution of Signature Mechanisms Using a recorded trace of memory accesses obtained from a
cycle-accurate simulation of our TM system that models perfect conflict detection, we can determine
the false-positive conflicts that would result from a given realistic signature implementation. We use
a recorded trace because the false positive rate of a dynamic system cannot be determined without
affecting the course of the benchmark execution: a dynamic system cannot distinguish a false-positive
conflict from a later true conflict that would have happened in the same transaction, if it was not aborted
immediately. We compute the false positive rate as the number of false conflicts divided by the total
30 45
BitSel BitSel
H3 40 H3
25 PBX PBX
LE−PBX 35 LE−PBX
False Positive Rate (%)
25
15
20
10 15
10
5
5
0 0 1 2 3 4 5
0 0 1 2 3 4 5
10 10 10 10 10 10 10 10 10 10 10 10
Signature Bit Length (bits) Signature Bit Length (bits)
25 12
BitSel BitSel
H3 H3
PBX 10 PBX
20 LE−PBX LE−PBX
False Positive Rate (%)
trie
False Positive Rate (%) trie
8
15
10
4
5
2
0 0 1 2 3 4 5
0 0 1 2 3 4 5
10 10 10 10 10 10 10 10 10 10 10 10
Signature Bit Length (bits) Signature Bit Length (bits)
Figure A.3: False positive rate vs signature bit length. Trie-based signatures were extended in length up
to the length that provides zero false positives on the training set.
The signatures that we study are configured as follows. The bit selection scheme selects the least
significant word-aligned address bits, to capture the most entropy. For H3, PBX and LE-PBX, we
found that increasing the number of hash functions caused a slight increase in the false positive rate for
short signatures, but helped reduce the number of signature bits required to completely eliminate false
positives. We empirically found that using four hash functions is a good trade-off between accuracy and
complexity, and hence we do so for all results reported. To train our trie-based hash functions, we use a
Figure A.3 shows the false positive rate for different hash functions (bit selection, H3, PBX, LE-
A PPENDIX A. A PPLICATION -S PECIFIC S IGNATURES FOR T RANSACTIONAL M EMORY 119
200 6000
2T 2T
4T 4T
180
8T 5000 8T
160
Frequency (MHz)
140
3000
120
2000
100
1000
80
60 0
0 50 100 150 200 250 0 50 100 150 200 250
Signature Bit Length (bits) Signature Bit Length (bits)
Figure A.4: Impact of increasing the bit length of trie-based signatures on (a) frequency and (b) LUT
usage of the conflict detection unit for 2, 4, and 8-thread (2T,4T,8T) systems. The results for H3, PBX
and LE-PBX are similar. In (a) we highlight the system operating frequency of 125MHz.
PBX and trie-based) as signature bit length varies. The false positive rate generally decreases with
longer signatures because of the reduced number of collisions on any single signature bit—although
small fluctuations are possible due to the randomness of the memory accesses. Our results show that
LE-PBX has a slightly lower false positive rate than H3 and PBX for an equal number of signature bits.
Bit selection generally requires a larger number of signature bits to achieve a low false positive rate,
except for UDHCP for which most of the memory accesses point to consecutive statically allocated
data. Overall, the trie scheme outperforms the others for C LASSIFIER, NAT and UDHCP by achieving
close to zero false positive rate with less than 100 bits, in contrast with several thousand bits. For
I NTRUDER, the non-trie schemes have a better resolution for signatures longer than 100 bits due to
the relatively large amount of dynamic memory used, which makes memory accesses more random.
has an entropy 1.7 times higher on average than the other benchmarks, thus explaining the difficulty in
Implementation of a Signature Mechanism Figure A.4 shows the results of implementing a signature-
based conflict detection unit using solely the LUTs in the FPGA for a processor system with 2 threads
A PPENDIX A. A PPLICATION -S PECIFIC S IGNATURES FOR T RANSACTIONAL M EMORY 120
like the one we implemented (2T) and for hypothetical transactional systems with 4 and 8 threads (4T
and 8T). While the plot was made for a trie-based hashing function, we found that H3, PBX and LE-
PBX produced similar results. As we will explain later, the bit selection scheme is better suited to
a RAM-based implementation. In Figure A.4(a) we observe that the CAD tools make an extra effort
to meet our 125 MHz required operating frequency by barely achieving it for many designs. In a
2-thread system, two signatures up to 200 bits will meet our 125MHz timing requirement while a 4-
thread system can only accommodate four signatures up to 100 bits long. For 8-threads, the maximum
number of signature bits allowed at 125MHz is reduced to 50 bits. Figure A.4(b) shows that the area
requirements grow linearly with the number of bits per signature. In practice, for 2-threads at 200 bits,
signatures require a considerable amount of resources: approximately 10% of the LUT usage of the
total non-transactional system. When the conflict detection unit is incorporated into the system, we
found that its area requirements—by putting more pressure on the routing interconnect of the FPGA—
lowered the maximum number of bits allowable to less than 100 bits for our 2-thread system (Table A.2).
Re-examining Figure A.3, we can see that the trie-based hashing function delivers significantly better
performance across all the hashing schemes proposed for less than 100 signature bits.
signature bit corresponding to a line in a block RAM. On that line, we store the corresponding read
and write signature bit for each thread. To preserve the 125MHz clock rate and our single-cycle conflict
detection latency, we found that we could only use one block RAM and that we could only use bit
selection to index the block RAM—other hashing schemes could only implement one hash function
with one block RAM and would perform worse than bit selection in that configuration. Because the
data written is only available on the next clock cycle in a block RAM, we enforce stalls upon read-
after-write hazards. Also, to emulate a single-cycle clear operation, we version the read and write sets
with a 2-bit counter that is incremented on commit or rollback to distinguish between transactions. If a
signature bit remains untouched and therefore preserves its version until a transaction with an aliasing
version accesses it (the version wraps over a 2-bit counter), the bit will appear to be set for the current
transaction and may lead to more false positives. The version bits are stored on the same block RAM
line as their associated signature bits, thus limiting the depth of our 16Kb block RAM to 2048 entries (8-
bits wide). Consequently, our best bit selection implementation uses a 11 bit-select of the word-aligned
A PPENDIX A. A PPLICATION -S PECIFIC S IGNATURES FOR T RANSACTIONAL M EMORY 121
1.05 1.05
1 1
Normalized throughput
UDHCP
0.9 0.9
Classifier
0.85 0.85 NAT
0.8 0.8
0.75 0.75
0.65
UDHCP 0.65
Classifier
0.6 0.6
NAT
0.55 0.55
0 20 40 60 80 100 120 140 160 180 200 0 20 40 60 80 100 120 140 160 180 200
Signature Bit Length (bits) Signature Bit Length (bits)
Figure A.5: Throughput with signatures using trie-based hashing with varying signature sizes
normalized to the throughput of an ideal system with perfect conflict detection (obtained using our
cycle-accurate simulator).
Impact of False Positives on Performance Figure A.5 shows the impact on performance in a full-
system simulation of a varying signature length, when using either a trie-based hashing function or
LE-PBX, the scheme with the second-lowest false positive rate. The jitter in the curves is again
explained by the unpredictable rollback penalty and rate of occurrence of the false positives, varying
the amount contention on the system. Overall, we can see that signatures have a dramatic impact on
system throughput, except for I NTRUDER for which the false positive rate varies little for this signature
size range (Figure A.3(d)). We observe that for C LASSIFIER, UDHCP and NAT, although they achieve
a small false positive rate with 10 bits on a static trace of transactional accesses (Figure A.3), their
performance increases significantly with longer signatures. We found that our zero-packet drop policy
to determine the maximum throughput of our benchmarks is very sensitive to the compute-latency of
packets since even a small burst of aborts and retries for a particular transaction directly impacts the size
of the input queue which in turn determines packet drops. The performance of NAT plateaus at 161
bits because that is the design that achieves zero false positives in training (Figure A.3(b)). As expected,
Figure A.5(b) shows that there is almost no scaling of performance for LE-PBX in the possible signature
implementation size range because the false positive rate is very high.
A PPENDIX A. A PPLICATION -S PECIFIC S IGNATURES FOR T RANSACTIONAL M EMORY 122
Table A.2: Size, LUT usage, LUT overhead and throughput gain of our real system with the best
application-specific trie-based hash functions over bit selection.
Measured Performance on the Real System As shown in Table A.2 and contrarily to the other
schemes presented, the size of the trie-based signatures can be adjusted to an arbitrary number of bits
to maximize the use of the FPGA fabric while respecting our operating frequency. The maximum
signature size is noticeably smaller for NAT because more address bits are tested to set signature bits,
which requires more levels of logic and reduces the clock speed. In all cases the conflict detection with a
customized signature outperforms the general purpose bit selection. This is coherent with the improved
false positive rate observed in Figure A.3. We can see that bit selection has the best performance
when the data accesses are very regular like in UDHCP, as indicated by the low false positive rate in
Figure A.3(c). Trie-based hashing improves the performance of I NTRUDER the most because the bit
CAD Results Comparing two-processor full system hardware designs, the system with trie-based
conflict detection implemented in LUTs consumes 161 block RAMs and the application-specific LUT
usage reported in Table A.2. Block-RAM-based bit selection requires one additional block RAM (out
of 232, i.e., 69% of the total capacity) and consumes 19546 LUTs (out of 47232, i.e. 41% of the total
capacity). Since both kinds of designs are limited by the operating frequency, trie-based hashing only
has an area overhead of 4.5% on average (Table A.2). Hence the overall overhead costs of our proposed
conflict detection scheme are low and enable significant throughput improvements.
A.5 Summary
In this chapter, we describe the first soft processor cores integrated with transactional memory, and
evaluated on a real (and simulated) system with threaded applications that share memory. We study
A PPENDIX A. A PPLICATION -S PECIFIC S IGNATURES FOR T RANSACTIONAL M EMORY 123
several previously-proposed signature-based conflict detection schemes for TM and demonstrate that
previous signature schemes result in implementations with either multicycle stalls, or unacceptable
operating frequencies or false conflict rates. Among the schemes proposed in the literature, we find that
bit selection provides the best implementation that avoids degrading the operating frequency or stalling
the processors for multiple cycles. Improving on this technique, we present a method for implementing
more efficient signatures by customizing them to match the access patterns of an application. Our
scheme builds on trie-based hashing, and minimizes the number of false conflicts detected, improving
the ability of the system to exploit parallelism. We demonstrate that application-specific signatures
can allow conflict detection at acceptable operating frequencies (125MHz), single cycle operation,
and improved false conflict rates—resulting in significant performance improvements over alternative
12%, 58%, 9% and 71% for four applications, demonstrating that application-specific signatures are
a compelling means to facilitate conflict detection for FPGA-based TM systems. With this powerful
NetFPGA and its VirtexII-Pro FPGA limit the scaling of NetTM to two 4-threaded cores (the most cores
that our current platform can accommodate). To evaluate the performance and resource usage of larger-
scale designs, we target a newer mid-range Virtex5 FPGA (XC5VLX110-2). However, we do not yet
have an full system board and hence cannot yet consider the entire NetTM design: hence we elide the
support for network and PCI access, and measure only the multiprocessor system. Additional processors
are added to the NetTM architecture (Figure 6.1) by simply widening the interconnect arbiters and
linearly scaling the load-store queue, the signature tables, and the undo-log. We assume the default
CM4 contention manager (Section 8.7.3), and include the application-specific signature hash function
Figure B.1 shows the resource usage of transactional and non-transactional multiprocessors with 2, 4
and 8 cores, when synthesized with high effort for speed and with a constraint of 125MHz for the core
frequency. The ∼3100 LUT difference between the TM2P and NT2P systems corresponds roughly to
our earlier results in Section 8.7.1, accounting for the change in LUT architecture between the VirtexII-
Pro and Virtex5 FPGAs. We can see that the TM processors consume approximately the area of a
system with twice the number of non-transactional cores: additional LUTs are associated with the extra
pipeline stage per core, conflict detection, the undo-log and pipelining in the interconnect between the
cores and the cache. The BRAMs also scale to almost entirely fill the Virtex 5 processor with 8 TM
124
A PPENDIX B. S CALING N ET TM TO 8 CORES 125
Figure B.1: CAD metrics. Processors are named as follows: NT for non-transactional or TM for
transactional, followed by the number of 4-threaded cores.
1: for 1...N do
2: for 1...IT ER NON T M do
3: work();
4: end for
5: acquire global lock();
6: critical access at(rand()% RANGE);
7: for 1..IT ER T M do
8: work();
9: end for
10: release global lock();
11: end for
cores. However, note that only a small portion of the capacity of the BRAMs that implement the 16 TM
log filters is utilized. Lastly, we can see that both 8-core systems fail to meet the 125MHz constraint and
have roughly the same achieved frequency: in both cases, the multiplexing in the interconnect becomes
significant. However, 125MHz could be achieved by adjusting the architecture to further pipeline cache
B.2 Performance
Our evaluation so far has focused on realistic workloads for an FPGA embedded in a network card.
Since systems with more than two cores cannot be synthesized on the NetFPGA, in this section we
explore the performance of these larger systems via cycle-accurate RTL simulation. Since our real
A PPENDIX B. S CALING N ET TM TO 8 CORES 126
Figure B.2: Left: speedup of TM over non-TM (NT) systems with an equal number of cores (2, 4 or 8)
as the ratio of critical work is increased. Middle and right: average number of active contexts for the
corresponding processors and application parameters.
applications require a full system including off-chip memory and host interaction, we instead focus on
the performance of a parametric application (Algorithm 1) that allows us to vary the relative size of its
critical section. For this study we measure N iterations of the benchmark, where N is sufficiently large
to minimize the impact of warm-up and tear-down activity. Each thread executes the same code (except
for the rand() seeding) and is de-scheduled after completing the execution1 . The work() routine only
performs computation and stack-based memory accesses that contribute to filling the TM undo-log.
Each critical section contains only one memory access that might conflict, at a random location within
a parametric range. Each potential location corresponds to one distinct signature bit and by default we
use a range of 1024 locations. To minimize the impact of cache contention as we scale the number of
processors, we designed the work() routine such that the shared-cache bus utilization does not exceed
Varying the fraction of critical code Figure B.2 shows the impact of varying the relative size of critical
ITER TM
sections compared to the non-critical code ( ITER NON TM in Algorithm 1). We observe that the speedup of
TM increases with the ratio of critical code. The speedups can be attributed to the average number of
active contexts (i.e. not de-scheduled because of pending on a lock or pending on a transaction restart)
1 This is ensured by forcing each hardware context into a deadlock, which is possible because our TM architecture supports
traditional locks.
A PPENDIX B. S CALING N ET TM TO 8 CORES 127
in each processor measured at steady-state, as illustrated in the two right-side graphs in Figure B.2.
ITER TM
With mostly non-critical code ( ITER NON TM ≪ 1), almost all the available contexts are active, resulting
in a speedup of 1 for TM processors compared to their non-TM counterpart. With mostly critical code
ITER TM
( ITER NON TM ≫ 1), all non-TM processors can only keep 1 thread context active. With multithreading,
a core can retire 1 instruction per cycle (IPC = 1, other than when there is a cache miss); with a single
thread to schedule, a processor core experiences pipeline hazards and can only retire ∼0.5 instructions
every cycle for this application. This result explains why the fully utilized TM2P has a speedup of
4.05 over NT2P, and not 2 (i.e. 2 TM cores with IPC of ∼1 each, versus 1 NT context with IPC of ∼0.5).
Because the 8-cores TM processor experiences a reduction from 32 to 14.6 active contexts due to aborts,
it too experiences pipeline hazards and the speedup is 8.0 (i.e. 8 TM cores with IPC of ∼0.5 versus 1 NT
Comparing Similar-Area Designs From Figure B.1, we observe that TM support costs roughly 2×
LUTs compared to the corresponding non-TM design. This implies that for roughly equivalent area, a
designer could choose between TM2P and NT4P, or between TM4P and NT8P—hence we are motivated to
compare the performance of these designs. From Figure B.2, we observe that for small critical sections,
TM2P and TM4P would each be half the performance of their respective area counterpart (NT4P and
NT8P). However, for large critical sections, TM2P would be 4.05× faster than NT4P, and TM4P would be
5.4× faster than NT8P. Large critical sections are inherently easier to program, hence TM provides a
Varying the static conflict intensity In Figure B.3, we control the transactional aborts by varying the
range of the random accesses, where each location points to one signature bit. The probability of a
conflict when all the contexts are active and have their signature loaded with a potentially conflicting
range!
access is given by: 1 − (range−#contexts)!∗range#contexts . The probability thus increases with the number of
contexts and decreases with a larger range. An 8-core system with 32 contexts is therefore the most
impacted system, and the conflict probability is 1 with a memory access range of 8 and 0.39 with a
range of 1024 (the default range in Figure B.2). For a 2-core system, those probabilities become 1.00
and 0.27. We respectively label these settings as high-conflict and low-conflict intensity in Figure B.3.
With mostly critical code, the 8-core TM system experiences a speedup reduction from 8.0 to 5.1,
A PPENDIX B. S CALING N ET TM TO 8 CORES 128
Figure B.3: Left: speedup of TM over non-TM (NT) processors in high (Hi-C) and low (Lo-C) conflict
intensity settings, as the ratio of critical work is increased. Middle: average number of aborts per
transaction-instance. Right: average number of active contexts. For reference, the right-most graphs
show the appropriate NT systems, although NT processors are not affected by conflict intensity.
which can be explained by an increased average number of aborts per transaction from 12.7 to 14.9,
and a reduction in the average number of active contexts from 14.6 to 8.0: still significantly better than
the 1 active context for NT processors, which are not visibly affected by the conflicting range. The
2 and 4-cores systems experience similar but more moderate slowdowns under high-conflict intensity.
Interestingly, all systems experience a reduction in the abort rate when the code is mostly transactional.
This can be attributed to contention manager de-scheduling aborted threads for a longer period of time,
preventing them from repeating aborts. This reduces the potential parallelism and the 4-core system
even experiences a speedup reduction when the ratio of critical code is increased.
129
Bibliography
[1] 1-CORE T ECHNOLOGIES . Soft CPU cores for FPGA. http://www.1-core.com, 2010.
[2] A ASARAAI , K., AND M OSHOVOS , A. Towards a viable out-of-order soft core: Copy-free,
checkpointed register renaming. In Proc. of FPL (2009).
[3] A LBRECHT, C., D ORING , A., P ENCZEK , F., S CHNEIDER , T., AND S CHULZ , H. Impact of
coprocessors on a multithreaded processor design using prioritized threads. In 14th Euromicro
International Conference on Parallel, Distributed, and Network-Based Processing (Germany,
February 2006).
[4] A LBRECHT, C., F OAG , J., KOCH , R., AND M AEHLE , E. DynaCORE : A dynamically
reconfigurable coprocessor architecture for network processors. In 14th Euromicro International
Conference on Parallel, Distributed, and Network-Based Processing (Germany, Frebruary 2006),
pp. 101–108.
[5] A LLEN , J., BASS , B., BASSO , C., B OIVIE , R., C ALVIGNAC , J., DAVIS , G., F RELECHOUX ,
L., H EDDES , M., H ERKERSDORF, A., K IND , A., L OGAN , J., P EYRAVIAN , M., R INALDI , M.,
S ABHIKHI , R., S IEGEL , M., AND WALDVOGEL , M. PowerNP network processor hardware
software and applications. IBM Journal of Research and Development 47, 2 (2003).
[6] A LMEROTH , K. The evolution of multicast: from the MBone to interdomain multicast to
Internet2 deployment. Network, IEEE 14, 1 (Jan/Feb 2000), 10 –20.
[7] A LTERA C ORP. Nios II Processor Reference Handbook v7.2. http://www.altera.com, 2007.
[11] A LTERA C ORPORATION. Altera Corportation 2009 annual report (form 10-k).
http://investor.altera.com, February 2010.
[12] A NANIAN , C. S., A SANOVIC , K., K USZMAUL , B. C., L EISERSON , C. E., AND L IE , S.
Unbounded transactional memory. In Proc. of the 11th International Symposium on High-
Performance Computer Architecture (HPCA) (Washington, DC, USA, 2005), IEEE Computer
Society, pp. 316–327.
130
B IBLIOGRAPHY 131
[13] A NSARI , M., KOTSELIDIS , C., L UJ ÁN , M., K IRKHAM , C., AND WATSON , I. On the
performance of contention managers for complex transactional memory benchmarks. In Proc.
of the 8th International Symposium on Parallel and Distributed Computing (ISPDC) (July 2009).
[15] BAKER , Z., AND P RASANNA , V. A methodology for synthesis of efficient intrusion detection
systems on FPGAs. In 12th Annual IEEE Symposium on Field-Programmable Custom
Computing Machines (April 2004), pp. 135 – 144.
[16] BALL , J. The Nios II Family of Configurable Soft-Core Processors. In Proc. of Hot Chips (CA,
USA, August 2005), Altera.
[17] B EHESHTI , N., G ANJALI , Y., G HOBADI , M., M C K EOWN , N., AND S ALMON , G. Experimen-
tal study of router buffer sizing. In Proc. of the 8th ACM SIGCOMM conference on Internet
measurement (IMC) (New York, NY, USA, 2008), ACM, pp. 197–210.
[18] B OBBA , J., M OORE , K. E., VOLOS , H., Y EN , L., H ILL , M. D., S WIFT, M. M., AND W OOD ,
D. A. Performance pathologies in hardware transactional memory. SIGARCH Comput. Archit.
News 35, 2 (2007), 81–91.
[19] B ONNY, T., AND H ENKEL , J. Instruction re-encoding facilitating dense embedded code. In
Proc. of DATE ’08 (2008), pp. 770–775.
[21] C AMPI , F., C ANEGALLO , R., AND G UERRIERI , R. IP-reusable 32-bit VLIW RISC core. In
Proc. of the 27th European Solid-State Circuits Conf (2001), pp. 456–459.
[22] C AO M INH , C., C HUNG , J., KOZYRAKIS , C., AND O LUKOTUN , K. STAMP: Stanford
transactional applications for multi-processing. In Proc. of The IEEE International Symposium
on Workload Characterization (IISWC) (Sept. 2008).
[23] C EZE , L., T UCK , J., T ORRELLAS , J., AND C ASCAVAL , C. Bulk disambiguation of speculative
threads in multiprocessors. In Proc. of the 33rd annual international symposium on Computer
Architecture’06 (Washington, DC, USA, 2006), IEEE Computer Society, pp. 227–238.
[25] C ISCO S YSTEMS. The Cisco QuantumFlow processor: Cisco’s next generation network
processor. http://www.cisco.com/en/US/prod/collateral/routers/ps9343/solution overview c22-
448936.html, 2008.
[26] C LARK , D. Marvell bets on ’plug computers’. Wall Street Journal February 23 (2009).
[28] C OOPERATIVE A SSOCIATION FOR I NTERNET DATA A NALYSIS. A day in the life of the Internet.
WIDE-TRANSIT link, Jan. 2007.
B IBLIOGRAPHY 132
[29] C ORPORATION , A. M. C. nP7510 –10 Gbps Network Processor (OC-192c / 4xOC-48 / 10GE /
10x1GE), 2002.
[30] C RISTEA , M.-L., DE B RUIJN , W., AND B OS , H. FPL-3: Towards language support for
distributed packet processing. In NETWORKING (2005), pp. 743–755.
[31] C ROWLEY, P., AND BAER , J.-L. A modeling framework for network processor systems.
Network Processor Design : Issues and Practices 1 (2002).
[32] C ROWLEY, P., F IUCZYNSKI , M., BAER , J.-L., AND B ERSHAD , B. Characterizing processor
architectures for programmable network interfaces. Proc. of the 2000 International Conference
on Supercomputing (May 2000).
[33] DAI , J., H UANG , B., L I , L., AND H ARRISON , L. Automatically partitioning packet processing
applications for pipelined architectures. SIGPLAN Not. 40, 6 (2005), 237–248.
[34] D ICE , D., L EV, Y., M OIR , M., AND N USSBAUM , D. Early experience with a commercial
hardware transactional memory implementation. In Proc. of the 14th international conference on
Architectural support for programming languages and operating systems (ASPLOS) (New York,
NY, USA, 2009), ACM, pp. 157–168.
[36] D ITTMANN , G., AND H ERKERSDORF, A. Network processor load balancing for high-
speed links. In International Symposium on Performance Evaluation of Computer and
Telecommunication Systems (SPECTS) (San Diego, California, July 2002), pp. pp. 727–735.
[37] D OBRESCU , M., E GI , N., A RGYRAKI , K., C HUN , B.-G., FALL , K., I ANNACCONE , G.,
K NIES , A., M ANESH , M., AND R ATNASAMY, S. Routebricks: exploiting parallelism to scale
software routers. In Proc. of the ACM SIGOPS 22nd symposium on Operating systems principles
(SOSP) (New York, NY, USA, 2009), ACM, pp. 15–28.
[38] D ORING , A., AND G ABRANI , M. On networking multithreaded processor design: hardware
thread prioritization. In 46th IEEE International Midwest Symposium on Circuits and Systems
(Switzerland, 2003), vol. 1, pp. 520–523.
[41] F ELDMANN , A. Internet clean-slate design: what and why? SIGCOMM Comput. Commun. Rev.
37, 3 (2007), 59–64.
[42] F ENDER , J., ROSE , J., AND G ALLOWAY, D. The Transmogrifier-4: an FPGA-based hardware
development system with multi-gigabyte memory capacity and high host and memory bandwidth.
In Proc. of FPT’05 (december 2005), pp. 301– 302.
B IBLIOGRAPHY 133
[43] F ORT, B., C APALIJA , D., V RANESIC , Z. G., AND B ROWN , S. D. A multithreaded soft
processor for SoPC area reduction. In Proc. of FCCM ’06 (Washington, DC, USA, 2006), IEEE
Computer Society, pp. 131–142.
[44] G HOBADI , M., L ABRECQUE , M., S ALMON , G., A ASARAAI , K., Y EGANEH , S. H., G ANJALI ,
Y., AND S TEFFAN , J. G. Caliper: A tool to generate precise and closed-loop traffic. In
SIGCOMM Conference (August 2010).
[45] G IOIA , A. FCC jurisdiction over ISPs in protocol-specific bandwidth throttling. 15 Mich.
Telecomm. Tech. Law Review 517, 2009.
[46] G OEL , S., AND S HAWKY, H. A. Estimating the market impact of security breach announcements
on firm values. Inf. Manage. 46, 7 (2009), 404–410.
[47] G OTTSCHLICH , J. E., C ONNORS , D., W INKLER , D., S IEK , J. G., AND VACHHARAJANI , M.
An intentional library approach to lock-aware transactional memory. Tech. Rep. CU-CS-1048-08,
University of Colorado at Boulder, October 2008.
[48] G RINBERG , S., AND W EISS , S. Investigation of transactional memory using FPGAs. In Proc.
of IEEE 24th Convention of Electrical and Electronics Engineers in Israel (Nov. 2006), pp. 119–
122.
[49] G R ÜNEWALD , M., N IEMANN , J.-C., P ORRMANN , M., , AND R ÜCKERT, U. A framework for
design space exploration of resource efficient network processing on multiprocessor SoCs. In
Proc. of the 3rd Workshop on Network Processors & Applications (2004).
[51] H AMMOND , L., W ONG , V., C HEN , M., C ARLSTROM , B. D., DAVIS , J. D., H ERTZBERG , B.,
P RABHU , M. K., W IJAYA , H., KOZYRAKIS , C., AND O LUKOTUN , K. Transactional memory
coherence and consistency. SIGARCH Comput. Archit. News 32, 2 (2004), 102.
[52] H ANDLEY, M., KOHLER , E., G HOSH , A., H ODSON , O., AND R ADOSLAVOV, P. Designing
extensible IP router software. In Proc. of the 2nd conference on Symposium on Networked Systems
Design & Implementation (Berkeley, CA, USA, 2005), USENIX Association, pp. 189–202.
[54] H ERLIHY, M., AND M OSS , J. E. B. Transactional memory: architectural support for lock-free
data structures. SIGARCH Comput. Archit. News 21, 2 (1993), 289–300.
[55] H ORTA , E. L., L OCKWOOD , J. W., TAYLOR , D. E., AND PARLOUR , D. Dynamic hardware
plugins in an FPGA with partial run-time reconfiguration. In Proc. of the 39th conference on
Design automation (DAC) (New York, NY, USA, 2002), ACM Press, pp. 343–348.
[56] I NTEL C ORPORATION. IXP2850 Network Processor: Hardware Reference Manual, July 2004.
[57] I NTEL C ORPORATION. IXP2800 Network Processor: Hardware Reference Manual, July 2005.
[58] I NTEL C ORPORATION. Intel R Itanium R Architecture Software Developer’s Manual, January
2006.
B IBLIOGRAPHY 134
[60] IP S EMICONDUCTORS. OC-48 VPN solution using low-power network processors, March 2002.
[61] JAKMA , P., JARDIN , V., OVSIENKO , D., S CHORR , A., T EPPER , H., T ROXEL , G., AND
YOUNG , D. Quagga routing software suite. http://www.quagga.net.
[62] JEDEC. JESD79: Double data rate (DDR) SDRAM specification. JEDEC Solid State
Technology Association, 2003.
[63] K ACHRIS , C., AND K ULKARNI , C. Configurable transactional memory. In Proc. of the
15th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM)
(Washington, DC, USA, 2007), IEEE Computer Society, pp. 65–72.
[64] K ACHRIS , C., AND VASSILIADIS , S. Analysis of a reconfigurable network processor. In Proc.
of IPDPS (Los Alamitos, CA, USA, 2006), IEEE Computer Society, p. 173.
[65] K ACHRIS , C., AND VASSILIADIS , S. Performance evaluation of an adaptive FPGA for
network applications. In Proc. of the Seventeenth IEEE International Workshop on Rapid System
Prototyping (RSP) (Washington, DC, USA, 2006), IEEE Computer Society, pp. 54–62.
[66] K ANE , G., AND H EINRICH , J. MIPS RISC Architecture. Prentice Hall, 1992.
[67] K AUSHIK R AVINDRAN , NADATHUR S ATISH , Y. J., AND K EUTZER , K. An FPGA-based soft
multiprocessor system for IPv4 packet forwarding. In 15th International Conference on Field
Programmable Logic and Applications (FPL-05) (August 2005), pp. pp 487–492.
[68] K EISTER , S. TSP3 traffic stream processor – M27483. Mindspeed Technologies, Inc., May
2004.
[69] K ENNY, J. R., AND K RIKELIS , A. Digital beam former coefficient management using advanced
embedded processor technology. In HPEC (2007).
[70] K HULLER , S., M OSS , A., AND NAOR , J. S. The budgeted maximum coverage problem. Inf.
Process. Lett. 70, 1 (1999), 39–45.
[71] K INGYENS , J., AND S TEFFAN , J. G. A GPU-inspired soft processor for high-throughput
acceleration. In Reconfigurable Architectures Workshop (2010).
[72] K ISSELL , K. D. MIPS MT: A multithreaded RISC architecture for embedded real-time
processing. In Proc. of HiPEAC ’08 (Sweden, January 2008), pp. 9–21.
[73] KONGETIRA , P., A INGARAN , K., AND O LUKOTUN , K. Niagara: a 32-way multithreaded Sparc
processor. Micro, IEEE 25, 2 (March-April 2005), 21–29.
[74] L ABRECQUE , M. Towards a compilation infrastructure for network processors. Master’s thesis,
University of Toronto, 2006.
[75] L ABRECQUE , M., J EFFREY, M., AND S TEFFAN , J. G. Application-specific signatures for
transactional memory in soft processors. In Proc. of ARC (2010).
B IBLIOGRAPHY 135
[76] L ABRECQUE , M., AND S TEFFAN , J. G. Improving pipelined soft processors with multithread-
ing. In Proc. of FPL ’07 (August 2007), pp. 210–215.
[77] L ABRECQUE , M., AND S TEFFAN , J. G. Fast critical sections via thread scheduling for
FPGA-based multithreaded processors. In Proc. of 19th International Conference on Field
Programmable Logic and Applications (FPL) (Prague, Czech Republic, Sept. 2009).
[78] L ABRECQUE , M., AND S TEFFAN , J. G. The case for hardware transactional memory in software
packet process ing. In ACM/IEEE Symposium on Architectures for Networking and Communicat
ions Systems (La Jolla, USA, October 2010).
[79] L ABRECQUE , M., AND S TEFFAN , J. G. NetTM.
http://www.netfpga.org/foswiki/bin/view/NetFPGA/OneGig/NetTM, February 2011.
[80] L ABRECQUE , M., AND S TEFFAN , J. G. NetTM: Faster and easier synchronization for soft
multicores via transactional memory. In International Symposium on Field-Programmable Gate
Arrays (FPGA) (Monterey, Ca, US, February 2011).
[81] L ABRECQUE , M., S TEFFAN , J. G., S ALMON , G., G HOBADI , M., AND G ANJALI , Y.
NetThreads: Programming NetFPGA with threaded software. In NetFPGA Developers Workshop
(August 2009).
[82] L ABRECQUE , M., S TEFFAN , J. G., S ALMON , G., G HOBADI , M., AND G ANJALI , Y.
NetThreads-RE: Netthreads optimized for routing applications. In NetFPGA Developers
Workshop (August 2010).
[83] L ABRECQUE , M., Y IANNACOURAS , P., AND S TEFFAN , J. G. Custom code generation for soft
processors. SIGARCH Computer Architecture News 35, 3 (2007), 9–19.
[84] L ABRECQUE , M., Y IANNACOURAS , P., AND S TEFFAN , J. G. Scaling soft processor systems.
In Proc. of FCCM ’08 (April 2008), pp. 195–205.
[85] L AI , Y.-K. Packet Processing on Stream Architecture. PhD thesis, North Carolina State
University, 2006.
[86] L EE , B. K., AND J OHN , L. K. NpBench: A benchmark suite for control plane and data plane
applications for network processors. In 21st International Conference on Computer Design (San
Jose, California, October 2001).
[87] L EVANDOSKI , J., S OMMER , E., S TRAIT, M., C ELIS , S., E XLEY, A., AND K ITTREDGE , L.
Application layer packet classifier for Linux. http://l7-filter.sourceforge.net.
[88] L IDINGTON , G. Programming a data flow processor. http://www.xelerated.com, 2003.
[89] L IU , Z., Z HENG , K., AND L IU , B. FPGA implementation of hierarchical memory architecture
for network processors. In IEEE International Conference on Field-Programmable Technology
(2004), pp. 295 – 298.
[90] L OCKWOOD , J. W. Experience with the NetFPGA program. In NetFPGA Developers Workshop
(2009).
[91] L OCKWOOD , J. W., M C K EOWN , N., WATSON , G., G IBB , G., H ARTKE , P., NAOUS , J.,
R AGHURAMAN , R., AND L UO , J. NetFPGA - an open platform for gigabit-rate network
switching and routing. In Proc. of MSE ’07 (June 3-4 2007).
B IBLIOGRAPHY 136
[92] L OCKWOOD , J. W., T URNER , J. S., AND TAYLOR , D. E. Field programmable port extender
(FPX) for distributed routing and queuing. In Proc. of the 2000 ACM/SIGDA eighth international
symposium on Field programmable gate arrays (FPGA) (New York, NY, USA, 2000), ACM
Press, pp. 137–144.
[93] L UO , Y., YANG , J., B HUYAN , L., AND Z HAO , L. NePSim: A network processor simulator
with power evaluation framework. IEEE Micro Special Issue on Network Processors for Future
High-E nd Systems and Applications (Sept/Oct 2004).
[94] L UPON , M., M AGKLIS , G., AND G ONZ ÁLEZ , A. Version management alternatives for hardware
transactional memory. In Proc. of the 9th workshop on memory performance (MEDEA) (New
York, NY, USA, 2008), ACM, pp. 69–76.
[95] LYSECKY, R., AND VAHID , F. A Study of the Speedups and Competitiveness of FPGA Soft
Processor Cores using Dynamic Hardware/Software Partitioning. In Proc. of the conference on
Design, Automation and Test in Europe (DATE) (Washington, DC, USA, 2005), IEEE Computer
Society, pp. 18–23.
[96] M ARTIN , M., B LUNDELL , C., AND L EWIS , E. Subtleties of transactional memory atomicity
semantics. IEEE Comput. Archit. Lett. 5, 2 (2006), 17.
[97] M AXIAGUINE , A., UNZLI , S., C HAKRABORTY, S., AND T HIELE , L. Rate analysis for
streaming applications with on-chip buffer constraints. In Asia and South Pacific Design
Automation Conference (ASP-DAC) (Japan, January 2004), pp. 131–136.
[98] M C L AUGHLIN , K., O’C ONNOR , N., AND S EZER , S. Exploring CAM design for network
processing using FPGA technology. In International Conference on Internet and Web
Applications and Services (February 2006).
[99] M ELVIN , S., AND PATT, Y. Handling of packet dependencies: a critical issue for highly parallel
network processors. In Proc. of the 2002 international conference on Compilers, architecture,
and synthesis for embedded systems (CASES) (New York, NY, USA, 2002), ACM, pp. 202–209.
[100] M EMIK , G., AND M ANGIONE -S MITH , W. H. Evaluating network processors using NetBench.
ACM Trans. Embed. Comput. Syst. 5, 2 (2006), 453–471.
[101] M EMIK , G., M ANGIONE -S MITH , W. H., AND H U , W. NetBench: A benchmarking suite
for network processors. In IEEE/ACM International Conference on Computer-Aided Design
(ICCAD) (San Jose, CA, November 2001).
[102] M EMIK , G., M EMIK , S. O., AND M ANGIONE -S MITH , W. H. Design and analysis of a layer
seven network processor accelerator using reconfigurable logic. In Proc. of the 10th Annual
IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM) (Washington,
DC, USA, 2002), IEEE Computer Society, p. 131.
[104] M INIWATTS M ARKETING G ROUP. Internet usage statistics - the big picture.
http://www.internetworldstats.com/stats.htm, 2010.
[105] M ORRIS , R., KOHLER , E., JANNOTTI , J., AND K AASHOEK , M. F. The Click modular router.
SIGOPS Oper. Syst. Rev. 33, 5 (1999), 217–231.
B IBLIOGRAPHY 137
[106] M OUSSALI , R., G HANEM , N., AND S AGHIR , M. Microarchitectural enhancements for
configurable multi-threaded soft processors. In Proc. of FPL ’07 (Aug. 2007), pp. 782–785.
[109] M UELLER , F. A library implementation of POSIX threads under UNIX. In Proc. of USENIX
(1993), pp. 29–41.
[110] M UNTEANU , D., AND W ILLIAMSON , C. An FPGA-based network processor for IP packet
compression. In SCS SPECTS (Philadelphia, PA, July 2005), pp. 599–608.
[111] OTOO , E. J., AND E FFAH , S. Red-black trie hashing. Tech. Rep. TR-95-03, Carleton University,
1995.
[112] PASSAS , S., M AGOUTIS , K., AND B ILAS , A. Towards 100 gbit/s Ethernet: multicore-based
parallel communication protocol design. In Proc. of the 23rd international conference on
Supercomputing (ICS) (New York, NY, USA, 2009), ACM, pp. 214–224.
[113] PATTERSON , D. A., AND H ENNESSY, J. L. Computer Organization and Design: The
Hardware/Software Interface. Morgan Kaufmann, 2008.
[114] PAULIN , P., P ILKINGTON , C., AND B ENSOUDANE , E. StepNP: A system-level exploration
platform for network processors. IEEE Design & Test of Computers 19, 6 (November 2002),
17–26.
[115] PAULIN , P., P ILKINGTON , C., B ENSOUDANE , E., L ANGEVIN , M., AND LYONNARD , D.
Application of a multi-processor SoC platform to high-speed packet forwarding. Proc. of Design,
Automation and Test in Europe Conference and Exhibition (DATE) 3 (2004).
[116] P ITTSBURGH S UPERCOMPUTING C ENTER (PSC). National laboratory for applied networking
research (NLANR), 1997.
[117] P LAVEC , F. Soft-core processor design. Master’s thesis, University of Toronto, 2004.
[119] Q UISLANT, R., G UTIERREZ , E., AND P LATA , O. Improving signatures by locality exploitation
for transactional memory. In Proc. of 18th International Conference on Parallel Architectures
and Compilation Techniques (PACT) (Raleigh, North Carolina, Sept. 2009), pp. 303–312.
[120] R AJWAR , R., H ERLIHY, M., AND L AI , K. Virtualizing transactional memory. In ISCA
(Washington, DC, USA, 2005), IEEE Computer Society, pp. 494–505.
[121] R AMASWAMY, R., W ENG , N., AND W OLF, T. Application analysis and resource mapping
for heterogeneous network processor architectures. In Proc. of Third Workshop on Network
Processors and Applications (NP-3) in conjunction with Tenth International Symposium on High
Performance Computer Architecture (HPCA-10) (Madrid, Spain, Feb. 2004), pp. 103–119.
B IBLIOGRAPHY 138
[122] R AVINDRAN , K., S ATISH , N., J IN , Y., AND K EUTZER , K. An FPGA-Based Soft
Multiprocessor System for IPv4 Packet Forwarding. Proc. of the International Conference on
Field Programmable Logic and Applications (2005), 487–492.
[123] R EDDI , V. J., S ETTLE , A., C ONNORS , D. A., AND C OHN , R. S. Pin: a binary instrumentation
tool for computer architecture research and education. In WCAE ’04: Proceedings of the 2004
workshop on Computer architecture education (New York, NY, USA, 2004), ACM, p. 22.
[124] ROESCH , M. Snort - lightweight intrusion detection for networks. In Proc. of the 13th USENIX
conference on System administration (LISA) (Berkeley, CA, USA, 1999), USENIX Association,
pp. 229–238.
[125] ROSINGER , H.-P. Connecting customized IP to the MicroBlaze soft processor using the Fast
Simplex Link (FSL) channel. XAPP529, 2004.
[126] S ADOGHI , M., L ABRECQUE , M., S INGH , H., S HUM , W., AND JACOBSEN , H.-A. Efficient
event processing through reconfigurable hardware for algori thmic trading. In International
Conference on Very Large Data Bases (VLDB) (Singapore, September 2010).
[127] S ALMON , G., G HOBADI , M., G ANJALI , Y., L ABRECQUE , M., AND S TEFFAN , J. G.
NetFPGA-based precise traffic generation. In Proc. of NetFPGA Developers Workshop’09
(2009).
[128] S ANCHEZ , D., Y EN , L., H ILL , M. D., AND S ANKARALINGAM , K. Implementing signatures
for transactional memory. In Proc. of the 40th Annual IEEE/ACM International Symposium on
Microarchitecture (MICRO) (Washington, DC, USA, 2007), IEEE Computer Society, pp. 123–
133.
[129] S CHELLE , G., AND G RUNWALD , D. CUSP: a modular framework for high speed network
applications on FPGAs. In FPGA (2005), pp. 246–257.
[131] S HADE , L. R. Computer networking in canada: From ca*net to canarie. Canadian Journal of
Communication 19, 1 (1994).
[132] S HAFER , J., AND R IXNER , S. RiceNIC: a reconfigurable network interface for experimental
research and education. In Proc. of the 2007 workshop on Experimental computer science
(ExpCS) (New York, NY, USA, 2007), ACM, p. 21.
[133] S HANNON , L., AND C HOW, P. Standardizing the performance assessment of reconfigurable
processor architectures. In Proc. of FCCM ’03 (2003), pp. 282–283.
[134] T EODORESCU , R., AND T ORRELLAS , J. Prototyping architectural support for program rollback
using FPGAs. In Proc. of FCCM ’05 (April 2005), pp. 23–32.
[135] T HIELE , L., C HAKRABORTY, S., G RIES , M., AND K ÜNZLI , S. Network Processor Design: Is-
sues and Practices. First Workshop on Network Processors at the 8th International Symposium on
High-Performance Computer Architecture (HPCA8). Morgan Kaufmann Publishers, Cambridge
MA, USA, February 2002, ch. Design Space Exploration of Network Processor Architectures,
pp. 30–41.
B IBLIOGRAPHY 139
[136] T HIES , W. Language and Compiler Support for Stream Programs. PhD thesis, Massachusetts
Institute of Technology, 2009.
[137] T URNER , J. S. A proposed architecture for the geni backbone platform. In Proc. of the 2006
ACM/IEEE symposium on Architecture for networking and communications systems (New York,
NY, USA, 2006), ANCS ’06, ACM, pp. 1–10.
[138] UL -A BDIN Z AIN - UL A BDIN , Z., AND S VENSSON , B. Compiling stream-language applications
to a reconfigurable array processor. In ERSA (2005), pp. 274–275.
[139] U NGERER , T., ROBI Č , B., AND Š ILC , J. A survey of processors with explicit multithreading.
ACM Computing Surveys (CSUR) 35, 1 (March 2003).
[140] VASSILIADIS , S., W ONG , S., G AYDADJIEV, G. N., B ERTELS , K., K UZMANOV, G., AND
PANAINTE , E. M. The molen polymorphic processor. IEEE Transactions on Computers
(November 2004), 1363– 1375.
[141] V EENSTRA , J., AND F OWLER , R. MINT: a front end for efficient simulation of shared-memory
multiprocessors. In Proc. of the Second International Workshop on Modeling, Analysis, and
Simulation of Computer and Telecommunication Systems (NC, USA, January 1994), pp. 201–
207.
[142] V ERD Ú , J., G ARCIA , J., N EMIROVSKY, M., AND VALERO , M. Analysis of traffic traces for
stateful applications. In Proc. of the 3rd Workshop on Network Processors and Applications
(NP-3) (2004).
[143] V ERD Ú , J., N EMIROVSKY, M., AND VALERO , M. Multilayer processing - an execution model
for parallel stateful packet processing. In Proc. of the 4th ACM/IEEE Symposium on Architectures
for Networking and Communications Systems (ANCS) (New York, NY, USA, 2008), ACM,
pp. 79–88.
[145] WANG , J., C HENG , H., H UA , B., AND TANG , X. Practice of parallelizing network applications
on multi-core architectures. In Proc. of the 23rd international conference on Supercomputing
(ICS) (New York, NY, USA, 2009), ACM, pp. 204–213.
[146] WANG , Y., L U , G., , AND L I , X. A study of Internet packet reordering. In ICOIN (2004).
[147] W EE , S., C ASPER , J., N JOROGE , N., T ESYLAR , Y., G E , D., KOZYRAKIS , C., AND
O LUKOTUN , K. A practical FPGA-based framework for novel CMP research. In Proc. of the
ACM/SIGDA 15th international symposium on Field programmable gate arrays (FPGA) (New
York, NY, USA, 2007), ACM, pp. 116–125.
[148] W ENG , N., AND W OLF, T. Pipelining vs. multiprocessors – choosing the right network processor
system topology. In Proc. of Advanced Networking and Communications Hardware Workshop
(ANCHOR 2004) in conjunction with The 31st Annual International Symposium on Computer
Architecture (ISCA 2004) (Munich, Germany, June 2004).
B IBLIOGRAPHY 140
[149] W ENG , N., AND W OLF, T. Pipelining vs. multiprocessors ? Choosing the right network
processor system topology. In Advanced Networking and Communications Hardware Workshop
(Germany, June 2004).
[150] W OLF, T., AND F RANKLIN , M. CommBench - a telecommunications benchmark for network
processors. In Proc. of IEEE International Symposium on Performance Analysis of Systems and
Software (ISPASS) (Austin, TX, April 2000), pp. 154–162.
[151] W OODS , D. Coherent shared memories for FPGAs. Master’s thesis, University of Toronto, 2009.
[152] X ILINX I NC . MicroBlaze RISC 32-Bit Soft Processor, August 2001.
[153] X ILINX I NC . AccelDSP synthesis tool v9.2.01, 2007.
[154] X ILINX I NC . MicroBlaze Processor Reference Guide v8.0. http://www.xilinx.com, 2007.
[155] X ILINX I NC . Virtex-5 family overview, June 2009.
[156] X ILINX , I NC . 2010 form 10-k and proxy. http://investor.xilinx.com, June 2010.
[157] X ILINX I NC . Zynq-7000 epp product brief - xilinx, February 2011.
[158] Y EN , L., B OBBA , J., M ARTY, M. R., M OORE , K. E., VOLOS , H., H ILL , M. D., S WIFT,
M. M., AND W OOD , D. A. LogTM-SE: Decoupling hardware transactional memory from
caches. In Proc. of the IEEE 13th International Symposium on High Performance Computer
Architecture (HPCA) (Washington, DC, USA, 2007), IEEE Computer Society, pp. 261–272.
[159] Y EN , L., D RAPER , S., AND H ILL , M. Notary: Hardware techniques to enhance signatures. In
Proc. of Micro’08 (Nov. 2008), pp. 234–245.
[160] Y IANNACOURAS , P., AND ROSE , J. A parameterized automatic cache generator for FPGAs.
In IEEE International Conference on Field-Programmable Technology (FPT) (December 2003),
pp. 324– 327.
[161] Y IANNACOURAS , P., ROSE , J., AND S TEFFAN , J. G. The microarchitecture of FPGA-based
soft processors. In Proc. of the 2005 international conference on Compilers, architectures and
synthesis for embedded systems (CASES) (New York, NY, USA, 2005), ACM Press, pp. 202–212.
[162] Y IANNACOURAS , P., S TEFFAN , J. G., AND ROSE , J. Application-specific customization of soft
processor microarchitecture. In Proc. of the internation symposium on Field programmable gate
arrays (FPGA) (New York, NY, USA, 2006), ACM Press, pp. 201–210.
[163] Y IANNACOURAS , P., S TEFFAN , J. G., AND ROSE , J. Exploration and customization of FPGA-
based soft processors. IEEE Transactions on Computer Aided Design of Integrated Circuits and
Systems (February 2007).
[164] Y IANNACOURAS , P., S TEFFAN , J. G., AND ROSE , J. Fine-grain performance scaling of soft
vector processors. In Proc. of the 2009 international conference on Compilers, architecture, and
synthesis for embedded systems (CASES) (New York, NY, USA, 2009), ACM, pp. 97–106.
[165] Y UJIA J IN , NADATHUR S ATISH , K. R., AND K EUTZER , K. An automated exploration
framework for FPGA-based soft multiprocessor systems. In Proc. of the 2005 International
Conference on Hardware/Software Codesign and System Synthesis (CODES) (September 2005),
pp. 273–278.