Efficient Per-Flow Queueing in DRAM at OC-192 Line Rate Using Out-of-Order Execution Techniques
Efficient Per-Flow Queueing in DRAM at OC-192 Line Rate Using Out-of-Order Execution Techniques
Efficient Per-Flow Queueing in DRAM at OC-192 Line Rate Using Out-of-Order Execution Techniques
Aristides A. Nikologiannis
Institute of Computer Science (ICS)
Foundation of Research & Technology – Hellas (FORTH)
P.O. Box 1385
Heraklio, Crete, GR-711-10 GREECE
Tel: +30 81 391 655, fax: +30 81 391 661
anikol@ics.forth.gr, anikol@ellemedia.com
Abstract
The explosive growth of Internet traffic has created an acute demand for networks of
ever increasing throughput. Besides raw throughput, modern mutlimedia
applications also demand Quality of Service (QoS) guarantees. Both of these
requirements result in a new generation of switches and routers, which use
specialized hardware to support high speeds and advanced QoS.
This thesis studies one of the subsystems of such switches/routers, namely
queue management in the ingress and egress line cards at OC-192 (10 Gbps) line
rate. The provision of QoS guarantee usually requires flow isolation, which is often
achieved using per-flow queueing. The implementation of a queue manager,
supporting thousands of flows and operating at such high speed, is challenging. We
study thoroughly this issue and we show that the queue manager implementation is
feasible by using advanced hardware techniques, similar to those employed in the
supercomputers of the 60's and in modern microprocessors. We use DRAM
technology for buffer memory in order to provide large buffer space. To effectively
deal with bank conflicts in the DRAM buffer, we have to use multiple pipelined
control processes, out-of-order execution and operand renaming techniques. To
avoid off-chip SRAM, we maintain the queue management pointers in the buffer
memory, using free buffer preallocation and free list bypassing. We have described
our architecture using behavioral Verilog (Hardware Description Language), at a
clock cycle accurate level, assuming Rambus DRAM, and we have partially verified
it by short simulation runs.
Table of Contents
1 Introduction..................................................................................... 7
1.1 Motivation ..............................................................................................................................7
1.1 Switch Evolution....................................................................................................................8
1.1.1 Switch Generations ........................................................................................................8
1.1.2 Switching Fabric Topologies .......................................................................................10
1.1.3 Queueing Architectures & Performance ......................................................................11
1.2 Related Work on High-Speed Switches .............................................................................14
1.3 Thesis Contribution.............................................................................................................15
1.4 Thesis Organization.............................................................................................................16
2 Ingress/Egress Interface Module Architecture .......................... 17
2.1 Ingress/Egress Module Functionality ................................................................................17
2.1.1 Ingress Module main Functions ...................................................................................17
2.1.2 Egress Module main Functions....................................................................................18
2.2 Ingress Module Chip Partitioning......................................................................................18
2.3 Datapath and Queue Management Chip ...........................................................................20
2.4 Header Protocol Processing Chip ......................................................................................20
2.4.1 Flow Classification ......................................................................................................20
2.4.2 Short Label Forwarding 1: ATM .................................................................................21
2.4.3 Short Label Forwarding 2: IP over ATM.....................................................................22
2.4.4 Short Label Forwarding 3: MPLS................................................................................23
2.5 Scheduling-Policing Chip....................................................................................................24
2.5.1 Basic Disciplines on Scheduling..................................................................................24
2.5.2 Scheduling Best-Effort Connections............................................................................25
2.5.3 Scheduling Guaranteed-Service Connections ..............................................................26
2.5.4 Leaky Bucket ...............................................................................................................26
2.5.5 Calendar Queue............................................................................................................28
2.5.6 Heap Management .......................................................................................................28
2.5.7 An Advanced Scheduler Architecture..........................................................................28
3 Datapath & Queue Management Chip Architecture ................ 31
3.1 Queue Management Data Structures.................................................................................31
3.1.1 Fragmentation loss .......................................................................................................32
3.1.2 Queue Management Operations...................................................................................32
3.2 Buffer Memory Technology................................................................................................33
3.2.1 DRAM versus SRAM ..................................................................................................33
3.2.2 Rambus DRAM Technology .......................................................................................34
3.2.3 Out-of-Order DRAM Accesses....................................................................................35
3.3 Multi-Queue Management Architecture at High-Speed (10Gbps) .................................35
3.3.1 Queue Management Architecture Overview................................................................35
3.3.2 Why Pipelined Queue Manager ...................................................................................37
3.3.3 Why Multiple Control Processes .................................................................................37
3.4 Queue Management Pipeline Dependencies......................................................................41
3.4.1 Successive Enqueue and Dequeue Operations for the same flow................................42
3.4.2 Successive Enqueue Operations of packet segments ...................................................42
3.4.3 Buffer Memory Module Dependencies........................................................................43
3.5 Pipeline Dependencies Handling ........................................................................................43
3.5.1 Operands Renaming (Tomasulo) [15, chapter 4], [16] ................................................43
3.5.2 Applying Operand Renaming Techniques to the Queue Management Architecture ...43
3.6 Queue Pointer Management & Architecture Modifications............................................46
3.6.1 Next-Pointers in the DRAM Buffer Memory ..............................................................46
3.6.2 Buffer Preallocation technique [29].............................................................................46
3.6.3 Link Throughput Saturation.........................................................................................48
3.6.4 Free List Bypassing technique [29]..............................................................................48
3.6.5 Per-memory bank Queueing Free List Organization ...................................................49
3.6.6 Free Buffer Cache ........................................................................................................50
3.7 The Overall Queue Management Architecture.................................................................50
4 Queue Management Micro-Architecture ................................... 53
4.1 Hardware Implementation of the QM Data Structures...................................................53
4.1.1 Queue Table.................................................................................................................53
4.1.2 Pending Write Table and Transit Buffer......................................................................54
4.1.3 Pending Read Table.....................................................................................................55
4.1.4 Free List Table and Free List Cache ............................................................................56
4.1.5 Control and data Buffer ...............................................................................................56
4.2 The Pipelined Control Processes Micro-Architecture......................................................57
4.2.1 Packet Fetching Process Micro-Architecture...............................................................57
4.2.2 Enqueue Operation Issuing Process Micro-Architecture .............................................59
4.2.3 Enqueue Execution Process Micro-Architecture .........................................................61
4.2.4 Handling Exceptional Cases during an Enqueue Operation ........................................65
4.2.5 Dequeue Operation Issuing Process Micro-Architecture.............................................66
4.2.6 Dequeue Operation Execution Process Micro-Architecture ........................................66
4.2.7 Handling Exceptional Cases during a Dequeue Operation ..........................................69
4.2.8 Queue Manager Interface Process Micro-Architecture................................................69
4.2.9 Resource Conflicts among Queue Management Processes..........................................72
4.2.10 Search Engines Architecture........................................................................................73
4.2.11 Free List Organization Alternatives.............................................................................77
4.3 Rambus Memory Technology ............................................................................................79
4.3.1 Read and Write Operations in a Pipelined Fashion .....................................................80
4.3.2 Rambus Memory Device Architecture ........................................................................81
4.3.3 Rambus Memory Module Architecture .......................................................................82
4.3.4 Rambus Memory Interface ..........................................................................................82
4.3.5 Rambus Memory Controller ........................................................................................83
5 Verilog Description & Simulation ............................................... 87
5.1 Hardware Implementation Cost ........................................................................................87
5.2 Verification ..........................................................................................................................89
6 Conclusions and Open Topics...................................................... 91
7 Appendix A .................................................................................... 93
7.1 Flow Classification ..............................................................................................................93
7.1.1 Recursive Flow Classification (RFC) ..........................................................................93
7.1.2 Flow Classification by using Hashing functions..........................................................94
7.2 IP Routing Lookup..............................................................................................................95
7.2.1 Multi-stage IP routing by using Small SRAM Blocks.................................................95
7.2.2 Two-stage IP routing by using Large DRAM Blocks..................................................96
8 Appendix B .................................................................................... 97
8.1 Block Diagrams of Queue Management Processes...........................................................97
9 References .................................................................................... 103
List of Figures
Chapter 1
Figure 1. 1 First generation switch architecture.....................................................................................8
Figure 1. 2 Second generation switch architecture ................................................................................9
Figure 1. 3 Third generation switch architecture ...................................................................................9
Figure 1. 4 Crossbars ...........................................................................................................................10
Figure 1. 5 Banyan...............................................................................................................................10
Figure 1. 6 Benes .................................................................................................................................11
Figure 1. 7 Benes constructure ............................................................................................................11
Figure 1. 8 Batcher-Banyan .................................................................................................................11
Figure 1. 9 Output Queueing Figure 1. 10 Input Queueing.............................................................12
Figure 1. 11 Head of Line Blocking ....................................................................................................13
Figure 1. 12 Advanced Input Queueing ...............................................................................................13
Figure 1. 13 Switches with internal speedup .......................................................................................13
Chapter 2
Figure 2. 1 Ingress Module Chip Partitioning......................................................................................19
Figure 2. 2 ATM Translation Table.....................................................................................................22
Figure 2. 3 Ipv4, Ipv6 packet format ...................................................................................................23
Figure 2. 4 MPLS Hierarchical Network.............................................................................................23
Figure 2. 5. Rate-controlled Scheduler ................................................................................................26
Figure 2. 6 Leaky Bucket.....................................................................................................................27
Figure 2. 7. The two leaky buckets traffic shaping mechanism...........................................................27
Figure 2. 8 Calendar queue structure ...................................................................................................28
Figure 2. 9 A two stage scheduler........................................................................................................29
Chapter 3
Figure 3. 1 Queue Manager Data Structures........................................................................................31
Figure 3. 2 Enqueue Operation ............................................................................................................33
Figure 3. 3 Dequeue Operation............................................................................................................33
Figure 3. 4 Buffer Memory Throughput ..............................................................................................34
Figure 3. 5 Non-Interleaved versus Interleaved Transaction ...............................................................35
Figure 3. 6 Multi-Queue Management Block Diagram .......................................................................36
Figure 3. 7 . Incoming segment entry process .....................................................................................39
Figure 3. 8 Enqueue Issuing Process ...................................................................................................39
Figure 3. 9 Enqueue Execution Process...............................................................................................40
Figure 3. 10 Queue Management Interface Process.............................................................................41
Figure 3. 11 Successive Enqueue Operations of packet segments.......................................................42
Figure 3. 12 Pending Lists ...................................................................................................................43
Figure 3. 13 Segment list per packet arrival ........................................................................................44
Figure 3. 14 Operand renaming technique for successive enqueue operations....................................45
Figure 3. 15 Per-flow pending lists......................................................................................................45
Figure 3. 16 Operand renaming technique for successive enqueue operations....................................46
Figure 3. 17 No free buffer preallocation ............................................................................................47
Figure 3. 18 Buffer preallocation.........................................................................................................47
Figure 3. 19 Read and Write transactions of an enqueue and a dequeue operation at the same time slot
....................................................................................................................................................48
Figure 3. 20 Free List Bypassing (memory transactions) ....................................................................49
Figure 3. 21 Mutli-Queue Rambus Controller block diagram .............................................................52
Chapter 4
Figure 4. 1 Queue Table ......................................................................................................................53
Figure 4. 2 Pending Write Table..........................................................................................................54
Figure 4. 3 Pending Read Table ..........................................................................................................55
Figure 4. 4 Free List Table and Free Buffer Cache .............................................................................56
Figure 4. 5 The Control Buffer format ................................................................................................57
Figure 4. 6 Packet fetching process block diagram (mode 1) ..............................................................58
Figure 4. 7 Packet fetching process block diagram (mode 2) ..............................................................59
Figure 4. 8 Enqueue issuing process datapath (not-pending state) ......................................................60
Figure 4. 9 Enqueue issuing process datapath (pending state).............................................................61
Figure 4. 10 Enqueue Execution process (first stage)..........................................................................62
Figure 4. 11 Free buffer extraction ......................................................................................................63
Figure 4. 12 Second stage (execute an enqueue operation) .................................................................64
Figure 4. 13 Dequeue Operation Issuing Process ................................................................................66
Figure 4. 14 First Stage........................................................................................................................67
Figure 4. 15 Second stage....................................................................................................................68
Figure 4. 16 Detection circuit ..............................................................................................................71
Figure 4. 17 Ingress Module Output Access Conflict..........................................................................71
Figure 4. 18 Queue Management Interface Process in Pipeline Fashion.............................................72
Figure 4. 19 The Search Engine Block Diagram .................................................................................75
Figure 4. 20 Priority Alternatives ........................................................................................................76
Figure 4. 21 T1 and T2 Priority Encoder Cells....................................................................................76
Figure 4. 22 Two-bit Priority Encoder Figure 4. 23 Eight-bit Priority Encoder..............................77
Figure 4. 24 The Modified Priority Encoder .......................................................................................77
Figure 4. 25 Bitmap Free List organization.........................................................................................78
Figure 4. 26 Free list organization as a linked list ...............................................................................78
Figure 4. 27 Per-bank queueing Free List Organization......................................................................79
Figure 4. 28 Rambus Technology........................................................................................................79
Figure 4. 29 Read Transaction.............................................................................................................80
Figure 4. 30 Write Transaction............................................................................................................80
Figure 4. 31 Interleaved read and write transactions ...........................................................................81
Figure 4. 32 Transaction Insertion FSM ..............................................................................................84
Figure 4. 33 Read Transaction time-diagram ......................................................................................85
Figure 4. 34 Write Transaction time-diagram......................................................................................85
Chapter 5
Figure 5. 1 Datapath chip memory requirements.................................................................................88
Figure 5. 2 hardware complexity of our architecture...........................................................................88
Appendix A
Figure 2b. 1 Recursive Flow Classification.........................................................................................93
Figure 2b. 2 Flow Classification by Hashing.......................................................................................94
Figure 2b. 3 Multi-stage IP routing .....................................................................................................95
Figure 2b. 4 Two-stage IP routing.......................................................................................................96
Appendix B
Figure 8. 1 Block Diagram of Packet Entry Process ...........................................................................97
Figure 8. 2 Block Diagram of Enqueue Issue Process........................................................................98
Figure 8. 3 Block Diagram of Enqueue Execution Process (first stage)..............................................98
Figure 8. 4 Block Diagram of Enqueue Execution Process (second stage) .........................................99
Figure 8. 5 Block Diagram of Dequeue Issue Process.........................................................................99
Figure 8. 6 Block Diagram of Dequeue Execution Process (first stage)............................................100
Figure 8. 7 Block Diagram of Dequeue Execution Process (second stage) .......................................100
1 Introduction
1.1 Motivation
Within the past few years, there has been a rapid growth in network traffic. New
applications, particularly multimedia applications, have placed increased demands
on speed and Quality of Service (QoS) guarantees of networks infrastructure. These
requirements are expressed using the following Quality of Service (QoS) related
parameters:
?? Bandwidth - the rate at which an application's traffic must be carried by the
network
?? Latency - the delay that an application can tolerate in the delivery of a packet
of data
?? Jitter - the variation in latency
?? Loss - the percentage of lost data
Today, the most command network technologies are IP and ATM. IP technology
offers low-cost and flexible service on network resource distribution, but it offers no
QoS guarantees, at least in its traditional form. On the other hand, ATM technology
offers QoS guarantees by using admission control based on statistical properties of
policed connections, and by static sharing of network resources among these
connections. This works well for long-lived connections of limited burstness (voice,
video), but performs poorly for short-lived, highly bursty transmissions (datagrams).
Hence, the modification of the IP technology in order to provide QoS guarantees has
become an acute and challenging demand of today network designing.
In order to accomplish the increasing demands on network resources (bandwidth),
networking companies are called upon to design and manufacture the fastest
possible switches and routers. Line (port) speed is one parameter that must grow,
and the number of ports is the other such parameter. Port speed is in the OC-12 to
OC-48 (622 Mbps to 2.5 Gbps) range today, and will grow to OC-192 (10 Gbps) in
the next few years. The number of port is in the tens to hundreds range, and will
need to grow to thousands, soon.
In this thesis, we describe a switch-router architecture that can support the two
trends: rising bandwidth demand, and rising demand for QoS guarantees. We focus
on router mechanisms that can support differentiated service to different types of
traffic (data, voice video) using the same infrastructure. We describe effective
implementations of these mechanisms, such as per-flow queueing, by using
hardware in order to accomplish the high speed rates. We discuss the functionality
of a switch-router interface at 10 Gbps line rate, and we propose advanced
techniques for the queue management implementation at such high speed. Finally,
we implement a behavioral model of the queue management subsystem, at a clock-
cycle-accurate level, using the Verilog HDL and we estimate the hardware
complexity of such architecture in terms of gates, flip-flops and SRAM bit count.
Introduction 8
Second generation switches are more scalable than first generation ones because the
critical path of routing and buffering is performed locally in the line cards.
Additionally the traffic of each input line passes only once through the common bus.
The only bottleneck of this scheme is the I/O bus and its central arbiter because they
can only work properly for a limited number of interface cards.
The third generation of switches replaces the I/O bus with a switching fabric. The
routing and buffering functions are performed locally in the line cards. The
switching fabric transfers packets from the inputs to the outputs in parallel. The
central scheduler controls the line card access to the switching fabric and updates
their routing tables. This scheme is scalable to the number of supported line cards as
well as the line rate. The figure 1.3 presents the architecture of the third generation
switches. Implementing switching fabrics and their interface cards at high-speed is a
challenging issue. Section 2.2 presents switching fabrics and section 2.3 describes
queueing architectures.
Introduction 10
The simplest self-routing switch fabric topology is the Banyan. A NxN Banyan
switch fabric consists of log2N stages and N/2 elements per stage [1, chapter 8] for
2x2 elements. The routing in a banyan network is internally non-blocking, only if
the packets at the inputs are sorted to their destination outputs and gap replications
are eliminated.
An alternative of Banyan topology is the Benes topology that is presented in the
figure 1.6. Similar to the Banyan, the Benes networks are constructed recursively, as
shown in figure 1.7. The routing of input packets to the correct output lines requires
off-line evaluation because some paths can only be determined after some other
paths are entirely defined. Additionally, Benes networks are reconfigurably non-
blocking; it means that when a flow is torn down or is set up, potentially all routing
paths may have to be reconfigured in order to avoid blocking. The routing
complexity of Benes networks is proportional to ? (log2 N -1/2) [1, chapter 8],
where N is the number of network inputs/ outputs.
Figure 1. 8 Batcher-Banyan
dedicated to its outgoing line, where it will wait until its departure from the switch.
This scheme achieves full throughput utilization but requires the fabric and memory
of an N x N switch run N 1 times as fast as the line rate. This implies that output
queueing is impractical for switches with high line rates, or with a large number of
ports. For example, consider a 32x32 output queueing switch operating at a line rate
of 10Gbps. If we use a 64-byte datapath, we require memory devices that can
perform a write and a read operation every 1.6 ns.
Another architecture, input queueing, locates the buffer memory at the inputs, as
shown in figure 1.10. When a packet arrives, it is immediately placed in its input
line queue and waits until it reaches the head of the queue. Then, it waits until the
scheduler of packet departures forwards it to the appropriate output. This scheme
requires a fabric that operates as fast as the input link rate and input link buffer
memory that provides throughput twice2 the line rate. For example, consider a 32x32
input queueing switch operating at a line rate of 10Gbps. In this case, the input line
buffer memory must provide a write and a read operation every 51.2 ns
(throughput=20 Gbps). However, input queueing suffers from head of line (HOL)
blocking: if the packet at the head of an input queue is destined to a busy output, the
subsequent packets in the same queue must wait (are blocked) even if they are
destined to non-busy outputs - see figure 1.11. HOL blocking reduces the packet
delivery rate through the switch by more than one third of the input link rate.
A modified scheme of the input queueing, which overcomes the head of line
blocking, is presented in figure 1.12. This scheme is called "Advanced Input
Queueing" or "Virtual Output Queueing". In this scheme each input maintains a
separate queue for each output; thus, each incoming packet is enqueued to the
corresponding queue of its destination output. Even if we can theoretically achieve
100% packet delivery rate through the switch by using advanced input queueing, we
can not achieve it practically because the scheduler of packet departures must
operate at least N times 3 as fast as the input link rate [2].
1
When all the packets of the N inputs are destined to the same output concurrently, then the fabric
must deliver N packets within a time interval and the memory must provide N times the throughput
of each line.
2
Write an incoming packet to the queue, and read a departing packet from the queue.
3
The number of input/output links of the switch is N
Another method that reduces the effects of HOL blocking is to provide some
internal speedup [3], [4] of the switch fabric. A switch with a speedup of S can
deliver S cells4 from each input and S cells to each output within a time slot5. If the
input link rate is a cell per time slot and the switching fabric can deliver S (S>1)
cells per time slot through the switching fabric, we can choose a value for S that
achieves delivery rates comparable or equal to link rate. It implies that the switching
fabric will operate at faster rates than the system input/output link rates. In [5] has
proved that a speedup of 2-1/N is both necessary and sufficient for a switch with
advanced full throughput utilization. Switches with internal speedup require
buffering at the inputs before switching as well as at the outputs after switching, as
shown in figure 1.13. Input buffering is required because multiple cells may arrive
for the same output and only S of them can be delivered; the remaining must be
buffered at the inputs until they are forwarded to the output. Output buffering is
required because the switching fabric feeds each output with cells at higher rate (due
to internal speedup) than the rates that the output transmits cells to the network.
4
The modern high speed switches manipulate fixed-size cells; variable size packets are segmented to
fixed-size cells
5
The time slot is the time between cell arrivals at input ports of a switch
Introduction 14
support high-speed interfaces use combined input and output queueing schemes and
internal speedup [12], [13] in order to achieve the output queueing performance and
the input queueing scalability. Input buffering is organized in a per-flow queueing
(Virtual Output Queueing) in order to eliminate the HOL blocking. Per-flow queues
share the same memory in each input/output interface module in order to maximize
the utilization efficiency of a fixed amount of buffer memory [12]. Furthermore, to
increase the number of buffer cells that can be integrated within a given silicon area,
the shared buffer memory is implemented in DRAM rather than SRAM technology
[1, chapter 9].
6
Verilog is a hardware description language
Introduction 16
input queueing1, which implies that the queue manager must operate at least2 twice
as fast as the input link rate (enqueue an incoming packet and dequeue a departing
packet).
All the queues in the ingress module dynamically share the space of a single buffer
memory, thus efficiently utilizing this buffer space. The shared buffer memory is
organized into fixed-size blocks, because this simplifies memory management; thus,
variable size packets are segmented into fixed-size segments.
The provision of quality of service guarantees also requires traffic shaping and
scheduling of packet departures, and in some cases may also require policing of the
incoming stream. Traffic shaping is required in order to conform the incoming
traffic to its traffic descriptor parameters3. The most commonly used traffic shaping
mechanism is the leaky bucket. In order to police the traffic of the majority of the
system flows, we use a leaky bucket for each flow. When thousands of flows are
supported, shaping becomes expensive due to the need for thousands of leaky
buckets. In addition, a good scheduler is required in order to service the different
flows according to their service class as well as to service flows of the same class
with fairness. The scheduler must keep state information for all the system queues,
which makes it, too, expensive.
1
An extreme of per-flow queueing is the advanced input queueing, while the other extreme is to keep
multiple flows per output.
2
It will operate at higher rates due to the internal speedup
3
A traffic descriptor is a set of parameters that describes the behavior of a data source.
4
Packets are segmented into fixed-size segments at the ingress module.
functions and may receive flow control backpressure from the switching fabric. The
flow control backpressure informs about congested outputs, so that packets to other
outputs are preferentially scheduled. The following sections describe the architecture
of the ingress module chips in more detail.
of the connection and the new VPI/VCI value of the connection on that link. The
switch then transmits the cell on that outgoing link with the appropriate connection
identifiers. Because of the local significance of VPI/VCI label across a particular
link, these values are remapped, as necessary, at each switch. Each ATM switch
maintains a routing table that it updates whenever a connection is set up or torn
down. The table has one entry per connection. The entry has the following format:
incoming link, incoming VPI/VCI label, outgoing link, and outgoing VPI/VCI label,
as the figure 2.2 illustrates. Note that Line In, VP In, VC In fields are the index of
the translation table.
The final parameter in designing scheduling discipline is the order in which the
scheduler serves packets from connections at a given priority level. There are three
fundamental choices: FCFS, weighted Round Robin, and out-of-order according to a
per packet service tag. Servicing the packets in the same order as the order of their
arrivals is easy to implement but it is not a flexible and fair decision. The round-
robin scheme is a fair solution with easy implementation. Finally, out-of-order
packet service needs a significant overhead for the packet tags during packet arrivals
and requires specialized hardware data structures, such as sorted linked lists, to
support out of order service. By using out of order service we accomplish to provide
differentiated service to the different connections of the same priority.
The leaky bucket is such a regulator. The leaky bucket has a pool of tokens - a
token bucket. The leaky bucket accumulates fixed-size tokens in the bucket. An
incoming packet can be transmitted only if the bucket has enough tokens. Otherwise,
the packet waits in a buffer until the bucket has enough tokens for the length of the
packet. The figure 2.6 illustrates the leaky bucket operation. As the figure 2.6 shows,
the regulator adds tokens to the bucket at the average rate (a). On a packet departure,
the leaky bucket removes the appropriate number of tokens. If we consider that the
incoming packets have fixed-size or are segmented into fixed-size units and that for
a packet departure one token is removed from the bucket, then the size of the bucket
corresponds to burst size (bs). By replenishing tokens in the bucket at the average
rate (a) and permitting the departure of (bs) contiguous packets we control the two
of the three traffic parameters: the average rate and the burst size. In order to control
the peak rate, a second leaky bucket must be introduced. If the token replenishment
interval corresponds to the peak rate, and the token bucket size is set to one token,
then the second leaky bucket is a peak rate regulator. The second leaky bucket is
located prior to the first leaky bucket in order to insert traffic, which is conforming
to peak rate. This leaky bucket does not have a buffer, but instead of dropping the
non-conformant packets it marks them and transmits them to the next leaky bucket.
In case of buffer overflow the marked packets are dropped. If the next leaky bucket
does not have a buffer to keep the non-conforming packets, it is called policer. A
policer drops the non-conforming or marked packets. Figure 2.7 shows the two
leaky buckets. A leaky bucket can be implemented as a calendar queue, see section
2.5.5.
The first stage distinguishes the guaranteed-service and best effort flows. It also
aggregates the flows into sets in order to minimize the control state information that
the scheduler must keep and advertise to the remaining network. For example, flows
for the same output and of the same priority may belong to the same set. In a scheme
where there are 64 outputs and four priority levels, the 64k flows may be aggregated
into 256 sets of flows. In the case of guaranteed service flows, the rate-controlled
scheduling discipline performs well. As mentioned above, a rate-controlled
scheduler consists of a regulator and a scheduler. More precisely, there must be a
leaky bucket per flow, which implies that a few thousand leaky buckets must be
supported. After the shaping of incoming traffic, an earliest-due-date scheduler
assigns a time stamp to the conforming packets. These time stamps, then, are
inserted in a priority queue (heap). The heap prepares an eligible packet to be
forwarded to the second stage of scheduling. On the other hand, a weighted round-
robin scheduler must service the best effort set of flows.
The second stage of scheduling has to serve the aggregated flows of the previous
stage. The sets of guaranteed flows have higher priority than the sets of best effort
flows. A weighted round robin scheduler is quite simple and works efficiently in that
stage of scheduling. The main goal of this stage is fairness. This stage also receives
flow control information from the switching fabric, which informs for the state of
the output links traffic and for traffic congestion. This stage stalls the service of the
congested sets of flows. Finally, a weighted round robin scheduling with priorities
would perform well, if it was applied to the second stage of scheduling.
1
by placing the queue management pointers in the buffer memory, as proposed in section 3.6.1, the
unit size is reduced to 60-bytes
Datapath & Queue Management Chip Architecture 32
We define as buffer, the memory space required to store one packet segment in the
buffer memory (i.e. 64-bytes). The queue manager handles the buffers as units,
when performing queue operations. For this purpose one pointer, the next pointer, is
associated to each buffer, and the queue manager executes instructions that only use
such pointers as arguments. Note here, that the memory contains two types of
buffers: free and occupied buffers. The free buffers do not store any segment, while
the occupied buffers store segments. The queue manager organizes the free buffers
in a single linked list, called Free List, by linking their associated pointers, as shown
in the figure 3.1. The occupied buffers are organized in queues by linking their
associated pointers in linked lists (see figure 3.1). Occupied buffers that store
segments of the same flow are organized in the same queue. Each pointer in a list
indicates the address of the next buffer. Apart from the next pointers, the queue
manager needs two additional pointers: one pointing to the head and one pointing to
the tail of each list. The head and tail pairs of all the system queues are maintained
in the Queue table, while the head/tail pair of the Free List is maintained in the Free
List head/ tail register 2. Additionally, it is necessary to store one bit per list, the
Empty bit, to indicate whether the corresponding list is empty or not.
2
in order to handle free buffers more efficiently, we organize them in per-memory bank lists, as
proposed in section 3.6.5.
? packet _ size ?
? segment _ size ? * segment _ size
3
fragmentation loss = 1 -
? ?
packet _ size
is to store the segment to the extracted free buffer and to link this buffer to the
queue. Writing the free buffer pointer to the associated next pointer field of the Q1
tail, the free buffer is linked to the queue. Finally, the head of the free list and the
tail of the queue must be updated. Figure 3.2 (right) shows the state of queue
manager data structures after the enqueue.
Similar to the enqueue operation, we describe the dequeue operation by using the
example in figure 3.3. Consider a memory of eight buffers; four of them are
occupied and belong to queue Q1 and the remaining belong to free list (figure 3.3
left). When the scheduler decides to forward a segment from Q1, the queue manager
must perform a dequeue operation. The first step is to read the pointer to the buffer
at the head of Q1 (head pointer=0). The next step is to retrieve the buffered segment
body from the memory and to link the corresponding buffer to free list. Writing the
buffer’s pointer to the next pointer field of the free list tail performs this linking.
Finally, the tail pointer of the free list and the head pointer of Q1 must be updated.
Figure 3.2 (right) shows the state of the data structures after the dequeue.
system, we assumed DRAM rather than SRAM technology [25]. Among DRAM
technologies, we chose Rambus [26] over DDR or SDRAM, because Rambus offers
higher throughput while using less pins.
The queue manager must operate at least twice as fast as the link rate because it has
to buffer the incoming packets in the buffer memory and to retrieve the buffered
packets from the memory in order to forward them toward the switching fabric,
simultaneously. Assuming the OC-192 physical interface, the queue manager must
handle 12 Gbps input traffic and 12 Gbps output traffic due to fragmentation loss. In
order to support the 24 Gbps throughput, two RIMM memory modules are required.
The total 25.6 Gbps provided throughput by the two RIMM modules allowing a
5
During a read operation a memory device loads the data results to the rambus channel after passing
a constant delay from the corresponding read command insertion
transmitting only a small fraction of the incoming traffic (only some fields of each
packet header) to the header-processing chip, which requires the header processor to
operate at much lower rate than the input link rate. The second is achieved by using
out-of-order memory access techniques.
The datapath consists of three parts as shown in figure 3.6. In the first part, the
incoming traffic is temporary stored in the “transit buffer”. In the second part,
packets are moved from the transit buffer to the buffer memory. In the third part, the
traffic is moved from the buffer memory to the switching fabric. In the first part, the
incoming packets are kept in the transit buffer until the header processor classifies
them to the proper flow. An incoming packet has also to wait in the transit buffer
until the queue manager identifies a free buffer and the memory is available for
accessing. Concluding, by keeping the incoming packets in the transit buffer, we
avoid packet movements from processing stage to processing stage, so as to avoid
additional throughput and power consumption.
6
40 ns time slot is the time interval for reading or writing an 64-byte segment from/to Rambus buffer
memory
7
In the case of enqueue operation, the common resource is the Pending Write Table. In the case of
dequeue operation, the common resource is the Pending Read Table
Datapath & Queue Management Chip Architecture 38
incoming traffic to the transit buffer, while the second retrieves the buffered packets
and loads them to the buffer memory. Hence, both stages can implement two
independent, pipelined processes. Note that the second process in embedded in the
second enqueue operation process.
In order to interface the queue management operation processes to the buffer
memory (Rambus) controller, we introduce an additional process. This process
inserts write and read transactions to the memory controller and receives the
memory data responses in order to forward retrieved packets to the switching fabric.
In conclusion, the pipelined queue management architecture is consisting of six
parallel and fully pipelined control processes. Three of them are dedicated to
manipulate the incoming traffic (enqueue operation). The first, which we call
“packet entry” process, buffers the incoming packets to the transit buffer. The
second, which we call “enqueue issuing” process, issues enqueue operations and
keeps their arguments and the corresponding write transactions in the pending write
table. The third, which we call “enqueue execution” process, selects an eligible write
transaction from the PWT, and executes a pending enqueue operation by transferring
(writing) the buffered packet body from the transit buffer to the memory.
Additionally, two processes are dedicated to manipulate the outgoing traffic
(dequeue operation). The first, which we call “dequeue issuing” process, issues
dequeue operations and keeps their arguments and the corresponding read
transaction in the pending read table. The second, which we call “dequeue
execution” process, selects an eligible read transaction from the PRT, and executes a
pending dequeue operation by retrieving (reading) the buffered packet body from
the memory and forwarding it to the switching fabric. Finally, we call the process,
which interfaces the queue management processes to the memory controller, as
“queue management interface” process. All the queue management processes will be
described more thoroughly in the subsequent sections. In order to simplify our
description, we consider fixed-size (i.e. 64-byte7) packets.
7
we consider 64-byte packets in order to avoid segmentation; because the segment size is 64-byte,
the packets and the segments are identical quantities of traffic.
8
the write command is sent to the memory controller via the queue management interface process.
9
the read command is sent to the memory controller via the queue management interface process.
Datapath & Queue Management Chip Architecture 42
3.4.1 Successive Enqueue and Dequeue Operations for the same flow
A major effect of pipelining is to change the relative timing of queue manager
operations by overlapping their execution. This introduces data hazards. Data
hazards occur when the pipeline changes the order of read/write accesses to the
queue manager data structures so that the order differs from the order seen by
sequentially executing operations on an unpipelined architecture. More precisely, an
enqueue operation reads the queue tail address during its issuing phase, and later, it
updates the new queue tail address during its execution phase. Since the time
interval between the operation issuing and execution may last more than one time
slot, the queue tail address may be non-updated/pending during this interval. When a
newly issued enqueue operation finds the queue tail address as pending, data
dependence occurs. Similar to the enqueue operation, successive dequeue operations
of the same flow may be dependent because the latter operation tries to read the
queue head address while the former has not updated it yet. These data dependencies
introduce stalls in the pipeline and no farther operations can be issued until the data
dependencies are removed. This condition decreases the pipeline’s performance and
must be eliminated.
pending lists. Each operation in the list forwards the required resource values to the
next pending operation in the list. The pending lists for the enqueue operations are
kept in the Pending Write Table, while the pending lists for the dequeue operations
are kept in the Pending Read Table. There are so many pending lists in the operation
tables as the number of active flows in the system. Additionally, the pending
enqueue operations are categorized into two types: those initiated by the header
processor and those initiated by the queue manager (see 3.4.2 section). We remind
that the header processor initiates enqueue operations per packet10, while the queue
manager initiates enqueue operations per segment11. Due to the existence of two
types of pending enqueue operations, we keep two pending lists per active flow: per
packet pending list and per segment pending list, as figure 3.12 shows.
We explain how we can construct the enqueue pending lists in the PWT by using a
simple example. The figure 3.13a shows an incoming packet consisting of four
segments to wait the packet entry process to assign a transit_id to each segment and
to store them in the transit buffer. The entry process organizes the packet segments
in a pending list as figure 3.13b shows. We use two pointers in order to keep the
pending lists in the PWT: a pointer to the next segment, the “next segment pointer”,
and a pointer to the last segment, the “last segment pointer”. The next segment
pointer indicates the transit_id of the next packet segment (or alternatively the
transit_id of the corresponding pending enqueue operation in the PWT). The last
segment pointer indicates the transit_id of the last packet segment. Concluding, the
packet entry process organizes a segment list per packet arrival.
1
0 The header processor initiates the enqueue operation for the first segment of each packet
1
1 The queue manager initiates enqueue operations for the remaining segments of each packet
operation, which has acquired the Q1 tail pointer. Then, the newly enqueue
operation is linked to the corresponding pending list in the PWT, as figure 3.14b
shows. Additionally, the newly issued enqueue operation leaves its transit_id to the
Q1 tail pointer field (in Queue table) in order to indicate to a successive enqueue
operation of the same flow that it was the last that accessed this field.
the figure 3.15. Since the scheduler of segment departures initiates dequeue
operations per segment, the pending list requires only a pointer that indicates the
transit_id of the next pending dequeue operation. The figure 3.16 shows the function
of linking a newly issued pending dequeue operation. The figure 3.16a shows the
state of the pending list before the linking of the new dequeue operation, while the
figure 3.16b shows this state after the linking.
More precisely, in the case of an enqueue operation that uses the buffer preallocation
technique, two memory accesses must be performed to the buffer memory: the
writing of the segment body to the reserved buffer of the target queue, and the
reading of the next pointer field of the newly extracted buffer in order to accomplish
the address of the next free buffer in the free list. Another issue, which rises now, is
that the two buffer memory accesses must be directed to different RIMM modules in
order to avoid module memory conflicts. It implies that the newly extracted free
buffer and the reserved buffer must belong to different memory modules. This issue
is addressed in the section 3.6.5. In the case of a dequeue operation, two memory
accesses must be performed: the reading of the departing buffer body and the
linking12 of this buffer to the free list. The figures 3.19 shows the required buffer
memory accesses when an enqueue operation to the Q2 flow and a dequeue
operation to Q1 flow take place.
As mentioned in section 3.4.1, the queue manger must perform an enqueue and a
dequeue operation per time slot. By locating the next pointers in the buffer memory,
four memory accesses must be performed to the buffer memory per time slot. It
implies that the required memory throughput is at least twice the provided. Using the
free list bypassing technique could reduce this memory throughput requirement. In
this technique, rather than dequeueing a departing buffer from an output queue and
enqueueing that buffer into the free list, and rather than extracting a buffer from the
free list and enqueueing it into another queue upon arrival, we combine the two
operations: the buffer into which an arriving segment is placed, is precisely the
buffer from which a segment is departing during the same time slot. Therefore, there
is not free list operation, which implies that the required memory throughput equals
to the provided throughput. The figure 3.20 shows the reduction to the number of
memory access by using the free list bypassing technique.
because the free buffer at the head of the free list may belong to the same RIMM
module with the reserved buffer of the targeted queue. In order to ensure that the
extracted free buffer and the targeted queue reserved buffer belong to different
RIMM modules, we organize the free list in a per-bank queueing scheme. This
organization also ensures that the extracted buffer does not belong to a busy bank.
The only expense of the above organization is the requirement for keeping the head
and tail pointers of the 512 13 different bank-queues in the Free List table.
However, the per-bank queueing organization complicates the free list update for a
dequeue operation. During a dequeue operation the departing buffer must be
enqueued to the proper free list bank-queue. The issue is that this queue is located in
the same memory module as the departing buffer, which means that during a
dequeue operation both memory accesses (read and write) must be performed to the
same memory module. Both accesses cannot be performed simultaneously; thus, one
of them must be delayed. Since the reading access to the departing buffer has higher
priority than the free list update, the free list update is delayed for later.
1
3both RIMM modules contain 512 banks
1
4Isolated buffer is a buffer that is not linked to the free list
1
5 An enqueue and a dequeue operation per time slot
last operation that accessed this field (operand renaming) and is linked to the
corresponding pending list in the PRT.
Similar to the enqueue operation, a search engine traces the pending dequeue/read
operations in the PRT and selects a dequeue operation that it will not cause a
memory bank conflict. Then, the queue manager sends a read transaction to the
memory controller. This read transaction will retrieve the segment body of the
departing buffer and the next pointer value, which is kept in the departing buffer and
indicates the next buffer of this queue. As soon as the buffer memory responds with
the data of the departing buffer, the segment body is forward to the switching fabric,
while the next pointer field updates the head pointer field of the corresponding
queue in the Queue table. Additionally, the address of the departing buffer is stored
in the free buffer cache.
The queue management architecture, which is described above, achieves high-
operation rates (an enqueue and a dequeue operation per time slot <time slot =
40ns>) by using advanced pipelining and by applying dynamic scheduling
techniques originated in the supercomputers in 60’s. The detailed micro-architecture
of the queue management block is performed in the chapter 4.
Similar to the Head table, the Tail table consists of the tail field and its state flag: the
Pending Tail (PT) flag. The Tail table does not need an Almost Ready flag because
the update of the tail table entries (queue tail pointers) is performed during the
Queue Management Micro-Architecture 54
enqueue execution phase. The queue tail update does not require a buffer memory
access, because the new tail pointer is the pointer of the extracted free buffer from
the free list and is currently available. The size of the supported system queues is
64K flows; thus the Head and Tail tables maintain 64K entries. The Head table
width size is 24 bits: 1bit for the PH flag, 1bit for the AR flag and 22 bits for the
head pointer1. Respectively, the Tail table width size is 23 bits: 1bit for PT flag, and
22 bits for the tail pointer. Additionally, an empty (E) flag is required for each entry
in the Queue table in order to indicate whether a queue is empty or not.
The operand table contains three fields: the write address, the flow identifier and the
Free List Update (FLU) flag. The write address field keeps the address of the buffer
to which a waiting segment will be written. The flow_id field keeps the identifier of
the queue to which the waiting segment will be enqueued. The FLU flag indicates
whether the pending write operation originated by an enqueue operation or by a free
list update operation. The next segment table has two fields: the next segment field
and the next segment (NS) flag. The next segment field keeps the transit_id of the
next segment in the pending list, while the NS flag indicates whether the next
segment field has a valid value. The last segment table has two fields: the last
segment field and the last segment (LS) flag. The last segment field keeps the
1
We remind that both RIMM modules contains 222 ( 4 million) 64-byte buffers
transit_id of the last segment in the pending list, while the LS flag indicates whether
this PWT entry keeps the enqueue operation, which is the last that has accessed the
queue tail field in the Tail table.
We define the last enqueue operation that has accessed the queue tail pointer as the
“server” of the corresponding pending list in the PWT. The name server means that
this enqueue operation will assist a successive enqueue operation for the same flow
to be linked in the pending list. Note that there is only one server for each pending
list. If a pending enqueue operation is the “server” of the pending list, it has to
update the Tail table when it is executed. As mentioned above, only the last
operation in a pending list updates the tail pointer fields in the Queue table. The
intermediate pending operations update only the next operation in the pending list.
The Pending Write Table contains 128 entries. The width of the state table is 3 bits,
the width of the operand table is 39 bits (22 for write address, 16 for flow_id and 1
for FLU flag), the width of next segment table is 8 bits (7 for transit_id and 1 for the
NS flag), and the last segment table width is also 8 bits (7 for transit_id and 1 for the
LS flag).
The Transit Buffer consists of 128 entries corresponding to 128 entries of PWT. The
size of each entry is 64-bytes (it stores a 64-byte segment). The 128 entries are kept
in a 16-bytes –width on-chip memory. The memory width-size is selected in order to
achieve 16 bytes memory access granularity. The 16-byte granularity achieves
writing and reading transaction rates of 12.8 Gbps (16 bytes / 10ns clock cycle).
address field keeps the address of the departing buffer, while the flow_id field
contains the identifier of the serviced flow. The next segment table has identical
format with its counterpart in the PWT but it is referred to the pending list of
dequeue operations. The Pending Read Table maintains 128 entries. This length is
independent of the PWT length but it is a normal size for keeping information of a
significant number of pending operations. The width size for state table is 3 bits, for
the operand table is 39-bits (22-bits for the read address, and 16-bits for the
flow_id), and 8-bits for the next segment table (7 for transit_id and 1 for NS flag.
generation until its completion. Because there are two memory RIMM modules, two
memory accesses may take place at each time slot; it implies that two control buffers
may be active, simultaneously.
Each control buffer consists of seven fields: the transaction type (R/W), the Valid
flag, the source of the transaction (table and the transit_id of the table entry that
keeps the transaction), the update destination (table and the identifier of the table
entry), and the transaction address. Note that, the destination update fields are valid
in the case of a read transaction. The control buffer size is 51 bits: 1-bit for the Valid
flag,1-bit for the operation type, 2-bits for the destination table, 16-bits for the
destination table entry (for the worst case of updating the Queue table), 2-bits for the
source table, 7-bits for the source table entry –PWT or PRT-, and 22-bits for the
operation address. The format of the control buffer is illustrated in the figure 4.5.
In the case of a write transaction, we move the segment body from the transit buffer
to a local buffer near to the memory controller. The memory controller splits the
segment body writing into four phases because it pipelines the memory transactions.
By keeping the segment body in a local buffer near to the memory controller, we
ensure that the segment body will be available for accessing from the memory
controller. We call this local buffer as “Data Buffer” and its size equals to the
segment body size (64-bytes).
?
The transactions that take place at the first, second, third, or fourth cycle, are represented in the
figures of this chapter with black, red, blue, and green colored arrows, respectively
Queue Management Micro-Architecture 58
segments of a packet into a single linked list. The information for the lists is kept in
the next and last segment fields of the pending list table, which are parts of the
PWT. The second mode initiates and simultaneously issues a free list update
operation by allocating an entry in the PWT and achieving the proper operands for
this task.
The description of the first mode is illustrated in the figure 4.6. The inputs of this
process are the transit_id of a free entry in the Transit Buffer and an incoming
segment body. During the first cycle the first 16-bytes-part of the segment body is
stored to the Transit Buffer indexed by the transit_id. The PWT entry indexed by the
transit_id is also allocated by setting the valid flag to 1. If the incoming segment was
the first of a packet it writes its transit_id in the head and middle registers. The head
register stores the transit_id of the first segment of a packet, while the middle
register stores the transit_id of each intermediate segment. The information in the
head and middle registers is used by the subsequent segments of the same packet in
order to be linked in the pending list, see the section 3.5.2. If the incoming segment
were the packet intermediate segment, it would achieve the transit_id of the previous
segment from the middle register and then it would be linked to the pending list by
writing its transit_id to the previous segment next segment field. If the incoming
segment were the packet last segment, it would achieve the transit_id of the previous
segment and the transit_id of the first segment from the middle and head register
respectively. Then, it would be linked to the list by updating the next segment field
of the previous segment, and it would update the last segment field of the first
segment. By this way we organize the incoming segments into pending lists during
their arrivals.
During the second cycle, the second 16-bytes-part of the segment body is stored to
the Transit buffer entry indexed by the transit_id augmented by one. Additionally, a
search engine searches for the next free entry in the PWT and holds its transit_id.
During the third and forth cycle, the third and forth 16-bytes-part of the segment
body is stored to the Transit Buffer entry indexed by the current transit_id
augmented by two and by three, correspondingly.
The description of the second mode is illustrated in the figure 4.7. The inputs of the
packet fetching process for this mode are the transit_id of the current free entry in
the Transit Buffer and the address of the free buffer from the Free List Cache
(FIFO). During the first cycle, the address of the free buffer is accomplished and it is
kept in a register. During the second cycle, the address of the tail buffer at the free
list is acquired by accessing the Free List Table. Similar to the first mode, in this
cycle a search engine accomplishes the next free entry in the PWT. During the third
cycle the free list update task operands are written to the proper fields in the PWT.
More precisely, the fields of the PWT entry indexed by the current transit_id is
updated as follow: the flags Valid, Busy, Ready are set to 1; the address of the tail
free buffer is written to the write address field; the flow_id field is not updated
because it is not used; the fields next and last segment are set to zero in order to
indicate that this entry does not belong to a pending list; the FLU flag is set to 1 in
order to indicate that this operation is a free list update operation. Additionally, the
address of the free buffer is written to the 32 most significant bits of the Transit
Buffer entry indexed by the current transit_id. Finally, the value of the tail pointer of
the free list Table is updated with the free buffer address. The forth cycle is idle. The
latency of the packet fetching process equals to a time slot, so its pipeline has only
one stage. A block diagram of this process is presented in the Appendix B.
The datapath in the case that the state is not pending is presented in the figure 4.8.
During the second cycle, the proper queue tail pointer is written to the write address
field at the PWT entry indexed by the input transit_id and the Ready flag is set to 1.
The flow_id field is also updated by the input flow_id, while the FLU flag is set to 0
to indicate that this entry keeps an enqueue operation. During the third cycle, the
transit_id of the newly issued enqueue operation is written to the corresponding
queue tail pointer field (operand renaming), while the PT flag is set to 1 (it indicates
pending state). This access informs a successive enqueue operation of the same flow
that the queue tail state is pending and the tail field keeps the transit_id of the last
enqueue operation that has accessed this field. During the fourth cycle the LS flag in
the last segment table is set to 1 in order to indicate to the kept operation of this
entry that it was the last that has accessed the queue tail pointer, and it has to update
the new queue tail pointer or to forward this value to the successive pending
enqueue operation.
The datapath in the case that the queue tail state is pending state is presented in the
figure 4.9. If the state of the tail pointer is pending then the queue tail pointer field
keeps the transit_id of the last enqueue operation that has accessed the queue tail.
The entry of PWT that stores this enqueue operation keeps information for the
pending list to which it belongs. During the second cycle, we access the last segment
field of this entry in order to acquire the transit_id of the last entry in the pending
list. Accomplishing this information, we link the current inserted enqueue operation
to the tail of the corresponding pending list during the third cycle. The linking is
performed by writing the transit_id of the newly issued enqueue operation to the
next pointer field of the last entry of the pending list. During the same (third) cycle,
the transit_id of the newly issued enqueue operation is written to the queue tail
pointer field of the Queue table, and the queue tail state remains pending. Now, the
“server” of this pending list will be the newly issued enqueue operation. The Queue
table knows this information, but we have to inform the enqueue operation itself that
it is the server by setting the LS flag of the PWT entry that keeps the newly issued
enqueue operation to 1 and by resetting this flag of the previous server enqueue
operation. The former of these tasks is performed in the third cycle while the latter
in the forth cycle. The latency of this process equals to a time slot, so its pipeline has
only one stage. A block diagram of this process is presented in the Appendix B.
scheduler”, has four inputs: the write_tr_id1, the write_tr_id2, the read_tr_id1, and
the read_tr_id2; the dynamic scheduler chooses the transit_ids of a read and a write
transaction that they are directed to different memory modules. At the end of the
third cycle the transit_id of the selected write transaction is available. During the
forth cycle, we collect the remaining information related to the selected write
transaction from the PWT. More precisely, we learn if this write operation belongs
to a pending list, if it has successive operations in the list or it is the last one, or
alternatively, if it is the “server” of the pending list. This information is
accomplished by accessing the next and last fields of the corresponding entry. The
results of the access are kept to some temporary registers. Simultaneously, the Busy
flag of this entry is set to 1 in order to indicate that this operation is in execution.
Finally, during the forth cycle of the first stage, a free buffer is extracted; Note that
the buffer must not belong to a busy bank.
The operation of extracting a new free buffer is presented in the figure 4.11. There
are two sources for extracting a free buffer: the Free List and the Free Buffer Cache.
The Free List keeps buffers organized in queues, while the Free Buffer Cache keeps
independent (unlinked) buffers. The choice of the free buffer source is dependent on
the dynamic scheduler results. If a write transaction –originate from an enqueue
operation- and a read transaction – originate from a dequeue operation- were
selected to be performed at the next time slot, concurrently, the free buffer would be
extracted from the Free Buffer Cache due to the free list bypassing technique.
Otherwise, the free buffer would be extracted from the Free List. In the case of
extracting a buffer from the Free List, this buffer must belong to a non-busy bank,
because it would be accessed as mentioned in section 3.6.4. By accomplishing this
constraint, we activate a search engine to find a buffer from a non-busy bank; it is an
easy task because the Free List is organized in a per-bank queueing scheme and we
can extract an eligible buffer at O(1).
Between the first and the second stage we use a buffer in order to isolate the two
stages; we call this buffer as “pipeline buffer”. The pipeline buffer consists of 10
fields: the transit_id (7 bits), the write address (22 bits), the flow_id (16 bits), the
FLU (1 bit), the next segment (7 bits), the NS flag (1 bit), the last segment (7 bits),
the pending list server flag (1 bit), the free buffer address (22 bits), and the free
buffer source (1 bit). The pipeline buffer contains all the necessary information for
the selected write transaction. This information will be used in the second stage that
executes the write transaction.
transaction is written to the corresponding control buffers, during the first cycle of
this pipeline stage, as figure 4.12 shows. The information that is kept to the control
buffers of the write and read transaction is presented in the tables 4.1 and 4.2,
respectively. Additionally, the first 16-bytes part of the segment body is moved from
the Transit Buffer to the data buffer during the first cycle.
Valid 1 Valid 1
R/W W R/W R
oper. addr. Write address (from PWT oper. addr. Free buffer addr[21:0]
entry indexed by transit_id)
Table 4. 1 Control buffer for a write trans Table 4. 2 Control buffer for a read trans.
second, third, and forth cycle, the second, third, and forth 16-byte part of the
segment body are moved from the Transit Buffer to the data buffer, respectively
(write transaction). At the end of this stage, the entry of the PWT, which keeps the
executed write transaction, is released.
In the case that the write transaction originates from a free list update operation, the
accompanied control information of this transaction is loaded to the corresponding
control buffer during the first cycle. The information that the control buffer keeps
for the write transaction is performed in the table 4.3. This write transaction writes
the address of an unlinked free buffer to the next pointer field of the free list tail
buffer. The address of the unlinked free buffer is kept in the 32 most significant bits
of the corresponding transit buffer entry. The data of this transit buffer entry are
loaded to the data buffer during the four cycles of the second stage.
Valid 1
R/W W
Dst.Tbl. Idle
Dst.Entry Idle
Src.Tbl. PWT
Src.Entry current Transit_id
Oper. Write address (from
addr. PWT entry indexed by
transit_id)
Table 4. 3 Control buffer for free list update (write transaction)
The datapath of this process is illustrated in the figure 4.13. The first cycle of this
stage is idle (resource accesses scheduling, see the section 4.2.9). During the second
cycle, the flow_id indexes the Head table entry that keeps the queue head pointer
and its state fields (PH and AR flags). Simultaneously, a search engine looks for the
next free entry in the PRT. If the PH flag is set to 0, it implies that the acquired
queue head pointer is updated (correct). So, the read address and flow_id fields in
the PRT entry of the newly issued operation are written with the correct values. This
operation is ready for execution to a successive time slot; thus, the Ready and Valid
flags of this entry are set to 1. Otherwise, if the PH flag is set to 1, we examine two
cases: AR flag is set to 0 or AR is set to 1. The former case means that the “server”
of the corresponding pending list in the PRT is located in the PRT entry indexed by
the value of the acquired head pointer (operand renaming). The server of a pending
list in the PRT is the last dequeue operation that accesses the queue head pointer
field at the Head Table. The newly issued dequeue operation, by accomplishing this
information, is linked to the tail of this pending list. It is performed by writing its
transit_id to the next segment field of the pending list tail entry (in the PRT). The
latter case means that the “server” operation of the corresponding pending list is in
execution. It implies that the “server” read operation is send to the buffer memory,
but the buffer memory has not responded with the results and the queue head pointer
is still pending.
structures.
We remind that control information related to the read operation (read address) and
information related to the corresponding dequeue operation (flow_id, transit_id of
the next operation in the pending list) is kept in local registers. The information that
is kept in the local registers is loaded to the pipeline buffer, at the pipeline clock
edge. The pipeline buffer has 6 fields: the read address, the flow_id, the next
segment of the pending list, the NS flag and the transit_id of the current read
operation.
Queue Management Micro-Architecture 68
Valid 1 Valid 1
R/W R R/W R
Dst.Entry Flow_id field (from the Dst.Entry next segment field (from the
pipeline buffer) pipeline buffer)
Src.Tbl. PRT Src.Tbl. PRT
Src.Entry transit_id field (from the Src.Entry transit_id field (from the pipeline
pipeline buffer) buffer)
read addr. Read address field (from the read Read address field (from the
pipeline buffer) addr. pipeline buffer)
Table 4. 4 Control buffer Table 4. 5 Control buffer
commands to the memory controller. The second task is to receive the data response
of the buffer memory in the case of a read access. The third task is to forward the
received data of a dequeue operation to the output and to update the data structures
(Queue table, Free List table)/operation table (PRT). This process is pipelined and
consists of five stages. Each stage has latency that equals to the time slot delay.
The first stage undertakes the insertion of the memory access operations to the
memory controller. An access operation consists of three elements: the command
field, the address field and the data field. The read operation has not data field. The
command can be either write or read. The address field consists of five parts: the
module address (1 bit), the device address (4 bits), the bank address (4 bits), the row
address (9 bits) and the column address (6 bits). To be noted here, that the Rambus
memory organization manipulates fixed size 16-byte units. Each memory row
contains 64 units of 16-bytes size; thus, the column address has 6 bits (26=64).
However, the queue manager manipulates 64-byte units (segments). It implies that
each memory access addresses quadruples of memory units and the 2 least
significant bits of the column address is set to 00. In the case of a write operation,
the data field consists of four data buffers; each buffer size is 64 bytes.
The remaining stages of the interface operation are dedicated to serve only the read
operations. The only requirement of a write transaction is to be inserted in the
controller. The memory core writing is performed by means of three modules: the
memory controller, the Rambus memory core interface (RAC), and the memory
core. Otherwise, a read transaction requires from the queue management interface
process to insert a read command and to receive and manipulate the memory
responses. The difficulty of the data-receiving task rises because the memory
responses will be available in a window time slot after the read transactions’
insertion. The minimum delay of memory response is 110ns (2.75 time slots), while
the maximum is 140ns (3.5 time slots) and the window size is 40ns. The latency of
rambus memory access is constant but this variability caused by the transactions’
shifting in the memory controller. Even if the memory transactions are inserted into
the memory controller at the beginning of a time slot, it may initiate them in a
successive cycle due to the memory turn around overhead. This overhead occurs
because read after write and write after read operations cannot be initiated back to
back. Instead, a time interval of 5ns must intervene among read and write
alternating. The main characteristic of the transactions shifting is that it is
accumulative, which means that the memory controller must remember the memory
transaction history. More details on this subject will be referred in section 4.3.5.
The main point of the second stage existence is the provision of a time slot delay.
Noted that the insertion of a memory access operation is performed during the first
cycle of the first stage. The remaining three cycles provide delay. The first two
cycles of the third stage is idle in order to provide additional delay. This delay holds
each read operation until the time where the memory responds with the data. The
tasks that are performed during the third and fourth cycle of the third stage and
during the first and second cycle of the fourth stage have many similarities because
they belong to the critical window time slot to which the memory will respond with
the data. Each of these tasks are split into three functions: the detection of a new
data block from the memory, the received data forwarding toward the output link,
and the update of the queue manager data structures.
The detection function can be performed by the circuit, which is illustrated in the
figure 4.16. The memory controller interface has an output for each memory module
that indicates the receiving of a new data block from the corresponding memory
module. Each of two outputs has 1 bit size and alternates its value at the beginning
of a new data block receiving. We remind that each data block size is 64 bytes. Each
time a read operation response originated by a dequeue operation, the receiving data
block must be forwarded to the output or alternatively it must be loaded to the
output buffer. However, it is possible both memory modules to respond with the
data almost simultaneously. The example of the figure 4.17 illustrates this case. It
implies that the data of both memory modules want to access the output buffer
simultaneously and cause a conflict. As the figure shows, the occurrence of this
conflict is not a usual case because we don’t send two read operations as well as two
write operations toward the two memory modules at the same time slot. This conflict
looks like with the classical case of the critical section accessing by multiple
processes in the computer operating systems. Similar to the operating systems we
implement an arbitration process that allows the access of the output buffer (critical
section) to only one process. This arbiter architecture is described in section 4.2.9.
Finally, the update of the queue management data structures can be performed only
during the fourth cycle of the third or the forth pipeline stage in order to schedule
this task and use the resources more efficiently, see the section 4.2.9. The overall
process pipeline is illustrated in the figure 4.18.
consists of two bits: the first bit indicates whether the critical section is busy or not
(key state) and the second bit indicates the process that accesses the critical section
(key owner). We also use some state registers, such as the owner state, the pending
request, and the pending request_id. A process may be in the critical section for at
most four clock cycles (clock cycle: 10ns); thus the owner state register indicates the
time that the process has been in the critical section (2-bits size). The pending
request indicates if a process waits to achieve the access to the critical section while
an other process has already been in the critical section. The pending request_id
indicates the identifier of the pending process. The arbitration cycle lasts a clock
cycle (10ns). The arbiter also uses a final state machine in order to schedule the
incoming requests. The FSM of our arbiter is illustrated in the table 4.6.
search to find a write or a read transaction that will not cause a memory bank
conflict. So, a search operation may be split into two simple search functions, which
are performed simultaneously. The first function searches for matching on a fraction
of examined bits, while the second function searches on the remaining bits for not
matching. Not matching search is referred to the conflicting cases.
which means that it has not already been in execution. Additionally, it examines the
9 most significant bits of the writing address in order to avoid a conflict. These bits
identify the module, the device, and the bank to which the writing buffer belongs (1
bit for the RIMM module, 4 bits for the device and 4 bits for the bank).
In the case of a free buffer extraction, 10 bits must be examined in the Free List
Table. One empty bit is examined for checking whether a bank in the free list is
empty of buffers or not. Next, the 9 most significant bits of the free buffer address
are examined in order to avoid a bank conflict. Similar to the above reference, these
9 bits identify the module (1), the device (4), and the bank (4) to which the free
buffer belongs.
The first and the third search engine of the dequeue operation examine only a 1-bit
field in the PRT entries, the Valid flag. The former for matching (V=0) and the latter
for matching (V=1). The second search engine of the dequeue operation lookups 12
bits in the PRT entries. It is looking for an entry that is valid (V=1) and contains a
ready dequeue operation (R=1). This operation must be not busy (B=0), which
means that it has not already been in execution. Additionally, it examines the 9 most
significant bits of the read address in order to avoid a conflict. These bits identify the
module, the device, the bank to which the reading buffer belongs (1 bit for the
RIMM module, 4 bits for the device and 4 bits for the bank).
The examined fields of all operation tables must be extracted from these tables and
must be located to CAMs in order to perform parallel lookups to these tables entries.
The total memory requirements are 128x12 bits for PWT, 128x12 bits for PRT,
512x10 bits for Free List Table. The results of each search operation are stored to an
one-dimension array. This array contains so many entries as the number of the
searched table entries. The array elements that correspond to the searched tables
matching entries are set to 1, while the remaining are set to 0. Next, the search
results, which are kept in the array, are driven to a priority encoder in order to be
identified the matching entry with the highest priority, as the figure 4.19 shows. The
structure of a priority encoder is described in the section 4.4.11.
The priority encoder is a function that has a N-bit binary number input and a
log2(N)-bit binary number output. The output number points the location of the
most significant 1 in the input number. For example if the input number is the
00101101 and the output is the number 110. In the case of a search engine the
priority encoder is required to index the first matching with the highest priority. For
a binary number the highest priority is identical to the most significant bit. Instead,
the search engine needs the flexibility to determine itself the highest priority point.
The figure 4.20 illustrates this flexibility. At the leftmost scheme the priority
encoder determines the upper element with the highest priority and the bottom
element with the lowest; at the rightmost scheme the middle element has the highest
priority, while the priority of the subsequent elements decreased in a circular order.
The priority encoder implementation in hardware is modular. The priority encoder
primitive (cell) has two versions as figure 4.21 shows. The priority encoder cell T1
is the complement of the T2 counterpart and vice versa. T1 and T2 are combined to
build a two-bit priority encoder, figure 4.22. In the same way, a multiple bit priority
encoder can be built. Figure 4.23 shows an 8-bit priority encoder which determines
the upper element with the highest priority.
Bitmap Organization
We explain the bitmap organization by giving an example. Next we apply the
bitmap organization to our memory system. For the example we consider that the
free list consists of 64 buffers. The state of these buffers is kept in an 8x8 array as
the figure 4.25 shows. The buffer emptiness is indicated by the “1” and the opposite
by “0”. By applying the “or” operation to the elements of a row and storing the
result to the corresponding entry of a 2x4 array, showed in the middle part of the
figure, we compress the information of the initial array. The middle array has
information only for the rows of the initial array that have at least an empty buffer.
Continuing this process, by applying the “or” operation to the first and second row
of the middle array, we update the rightmost array. Selecting an element with 1 of an
one-dimension array is a trivial issue; we can use a priority encoder. If we select an
element with 1 from the rightmost array, we go back to the corresponding row of the
middle array and select an element with 1 of this row. Continuing similarly, we
choose a row from the leftmost array, which has at least one free buffer; all the
previous steps ensure this situation. Finally, we select an element with 1 from this
row. The location of this element in the array points to a free buffer. Applying the
bitmap organization to our buffer memory, which contains 4 million buffers, will be
extremely expensive.
Per-Bank Queueing
Another alternative implementation of the free list is to organize the free buffers in
single linked lists (queues). A queueing organization requires two pointers: a pointer
to the queue head and a pointer to the queue tail. It also requires to assign a pointer
to each free buffer, which we call next pointer. Each free buffer next pointer field
indicates to the next free buffer in the list, as the figure 4.26 shows. In order to build
a more flexible scheme, we organize free buffers of the memory in a per- bank
queueing scheme, as the figure 4.27 shows.
Write transaction timing is very much like read transactions. The control packets are
sent in the same way as the read command. However, one significant difference
between the RDRAM and a conventional DRAM is that write data is delayed to
match the timing of a read transaction in order to maximize the usable bandwidth
on the data pins. On a conventional SDRAM, the write-read transaction alternating
causes a gap on the data bus from the write data to the read data. The RDRAM
avoids this gap by sending the data later in time. A write command on the COL pins
tells the RDRAM that data will be written to the device on an exact number of clock
cycles later. This data would be written to the core as soon as the data is received.
The write transaction is shown in the figure 4.30
Each of the commands on the control bus may be pipelined, allowing much higher
throughput. The ACT commands can completely absorb the ROW pins, allowing
16-byte random transfers to occur. In order to completely fill the data bus, column
command would be continuously sent on the COL pins. Except for small gaps of 5ns
required for bus turn-around going from a write to a read, these busses can be fully
utilized. In the figure 4.31 is presented an example of interleaved write and read
transactions. The transaction transfer granularity is 64-byte data blocks.
accompanied with the type of the current transaction in order to determine the state
of the new transaction; determining the state of a new operation is identical of
determining the cycle that it will be inserted. The FSM that describes the above
function is illustrated in the figure 4.32. As figure 4.32 shows, the state parameter is
split into two fields: cycle number (2-bit size) and clock edge (1 bit size). If clock
edge field is 1, it corresponds to the rising clock edge, while if it is set to 0, it
corresponds to the falling clock edge.
We remind that a new read or write transaction may be inserted at the rising or the
falling edge of the four clock cycles in a time slot. The figure 4.33 shows a time-
diagram of a read transaction. In this diagram we show that the signaling is identical
independent the exact time that the read transaction is inserted. Note that the
Trowsel, Tcolsel, Tdatasel, Rdatasel are control signals that indicate the timing a
row command, a column command, a transmitted data packet, or a received data
packet will be loaded to the RAC interface. The figure 4.34 shows the time-diagram
of a write transaction.
We estimated the on-chip memory requirements for the datapath chip. Table 5.1
shows this estimation per memory block. The PWT, the PRT as well as the
Head/Tail table are split into multiple separate tables in order to allow parallel
accessing.
We also estimate the hardware complexity of our architecture in terms of gates and
flip-flops for 64 Kflows, as shown in table 5.2
Processes Gates Flip-Flops
Packet entry 3K 4K
Enqueue issue 7K 10 K
Enqueue execution 12 K 15 K
Dequeue issue 10 K 14 K
Dequeue execution 13 K 15 K
Queue management 15 K 22 K
interface
Total 60 K 80 K
5.2 Verification
In order to verify the design that simulated, the queue management architecture at
cycle accurate level, using test patterns that simulate incoming traffic at 10 Gbps
maximum load. We assumed that the packet segmentation is performed externally of
the architecture model, i.e the test patterns contain segment arrivals rather than
packet arrivals. The test patterns parameters are:
?? The input load
?? The segment arrival distribution
?? The maximum packet size
?? The flow identifiers of the incoming packets
?? The header processing delay variability for incoming packets
The test patterns were generated by using the C programming language and stored in
files. The files’ format is the following:
time slot packet_id segment_id segment type flow_id Header processing delay
1 1 1 0 1500 5 (time slots)
2 1 2 1 1500
3 1 3 3 1500
4 2 1 3 16383 1
5 3 1 3 0 2
The test pattern files have the following information: at times slot 1 the first segment
of the first packet arrived. The segment type identifies the type the incoming
segment, which mean that it identifies if the incoming segment is the first, an
intermediate or the last segment of a packet. This information is required in order to
organize the incoming segments into packet queues at the time of segment arrivals.
Instead we have to wait the arrival of the next segment in order to identify the tail of
the last packet and the head of a new packet. The flow_id and processing delay
fields identify the flow, which the packet (packet segment) belongs, and the delay,
which the packet suffers during its header processing period.
Except for the test pattern generation, the header processor and scheduler
simulation is required. Both header processor and scheduler can be simulated as
devices that schedule the enqueue and dequeue operations, correspondingly. The
input of the header processor device is the triplet of the packet identifier, flow
identifier and processing delay. The header processor schedules the incoming
enqueue operations according to their processing delays. The input of the scheduler
is the state of system queues. The scheduler defines the order of the packet
departures from the active flows (non-empty queues). Using calendar queue data
structures may simulate both header processor and scheduler.
The system architecture verification is split into four stages. In the first stage we
implemented and verified the Rambus memory controller. The verification is
performed by writing memory segments and then reading them in order to compare
writing and reading data. In the second stage we implemented the six control
processes of the queue manager and verified each process separately by using short
Verilog Description & Simulation 90
simulation runs. In the third stage we verified all the enqueue and all the dequeue
control processes separately. In the fourth stage, we verified all the system processes
by using short simulation runs. Verification consists of the following steps:
1. we generate test patterns that simulates the incoming traffic in C language
2. we organize the incoming packet segments into queues according to the
flows that they belong and save the result to the fileA
3. we apply these test patterns to the queue manager architecture model in
Verilog
4. we organize the outgoing packet segments of the Verilog model into queues
according to the flows that they belong and save the result to the fileB
5. we compare the fileA and fileB. The results of this comparison verifies:
?? the packet/segment loss
?? the packet segments output order
?? if the incoming segments were linked to the proper queues
The test patterns that were successfully run through the queue manager behavioral
model were short1, due to time constraints of this master thesis. As a consequence,
our design has been debugged only partially, up to now.
1
test patterns consisting of 128 segments pass through our queue manager behavioural model
successfully. We detected some bags in the way of updating memory blocks of our architecture
(queue management interface process).
7 Appendix A
The RFC performance can be tuned with two parameters: the number of phases and
the way the memory access results of one phase are combined to index the memories
of the next phase in order to yield the best reduction. The latter can be achieved by
combining the memory access results with the most correlation without causing
unreasonable memory consumption. As the number of phases increases the total
amount of memory decreases but the number of memory accesses per classification
increases. An important disadvantage of this algorithm is the big time for
preprocessing and updating the memory contents. The classification rates that the
RFC performs are 30 million packets per second (for 40-bytes minimum size
packets).
Appendix A 94
A hash table entry contains an index into a second level table, the address table,
where the full packet identifier is stored together with the forwarding information
for the destination. When the lookup engine has detected a hit in a hash table, the
packet identifier is compared to the original identifier in the table, making sure that
they are the same. The hash calculation, the memory lookup, the table lookup and
the comparison are all independent operations and can work in parallel, thus the
lookup can easily be pipelined to increase the throughput. The modification of the
hash architecture in a pipelined fashion is also performed in the figure 2b.2. The
performance of this pipeline is determined by the slowest pipeline stage, which is
the hash memory stage; thus, this pipeline is capable to perform one lookup per
memory cycle. The most common used hash memories are the Content Addressable
Memories (CAM), which can perform parallel lookups. The modern CAMs provide
up to 100 million searches per second and a single module can handle up to ½
million entries.
The performance of the design is mainly limited by the speed of the memory
accesses. Since the design can be pipelined with one memory access per pipeline
stage, it can perform lookups at the rate of one lookup per memory cycle.
Furthermore, the maximum delay of a lookup is the memory cycle time times the
number of stages in the pipeline. By using SRAM with memory cycle time of 100ns,
it is possible to process 10 million packets per second. Assuming that an average IP
packet is 1000 – 2000 bits long, this means that each lookup can deal with 10-
Although (in general) two memory accesses are required per routing lookup, these
accesses are in separate memories, allowing the scheme to be pipelined. By
pipelining this scheme we can achieve a routing rate of one lookup per memory
access. This longest prefix matching architecture provides high performance with
simple hardware by using the memory inefficiently. The total requirement of
memory is 33Mbytes of DRAM. Additionally, an important advantage of this
scheme is that by using simple hardware logic, the routing table update operation is
quite simple.
[4] J. G. Jim: «The throughput of data switches with and without speedup», Dai Schools
of Industrial and Systems Engineering, and Mathematics Georgia Institute of
Technology Balaji Prabhakar
[6] http://www.tml.hut.fi/Opinnot/Tik-
110.551/1999/papers/08IEEE802.1QosInMAC/qos.html
[7] http://www.ietf.org/html.charters/intserv-charter.html
[8] http://www.ietf.org/html.charters/diffserv-charter.html
[9] V. Kumar, T. Lakshman, D. Stiliadis: «Beyond Best Effort: Router Architectures for
the Differentiated Services of Tomorrow’s Internet», IEEE Communications
Magazine, May 1998, pp. 152-164.
[10] F.M. Chiussi, Y.T. Wang, « Enhanced DMRCA for ATM Switches with Per-VC
Queueing: Preliminary Results», Proc. GLOBECOM ‘97, 1997
[11] P. Moestedt, P. Sjf din: «IP Address Lookup in Hardware for High-Speed
Routing», Swedish Institute of Computer Science, Hot Interconnects VI,
Stanford, Aug. 1998.
[12] A.Charny: «Providing QoS Guarantees in Input Buffered Crossbar Switches with
Speedup», PhD Thesis, MIT, 1998.
[13] P. Prabhakar and N. McKeown: «On the speedup required for combined input- and
output-queued switching», to appear in Automatica.
[18] A Moestedt, P. Sjodin: “IP Address Lookup for High-Speed Routing”, Swedish
Institute of Computer Science
[19] Http://www.netlogicmicro.com/products
[23] Ioannis Mavroidis: “Heap Management in Hardware”, Technical Report 222, ICS-
FORTH, July 1998
[24] Aggelos D. Ioannou: ``An ASIC Core for Pipelined Heap Management to Support
Scheduling in High Speed Networks'', Master of Science Thesis, University of Crete,
Greece; Technical Report FORTH-ICS/TR-278, Institute of Computer Science,
FORTH, Heraklio, Crete, Greece, November 2000;
http://archvlsi.ics.forth.gr/muqpro/heapMgt.html
[25] Tzi-cker Chiueh, Varadarajan, S.: “Design and evaluation of a DRAM-based shared
memory ATM”, 1997 ACM International Conference on Measurement and
Modeling of Computer Systems (SIGMETRICS 97) Seattle, WA, USA 15-18, June
1997
[26] http://www.rambus.com
[28] K. Thompson, G.J. Miller, and Rick Wilder: “Wide Area Internet Traffic Patterns
and Characteristics”, IEEE Network November/December 1997