5.1 Evaluation Methodology
The evaluations are performed using Booksim [
15], a cycle-accurate interconnection network simulator, with synthetic traffic patterns as well as traces from real-workload. The detailed configuration has been listed in Table
1. We adopt an
\(8\times 8\) 2D mesh with 64 nodes as the baseline network topology. The
\(4\times 4\) and
\(16\times 16\) 2D meshes are also evaluated to study the scalability of our design. Our evaluations focus on 2D mesh topologies due to the benefit of mapping well to the 2D layout and the fact that this topology has been implemented in commercial and experimental manycore systems. All channels have one cycle delay and each router connects one terminal node. Traditional VC flow control [
4] is used and there are 4 VCs in each input port. Different number of VCs are also evaluated and each VC has a buffer size of eight flits in all the evaluations. We choose Footprint [
10] routing for the mesh, because it is the most efficient adaptive routing in NoCs. Duato’s theory [
8] is utilized in Footprint to avoid routing deadlock. Router architecture has been described in Section IV, and the credits can be transmitted upstream in two cycles. We use one-iteration iSLIP as the baseline allocator for Packet Chaining, TS-Router and our design. In the evaluation, we implement the second type of Packet Chaining (same port, different VCs). Unless indicated otherwise, our evaluation is performed with single-flit packets.
Synthetic traffic patterns, such as uniform, transpose, tornado, and bitrev are used in the evaluation. In addition, a hotspot traffic is designed to evaluate the performance of our design under endpoint congestion conditions. For hotspot traffic pattern, 10% nodes are randomly selected as the hotspot receivers to accept traffic from other nodes. Other nodes have a 20% probability of sending packets to hotspot endpoint nodes, and an 80% probability of sending packets to a randomly selected node.
To evaluate the performance of our design with real-workload, trace-driven simulations [
12] are performed under PARSEC benchmarks [
2]. The traces are gathered from full-system simulations of 64 in-order-issue 2-way SMT cores. Dsent [
32] is used to evaluate area and power consumption. Area and power evaluations are performed at 22-nm technology scaling with 1.0-V operating voltage. Operating frequency is set to 1.0 GHz while data width of links and flit width are all set to 128 bits.
5.2 Network Performance of Synthetic Traffic
We first evaluate the performance of Eca-Router and CoD-Router in the mesh under three different traffic patterns. We also compare our design with other advanced SA strategies. Average packet latency is measured and the result has been presented in Figure
9. As can be seen from the figure, Eca-Router outperforms most other allocators and CoD-Router achieves even better performance than Eca-Router. Our design not only achieves higher saturation throughput but also achieves lower latency when the injection rate is less than the saturation injection rate. Compared with TS-Router, Eca-Router achieves 6.7%, 6.2%, 8.9%, and 5.8% performance improvement in saturation throughput under uniform, transpose, and tornado traffic patterns, respectively. The throughput improvements of CoD-Router under these three traffic patterns are 7.2%, 7.8%, 12.8%, and 11.5% respectively. Under each traffic pattern, Eca-Router achieves much lower packet latency; this is because the impact of congestion is relieved during the SA process, which reduces the packet latency as a result. CoD-Router is designed based on Eca-Router and further introduces the SA information to the RC stage, achieving better performance than Eca-Router and other SA strategies due to more conflict-free requests provided to the switch allocator. It should noted that the overall performance improvement cannot compensate for the decreased frequency (13.6%) in this evaluation. This is because the complicated SA process will bring down the router frequency, which is a flaw in our design. However, our design gives a novel method to improve the performance of SA, and we can achieve higher performance through further optimization to offset the router frequency overhead. However, our design can achieve high performance under certain traffic patterns or larger NoCs, which is enough to offset the overhead of router frequency. Detailed experimental results will be presented in the following paragraphs.
We further evaluate the impact of the number of VCs on performance and the result is shown in Figure
10. In this evaluation, we vary the number of VCs and compare our design with TS-Router. To avoid deadlock in RC, at least two VCs are required in the input buffer [
8]. Therefore, in this evaluation, we adopt two VCs, four VCs, and eight VCs in the input buffer. More VCs can provide more requests to the switch allocator, resulting in more matchings in the SA process. As shown in Figure
10, for different SA strategies, adopting more VCs can achieve better performance. As the number of VCs increases, the performance of SA strategy increases accordingly. However, when the number of VCs continues to increase, the number of requests is no longer a performance bottleneck for SA, the allocation efficiency of the switch allocator is. Therefore, the performance difference between different SA strategies is more obvious when a larger number of VCs is used. As shown in Figure
10, given the same number of VCs, CoD-Router and Eca-Router outperform TS-Router under different traffic patterns. Eca-Router improves the saturation throughput by 5.2% with 2 VCs and increases to 7.3% with 8 VCs under uniform traffic pattern, and the corresponding improvements achieved by CoD-Router are 6.7% and 8.1%, respectively.
One reason why MUQ-ROUTER is efficient lays in its ability to relieve the impact of endpoint congestion. Thus we evaluate our design and compare it with other SA strategies under a hotspot and uniform mixed traffic pattern, described earlier in Section
5.1. The hotspot traffic will create a congestion tree and congest other uniform traffic due to the HoL blocking, which can reduce network throughput. We present the latency-throughput curve in Figure
11 to demonstrate the benefit of our design in relieving the impact of endpoint congestion. As shown in Figure
11, TS-Router saturates when the injection rate reaches approximately 34%. Compared with TS-Router, the saturation injection rate of Eca-Router and CoD-Router can be increased by 4.6% and 8.8%, respectively. In addition to increasing network throughput, our design also reduces the transmission latency of packets under low load. As shown in Figure
11, when the injection rate is lower than the saturation point, CoD-Router can achieve much lower packet latency compared with other SA strategies. When the injection rate is 0.2 flits/cycle, CoD-Router can reduce average packet latency by 37.5% and 30.2% compared with TS-Router and Eca-Router.
We also compare our design with TS-Router using different network scale and the result is shown in Figure
12. The throughput of Eca-Router and CoD-Router has been normalized to TS-Router under each traffic pattern. As shown in Figure
12, the performance improvement of our design is larger in the
\(16\times 16\) mesh than in the
\(4\times 4\) mesh, since larger network can stress the congestion. For uniform traffic pattern, the throughput gained by Eca-Router over TS-Router in
\(4\times 4\) and
\(16\times 16\) meshes is 3.6% and 15.4%, respectively. The corresponding improvement achieved by CoD-Router in these two meshes is 4.9% and 23.1%, respectively. Compared with TS-Router, the improvement in throughput of Eca-Router in
\(4\times 4\) and
\(16\times 16\) meshes is 3.8% and 12.5%, respectively, while the corresponding improvement of CoD-Router in these two meshes is 4.5% and 17.5%, respectively. Considering the overhead of router frequency, the performance improvement of our design is more pronounced for larger NoCs, which can offset the overhead of router frequency. For this evaluation, the number of VCs is not increased as the network size increases, and we do not increase the depth of VCs as well. Although adopting more VCs or deeper VCs implies better performance in NoCs, the increased buffer size and the complex scheduling logic are difficult to implement in the on-chip router. Note that input buffer consumes a considerable amount of area and power in a NoC router, it is difficult to adopt a large buffer in on-chip routers [
20].
The performance of switch allocator is associated with the matching number, and the larger matching number achieved, the better performance of the switch allocator. We compare the number of matchings in a single router for CoD-Router, Eca-Router, and TS-Router by increasing the injection rate from 0 to 0.9 flits/cycle/node. The result is presented in Figure
13, and it is collected from 1,000 consecutive stable cycles in the mesh network under uniform traffic pattern. As Figure
13 shows, when the injection rate exceeds 0.38 flits/cycle/node, the network will be saturated. Moreover, when the injection rate exceeds the saturation point, as the injection rate increases, the matching number does not continue to increase but decreases. This is because when the output port is congested, requests provided to the switch allocator will be reduced, thereby reducing the number of SA requests. However, due to the advantage in SA process, CoD-Router and Eca-Router can achieve more matchings when the injection rate is near or exceeds the saturation injection rate.
We further demonstrate the advantage of our design in increasing the number of matchings and present the result in Figure
14. Figure
14 presents the improved matching number of the SA process for CoD-Router and Eca-Router compared with TS-Router. We collect the matching number of the switch allocator for CoD-Router, Eca-Router, and TS-Router in 40,000 cycles in an
\(8\times 8\) mesh network with the injection rate changing from 0.1 to 1.0 flits/cycle/node. As presented in Figure
14, CoD-Router and Eca-Router achieve more matchings than TS-Router at each injection rate, and CoD-Router can deliver more matchings than Eca-Router. When the injection rate is less than 0.5, the increased matching numbers of both methods increase as the injection rate increases. This is because the matching number relies on the number of requests, and the more requests provided to the switch allocator, the more matchings can be achieved. However, when the injection rate exceeds 0.5 flits/cycle/node, the number of increased matchings begins to decrease. This is because the excessive load can block the output port of the router, which leads to a decrease in the number of valid requests and in turn damages the allocation efficiency of the switch allocator.
Eca-Router achieves better performance in average packet latency by relieving the impact of endpoint congestion. In this way, the long waiting latency of packets caused by endpoint congestion can be reduced, resulting in lower average packet latency. Although CoD-Router changes the execution process of routing algorithm, it does not negatively affect the performance. We evaluate the latency distributions of TS-Router, Eca-Router, and CoD-Router and present the result of
cumulative distribution function (CDF) in Figure
15. The evaluation is performed in an
\(8\times 8\) mesh network and packets are injected in saturation injection rate. As can be seen in the figure, the proportion of packets with low latency is larger in our design than in TS-Router. Moreover, our design does not introduce significant tail latency. That is, our design can avoid the long packet latency caused by unnecessary waiting when faced with endpoint congestion.
We next evaluate the impact of packet length on the performance of our design. We evaluate the average packet latency of CoD-Router, Eca-Router, and TS-Router with the packet length changing from 1 flit to 16 flits. Results are presented in Figure
16. We set the injection rate to saturation injection rate for each test. As can be observed from the figure, Eca-Router outperforms TS-Router when the packet length is smaller than 8 flits and becomes worse when the packet length beyond 8 flits. The performance improvement of Eca-Router degrades as the packet length keeps increasing. This is because long packets reduce the number of requests provided to the switch allocator, which limits the optimization of the SA process. For TS-Router, there is a mutation after packet length beyond 8 flits. This is because we set the capacity of a VC to 8 flits in our evaluation. When the packet length beyond 8 flits, the probability of two packets in the same VC will be very small. Consequently, the opportunity to get the information of next requests will also be reduced and thus reduces the performance of TS-Router. In most cases, CoD-Router can achieve better performance than the other two methods, especially when the packet length is less than 7. This indicates that CoD-Router is more suitable for NoCs where a large proportion of packets are short packets [
25].
Our design strives to find ECC requests and limit the allocation of these ECC requests to relieve the impact of endpoint congestion. To prove the effectiveness of our design, we have counted the changes in ECC requests when endpoint congestion occurs, as well as the changes in the average packet latency of our design and the baseline router. The result has been presented in Figure
17. In this evaluation, hotspot traffic, which can introduce endpoint congestion, occurs in the 20,000th cycle. Background load is a uniform random traffic pattern with an injection rate of 40%, and the hotspot traffic continues to inject 500 cycles from the 20,000th cycle with an injection rate of 50%. As shown in the figure, when endpoint congestion occurs, the average packet latency of both the CoD-Router and the baseline router increase sharply, but the CoD-Router can recover to normal levels more quickly, while the baseline router takes a long time to eliminate the impact of endpoint congestion on network performance. This is because CoD-Router can use the collected ECC request information to reduce the impact of endpoint congestion by limiting the allocation of requests that cause endpoint congestion. It can be seen from the figure that when congestion occurs in the network, the ECC requests detected by the CoD-Router also increase rapidly, which provides the required information for the performance optimization of the CoD-Router, and also proves that our design can effectively use ECC request information to optimize network performance.
5.3 Application-Level Performance
In addition to synthetic traffic patterns, we also compare our design with TS-Router using network traces from PARSEC 2.0 workloads [
2] in the mesh network. We select eight PARSEC benchmarks to evaluate the performance of our design in some representative application scenarios. For example, the application domains of bodytrack, canneal, ferret, and fluidanimate are computer vision, engineering, similarity search, and animation, respectively. The application domain represented by blackscholes and swaptions is financial analysis, but blackscholes is with a small working set while swaptions is with a large working set to evaluate the impact of working set size. Both vips and x264 are applications from media processing area, but the parallelization model of vips is data-parallel while x264 adopts pipeline parallelization model, which can be used to evaluate the impact of different parallelization models. All the traces are from the whole execution of applications, and both the parallel regions and the serial phases of the benchmarks will be taken into account. For each benchmark, results are collected after 1,000,000 cycles of continuous trace injection.
Application-level results are presented in Figure
18 and the latency reduction of CoD-Router and Eca-Router has been normalized to TS-Router. As can be seen from the figure, CoD-Router and Eca-Router outperform TS-Router in all benchmarks, and the performance of CoD-Router is much better than Eca-Router. For Eca-Router, the average improvement over all the benchmarks is 4.1% while the maximum improvement is 9.2% using bodytrack benchmark. The average improvement of CoD-Router over all the benchmarks is 11.7% while the maximal improvement is 28.5% using bodytrack benchmark. For most benchmarks in Figure
18, the performance improvement of Eca-Router is not significant; this is because the average injection rate of PARSEC benchmarks is relatively low, which results in fewer requests to the switch allocator and thus reduces the performance of Eca-Router in relieving the impact of endpoint congestion. However, CoD-Router can achieve much more performance improvements than Eca-Router. This is because most packets in PARSEC benchmarks are short packets, and CoD-Router can achieve much better performance with short packets (as shown in Figure
16).
We also calculate the average length of the packets in PARSEC benchmarks and present the result in Figure
19. As shown in Figure
19, the average packet length varies from 2 to 4 flits for different benchmarks, and the overall average packet length is 2.8 flits. It can be seen from Figure
16 that CoD-Router can best reduce the packet latency when the packet length changes from 1 to 6 flits. This is the reason why CoD-Router can achieve a significant reduction in packet latency under PARSEC benchmarks whose average packet length changes from 2 to 4 flits. For the same reason, the performance of Eca-Router and CoD-Router under different benchmarks has a great relationship with the packet length. Eca-Router adopts fine-granularity endpoint congestion control strategy, and short packet dominated traffic is more suitable for this strategy. For CoD-Router, short packets can provide more precise feedback from the SA stage to the RC stage, which achieves better performance by providing more conflict-free requests to the switch allocator. The three benchmarks that improve performance most are bodytrack, ferret, and swaptions. Accordingly, as shown in Figure
19, the average packet length of these three benchmarks is also the shortest. The performance of Eca-Router and CoD-Router also relates with other factors, such as the number of synchronization operations.
Synchronous operation contributes most to endpoint congestion. To achieve better performance, Eca-Router can alleviate the impact of endpoint congestion directly while CoD-Router can mitigate the competing for the same output port during the SA. Therefore, Eca-Router and CoD-Router get better performance in benchmarks that contain more synchronization operations. For this reason, Eca-Router and CoD-Router achieve the maximum performance improvement under the bodytrack benchmark, due to its large amount of barriers synchronization operations. By the way, the performance improvement of Eca-Router is not obvious; this is because the average injection rate of PARSEC is relatively low (less than 0.005 flits per cycle) and the congestion in the network is not serious. The degree of performance improvement and the injection rate also have a certain correlation and the greater the injection rate, the more obvious the performance improvement is. This is because the higher the injection rate, the greater the probability of congestion, resulting in a more significant performance improvement in our design. For PARSEC benchmarks, bodytrack and swaptions have the highest average injection rate thus achieve the highest performance improvement.
5.4 Power and Area
We adopt a light-weighted implementation in our design to minimum the overhead. For Eca-Router, only a simple RS process is added before SA. The main process of RS can be executed in parallel with RC stage, and the other process of RS is preformed by marking the ECC request that has been selected to be discarded. The main overhead of Eca-Router comes from the added registers. In Eca-Router, to track the destination of the request in each VC, a
\(\log _{2}(N)\) bits of register is needed for each VC, where
N is the network size. For an
\(8\times 8\) 2D mesh with four VCs per physical channel, this results in only 24 bits storage per port, and 120 bits in total for a 5-port router. Given that the size of a NoC flit is often very wide (e.g., 128 [
11] or 256 [
7] bits), the additional storage overhead is approximately equal to another flit buffer entry at the router. Compared with Eca-Router, CoD-Router adds two new datapaths to prefetch SA information from the SA and the VA stages. However, there is no need to set up new registers to store this information, as this information can be obtained directly from the SA and the VA processes. This information can be used to calculate the SA contention number for each output port, which is the basis of CoD-Router’s RC process to select an output port that do not compete with future SA requests. Therefore, only a few registers need to be added to record the SA contention number of each output port. For an
n-port router, a
\(\log _{2}(n)\) bits of register is needed for each output port, and thus the total capacity in a router is
\(n\times\) \(\log _{2}(n)\) bits. For a five-port router, compared with Eca-Router, CoD-Router’s storage overhead is only 15 bits. CoD-Router has a small storage overhead, and there is only a little impact on the critical datapath of CoD-Router’s pipelines. As our design only adds some registers and the corresponding logic circuits, the overhead of resource consumption is marginal.
Dsent is utilized to estimate power and area of our design. Our design is modeled based on the router model in the simulator. The baseline router model has five pipelines stages and five input/output ports. In the baseline router, there are four VCs in each input port and each VC has a buffer size of eight flits. We modify the configuration parameters and the implementation details in the router model to simulate our design. The buffer space of registers is added to emulate the storage overhead. Figure
20 presents the increased power consumption of a router for CoD-Router and Eca-Router compared with the baseline router, and the power result includes the corresponding leakage power and the dynamic power. It should be noted that we put the cost of registers into the buffer. Therefore, the main increase in power consumption comes from the switch allocator and the buffer. In total, power consumption is increased by 2.6% and 3.8% for Eca-Router and CoD-Router, respectively. Because we have added the cost of RS process to that of switch allocator, the switch allocator contributes the most in power consumption. We also present the area overhead of our design in Figure
21. In general, the area cost of our design is acceptable and similar to that of power consumption. In total, the increase in area of the Eca-Router and CoD-Router is 2.4% and 4.1%, respectively. The increase in area of the switch allocator of Eca-Router and CoD-Router is 9.7% and 10.0%, respectively.