MPEG-4 Performance Analysis For A CDMA Network-on-Chip: Manho Kim, Daewook Kim and Gerald E. Sobelman
MPEG-4 Performance Analysis For A CDMA Network-on-Chip: Manho Kim, Daewook Kim and Gerald E. Sobelman
MPEG-4 Performance Analysis For A CDMA Network-on-Chip: Manho Kim, Daewook Kim and Gerald E. Sobelman
Abstract Realistic trafc patterns for a multi-processor MPEG-4 architecture are used to evaluate the performance of network-on-chip (NoC) implementations. In particular, we study the characteristics for a design that is based on CDMA switching techniques and a star-network topology. The results are compared to those for a more conventional mesh-topology NoC. We evaluate metrics for bandwidth requirements, latency and area overhead and show that the CDMA star design is a good candidate for the implementation of these systems.
I. I NTRODUCTION Multimedia applications are widespread and will become even more important in the future. Video telephony, digital television, video games, virtual reality simulators, etc. are growth areas of the future and the MPEG-4 standard has emerged as a key ingredient in many of these systems [1], [2], [3]. Therefore, efcient hardware platforms to perform the set of algorithms within the standard are of great interest. Since these computations are varied, it is necessary to include a range of hardware resources in the system such as a DSP processor, RISC CPU, graphics engine, etc. [10]. As a result, there is a need for high-throughput communications links between these blocks, and this can become a performance bottleneck. A busbased interconnect scheme is a shared medium which does not scale well to large systems requiring very high aggregate bandwidth. Networks-on-chip have been proposed as a way to overcome this limitation and provide a scalable interconnect environment [4]. Several types of network switches and topologies have been proposed, but most of the performance analysis is done using random trafc models where the computational blocks are simply modeled as random number generators without respect to any particular application. These types of analyses are only of limited utility, since they do not address the actual trafc requirements that one would nd in an actual application. Some recent papers have used more realistic trafc models. For example, Varatkar et al proposed on-chip trafc analysis using self-similar processes [7]. Ching et al discussed integrated modeling and trafc generation for a recongurable NoC [8]. Murali and De Micheli [5] proposed automatically mapping cores onto a network architecture using a mesh/torus topology by making use of the Xpipes library [6]. Wiklund et al use both random and data plus control trafc models in their analyses [9]. In this paper, we study the performance and area overhead
of NoCs for MPEG-4 system implementations. In particular, we focus on the properties of a recently proposed NoC that is based on using CDMA switching techniques to concurrently route multiple data streams between computational resources [11]. We use the average data rates between the blocks in a realistic MPEG-4 implementation that uses a CDMA star network topology. In addition, we compare the performance and area overhead of the CDMA NoC to that of a more conventional mesh topology NoC for this application. The remainder of this paper is organized as follows. In Section II, we describe the mapping methodology that is used to create the MPEG system implementation and trafc model. In Section III, we briey describe the properties of our CDMA-based star network topology. Then, in Section IV, we give the specic mapping that results for the MPEG4 application onto our NoC, as well as on a baseline mesh topology NoC. Section V presents our results for performance and area overhead and our conclusions are given in Section VI. II. M APPING M ETHODOLOGY Our overall procedure for characterizing the performance of the NoC for a particular application such as MPEG-4 is illustrated in Figure 1. We start with the given communication characteristic parameters such as bandwidth requirements, payload size, buffer size, and operating clock period, etc. The trafc generator, which is actually a resource (IP) model written in a hardware description language, generates trafc with a given probability of the packet being sent out to a given destination resource. The generated input trafc data is used to simulate the application on the NoC platform. Using the specied communication characteristics, a mapper groups the resources which communicate frequently onto the same switch in order to reduce latency. The transmitted and received packet trafc is traced in a log le. In a post-processing phase, we use the log le to analyze the performance. In particular, the latency of packet transmission and the FIFO buffer full signal are monitored for their performance. These steps can be iterated until all the requirements are met. After nding the best case structure and mapping for the NoC platform, we can then synthesize it and thereby obtain the estimated frequency and area overhead of the system.
Parameters: Communication pattern and other values (bandwidth, payload size, buffer size, operating clock period and etc.)
Traffic generator
Synthesize
The entries in this table can be sorted to indicate the most frequently communicating blocks in the design. These blocks can then be given the highest priority in terms of scheduling. Also, once this table has been constructed, the relative rate of transmitting a packet can be determined for each of the IP blocks and a corresponding packet trafc le can be generated for use in simulations. The transmission probabilities are calculated as follows: The highest bandwidth requirement for this application is the 455 MBytes/second between the SDRAM and the upsampling unit. All other bandwidth values in the table are then normalized relative to this value, so that this path will have a transmission rate of 100%, while all of the others will have a rate less than this. In other words, the path between those two IP blocks will always have trafc, whereas the other paths will pass trafc less often based on their transmission percentages. The normalized values for the various paths are given in Figure 3.
Area report
Fig. 1.
Mapping Methodology
We consider a particular MPEG-4 video processing application that was presented in [12]. We have used the IP resources and the inter-resource bandwidth requirements specied in that paper as our starting point. There are a total of 12 IP blocks in this design. They are listed as follows, where the number in parentheses is used as an index to identify each of the blocks: audio output processor (1), audio DSP processor (2), media CPU (3), video output processor (4), 3D graphics processor (5), SDRAM (6), SRAM1 (8), quantization unit (9), SRAM2 (10), RISC CPU (11), scaling unit (12) and upsampling unit (13). The communication requirements between these blocks are specied in the data structure of Figure 2. In this table, the entry i j species the average communications bandwidth from block i to block j. (The bandwidth data specied in Ref. [12] are the total bidirectional bus trafc between pairs of blocks. For simplicity, we have split the total value equally amongst the two directions of data transfer between each pair of IP blocks.)
Fig. 3.
III. CDMA S TAR N O C A RCHITECTURE In a wired CDMA communication network, each data bit is represented as either an L-bit Walsh codeword or its ones complement depending on whether the bit is a 0 or a 1, respectively [11]. We refer to this process as modulation. While this leads to an increase in the number of bits to be transmitted by each resource, it is offset by the fact that up to L 1 resources transmit concurrently through a switch. Each packet is composed of group identication, source address, destination address and payload elds. The transmitter module selects a codeword to use depending on the destination eld of the packet and sends out a corresponding modulated codeword. The modulated codewords from the different sources attached to a switch are then summed together using a code adder block. At the receiving side, the demodulation module recovers the original transmitted data using the same codeword that was used for transmission. Each transmitter module includes a FIFO buffer. This buffer is used for storing packets when other transmitters also wish to send a packet to the same destination at the same time. In this case, the scheduler controls which packet to send depending on a predened scheduling algorithm. If the FIFO is full, the packet is not dropped. Rather, the transmitter sends a buffer full signal to the corresponding resources. Those resources will stop sending packets until that signal is deasserted.
Fig. 2.
Video out
Audio out
Media CPU
3D CPU
Scheduler
190
SDRAM
0.5
60
600
40
40
SRAM1
CDMA SW 1
TX 2 RX 2
TX 7 RX 7
CDMA SW 2
0.5
Audio DSP
910
32 250
RISC CPU Scaling
SRAM2
173
500
670
Up Sample
Code Adder
TX 3 RX 3 TX 6 RX 6
QUANT
Fig. 5.
Xbar 1
RX 4
RX 5
TX 4
TX 5
Xbar 5
190
0.5
60
Xbar 6 SRAM1
40 600 Xbar
7
40
Xbar 8
Fig. 4.
500 250
IV. M APPING
OF
MPEG-4
Xbar 0.5 9 Audio DSP
910
In this section we consider the mapping of the MPEG4 application onto two types of network-on-chip structures, namely our proposed CDMA star network and a crossbar mesh network. A. Mapping onto Star NoC with CDMA Switch Fig. 5 is a proposed mapping of the MPEG-4 decoder system onto our CDMA NoC architecture. In the gure, solid lines represent actual links between a switch and a processing element (PE). A dotted line represents the required communication bandwidth in MBytes/sec. From the given communication characteristic parameters such as bandwidth requirements, payload size, buffer size, operating clock period, etc, the trafc generator creates packets with the specied probability. The generated input trafc data is used to simulate the MPEG-4 computations mapped onto the NoC platform. B. Mapping onto Mesh NoC with Crossbar Switch To compare our CDMA NoC platform with another implementation, we considered mapping onto a crossbar-based mesh topology NoC, as shown in Fig. 6. The crossbar switch has the same input buffer scheme and a buffer size of 8. Because there are a total of 12 IP blocks, we used a 4-by-3 mesh topology to accommodate all of these resources. The shortest path routing scheme is used for this mesh topology. V. S IMULATION R ESULTS AND P ERFORMANCE A NALYSIS In order to generate the results of interest, we logged all input and output packet transactions into a le. In a postprocessing phase, we use a script to analyze the average time to deliver packets and the buffer utilization that was obtained.
Xbar 10
32
670
Xbar 11
173
BAB
scaling context calc
Fig. 6.
We also compared the hop count for both platforms. For the area comparison, we synthesized both the CDMA star and the crossbar mesh switches using the Synplify ASIC tool with the Chip Express CX4001 0.25 m structured library. A. Hop Count
TABLE I H OP COUNT COMPARISON
The hop count for a packet is dened as the number of routers it has been forwarded through. Table I shows the average number of hops for both platforms. The table indicates that the CDMA star topology has favorable (i.e., lower) hop count values compared to the crossbar mesh topology. B. Area Overhead The synthesized area includes the total cell area for either network. In other words, it compares the total area for the 2 required CDMA star switches vs. the total area for the 12
required crossbar mesh switches. We used a buffer size of 8 in both platforms. The estimated maximum frequency is 76 MHz. The Table II shows that our CDMA NoC platform is about two times larger than the crossbar mesh topology platform. The reason is that our prototype CDMA switch uses a more complex algorithm in the TX and RX modules compared to the simple crossbar input and output buffers.
TABLE II A REA COMPARISON
the average latency for packet transmission. In addition, we compared our proposed NoC implementation to one based on a traditional mesh topology and crossbar switches. It was determined that the latency for the CDMA design is about one-ninth of that for the crossbar mesh network. However, synthesis results show that the area overhead is about 2 times as much for the CDMA network. We also illustrated the basic mapping and trafc generation techniques that can be applied to any multimedia application. VII. ACKNOWLEDGMENTS We thank Sang Woo Rhim, Bumhak Lee and Euiseok Kim of Samsung Advanced Institute of Technology (SAIT) for their help with this manuscript. This research work is supported by a grant from SAIT. R EFERENCES
Area [m2 ] CDMA star NoC Crossbar Mesh NoC 109,090.0 42,383.5 165,003.6 69,535.8 273,014.0 128,946.0 498,256.4 232,813.8
C. Latency After analyzing the trafc log le, we can obtain the travel times of the packets during the transmission. Note that the latency values include the effects due to contention between packets destined for the same address at the same time. Latency represents one of the important performance parameters of the NoC platform and is computed as follows: average latency
T ireceived T itransmitted N 1
where N is the total number of received packets. Table III shows that our CDMA star topology is around 9 times faster than the general crossbar mesh topology.
TABLE III L ATENCY COMPARISON
Average
D. Bandwidth Constraints The highest bandwidth requirement in our system, as given in Figure 2, is 455 MBytes/sec. We would like to determine if our CDMA NoC can meet that constraint. The largest possible bandwidth is the maximum clock frequency, which was found to be 76 MHz, multiplied by the payload size in bytes. Of the cases considered, only the 64-bit payload size is sufcient to meet this constraint: 76 MHz 8 bytes = 608 Mbytes/sec. This is a best-case value that does not take into account possible effects due to contention. However, the number is sufciently high to strongly suggest that the network is fast enough to meet the MPEG throughput requirements. VI. C ONCLUSIONS We have obtained performance and area overhead results for a high-performance a CDMA-based network-on-chip implementation of an MPEG-4 processor. Realistic trafc rates between the IP resources in the design were used to determine
[1] T. Ebrahimi and F. Pereira, The MPEG-4 Book, Prentice Hall PTR, 2002. [2] I. Richardson, H.264 and MPEG-4 Video Compression: Video Coding for Next Generation Multimedia, John Wiley & Sons, 2003. [3] P. Kuhn, Algorithms, Complexity Analysis and VLSI Architectures for MPEG-4 Motion Estimation, Kluwer Academic Publishers, 1999. [4] H. Tenhunen and A. Jantsch, Networks on Chip, Kluwer Academic Publishers, 2003. [5] S. Murali and G. De Micheli, Bandwidth-Constrained Mapping of Cores onto NoC Architectures, Design, Automation and Test in Europe Conference and Exhibition, Vol. 2, pp. 896-901, 2004. [6] D. Bertozzi and L. Benini, Xpipes: A Network-on-Chip Architecture for Gigascale Systems-On-Chip, IEEE Circuits and Systems Magazine, Vol. 4, No. 2, pp. 18-31, 2004. [7] G. V. Varatkar and R. Marculescu, On-chip Trafc Modeling and Synthesis for MPEG-2 Video Applications, IEEE Transactions on Very Large Scale Integrated Systems, Vol. 12, No. 1, pp. 108-118, 2004. [8] D. Ching, P. Schaumont and I. Verbauwhede, Integrated Modeling and Generation of a Recongurable Network-on-Chip, 18th International Parallel and Distributed Processing Symposium, pp. 139-145, 2004. [9] D. Wiklund, S. Sathe and D. Liu, Network on Chip Simulations for Benchmarking, 4th IEEE International Workshop on System-on-Chip for Real-Time Applications, pp. 269-274, 2004. [10] M. Pastrnak, P. Poplavko, P. N. H. de With and D. S. Farin, Data-ow Timing Models of Dynamic Multimedia Applications for Multiprocessor Systems, 4th IEEE International Workshop on System-on-Chip for RealTime Applications, pp. 206-209, 2004. [11] D. Kim, M. Kim and G. E. Sobelman, CDMA-Based Network-on-Chip Architecture, IEEE Asia Pacic Conference on Circuits and Systems, pp. 137-140, 2004. [12] E. B. van der Tol and E. G. Jaspers, Mapping of MPEG-4 Decoding on a Flexible Architecture Platform, Proceedings of SPIE - Media Processors, Vol. 4674, pp. 1-13, 2002.