# Low Power Gated-Clock Design for Multi-core DSP Based SDR Platform

Xu Li, An Peng, Wang Yu and Li Jun

School of Electronic and Information Engineering, Ningbo University of Technology, Ningbo 315016, China
Cnxl1984@gmail.com

### Abstract

With the rapid development of wireless communication systems, multi-core DSP with high computing performance is an important part of SDR (Software-Defined Radio) platform. Research has focused on low-power design in SDR platform since the SDR platform is sensitive to power consumption. Following the baseband digital signal processing features of SDR application, this paper proposes a data-driven and task-driven gated-clock architecture. The DSP cores in the multi-core DSP can be turned on and turned off with this architecture at the appropriate time. Experiments show that the proposed low power gated-clock architecture can provide effective low-power performance for multi-core DSP in SDR platforms.

**Keywords**: software-defined radio; low-power design; gated clock

### 1. Introduction

With the development of communication technology, especially in wireless fields, various standards and protocols have been applied to wireless communications. Meanwhile, the implementation of the communication systems becomes increasingly complex. To accommodate these changes in communication fields, software-defined radio-SDR has attracted increasing attention. SDR generally adopts programmable processor, such as DSP, to realize the wireless communication system, which improves the flexibility of wireless communication implementation. At present, multi-core DSP is often used to build a SDR platform with its high computational capabilities [1-2]. SDR platform is sensitive to power consumption, and therefore, the low power design for multi-core DSP based SDR platform is extremely essential [3-4].

This paper proposes a data-driven and task driven gated-clock architecture and related special instructions for low power design to highlight the feature of the baseband digital signal processing. First of all, the master DSP can turn on or turn off the slave DSP core via the data-driven gated-clock architecture, when the data processing of the network load is coming or completed. Besides, baseband digital signal processing is divided into tasks. According to the relevant relationship between tasks, the tasks are assigned to each DSP core. The DSP core can turn itself off following task driven gated-clock architecture, when the assigned task is completed. Hence, DSP core with gated-clock architecture flexibly realizes the activation and deactivation of the DSP core, which achieves the goal of low power consumption.

The organization of paper as follows: the second chapter presents the related work for multi-core DSP low power design. The third chapter expounds the feature of the baseband digital signal processing. The fourth chapter in detail design of gated-clock architecture and related special instructions to achieve data and task driven gated-clock. The fifth chapter presents the target of the multi-core DSP realization of IEEE 802.11a receiver as an example for evaluation and analysis the gated-clock architecture. The last chapter is the summary.

ISSN: 2005-4246 IJUNESST Copyright © 2016 SERSC

### 2. Related Work

Figure 1 shows the model of multi-core DSP architecture, which includes DSP cores, interconnection between DSP cores, and on-chip memory system. Therefore, the low power design of multi-core DSP mainly refers to those three parts.



Figure 1. Multi-Core DSP Architecture Model

The independent processor cores on multi-core processors are connected with an interconnection structure. Each processor core has its own independent instruction stream. The single-core processor achieves high-performance with complex architecture, while multi-core processors include many low complexity processor cores to improve parallelism. Therefore, multi-core processors usually use heterogeneous multicore architecture or dynamic voltage and frequency scaling mechanism, to reduce power consumption.

The interconnection structure is an important part of the multi-core processors for processing data and control information between processor cores. As shown in Figure 2, interconnection structure can generally be classified into several types, such as hierarchical and mesh networks, among others [5]. Meanwhile, interconnection structure also results in a certain amount of power consumption overhead, which passes from the switch unit and data traffic in interconnection structure. Tran proposed a multilayer interconnection structure, which makes long-range processor cores of the mesh interconnection structure directly connected to each other with an additional programmable switch unit[6,7]. The situation of long-range processor cores communicating with each other is avoided through a large number of switch units in traditional mesh interconnection structure. This multilayer interconnection structure reduces the overhead of the switch unit, thereby reducing the power consumption of multi-core processors. L Hang adopts a dynamic voltage/frequency scaling mechanism for communication line in interconnection structure, which is based on the usage of the interconnection that dynamically adjusts the voltage of communication line for low power consumption [8].



Figure 2. Interconnection Structure

Considering that off-chip memory access brings latency and power consumption, multi-core processor often integrates a certain number of on-chip storage unit to improve the speed of memory access, and to reduce the power consumption [9-10]. On-chip

memory optimization mainly involves the reasonable allocation of on-chip memory to reduce the off-chip memory access. G Suh based on the real-time collection of the cache miss rate, dynamically allocated the second-level cache between the different processes to reduce the off-chip memory access [11-12]. N Vujic proposed an adaptive and prediction mechanism for CELL multicore processor code optimized to reduce the execution time of the program [13-14].

Numerous studies show that low-power design on multi-core processor architecture has attracted more attention. As type of multi-core processor, multi-core DSP in software-defined radio is often used in demanding low-power requirements of mobile devices. Therefore, multi-core DSP can optimize power consumption based on the characteristics of baseband digital signal processing.

As shown in Figure 3, baseband digital signal processing in a wireless communication system can be divided into the receiver and the transfer. The receiver demodulated the signal from radio front end module, and then sent the bit stream data to the MAC layer. The transfer modulated the MAC layer bit-stream data to the physical layer frame, and then sends it through the wireless front-end. According to different processing purposes baseband digital signal processing can be divided into three parts, which are digital front-end, modem, and codec. The current popular wireless communication standards, such as WLAN, WiMAX, *etc.*, are data-stream dominant and has synchronous data flow characteristic in digital processing [15-16].



Figure 3. Baseband Digital Signal Processing

# 3. Multi-core DSP Low-Power Design

### 3.1. Data Driven Gated-clock

The A/D converter in wireless communication system receiver is not sampling the physical frame in real-time, and the MAC layer is not sending bit-stream data to the physical layer in real-time. When the baseband digital signal processing task is assigned to each DSP core of the multi-core DSP, if no data is available to process, the DSP core may be deactivated until it is required. Therefore, this paper designs data driven gated-clock for activating and deactivating the DSP core; its architecture as shown in Figure 4. This data-driven gated-clock architecture involves the selection of one DSP core as the master DSP core for monitoring communication payload. In addition to the basic DSP core architecture, this DSP also includes a control unit, a MUX unit, and a state machine.



Figure 4. Data Driven Gated-Clock Architecture

Table 1 shows the special instruction format of the master DSP core for controlling the turn on or turn off functions. The master DSP fetches the special instruction and decodes it to the corresponding control unit. The control unit sends the last two bits of special instruction to the state machine unit. The two bits are known as the clock control signal, and may be generated to make the state machine strobe the enabling signal. These two bits are which make the state machine generate the enabling signal to MUX. Finally, via the MUX, the turn on or turn off function controls the other DSP core. Figure 5 shows the state machine.

Table1. Special Instruction on Master DSP Core

| special instru                     |             |   |                 |
|------------------------------------|-------------|---|-----------------|
|                                    | Parameters  |   | Operation       |
| special instruction flag(n-2 bits) | bit(2 bits) |   |                 |
|                                    | 1           | 1 | Turn on the     |
|                                    |             |   | other DSP cores |
|                                    | 1           | 0 | Turn off the    |
|                                    |             |   | other DSP cores |
|                                    | 0           | 1 | idle            |
|                                    | 0           | 0 |                 |



Figure 5. State Machine of Data Driven Gated-clock Architecture

Programmers, by using a special instruction in the master DSP core, can turn on or turn off the other DSP cores, such that the other DSP core can be turned on when obtaining valid data to reduce average power consumption.

#### 3.2. Task Driven Gated-clock

When multi-core DSP obtains valid data for processing, the baseband digital signal processing is mapped to the DSP cores in the multi-core DSP. Considering the features of the baseband digital signal processing, the processing generally splits into various tasks and static mapping. This step takes advantage of the multi-core DSP features to achieve parallel task levels. In Figure 6, assume the baseband digital signal processing is split into four tasks, which are mapped to the four DSP cores. Each DSP checks and processes the received data packet pipeline. At time t4 in Figure 6, when the DSP core 1 processes the fourth packet's first task, and the DSP core 2 processes the third packet's second task. The DSP core 3 processes the second packet's third task, and the DSP core 4 processes the first packet's fourth task simultaneously. In addition, the multi-core DSP can handle 4 packets in parallel with its different tasks, and each DSP core only processes the task that is assigned to it, without changing the task online.



Figure 6. Parallel Task Execution on Multi-core DSP

Figure 6 indicates that the multi-core DSP can execute the tasks of baseband digital signal processing with parallel pipeline. The pipeline length depends on the longest execution time of the tasks. However, task mapping cannot guarantee the task execution time on the DSP cores to be exactly the same. As a result, the DSP cores, which have the shorter task execution time, must wait for a certain period before the next task execution. In addition, the start and end processing of physical layer frame only requires certain DSP cores. Based on these two reasons, this paper introduces a task-driven gated clock. And expands the special instructions of the data driven gated-clock for combining the data driven gated-clock and task driven gated-clock. When the DSP core obtains valid data, the DSP cores can be dynamically turned off and the static turned on to reduce power consumption

Static activation refers to the DSP core, which is based on the fixed assigned tasks, and the assigned tasks do not change online. A master DSP core (DSP core 1) turns on the other DSP cores for every intervals pipeline time. The master DSP core turns on DSP cores 2, 3, and 4 at t4, as shown in Figure 6. When a physical layer data frame starts or ends to be processed, the master DSP turns on the DSP cores in proper sequence. Dynamic deactivation means that the DSP cores, except the master core, turn off when the DSP core has completely processed the assigned tasks.

## 3.3. Dynamic Turn Off and Static Turn On

To realize the dynamic turn off and static turn on, this paper designs the special instruction for task driven gated-clock and expand the data driven gated-clock special instruction. The hardware architecture of the task driven gated-clock and data driven gated-clock is shown in Figure 7. Each DSP core has an internal control unit. Special instructions are decoded by calling the control unit to generate a clock control signal. In addition to other DSP cores, each DSP core consists of a strobe and a state machine to implement DSP cores on and off. As shown in Figure 8, the state machine is different in data driven gated-clock, which generates the enabling signal to turn off the DSP core by

itself. The special instructions on the master DSP core and the other DSP core are listed in Table 2 and 3.



Figure 7. Task Driven Gated-clock Architecture



Figure 8. State Machine of the Task Driven Gated-clock Architecture

| SI                                        | pecial instruction (n l                                                  |                        |     |                                                                    |
|-------------------------------------------|--------------------------------------------------------------------------|------------------------|-----|--------------------------------------------------------------------|
|                                           | Identify bit (4 bits)                                                    | Parameters bit(2 bits) |     | operation                                                          |
| Special<br>instruction<br>flag (n-6 bits) | identify the DSP core's ID, ( 0000 is global turn on or global turn off) | 1                      | 1   | Turn on the DSP core<br>2 to DSP core with<br>identify bit         |
|                                           |                                                                          | 1                      | 0   | Turn on the DSP core with identify bit to DSP core with the max id |
|                                           |                                                                          | 0                      | 1 0 | idle                                                               |

Table 2. Modified Special Instruction on Master DSP Core

## 4. Evaluation and Analysis

The heterogeneous multi-core DSP and low power design architecture are described by Verilog HDL, evaluated and analyzed the power consumption by EDA tools (Synopsis Design Complier) and Xilinx Viretex-4 FPGA, as shown in Figure 9. The related digital signal processing module on the heterogeneous multi-core DSP was the IEEE 802.11a baseband receiver processing [17].



Figure 9 FPGA Implementation Platform

A Heterogeneous multi-core DSP in evaluation and analysis is based on ring interconnection, which is illustrated in Figure 10 [18-19]. The heterogeneous processor has two processor cores, and all processor cores run at 400 MHz. Each processor core has different computational capacity. The 32x32bit register files and a number of private memories (PM) are included in the memory system of each processor core. A shared memory (SM) is implemented between two processor cores, which are used to transfer data between the DSP cores. The IEEE 802.11a baseband receiver processing is divided into preamble and data symbol parts. For the preamble parts, processing includes frame synchronization, carrier frequency offset estimation, carrier frequency-offset compensation, guard removal, channel estimation, and so on. Data symbol part processing includes carrier frequency offset compensation, guard removal, 64 point FFT, channel equalization, pilot removal, demodulation, de-puncture, Viterbi decoding, de-scramble, and so on.



Figure 10. Architecture of Heterogeneous Multi-core DSP

IEEE 802.11 baseband receiver processing tasks are mapped on multi-core DSP as shown in Figure 11. The heterogeneous multi-core DSP with a Viterbi hardware accelerator has a 9 DSP core. It is synthesized with 45 nm logic technology by EDA tools [20]. The die area of the multi-core DSP without gated-clock is 3.43 mm2, and with gate-clock, the die area is 3.4302 mm2. The average power of multi-core DSP is 1244.42 mW, which is under the 1.1 v global voltage and runs at 400 Mhz. The results of the experiment show that the die area of multi-core DSP with gated-clock is increased by 0.005%. The multi-core DSP is still effective for IEEE 802.11a baseband receiver processing. Due to the gated-clock, the multi-core DSP can turn off the idle DSP core at the proper time. For the processing of the preamble and data symbols, the heterogeneous

multi-core DSP average power consumption is 830.35 mW and 830.35 mW respectively. In contrast to the multi-core DSP without the gated-clock, the average power is reduced to 33.27% and 27.54%. Unlike MCMT multi-core DSP and the godson III multi-core DSP with the in-core gated-clock, the gated-clock proposed is actually data- and task-driven to turn off and turn on the DSP core. Meanwhile, the proposed gated-clock low power design method can cooperate with other low-power techniques for achieving lower power.



Figure 11 IEEE 802.11 Baseband Receiver Processing Model

### 5. Conclusions

According to the characteristics of the OFDM systems, this paper proposes a novel data-driven and task-driven gated-clock architecture for multi-core DSP based SDR platform's power and energy efficient. The DSP cores in the multi-core DSP can be turned on and turned off with this architecture at the appropriate time. The results show that this architecture is possible in multi-core DSP based SDR platform and that, as expected, can reduce energy and power by up to 33.27% in OFDM preamble processing. Future work involves using the technology with other DSP in different process node. And investigate the margins that exist at lower feature sizes and also exploring how this data-driven and task-driven gated-clock architecture can be incorporated into an energy-aware operating system.

### **Acknowledgments**

This work was supported by the follows: National Natural Science Foundation of China (61502256), Zhejiang Provincial Natural Science Foundation of China (LY15F020011, LY12F01002), Foundation of Zhejiang Educational Committee of China (Y201327685, Y201224456).

### References

- [1] L. J. Karam, I. AlKamal, A.Gatherer, G. A. Frantz, D. V. Anderson and B. L. Evans, "Trends in multicore DSP platforms", IEEE Signal Processing Magazine, vol. 26, no. 6, (2009), pp. 38-49.
- [2] H. Jiang-Zhou, W.-Guang Chen, C. Guang-Ri, W.-M. Zheng, Z.-Z. Tang and H.-D. Ye, "OpenMDSP: Extending OpenMP to Program Multi-Core DSPs", Journal of Computer Science and Technology, vol. 29, no. 2, (2014), pp. 316-331.
- [3] R. Kumar, D. M. Tullsen and N. P. Jouppi, "Core Architecture Optimization for Heterogeneous Chip Multiprocessors", the 15th international conference on Parallel architectures and compilation techniques, Seattle, USA, (2006),
- [4] J. Howard, S. Dighe and S. Vangal, "A 48-Core IA-32 Processor in 45 nm CMOS Using On-Die Message-Passing and DVFS for Performance and Power Scaling", IEEE Journal of Solid-State Circuits, vol. 46, no. 1, (2011), pp. 173-183.
- [5] R. Berg, L. König, J. Rühaak, R. Lausen and B. Fischer, "Highly efficient image registration for embedded systems using a distributed multicore DSP architecture", Journal of Real-Time Image Processing, (2014), pp. 1-21.
- [6] A. Tran, D. Truong and B. Baas, "A Reconfigurable Source-Synchronous On-Chip Network for GALS Many-Core Platforms", IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol.29, no.6, (2010), pp. 897-910.
- [7] A. T. Tran and B.M. Baas. "Achieving High-Performance On-Chip Networks With Shared-Buffer

- Routers", IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 22, no. 6, (2014), pp. 1391-1403.
- [8] L. Hang, L. S Peh and N. K Jha, "Dynamic voltage scaling with links for power optimization of interconnection networks", the Ninth International Symposium on High-Performance Computer Architecture, Anaheim, USA, (2003).
- [9] T. Moreshet, R. I Bahar, M. Herlihy, "Energy-aware microprocessor synchronization: Transactional memory vs. locks", Workshop on Memory Performance Issues, Austin, USA, (2006),
- [10] E. Gaona, J. R.Titos-Gil, J.Fernández and M. E. Acacio, "Selective dynamic serialization for reducing energy consumption in hardware transactional memory systems", The Journal of Supercomputing, vol. 68, no. 2, (2014), pp. 914-934.
- [11] G. Suh, L. Rudolph and S. Devadas, "Dynamic partitioning of shared cache memory", Journal of Supercomputing, vol. 28, no. 1, (2004), pp. 7-26.
- [12] R. Quislant, E. Gutierrez, E. L. Zapata and O. Plata, "Improving Signature Behavior by Irrevocability in Transactional Memory Systems", 26th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), (2014).
- [13] N. Vujic, L. Alvarez and M. Tallada, "Adaptive and Speculative Memory Consistency Support for Multi-core Architectures with On-Chip Local Memories", Languages and Compilers for Parallel Computing, (2010), pp. 218-232.
- [14] F. L. Teixeira, M. L. Pilla, A. R. Du Bois and D. Mosse, "Profiling Patterns of Bit Flipping for Software", 26th International Symposium on Transactional Memories. In Computer Architecture and High Performance Computing (SBAC-PAD), (2014), pp. 136-143.
- [15] E. Lee and D. Messerschmitt, "Static Scheduling of Synchronous Data Flow Programs for Digital Signal Processing", IEEE Transactions on Computers, vol. 32, no. 1, (1987), pp. 24-35.
- [16] W. Ahmad, R. de Groote, P. K. F. Holzenspies, M. Stoelinga and J. Van de Pol, "Resource-constrained optimal scheduling of synchronous dataflow graphs via timed automata". 14th International Conference on Application of Concurrency to System Design (ACSD), (2014), pp. 72-81.
- [17] "IEEE Std 802.11a/D7.0-1999", Part11: Wireless LAN Medium Access Control (MAC) and Physical Layer (PHY) Specifications: High Speed Physical Layer in the GHz Band. http://standards.ieee.org/getieee802/download/802.11a-1999.pdf, (2007).
- [18] W. Ye, J. Heidemann and D. Estrin., "Medium access control with coordinated adaptive sleeping for wireless sensor networks", IEEE/ACM Transactions on Networking, vol. 12, no. 3, (2004), pp. 493-506.
- [19] L. XU, Q. WANG and S. SHI, "A task mapping and scheduling algorithm for heterogeneous multicore processor based SDR platform", Journal of Computational Information Systems, (2011).
- [20] "FreePDK 45nm", http://www.eda.ncsu.edu/wiki/FreePDK45:Contents, (2014).

### **Authors**



**Xu Li**, he received the Ph.D. degrees in Computer system structure from University of Science and Technology, Beijing, China, in 2012. He is currently a Lecturer of Ningbo University of Technology. His main research interests include Computer Architecture, VLSI, Internet of things.



**An Peng**, he is currently an associate professor of Ningbo University of Technology. His main research interests include Embedded system, Internet of things.



**Wang Yu**, she is currently a professor of Ningbo University of Technology. Her main research interests include Machine Vision, Computer Graphics.



**Li Jun**, she is currently a Lecturer of Ningbo University of Technology. Her main research interests include Machine Learning, Computer Graphics.