Chip multiprocessors (CMPs) combine increasingly many general-purpose processor cores on a single chip. These cores run several tasks with unpredictable communication needs, resulting in uncertain and often-changing traffic patterns. This... more
Chip multiprocessors (CMPs) combine increasingly many general-purpose processor cores on a single chip. These cores run several tasks with unpredictable communication needs, resulting in uncertain and often-changing traffic patterns. This unpredictability leads network-onchip (NoC) designers to plan for the worst-case traffic patterns, and significantly over-provision link capacities. In this paper, we provide NoC designers with an alternative statistical approach. We first present the traffic-load distribution plots (T-Plots), illustrating how much capacity overprovisioning is needed to service 90%, 99%, or 100% of all traffic patterns. We prove that in the general case, plotting T-Plots is #P-complete, and therefore extremely complex. We then show how to determine the exact mean and variance of the traffic load on any edge, and use these to provide Gaussian-based models for the T-Plots, as well as guaranteed performance bounds. We also explain how to practically approximate T-Plot...
The Centre for Telecommunications Value Chain Driven Research (CTVR) approach to software radio is to focus on the use of a general-purpose processor (GPP). The use of a GPP to perform signal processing for communications applications... more
The Centre for Telecommunications Value Chain Driven Research (CTVR) approach to software radio is to focus on the use of a general-purpose processor (GPP). The use of a GPP to perform signal processing for communications applications presents the developer with challenges but it also presents some opportunities. We argue new classes of algorithms are required which will exploit the advantages and negate the disadvantages of using a GPP. Indeed other researchers have already started this programme of ‘algorithmic advances’. This paper discusses the issues involved and reviews some existing developments. We present our own progress in developing a noise adaptive symbol synchroniser and we discuss some initial thoughts on how these techniques may be applied to radio functions generally.
Linux operative system, extended with the Real Time Application Interface (RTAI), makes it possible to realize industrial motion controls. This requires General Purpose Processors (GPPs) instead of dedicated DSPs or microcontrollers. RTAI... more
Linux operative system, extended with the Real Time Application Interface (RTAI), makes it possible to realize industrial motion controls. This requires General Purpose Processors (GPPs) instead of dedicated DSPs or microcontrollers. RTAI offers the possibility of designing and developing motion controls using high level object languages (i.e., Simulink). The main advantages brought using Linux-RTAI and a PC as a controller platform are: very reduced time for development, diagnostics, design of controls; cost reduction offered by the open-source status of the operating system; high availability of the hardware. This paper shows that such a system offers adequate digital programming and signal processing capabilities to make real time motion control applications. An industrial-PC based acceleration control is presented and points of strength of RTAI solution are shown.
Software Defined Radios are becoming more and more prevalent. Especially in the radio amateur community, Software Defined Radios are a big success. The wireless industry also has considerable interest in the dynamic reconfigurability and... more
Software Defined Radios are becoming more and more prevalent. Especially in the radio amateur community, Software Defined Radios are a big success. The wireless industry also has considerable interest in the dynamic reconfigurability and other advantages of Software Defined Radios. Our research focuses on the latency of Software Defined Radios and its impact on throughput in modern wireless protocols. Software Defined Radio systems often employ a bus system to transfer the samples from a radio frontend to the processor which introduces a non-negligible latency. Additionally, the signal processing calculations on general-purpose processors introduce additional latencies that are not found on conventional radios. This work concentrates on one particular Software Defined Radio system called GNU Radio, an open source Software Defined Radio application, and one of its hardware components, the Universal Software Radio Peripheral (USRP), and analyzes its receive and transmit latencies. We ...
Abstract. Data cache hit ratio has a major impact on execution performance of programs by effectively reducing average data access time. Prefetching mechanisms improve this ratio by fetching data items that shall soon be required by the... more
Abstract. Data cache hit ratio has a major impact on execution performance of programs by effectively reducing average data access time. Prefetching mechanisms improve this ratio by fetching data items that shall soon be required by the running program. Software-driven prefetching ...
Abstract—This paper presents three reconfigurable radio sys-tems developed within CTVR, The Telecommunications Research Centre and demonstrated at the IEEE International Dynamic Spectrum Access Networks (DySPAN) symposium held in Chicago... more
Abstract—This paper presents three reconfigurable radio sys-tems developed within CTVR, The Telecommunications Research Centre and demonstrated at the IEEE International Dynamic Spectrum Access Networks (DySPAN) symposium held in Chicago in October 2008. All three systems were developed using the Iris cognitive radio network architecture. Each system employs a different processing platform. Today's radio communication standards feature increasing levels of flexibility and reconfigurability as designers strive to ...
Modern embedded systems are being modeled as Heterogeneous Reconfigurable Computing Systems (HRCS) where Reconfigurable Hardware i.e. Field Programmable Gate Array (FPGA) and soft core processors acts as computing elements. So, an... more
Modern embedded systems are being modeled as Heterogeneous Reconfigurable Computing Systems (HRCS) where Reconfigurable Hardware i.e. Field Programmable Gate Array (FPGA) and soft core processors acts as computing elements. So, an efficient task distribution methodology is essential for obtaining high performance in modern embedded systems. In this paper, we present a novel methodology for task distribution called Minimum Laxity First (MLF) algorithm that takes the advantage of runtime reconfiguration of FPGA in order to effectively utilize the available resources. The MLF algorithm is a list based dynamic scheduling algorithm that uses attributes of tasks as well computing resources as cost function to distribute the tasks of an application to HRCS. In this paper, an on chip HRCS computing platform is configured on Virtex 5 FPGA using Xilinx EDK. The real time applications JPEG, OFDM transmitters are represented as task graph and then the task are distributed, statically as well dynamically, to the platform HRCS in order to evaluate the performance of the designed task distribution model. Finally, the performance of MLF algorithm is compared with existing static scheduling algorithms. The comparison shows that the MLF algorithm outperforms in terms of efficient utilization of resources on chip and also speedup an application execution.
Smart Cameras are important components in Human Computer Interaction. In any remote surveillance scenario, smart cameras have to take intelligent decisions to select frames of significant changes to minimize communication and processing... more
Smart Cameras are important components in Human Computer Interaction. In any remote surveillance scenario, smart cameras have to take intelligent decisions to select frames of significant changes to minimize communication and processing overhead. Among many of the algorithms for change detection, one based on clustering based scheme was proposed for smart camera systems. However, such an algorithm could achieve low frame rate far from real-time requirements on a general purpose processors (like PowerPC) available on FPGAs. This paper proposes the hardware accelerator capable of detecting real time changes in a scene, which uses clustering based change detection scheme. The system is designed and simulated using VHDL and implemented on Xilinx XUP Virtex-IIPro FPGA board. Resulted frame rate is 30 frames per second for QVGA resolution in gray scale.
Abstract Low-Density Parity-Check (LDPC) codes are among the best error correcting codes known and have been adopted by data transmission standards, such as DVB-S2 or WiMax. They are based on binary sparse parity check matrices and... more
Abstract Low-Density Parity-Check (LDPC) codes are among the best error correcting codes known and have been adopted by data transmission standards, such as DVB-S2 or WiMax. They are based on binary sparse parity check matrices and usually represented by ...
— Microprocessors are applicable to a wide range of information processing tasks, ranging from general computing to real-time monitoring systems. The microprocessor facilitates new ways of communication and how to make use of the vast... more
— Microprocessors are applicable to a wide range of information processing tasks, ranging from general computing to real-time monitoring systems. The microprocessor facilitates new ways of communication and how to make use of the vast information available online and offline both at home and in workplace. Most electronic devices--including everything from computers, remote controls, washing machines, microwaves and cell phones to iPods and more--contain a built-in microprocessor. Microprocessors are at the core of personal computers, laptops, mobile phones and complex military and space systems. This work presents the general application of microprocessors.
General-purpose processors have utilized complex and energy inefficient techniques to accelerate performance. In embedded DSP designs, power constraints have precluded general-purpose microarchitectural techniques. Rather than minimize... more
General-purpose processors have utilized complex and energy inefficient techniques to accelerate performance. In embedded DSP designs, power constraints have precluded general-purpose microarchitectural techniques. Rather than minimize average execution time, embedded DSP processors require the worst-case execution time to be minimized. Subsequently, Very Long In-struction Word (VLIW) processors have been employed, but architecturally visible side effects have imposed restrictions on parallelism due to interrupt and latency considerations – particu-larly if all loads must complete prior to servicing interrupts. In this paper, we present a low-power multithreaded interlocked (transparent) processor capable of parallelizing non-associative DSP arithmetic. We describe specific memory and logic tech-niques for reducing power dissipation and discuss how multi-threading enables low-power optimization. We further describe the programming environment for our SDR DSP processor. Fi-nally, we ...
MicroSIMD architectures incorporating subword parallelism are very efficient for application-specific media processors as well as for fast multimedia information processing in general-purpose processors. This paper addresses the unsolved... more
MicroSIMD architectures incorporating subword parallelism are very efficient for application-specific media processors as well as for fast multimedia information processing in general-purpose processors. This paper addresses the unsolved problem of the need to permute the ...
This paper evaluates the Raw microprocessor. Raw addresses the challenge of building a general-purpose architecture that performs well on a larger class of stream and embedded computing appli-cations than existing microprocessors, while... more
This paper evaluates the Raw microprocessor. Raw addresses the challenge of building a general-purpose architecture that performs well on a larger class of stream and embedded computing appli-cations than existing microprocessors, while still running existing ILP-...
For some time, the networking community has assumed that it is impossible to do IP routing lookups in software fast enough to support gigabit speeds. IP routing lookups must �nd the routing entry with the longest matching pre�x, a task... more
For some time, the networking community has assumed that it is impossible to do IP routing lookups in software fast enough to support gigabit speeds. IP routing lookups must �nd the routing entry with the longest matching pre�x, a task that has been thought to require hardware support at lookup frequencies of millions per second. We present a forwarding table data structure designed for quick routing lookups. Forwarding tables are small enough to �t in the cache of a conventional general purpose processor. With the table in cache, a 200 MHz Pentium Pro or a 333 MHz Alpha 21164 can perform a few million lookups per second. This means that it is feasible to do a full routing lookup for each IPpacket at gigabit speeds without special hardware. The forwarding tables are very small, a large routing table with 40,000 routing entries can be compacted to a forwarding table of 150�160 Kbytes. A lookup typically requires less than 100 instructions on an Alpha, using eight memory references ac...
Floating-point matrix multiplication is arguably the most important kernel routine in many scientific applications. Therefore, its efficient implementation is crucial for the overall performance of any computer system targeting scientific... more
Floating-point matrix multiplication is arguably the most important kernel routine in many scientific applications. Therefore, its efficient implementation is crucial for the overall performance of any computer system targeting scientific computations. In this paper, we propose a holistic solution to accelerate matrix multiplication on reconfigurable hardware using the MOLEN polymorphic processor. The MOLEN polymorphic processor consists of a general purpose processor (GPP) tightly coupled with a reconfigurable coprocessor. The latter can be used to implement arbitrary functions in hardware using custom computing units (CCUs). We implemented matrix multiplication as a CCU for the MOLEN processor and realized it on real reconfigurable hardware. The software interface is defined by the MOLEN programming paradigm, which enables trivial integration of the hardware accelerator at the application level. A matrix multiplication is initiated on the CCU by a MOLEN execute instruction and the...
This paper presents the implementation of a 16-bit interval floating-point unit on a soft-core processor to allow interval computations for embedded systems. The distributed localization of a source using a network of sensors is presented... more
This paper presents the implementation of a 16-bit interval floating-point unit on a soft-core processor to allow interval computations for embedded systems. The distributed localization of a source using a network of sensors is presented to compare the performance of the proposed processor to those obtained with a general-purpose processor.