Newsletter Downloads
The challenges of writing portable, correct and high performance libraries for GPUs
Graphics Processing Units (GPUs) are widely used to accelerate scientific applications. Many successes have been reported with speedups of two or three orders of magnitude over serial implementations of the same algorithms. These speedups typically ...
Power profiling and optimization for heterogeneous multi-core systems
Processing speed and energy efficiency are two of the most critical issues for computer systems. This paper presents a systematic approach for profiling the power and performance characteristics of application targeting heterogeneous multi-core ...
GPU accelerated CAE using open solvers and the cloud
After more than five years since GPUs were first used as accelerators for general scientific computations, the field of General Purpose GPU computing or GPGPU has finally reached mainstream. Developers have now access to a mature hardware and software ...
Design space exploration of adaptive beamforming acceleration for bedside and portable medical ultrasound imaging
The use of adaptive beamforming is a viable solution to provide high-resolution real-time medical ultrasound imaging. However, the increase in image resolution comes at an expense of a significant increase in compute requirement over conventional ...
GPU implementation and optimization of electromagnetic simulation using the FDTD method for antenna designing
This paper describes electromagnetical field simulation using the 3D-FDTD method for antenna designing on a CUDAcompatible GPU. We use the Split Perfectly Matched Layer as an absorbing boundary condition. As is well known, the 3D-FDTD method is a kind ...
CoreSymphony: an efficient reconfigurable multi-core architecture
This paper describes CoreSymphony, a cooperative and reconfigurable superscalar processor architecture that improves single-thread performance in chip multiprocessor. CoreSymphony enables some narrow-issue cores to be fused into a single wide-issue ...
An FPGA-based scalable simulation accelerator for tile architectures
FPGA-based simulation systems can simulate processor behavior in realistic time. In order to practically simulate tile many-core architectures, we propose ScalableCore for prototyping system development using multiple FPGAs. In this paper, we present an ...
Domain-specific programmable design of scalable streaming-array for power-efficient stencil computation
This paper presents the domain-specific programmable design of custom computing machines for high-performance stencil computation. Stencil computation is one of the typical kernels in scientific computations, however its low operational-intensity makes ...
An implementation of out-of-order execution system for acceleration of computational fluid dynamics on FPGAs
CFD is an important tool for designing aircraft components. FaSTAR is one of the most recent CFD program package with various solvers and automatic generation of grid data. However, FaSTAR is difficult to be executed in parallel machines because of its ...
Embedded architecture with hardware accelerator for target recognition in driver assistance system
This paper presents a new Radar-based recognition system, which is able to identify obstacles during a vehicle movement. Obstacles recognition gives the benefits of avoiding false alarms and allows generating alarms that take into account the ...
Surviving the end of frequency scaling with reconfigurable dataflow computing
Over the past decade x86 processors have come to dominate the world's largest supercomputers. However in the future conventional multicore processors are unlikely to be able to deliver the necessary performance per $ and per W to achieve exascale ...
KPN2GPU: an approach for discovery and exploitation of fine-grain data parallelism in process networks
With advances in manycore and accelerator architectures, the high performance and embedded spaces are rapidly converging. Emerging architectures feature different forms of parallelism. The Polyhedral Processes Networks (PPNs) are a proven model of ...
High speed CRC with 64-bit generator polynomial on an FPGA
Deployment of jumbo frame sizes beyond 9000 bytes for storage systems is limited by 32-bit Cyclic Redundancy Checks used by a network protocol. In order to overcome this limitation we study possibility of using 64-bit polynomials in software and ...
A biologically plausible real-time spiking neuron simulation environment based on a multiple-FPGA platform
Neurological research has revealed that neurons encode information in the timing of spikes. Spiking neural network simulations are a flexible and powerful method for investigating the behaviour of such neuronal systems. The spiking neuron models which ...
Augmenting DR-ASIP flexibility through multi-mode custom instructions
This paper introduces a simple method called multimode custom instructions, which aims at reducing the power consumption of the register file of tightly coupled dynamically reconfigurable application specific instruction set processors (DR-ASIPs). To ...
Automatic fusions of CUDA-GPU kernels for parallel map
When implementing a function mapping on the contemporary GPU, several contradictory performance factors affecting distribution of computation into GPU kernels have to be balanced. A decomposition-fusion scheme suggests to decompose the computational ...
A discussion on calculating eigenvalues of real symmetric tridiagonal matrices on a GPU
While GPUs are attracting attention as an accelerator in wide-ranged application areas, compatibility between the architecture and selected algorithm is important to effectively bring out their potential performance. This paper focuses on eigenvalue ...
Multicore reconfiguration platform an alternative to RAMPSoC
The current state of the art in processor performance improvement is multicore-processor systems. These systems offer a number of homogeneous and static processor cores for the parallel distribution of computational tasks. A novel idea in this research ...
Parallelism Level Impact on Energy Consumption in Reconfigurable Devices
Nowadays, System-on-Chip architectures are composed of several execution resources which support complex applications. As it shares silicon area and limits the cost of the global circuit, the embedding of a reconfigurable resource in these SoC provides ...
Power and area optimisation in heterogeneous 3D networks-on-chip architectures
Three dimensional Network-on-Chip (3D NoC) architectures have evolved with a lot of interest to address the on-chip communication delays of modern SoC systems. However, the vertical interconnections between layers is more power and area hungry compared ...