Digital Signal Processing (DSP) architectures have emerged as considerable driving forces in consumer electronics, communications, entertainment, medical devices, video games, and computing in general. These technologies have gained, even, more significance due to the new developments in Internet and Wireless Communication, Portable Multimedia, Smart Sensors, and Cognitive Systems. As the demands of the market and consumers grew exponentially, new and key advanced technologies have to be developed in DSP algorithms, circuits, architectures, implementation, design methods, and prototyping. Implementation and Prototyping of DSP systems have become very sophisticated because of the increasing demands at each level and the market dynamics.

This special issue is focused on implementation and prototyping of DSP architectures for multimedia communications. Several case studies are given in various implementation technologies. It shows the impact of implementation technologies on the performance of a system. It also demonstrates the significance of prototyping on the success of the final product. The first three papers are case studies of prototyping in ASIC technologies. They are a demonstration of how to improve the system performance at the circuit level.

The first paper, Efficient 45 nm ASIC Architecture for Full-Search Free Intra Prediction in Real-Time H.264/AVC Decoder, presents an ASIC architecture for a high throughput Full-Search Free (FSF) intra mode selection and direction prediction algorithm for H.264/AVC decoder. The target application is mobile video, so low power and simplicity are the target design attributes. A prototype of the proposed architecture is implemented in 45 nm CMOS technology. The overall power consumption is 9.01 mW at 140 MHZ. In the second paper, Full-Hardware Architectures for Data-Dependent Superimposed Training Channel Estimation, two hardware channel estimator architectures for data-dependent superimposed training (DDST) receiver with perfect synchronization and nonexistent DC-offset are developed. These implementations are demonstration of achieving the required performance for commercial applications. The proposed architectures are prototyped using FPGA and then implemented in 90 nm CMOS technology. The overall power consumption is 3.7 mW and 2.74 mW at 187 MHz and 247 MHz, respectively. The third paper, VLSI Architecture for MIMO Soft-Input Soft-Output Sphere Detection, introduces the first soft-input soft-output (SISO) tuple search detector (TSD) hardware implementation. This computational module is scalable in constellation size and number of antennas, it is highly parallel and pipelined. The SISO-TSD architecture is implemented in 65 nm CMOS technology for 4 × 4 MIMO transmission and 64-QAM constellation. The prototype has low power consumption of 58.2 mW to 73.9 mW operating at 454 MHz.

The next two papers use FPGA for prototyping and evaluating the proposed algorithms and architectures. The fourth paper, Multi-source Neural Activity Estimation and Sensor Scheduling: Algorithms and Hardware Implementation, presents an FPGA prototype of electroencephalography (EEG)/magnetoencephalography (MEG) sensors scheduling algorithm focused on reducing power. The Xilinx Virtex-5 platform shows that it only takes 10 ms to process 100 data samples using 6400 particles. This performance can support real-time processing of an EEG/MEG neural activity system with sampling rate of up to 10 kHz.

In the fifth paper, Hardware Acceleration for Neuromorphic Vision Algorithms, an application specific architecture for accelerating a neuromorphic vision system for object recognition is presented. The architecture is based on HMAX, a biologically-inspired model of the visual cortex. The neuromorphic accelerators are validated on a multi-FPGA system. Results show that the neuromorphic accelerators are 13.8X (2.6X) more power efficient when compared to CPU (GPU) implementation.

The next two papers are examples of prototyping at processor or module level. In the sixth paper, Integration of Dataflow-based Heterogeneous Multiprocessor Scheduling Techniques in GNU Radio, a heterogeneous multiprocessor platform is used efficiently to explore design spaces for Software Defined Radio (SDR) system implementation, and examine the overhead of different solutions. In the seventh paper, Scalable Low-Power Computing via Scheduling on Subsets of Multicore Processors, presents a novel approach to power management for multi-core processor systems by exploiting the operating system scheduler. Using the various power levels and the utilization statistics of the cores. The execution of software threads is limited to a subset of the available cores while leaving the others idle to allow them to enter into deeper power-saving states. The experimental results show significant thermal power reduction (up to 61 %) in a variety of scenarios, while system performance was sustained in most cases.

The last three papers are using simulation and experimentation for exploring the design space and evaluating the proposed algorithms and architectures. The eighth paper, A Hardware-Efficient Algorithm for Real-Time Computation of Zadoff-Chu Sequences, presents a reconfigurable hardware architecture that implement a new algorithm for computing Zadoff-Chu (ZC) sequence elements on-line using the CORDIC algorithm. This architecture is applied in a searcher block for detecting the physical random access channel (PRACH) in Long-Term Evolution (LTE). Simulation tools have been employed for evaluation, results demonstrate that the proposed architecture is capable of achieving detection error rates for LTE PRACH that are close to ideal rates achieved using floating point precision. The ninth paper, Fast Likelihood Computation in Speech Recognition using Matrices, explores acoustic modeling using mixtures of multivariate Gaussians. Two case studies are evaluated; direct low-rank approximation of the Gaussian parameter matrix and indirect derivation of low-rank factors of the Gaussian parameter matrix by optimum approximation of the likelihood matrix. Experiments show that both methods lead to similar speedups but the latter leads to far lesser impact on the recognition accuracy. Experiments on 1138 work vocabulary RM1 task and 6224 word vocabulary TIMIT task using Sphinx 3.7 system show that, for a typical case the matrix multiplication based approach leads to overall speedup of 46 % on RM1 task and 115 % for TIMIT task. The final paper, Soft-Decision Error Correction of NAND Flash Memory with a Turbo Product Code, presents an error correction scheme for NAND Flash Memory, it is based on The turbo product code (TPC) with multi-precision output. Experimental results, based on a construction rate-0.907 (36116, 32768) extended TPC for 2-bit MLC NAND flash memory, and apply the Chase-Pyndiah decoding algorithm, .are presented for a simulated flash memory channel.