2017 IEEE International Workshop on Signal Processing Systems (SiPS), 2017
The evolution of convolutional neural networks (CNNs) into more complex forms of organization, wi... more The evolution of convolutional neural networks (CNNs) into more complex forms of organization, with additional layers, larger convolutions and increasing connections, established the state-of-the-art in terms of accuracy errors for detection and classification challenges in images. Moreover, as they evolved to a point where Gigabytes of memory are required for their operation, we have reached a stage where it becomes fundamental to understand how their inference capabilities can be impaired if data elements somehow become corrupted in memory. This paper introduces fault-injection in these systems by simulating failing bit-cells in hardware memories brought on by relaxing the 100% reliable operation assumption. We analyze the behavior of these networks calculating inference under severe fault-injection rates and apply fault mitigation strategies to improve on the CNNs resilience. For the MNIST dataset, we show that 8x less memory is required for the feature maps memory space, and that in sub-100% reliable operation, fault-injection rates up to 10−1 (with most significant bit protection) can withstand only a 1% error probability degradation. Furthermore, considering the offload of the feature maps memory to an embedded dynamic RAM (eDRAM) system, using technology nodes from 65 down to 28 nm, up to 73∼80% improved power efficiency can be obtained.
2021 International Conference on Graphics and Interaction (ICGI), 2021
With the ever-increasing demand for virtualizing every aspect of life, engineers design solutions... more With the ever-increasing demand for virtualizing every aspect of life, engineers design solutions to create immersion in virtual and augmented reality scenarios. Most virtual reality (VR) and augmented reality (AR) headsets require a connection to a computing system and thus they demand the acquisition of additional hardware to use the headsets. Following Moore's law, transistors have become smaller and computers more powerful. As a result, new hardware and programming languages have been developed to achieve high-performance graphics while requiring low power. This paper compares OpenGL and Vulkan implementations on Nvidia's Jetson development kits equipped with edge GPUs that can generate 3D graphics under 5W (Jetson Nano), 15W (Jetson TX2) and, 30W (Jetson Xavier), providing an in-house processing and cost-effective headset solution without an external processing unit. We report that efficiency can be 2 times higher than desktop graphics processing units (GPUs) while maintaining a reasonable amount of rendering power.
This article proposes to address, in a tutorial style, the benefits of using Open Computing Langu... more This article proposes to address, in a tutorial style, the benefits of using Open Computing Language [1] (OpenCL) as a quick way to allow programmers to express and exploit parallelism in signal processing algorithms, such as those used in error-correcting code systems. In particular, we will show how multiplatform kernels can be developed straightforwardly using OpenCL to perform computationally intensive
ACM Transactions on Embedded Computing Systems, 2015
The design cycle for complex special-purpose computing systems is extremely costly and time-consu... more The design cycle for complex special-purpose computing systems is extremely costly and time-consuming. It involves a multiparametric design space exploration for optimization, followed by design verification. Designers of special purpose VLSI implementations often need to explore parameters, such as optimal bitwidth and data representation, through time-consuming Monte Carlo simulations. A prominent example of this simulation-based exploration process is the design of decoders for error correcting systems, such as the Low-Density Parity-Check (LDPC) codes adopted by modern communication standards, which involves thousands of Monte Carlo runs for each design point. Currently, high-performance computing offers a wide set of acceleration options that range from multicore CPUs to Graphics Processing Units (GPUs) and Field Programmable Gate Arrays (FPGAs). The exploitation of diverse target architectures is typically associated with developing multiple code versions, often using distinct...
ABSTRACT In this chapter we present a pair of kernels to decode a class of powerful error-correct... more ABSTRACT In this chapter we present a pair of kernels to decode a class of powerful error-correcting codes, known as low-density parity-check (LDPC) codes, on graphics processing units (GPU). The proposed parallel implementations adopt a compact data structures representation to access data, while the processing on the GPU grid efficiently exploits a balanced distribution of the computational workload. Moreover, we have analyzed different levels of thread coarsening and propose an efficient one, which balances computation and memory accesses. This case study shows that by adopting these practical techniques, it is possible to use the GPU’s computational power to perform this type of processing and achieve significant throughputs, which until recently could be obtained only by developing very large scale integration (VLSI) dedicated microelectronics systems.
ABSTRACT A novel wide-pipeline low-density parity-check (LDPC) decoder approach for the worldwide... more ABSTRACT A novel wide-pipeline low-density parity-check (LDPC) decoder approach for the worldwide interoperability for microwave access (WiMAX) standard (802.16e) is proposed for execution on field-programmable gate arrays (FPGAs), using a high-level synthesis tool to reduce the development effort and design validation time that generates a wide-pipeline architecture. Optimised open computing language (OpenCL)-based kernels are developed and the integration of distinct configurations of single instruction multiple data and compute units to increase the level of parallelism are analysed. The decoding throughput surpasses the minimal requirements of 75 Mbit/s, a key figure of merit that ranks the design with other very large-scale integrationbased approaches. Furthermore, extra precision is deployed with 8-bit fixed-point arithmetic, delivering superior bit error rate performance and lower error floor regions.
2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 2013
ABSTRACT In medical ultrasound, synthetic aperture (SA) imaging is well-considered as a novel ima... more ABSTRACT In medical ultrasound, synthetic aperture (SA) imaging is well-considered as a novel image formation technique for achieving superior resolution than that offered by existing scanners. However, its intensive processing load is known to be a challenging factor. To address such a computational demand, this paper proposes a new parallel approach based on the design of OpenCL signal processing kernels that can compute SA image formation in real-time. We demonstrate how these kernels can be ported onto different classes of parallel processors, namely multi-core CPUs and GPUs, whose multi-thread computing resources are able to process more than 250 fps. Moreover, they have strong potential to support the development of more complex algorithms, thus increasing the depth range of the inspected human volume and the final image resolution observed by the medical practitioner.
2012 20th Euromicro International Conference on Parallel, Distributed and Network-based Processing, 2012
Low-Density Parity-Check (LDPC) codes are powerful error correcting codes used today in communica... more Low-Density Parity-Check (LDPC) codes are powerful error correcting codes used today in communication standards such as DVB-S2 and WiMAX to transmit data inside noisy channels with high error probability. LDPC decoding is computationally demanding and requires irregular accesses to memory which makes it suitable for parallelization. The recent introduction of the many-core Single-chip Cloud Computer (SCC) from Intel research Labs
2014 IEEE 25th International Conference on Application-Specific Systems, Architectures and Processors, 2014
ABSTRACT Power and flexibility are important constraints in the design of new chips. The efficien... more ABSTRACT Power and flexibility are important constraints in the design of new chips. The efficiency extracted from a design is increasingly becoming a dominant question, and several techniques and technological advances can be used to optimize efficiency in its energy and functionality domains. These two characteristics are critical in digital communication systems that must work accordingly with multiple communication standards at different power, throughput and latency requirements. In this work, we focus on the physical layer Forward Error Correcting (FEC) system, due to the tight throughput and latency constraints they are required to meet, and develop specialized processing engines for Low-Density Parity-Check (LDPC) codes decoding, a class of widely standardized codes. The engines were developed for execution on Field-Programmable Gate Array (FPGA) devices by exploring dataflow and wide-pipeline design approaches, and have the design flexibility to target different LDPC codes, since they were implemented using recent High-Level Synthesis (HLS) tools. The generated engines and architectures allow achieving highly efficient decoders with decoding throughputs ranging from 16 Mbit/s to 1.2 Gbit/s at energy efficiencies of 42 to 908 Mbit/Joule/iteration, while the achieved clock frequencies of operation vary from 80 to 300 MHz. Furthermore, our bandwidth analysis shows that workload boundaries do not impose limitations on a system bus.
2017 IEEE International Workshop on Signal Processing Systems (SiPS), 2017
The evolution of convolutional neural networks (CNNs) into more complex forms of organization, wi... more The evolution of convolutional neural networks (CNNs) into more complex forms of organization, with additional layers, larger convolutions and increasing connections, established the state-of-the-art in terms of accuracy errors for detection and classification challenges in images. Moreover, as they evolved to a point where Gigabytes of memory are required for their operation, we have reached a stage where it becomes fundamental to understand how their inference capabilities can be impaired if data elements somehow become corrupted in memory. This paper introduces fault-injection in these systems by simulating failing bit-cells in hardware memories brought on by relaxing the 100% reliable operation assumption. We analyze the behavior of these networks calculating inference under severe fault-injection rates and apply fault mitigation strategies to improve on the CNNs resilience. For the MNIST dataset, we show that 8x less memory is required for the feature maps memory space, and that in sub-100% reliable operation, fault-injection rates up to 10−1 (with most significant bit protection) can withstand only a 1% error probability degradation. Furthermore, considering the offload of the feature maps memory to an embedded dynamic RAM (eDRAM) system, using technology nodes from 65 down to 28 nm, up to 73∼80% improved power efficiency can be obtained.
2021 International Conference on Graphics and Interaction (ICGI), 2021
With the ever-increasing demand for virtualizing every aspect of life, engineers design solutions... more With the ever-increasing demand for virtualizing every aspect of life, engineers design solutions to create immersion in virtual and augmented reality scenarios. Most virtual reality (VR) and augmented reality (AR) headsets require a connection to a computing system and thus they demand the acquisition of additional hardware to use the headsets. Following Moore's law, transistors have become smaller and computers more powerful. As a result, new hardware and programming languages have been developed to achieve high-performance graphics while requiring low power. This paper compares OpenGL and Vulkan implementations on Nvidia's Jetson development kits equipped with edge GPUs that can generate 3D graphics under 5W (Jetson Nano), 15W (Jetson TX2) and, 30W (Jetson Xavier), providing an in-house processing and cost-effective headset solution without an external processing unit. We report that efficiency can be 2 times higher than desktop graphics processing units (GPUs) while maintaining a reasonable amount of rendering power.
This article proposes to address, in a tutorial style, the benefits of using Open Computing Langu... more This article proposes to address, in a tutorial style, the benefits of using Open Computing Language [1] (OpenCL) as a quick way to allow programmers to express and exploit parallelism in signal processing algorithms, such as those used in error-correcting code systems. In particular, we will show how multiplatform kernels can be developed straightforwardly using OpenCL to perform computationally intensive
ACM Transactions on Embedded Computing Systems, 2015
The design cycle for complex special-purpose computing systems is extremely costly and time-consu... more The design cycle for complex special-purpose computing systems is extremely costly and time-consuming. It involves a multiparametric design space exploration for optimization, followed by design verification. Designers of special purpose VLSI implementations often need to explore parameters, such as optimal bitwidth and data representation, through time-consuming Monte Carlo simulations. A prominent example of this simulation-based exploration process is the design of decoders for error correcting systems, such as the Low-Density Parity-Check (LDPC) codes adopted by modern communication standards, which involves thousands of Monte Carlo runs for each design point. Currently, high-performance computing offers a wide set of acceleration options that range from multicore CPUs to Graphics Processing Units (GPUs) and Field Programmable Gate Arrays (FPGAs). The exploitation of diverse target architectures is typically associated with developing multiple code versions, often using distinct...
ABSTRACT In this chapter we present a pair of kernels to decode a class of powerful error-correct... more ABSTRACT In this chapter we present a pair of kernels to decode a class of powerful error-correcting codes, known as low-density parity-check (LDPC) codes, on graphics processing units (GPU). The proposed parallel implementations adopt a compact data structures representation to access data, while the processing on the GPU grid efficiently exploits a balanced distribution of the computational workload. Moreover, we have analyzed different levels of thread coarsening and propose an efficient one, which balances computation and memory accesses. This case study shows that by adopting these practical techniques, it is possible to use the GPU’s computational power to perform this type of processing and achieve significant throughputs, which until recently could be obtained only by developing very large scale integration (VLSI) dedicated microelectronics systems.
ABSTRACT A novel wide-pipeline low-density parity-check (LDPC) decoder approach for the worldwide... more ABSTRACT A novel wide-pipeline low-density parity-check (LDPC) decoder approach for the worldwide interoperability for microwave access (WiMAX) standard (802.16e) is proposed for execution on field-programmable gate arrays (FPGAs), using a high-level synthesis tool to reduce the development effort and design validation time that generates a wide-pipeline architecture. Optimised open computing language (OpenCL)-based kernels are developed and the integration of distinct configurations of single instruction multiple data and compute units to increase the level of parallelism are analysed. The decoding throughput surpasses the minimal requirements of 75 Mbit/s, a key figure of merit that ranks the design with other very large-scale integrationbased approaches. Furthermore, extra precision is deployed with 8-bit fixed-point arithmetic, delivering superior bit error rate performance and lower error floor regions.
2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 2013
ABSTRACT In medical ultrasound, synthetic aperture (SA) imaging is well-considered as a novel ima... more ABSTRACT In medical ultrasound, synthetic aperture (SA) imaging is well-considered as a novel image formation technique for achieving superior resolution than that offered by existing scanners. However, its intensive processing load is known to be a challenging factor. To address such a computational demand, this paper proposes a new parallel approach based on the design of OpenCL signal processing kernels that can compute SA image formation in real-time. We demonstrate how these kernels can be ported onto different classes of parallel processors, namely multi-core CPUs and GPUs, whose multi-thread computing resources are able to process more than 250 fps. Moreover, they have strong potential to support the development of more complex algorithms, thus increasing the depth range of the inspected human volume and the final image resolution observed by the medical practitioner.
2012 20th Euromicro International Conference on Parallel, Distributed and Network-based Processing, 2012
Low-Density Parity-Check (LDPC) codes are powerful error correcting codes used today in communica... more Low-Density Parity-Check (LDPC) codes are powerful error correcting codes used today in communication standards such as DVB-S2 and WiMAX to transmit data inside noisy channels with high error probability. LDPC decoding is computationally demanding and requires irregular accesses to memory which makes it suitable for parallelization. The recent introduction of the many-core Single-chip Cloud Computer (SCC) from Intel research Labs
2014 IEEE 25th International Conference on Application-Specific Systems, Architectures and Processors, 2014
ABSTRACT Power and flexibility are important constraints in the design of new chips. The efficien... more ABSTRACT Power and flexibility are important constraints in the design of new chips. The efficiency extracted from a design is increasingly becoming a dominant question, and several techniques and technological advances can be used to optimize efficiency in its energy and functionality domains. These two characteristics are critical in digital communication systems that must work accordingly with multiple communication standards at different power, throughput and latency requirements. In this work, we focus on the physical layer Forward Error Correcting (FEC) system, due to the tight throughput and latency constraints they are required to meet, and develop specialized processing engines for Low-Density Parity-Check (LDPC) codes decoding, a class of widely standardized codes. The engines were developed for execution on Field-Programmable Gate Array (FPGA) devices by exploring dataflow and wide-pipeline design approaches, and have the design flexibility to target different LDPC codes, since they were implemented using recent High-Level Synthesis (HLS) tools. The generated engines and architectures allow achieving highly efficient decoders with decoding throughputs ranging from 16 Mbit/s to 1.2 Gbit/s at energy efficiencies of 42 to 908 Mbit/Joule/iteration, while the achieved clock frequencies of operation vary from 80 to 300 MHz. Furthermore, our bandwidth analysis shows that workload boundaries do not impose limitations on a system bus.
Uploads
Papers by G. Falcão