Hardware Architecture
See recent articles
- [1] arXiv:2407.21184 [pdf, other]
-
Title: Optical Computing for Deep Neural Network Acceleration: Foundations, Recent Developments, and Emerging DirectionsSubjects: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
Emerging artificial intelligence applications across the domains of computer vision, natural language processing, graph processing, and sequence prediction increasingly rely on deep neural networks (DNNs). These DNNs require significant compute and memory resources for training and inference. Traditional computing platforms such as CPUs, GPUs, and TPUs are struggling to keep up with the demands of the increasingly complex and diverse DNNs. Optical computing represents an exciting new paradigm for light-speed acceleration of DNN workloads. In this article, we discuss the fundamentals and state-of-the-art developments in optical computing, with an emphasis on DNN acceleration. Various promising approaches are described for engineering optical devices, enhancing optical circuits, and designing architectures that can adapt optical computing to a variety of DNN workloads. Novel techniques for hardware/software co-design that can intelligently tune and map DNN models to improve performance and energy-efficiency on optical computing platforms across high performance and resource constrained embedded, edge, and IoT platforms are also discussed. Lastly, several open problems and future directions for research in this domain are highlighted.
- [2] arXiv:2407.21192 [pdf, html, other]
-
Title: Functional ISS-Driven Verification of Superscalar RISC-V ProcessorsComments: Accepted for lecture presentation at 2024 31st IEEE International Conference on Electronics, Circuits and Systems (ICECS), Nancy, France, Nov. 18-20, 2024Subjects: Hardware Architecture (cs.AR)
A time-efficient and comprehensive verification is a fundamental part of the design process for modern computing platforms, and it becomes ever more important and critical to optimize as the latter get ever more complex. SupeRFIVe is a methodology for the functional verification of superscalar processors that leverages an instruction set simulator to validate their correctness according to a simulation-based approach, interfacing a testbench for the design under test with the instruction set simulator by means of socket communication. We demonstrate the effectiveness of the SupeRFIVe methodology by applying it to verify the functional correctness of a RISC-V dual-issue superscalar CPU, leveraging the state-of-the-art RISC-V instruction set simulator Spike and executing a set of benchmark applications from the open literature.
- [3] arXiv:2407.21325 [pdf, other]
-
Title: EdgeLLM: A Highly Efficient CPU-FPGA Heterogeneous Edge Accelerator for Large Language ModelsSubjects: Hardware Architecture (cs.AR)
The rapid advancements in artificial intelligence (AI), particularly the Large Language Models (LLMs), have profoundly affected our daily work and communication forms. However, the colossal scale of LLM presents significant operational challenges, particularly when attempting to deploy them on resource-constrained edge devices such as smartphones, robots, and embedded systems. In this work, we proposed EdgeLLM, an efficient CPU-FPGA heterogeneous acceleration framework, to markedly enhance the computational efficiency of LLMs on edge. We first analyzed the whole operators within AI models and developed a universal data parallelism scheme, which is generic and can be adapted to any type of AI algorithm. Then, we developed fully-customized hardware operators according to the designated data formats. A multitude of optimization techniques have been integrated in the design, such as approximate FP16*INT4 and FP16*FP16 computation engines, group vector systolic arrays, log-scale structured sparsity, asynchronous between data transfer and processing. Finally, we proposed an end-to-end compilation scheme that can dynamically compile all of the operators and map the whole model on CPU-FPGA heterogeneous system. The design has been deployed on AMD Xilinx VCU128 FPGA, our accelerator achieves 1.67x higher throughput and 7.4x higher energy efficiency than the commercial GPU (NVIDIA A100-SXM4-80G) on ChatGLM2-6B, and shows 10%~20% better performance than state-of-the-art FPGA accelerator of FlightLLM in terms of HBM bandwidth utilization and LLM throughput.
- [4] arXiv:2407.21367 [pdf, html, other]
-
Title: Blink: Fast Automated Design of Run-Time Power Monitors on FPGA-Based Computing PlatformsComments: Accepted for lecture presentation at 2024 31st IEEE International Conference on Electronics, Circuits and Systems (ICECS), Nancy, France, Nov. 18-20, 2024Subjects: Hardware Architecture (cs.AR)
The current over-provisioned heterogeneous multi-cores require effective run-time optimization strategies, and the run-time power monitoring subsystem is paramount for their success. Several state-of-the-art methodologies address the design of a run-time power monitoring infrastructure for generic computing platforms. However, the power model's training requires time-consuming gate-level simulations that, coupled with the ever-increasing complexity of the modern heterogeneous platforms, dramatically hinder the usability of such solutions. This paper introduces Blink, a scalable framework for the fast and automated design of run-time power monitoring infrastructures targeting computing platforms implemented on FPGA. Blink optimizes the time-to-solution to deliver the run-time power monitoring infrastructure by replacing traditional methodologies' gate-level simulations and power trace computations with behavioral simulations and direct power trace measurements. Applying Blink to multiple designs mixing a set of HLS-generated accelerators from a state-of-the-art benchmark suite demonstrates an average time-to-solution speedup of 18 times without affecting the quality of the run-time power estimates.
- [5] arXiv:2407.21661 [pdf, other]
-
Title: Towards Error Correction for Computing in Racetrack MemoryPreston Brazzle, Benjamin F. Morris III, Evan McKinney, Peipei Zhou, Jingtong Hu, Asif Ali Khan, Alex K. JonesComments: 4 pages, 6 figures, to be submitted to IEEE CALSubjects: Hardware Architecture (cs.AR)
Computing-in-memory (CIM) promises to alleviate the Von Neumann bottleneck and accelerate data-intensive applications. Depending on the underlying technology and configuration, CIM enables implementing compute primitives in place, such as multiplication, search operations, and bulk bitwise logic operations. Emerging nonvolatile memory technologies such as spintronic Racetrack memory (RTM) promise not only unprecedented density but also significant parallelism through CIM. However, most CIM designs, including those based on RTM, exhibit high fault rates. Existing error correction codes (ECC) are not homomorphic over bitwise operations such as AND and OR, and hence cannot protect against CIM faults. This paper proposes CIRM-ECC, a technique to protect spintronic RTMs against CIM faults. At the core of CIRM-ECC, we use a recently proposed RTM-based CIM approach and leverage its peripheral circuitry to our implement our novel ECC codes. We show that CIRM-ECC can be applied to single-bit Hamming codes as well as multi-bit BCH codes.
New submissions for Thursday, 1 August 2024 (showing 5 of 5 entries )
- [6] arXiv:2407.21508 (cross-list from eess.SP) [pdf, html, other]
-
Title: Machine Learning In-Sensors: Computation-enabled Intelligent Sensors For Next Generation of IoTSubjects: Signal Processing (eess.SP); Hardware Architecture (cs.AR); Systems and Control (eess.SY)
Smart sensors are an emerging technology that allows combining the data acquisition with the elaboration directly on the Edge device, very close to the sensors. To push this concept to the extreme, technology companies are proposing a new generation of sensors allowing to move the intelligence from the edge host device, typically a microcontroller, directly to the ultra-low-power sensor itself, in order to further reduce the miniaturization, cost and energy efficiency. This paper evaluates the capabilities of a novel and promising solution from STMicroelectronics. The presence of a floating point unit and an accelerator for binary neural networks provide capabilities for in-sensor feature extraction and machine learning. We propose a comparison of full-precision and binary neural networks for activity recognition with accelerometer data generated by the sensor itself. Experimental results have demonstrated that the sensor can achieve an inference performance of 10.7 cycles/MAC, comparable to a Cortex-M4-based microcontroller, with full-precision networks, and up to 1.5 cycles/MAC with large binary models for low latency inference, with an average energy consumption of only 90 $\mu$J/inference with the core running at 5 MHz.
Cross submissions for Thursday, 1 August 2024 (showing 1 of 1 entries )
- [7] arXiv:2407.18499 (replaced) [pdf, html, other]
-
Title: Non-Overlapping Placement of Macro Cells based on Reinforcement Learning in Chip DesignSubjects: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
Due to the increasing complexity of chip design, existing placement methods still have many shortcomings in dealing with macro cells coverage and optimization efficiency. Aiming at the problems of layout overlap, inferior performance, and low optimization efficiency in existing chip design methods, this paper proposes an end-to-end placement method, SRLPlacer, based on reinforcement learning. First, the placement problem is transformed into a Markov decision process by establishing the coupling relationship graph model between macro cells to learn the strategy for optimizing layouts. Secondly, the whole placement process is optimized after integrating the standard cell layout. By assessing on the public benchmark ISPD2005, the proposed SRLPlacer can effectively solve the overlap problem between macro cells while considering routing congestion and shortening the total wire length to ensure routability.