Hardware Architecture
See recent articles
- [1] arXiv:2407.03497 [pdf, other]
-
Title: A 95.5Gb/s 29.6ns worst-case latency ORBGRAND decoder for 6G xURLLCComments: Submitted to IEEE TVLSISubjects: Hardware Architecture (cs.AR)
Ultra-Reliable Low-Latency Communications (URLLC) in both 5G and 6G demand high throughput and short latency with low error rates. Guessing Random Additive Noise Decoding (GRAND) and Ordered Reliability Bits GRAND (ORBGRAND) are powerful universal decoding algorithms that work well with short, high-rate codes. As short forward error correcting codes can help limiting latency, and code unification in 6G calls for flexible, possibly code-agnostic decoders, GRAND and ORBGRAND are well suited to tackle 6G URLLC. This work proposes a ultra-high, constant speed ORBGRAND decoder architecture with very low worst-case and average latency. Compared to a baseline architecture, through out-of-order output, aggressive clock gating, and selective programmability, the decoder reduces area, power, and average latency by 15.5%, 19.4%, and 56%, respectively. In 3nm FinFET technology, it achieves a constant throughput of 95.49Gb/s, with 29.59ns worst-case latency and 13.02ns on average.
- [2] arXiv:2407.03711 [pdf, html, other]
-
Title: Decoupled Access-Execute enabled DVFS for tinyML deployments on STM32 microcontrollersElisavet Lydia Alvanaki, Manolis Katsaragakis, Dimosthenis Masouros, Sotirios Xydis, Dimitrios SoudrisComments: 6 pages, 6 figures, 1 listing, presented in IEEE DATE 2024Journal-ref: 2024 Design, Automation & Test in Europe Conference & Exhibition (DATE) (pp. 1-6). IEEESubjects: Hardware Architecture (cs.AR)
Over the last years the rapid growth Machine Learning (ML) inference applications deployed on the Edge is rapidly increasing. Recent Internet of Things (IoT) devices and microcontrollers (MCUs), become more and more mainstream in everyday activities. In this work we focus on the family of STM32 MCUs. We propose a novel methodology for CNN deployment on the STM32 family, focusing on power optimization through effective clocking exploration and configuration and decoupled access-execute convolution kernel execution. Our approach is enhanced with optimization of the power consumption through Dynamic Voltage and Frequency Scaling (DVFS) under various latency constraints, composing an NP-complete optimization problem. We compare our approach against the state-of-the-art TinyEngine inference engine, as well as TinyEngine coupled with power-saving modes of the STM32 MCUs, indicating that we can achieve up to 25.2% less energy consumption for varying QoS levels.
- [3] arXiv:2407.03962 [pdf, html, other]
-
Title: Multiplier Design Addressing Area-Delay Trade-offs by using DSP and Logic resources on FPGAsComments: Accepted at ASAP 2024Subjects: Hardware Architecture (cs.AR)
The major challenge when designing multipliers for FPGAs is to address several trade-offs: On the one hand at the performance level and on the other hand at the resource level utilizing DSP blocks or look-up tables (LUTs). With DSPs being a relatively limited resource, the problem of under- or over-utilization of DSPs has previously been addressed by the concept of multiplier tiling, by assembling multipliers from DSPs and small supplemental LUT multipliers. But there had always been an efficiency gap between tiling-based multipliers and radix-4 Booth-Arrays. While the monolithic Booth-Array was shown to be considerably more efficient in terms of LUT-resources on many modern FPGA-architectures, it typically possess a significantly higher critically path delay (or latency when pipelined) compared to multipliers designed by tiling. This work proposes and analyzes the use of smaller Booth-Arrays as sub-multipliers that are integrated into existing tiling-based methods, such that better trade-off points between area and delay can be reached while utilizing a user-specified number of DSP blocks. It is shown by synthesis experiments, that the critical path delay compared to large Booth-Arrays can be reduced, while achieving significant reductions in LUT-resources compared to previous tiling.
- [4] arXiv:2407.04182 [pdf, html, other]
-
Title: Towards Generalized On-Chip Communication for Programmable Accelerators in Heterogeneous ArchitecturesJoseph Zuckerman, John-David Wellman, Ajay Vanamali, Manish Shankar, Gabriele Tombesi, Karthik Swaminathan, Kevin Lee, Mohit Kapur, Robert Philhower, Pradip Bose, Luca P. CarloniComments: Appeared in the Sixth International Workshop on Domain Specific System Architecture (DOSSA-6)Subjects: Hardware Architecture (cs.AR)
We present several enhancements to the open-source ESP platform to support flexible and efficient on-chip communication for programmable accelerators in heterogeneous SoCs. These enhancements include 1) a flexible point-to-point communication mechanism between accelerators, 2) a multicast NoC that supports data forwarding to multiple accelerators simultaneously, 3) accelerator synchronization leveraging the SoC's coherence protocol, 4) an accelerator interface that offers fine-grained control over the communication mode used, and 5) an example ISA extension to support our enhancements. Our solution adds negligible area to the SoC architecture and requires minimal changes to the accelerators themselves. We have validated most of these features in complex FPGA prototypes and plan to include them in the open-source release of ESP in the coming months.
- [5] arXiv:2407.04292 [pdf, html, other]
-
Title: Corki: Enabling Real-time Embodied AI Robots via Algorithm-Architecture Co-DesignYiyang Huang, Yuhui Hao, Bo Yu, Feng Yan, Yuxin Yang, Feng Min, Yinhe Han, Lin Ma, Shaoshan Liu, Qiang Liu, Yiming GanSubjects: Hardware Architecture (cs.AR); Robotics (cs.RO)
Embodied AI robots have the potential to fundamentally improve the way human beings live and manufacture. Continued progress in the burgeoning field of using large language models to control robots depends critically on an efficient computing substrate. In particular, today's computing systems for embodied AI robots are designed purely based on the interest of algorithm developers, where robot actions are divided into a discrete frame-basis. Such an execution pipeline creates high latency and energy consumption. This paper proposes Corki, an algorithm-architecture co-design framework for real-time embodied AI robot control. Our idea is to decouple LLM inference, robotic control and data communication in the embodied AI robots compute pipeline. Instead of predicting action for one single frame, Corki predicts the trajectory for the near future to reduce the frequency of LLM inference. The algorithm is coupled with a hardware that accelerates transforming trajectory into actual torque signals used to control robots and an execution pipeline that parallels data communication with computation. Corki largely reduces LLM inference frequency by up to 8.0x, resulting in up to 3.6x speed up. The success rate improvement can be up to 17.3%. Code is provided for re-implementation. this https URL
- [6] arXiv:2407.04404 [pdf, other]
-
Title: Multi-Antenna Technology for 6G Integrated Sensing and CommunicationYong Zeng, Zhenjun Dong, Huizhi Wang, Lipeng Zhu, Ziyao Hong, Qingji Jiang, Dongming Wang, Shi Jin, Rui ZhangComments: in Chinese languageSubjects: Hardware Architecture (cs.AR)
By deploying antenna arrays at the transmitter/receiver to provide additional spatial-domain degrees of freedom (DoFs), multi-antenna technology greatly improves the reliability and efficiency of wireless communication. Meanwhile, the application of multi-antenna technology in the radar field has achieved spatial angle resolution and improved sensing DoF, thus significantly enhancing wireless sensing performance. However, wireless communication and radar sensing have undergone independent development over the past few decades. As a result, although multi-antenna technology has dramatically advanced in these two fields separately, it has not been deeply integrated by exploiting their synergy. A new opportunity to fill up this gap arises as the integration of sensing and communication has been identified as one of the typical usage scenarios of the 6G communication network. Motivated by the above, this article aims to explore the multi-antenna technology for 6G ISAC, with the focus on its future development trends such as continuous expansion of antenna array scale, more diverse array architectures, and more flexible antenna designs. First, we introduce several new and promising antenna architectures, including the centralized antenna architectures based on traditional compact arrays or emerging sparse arrays, the distributed antenna architectures exemplified by the cell-free massive MIMO, and the movable/fluid antennas with flexible positions and/or orientations in a given 3D space. Next, for each antenna architecture mentioned above, we present the corresponding far-field/near-field channel models and analyze the communication and sensing performance. Finally, we summarize the characteristics of different antenna architectures and look forward to new ideas for solving the difficulties in acquiring CSI caused by the continuous expansion of antenna array scale and flexible antenna designs.
New submissions for Monday, 8 July 2024 (showing 6 of 6 entries )
- [7] arXiv:2407.03852 (cross-list from quant-ph) [pdf, html, other]
-
Title: Low-latency machine learning FPGA accelerator for multi-qubit state discriminationPradeep Kumar Gautam, Shantharam Kalipatnapu, Shankaranarayanan H, Ujjawal Singhal, Benjamin Lienhard, Vibhor Singh, Chetan Singh ThakurComments: 10 pages, 6 figuresSubjects: Quantum Physics (quant-ph); Hardware Architecture (cs.AR); Machine Learning (cs.LG)
Measuring a qubit is a fundamental yet error prone operation in quantum computing. These errors can stem from various sources such as crosstalk, spontaneous state-transitions, and excitation caused by the readout pulse. In this work, we utilize an integrated approach to deploy neural networks (NN) on to field programmable gate arrays (FPGA). We demonstrate that it is practical to design and implement a fully connected neural network accelerator for frequency-multiplexed readout balancing computational complexity with low latency requirements without significant loss in accuracy. The neural network is implemented by quantization of weights, activation functions, and inputs. The hardware accelerator performs frequency-multiplexed readout of 5 superconducting qubits in less than 50 ns on RFSoC ZCU111 FPGA which is first of its kind in the literature. These modules can be implemented and integrated in existing Quantum control and readout platforms using a RFSoC ZCU111 ready for experimental deployment.
Cross submissions for Monday, 8 July 2024 (showing 1 of 1 entries )
- [8] arXiv:2311.12359 (replaced) [pdf, html, other]
-
Title: Shedding the Bits: Pushing the Boundaries of Quantization with Minifloats on FPGAsShivam Aggarwal, Hans Jakob Damsgaard, Alessandro Pappalardo, Giuseppe Franco, Thomas B. Preußer, Michaela Blott, Tulika MitraComments: Accepted in FPL (International Conference on Field-Programmable Logic and Applications) 2024 conference. Revised with updated resultsSubjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Machine Learning (cs.LG); Performance (cs.PF)
Post-training quantization (PTQ) is a powerful technique for model compression, reducing the numerical precision in neural networks without additional training overhead. Recent works have investigated adopting 8-bit floating-point formats(FP8) in the context of PTQ for model inference. However, floating-point formats smaller than 8 bits and their relative comparison in terms of accuracy-hardware cost with integers remains unexplored on FPGAs. In this work, we present minifloats, which are reduced-precision floating-point formats capable of further reducing the memory footprint, latency, and energy cost of a model while approaching full-precision model accuracy. We implement a custom FPGA-based multiply-accumulate operator library and explore the vast design space, comparing minifloat and integer representations across 3 to 8 bits for both weights and activations. We also examine the applicability of various integerbased quantization techniques to minifloats. Our experiments show that minifloats offer a promising alternative for emerging workloads such as vision transformers.
- [9] arXiv:2401.02669 (replaced) [pdf, html, other]
-
Title: Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCacheBin Lin, Chen Zhang, Tao Peng, Hanyu Zhao, Wencong Xiao, Minmin Sun, Anmin Liu, Zhipeng Zhang, Lanbo Li, Xiafei Qiu, Shen Li, Zhigang Ji, Tao Xie, Yong Li, Wei LinSubjects: Distributed, Parallel, and Cluster Computing (cs.DC); Hardware Architecture (cs.AR)
Large Language Models (LLMs) demonstrate substantial potential across a diverse array of domains via request serving. However, as trends continue to push for expanding context sizes, the autoregressive nature of LLMs results in highly dynamic behavior of the attention layers, showcasing significant differences in computational characteristics and memory requirements from the non-attention layers. This presents substantial challenges for resource management and performance optimization in service systems. Existing static model parallelism and resource allocation strategies fall short when dealing with this dynamicity. To address the issue, we propose Infinite-LLM, a novel LLM serving system designed to effectively handle dynamic context lengths. Infinite-LLM disaggregates attention layers from an LLM's inference process, facilitating flexible and independent resource scheduling that optimizes computational performance and enhances memory utilization jointly. By leveraging a pooled GPU memory strategy across a cluster, Infinite-LLM not only significantly boosts system throughput but also supports extensive context lengths. Evaluated on a dataset with context lengths ranging from a few to 2000K tokens across a cluster with 32 A100 GPUs, Infinite-LLM demonstrates throughput improvement of 1.35-3.4x compared to state-of-the-art methods, enabling efficient and elastic LLM deployment.
- [10] arXiv:2407.02353 (replaced) [pdf, html, other]
-
Title: Roadmap to Neuromorphic Computing with Emerging TechnologiesAdnan Mehonic, Daniele Ielmini, Kaushik Roy, Onur Mutlu, Shahar Kvatinsky, Teresa Serrano-Gotarredona, Bernabe Linares-Barranco, Sabina Spiga, Sergey Savelev, Alexander G Balanov, Nitin Chawla, Giuseppe Desoli, Gerardo Malavena, Christian Monzio Compagnoni, Zhongrui Wang, J Joshua Yang, Ghazi Sarwat Syed, Abu Sebastian, Thomas Mikolajick, Beatriz Noheda, Stefan Slesazeck, Bernard Dieny, Tuo-Hung (Alex)Hou, Akhil Varri, Frank Bruckerhoff-Pluckelmann, Wolfram Pernice, Xixiang Zhang, Sebastian Pazos, Mario Lanza, Stefan Wiefels, Regina Dittmann, Wing H Ng, Mark Buckwell, Horatio RJ Cox, Daniel J Mannion, Anthony J Kenyon, Yingming Lu, Yuchao Yang, Damien Querlioz, Louis Hutin, Elisa Vianello, Sayeed Shafayet Chowdhury, Piergiulio Mannocci, Yimao Cai, Zhong Sun, Giacomo Pedretti, John Paul Strachan, Dmitri Strukov, Manuel Le Gallo, Stefano Ambrogio, Ilia Valov, Rainer WaserComments: 90 pages, 22 figures, roadmap, neuromorphicSubjects: Signal Processing (eess.SP); Hardware Architecture (cs.AR); Systems and Control (eess.SY)
The roadmap is organized into several thematic sections, outlining current computing challenges, discussing the neuromorphic computing approach, analyzing mature and currently utilized technologies, providing an overview of emerging technologies, addressing material challenges, exploring novel computing concepts, and finally examining the maturity level of emerging technologies while determining the next essential steps for their advancement.