Search | arXiv e-print repository

Enhancing Layout Hotspot Detection Efficiency with YOLOv8 and PCA-Guided Augmentation

Authors: Dongyang Wu, Siyang Wang, Mehdi Kamal, Massoud Pedram

Abstract: In this paper, we present a YOLO-based framework for layout hotspot detection, aiming to enhance the efficiency and performance of the design rule checking (DRC) process. Our approach leverages the YOLOv8 vision model to detect multiple hotspots within each layout image, even when dealing with large layout image sizes. Additionally, to enhance pattern-matching effectiveness, we introduce a novel a… ▽ More In this paper, we present a YOLO-based framework for layout hotspot detection, aiming to enhance the efficiency and performance of the design rule checking (DRC) process. Our approach leverages the YOLOv8 vision model to detect multiple hotspots within each layout image, even when dealing with large layout image sizes. Additionally, to enhance pattern-matching effectiveness, we introduce a novel approach to augment the layout image using information extracted through Principal Component Analysis (PCA). The core of our proposed method is an algorithm that utilizes PCA to extract valuable auxiliary information from the layout image. This extracted information is then incorporated into the layout image as an additional color channel. This augmentation significantly improves the accuracy of multi-hotspot detection while reducing the false alarm rate of the object detection algorithm. We evaluate the effectiveness of our framework using four datasets generated from layouts found in the ICCAD-2019 benchmark dataset. The results demonstrate that our framework achieves a precision (recall) of approximately 83% (86%) while maintaining a false alarm rate of less than 7.4\%. Also, the studies show that the proposed augmentation approach could improve the detection ability of never-seen-before (NSB) hotspots by about 10%. △ Less

Submitted 19 July, 2024; originally announced July 2024.

arXiv:2407.12736 [pdf, other]

CHOSEN: Compilation to Hardware Optimization Stack for Efficient Vision Transformer Inference

Authors: Mohammad Erfan Sadeghi, Arash Fayyazi, Suhas Somashekar, Massoud Pedram

Abstract: Vision Transformers (ViTs) represent a groundbreaking shift in machine learning approaches to computer vision. Unlike traditional approaches, ViTs employ the self-attention mechanism, which has been widely used in natural language processing, to analyze image patches. Despite their advantages in modeling visual tasks, deploying ViTs on hardware platforms, notably Field-Programmable Gate Arrays (FP… ▽ More Vision Transformers (ViTs) represent a groundbreaking shift in machine learning approaches to computer vision. Unlike traditional approaches, ViTs employ the self-attention mechanism, which has been widely used in natural language processing, to analyze image patches. Despite their advantages in modeling visual tasks, deploying ViTs on hardware platforms, notably Field-Programmable Gate Arrays (FPGAs), introduces considerable challenges. These challenges stem primarily from the non-linear calculations and high computational and memory demands of ViTs. This paper introduces CHOSEN, a software-hardware co-design framework to address these challenges and offer an automated framework for ViT deployment on the FPGAs in order to maximize performance. Our framework is built upon three fundamental contributions: multi-kernel design to maximize the bandwidth, mainly targeting benefits of multi DDR memory banks, approximate non-linear functions that exhibit minimal accuracy degradation, and efficient use of available logic blocks on the FPGA, and efficient compiler to maximize the performance and memory-efficiency of the computing kernels by presenting a novel algorithm for design space exploration to find optimal hardware configuration that achieves optimal throughput and latency. Compared to the state-of-the-art ViT accelerators, CHOSEN achieves a 1.5x and 1.42x improvement in the throughput on the DeiT-S and DeiT-B models. △ Less

Submitted 24 July, 2024; v1 submitted 17 July, 2024; originally announced July 2024.

arXiv:2407.08192 [pdf, other]

ARCO:Adaptive Multi-Agent Reinforcement Learning-Based Hardware/Software Co-Optimization Compiler for Improved Performance in DNN Accelerator Design

Authors: Arya Fayyazi, Mehdi Kamal, Massoud Pedram

Abstract: This paper presents ARCO, an adaptive Multi-Agent Reinforcement Learning (MARL)-based co-optimizing compilation framework designed to enhance the efficiency of mapping machine learning (ML) models - such as Deep Neural Networks (DNNs) - onto diverse hardware platforms. The framework incorporates three specialized actor-critic agents within MARL, each dedicated to a distinct aspect of compilation/o… ▽ More This paper presents ARCO, an adaptive Multi-Agent Reinforcement Learning (MARL)-based co-optimizing compilation framework designed to enhance the efficiency of mapping machine learning (ML) models - such as Deep Neural Networks (DNNs) - onto diverse hardware platforms. The framework incorporates three specialized actor-critic agents within MARL, each dedicated to a distinct aspect of compilation/optimization at an abstract level: one agent focuses on hardware, while two agents focus on software optimizations. This integration results in a collaborative hardware/software co-optimization strategy that improves the precision and speed of DNN deployments. Concentrating on high-confidence configurations simplifies the search space and delivers superior performance compared to current optimization methods. The ARCO framework surpasses existing leading frameworks, achieving a throughput increase of up to 37.95% while reducing the optimization time by up to 42.2% across various DNNs. △ Less

Submitted 22 July, 2024; v1 submitted 11 July, 2024; originally announced July 2024.

Comments: Under review

arXiv:2406.14854 [pdf, other]

PEANO-ViT: Power-Efficient Approximations of Non-Linearities in Vision Transformers

Authors: Mohammad Erfan Sadeghi, Arash Fayyazi, Seyedarmin Azizi, Massoud Pedram

Abstract: The deployment of Vision Transformers (ViTs) on hardware platforms, specially Field-Programmable Gate Arrays (FPGAs), presents many challenges, which are mainly due to the substantial computational and power requirements of their non-linear functions, notably layer normalization, softmax, and Gaussian Error Linear Unit (GELU). These critical functions pose significant obstacles to efficient hardwa… ▽ More The deployment of Vision Transformers (ViTs) on hardware platforms, specially Field-Programmable Gate Arrays (FPGAs), presents many challenges, which are mainly due to the substantial computational and power requirements of their non-linear functions, notably layer normalization, softmax, and Gaussian Error Linear Unit (GELU). These critical functions pose significant obstacles to efficient hardware implementation due to their complex mathematical operations and the inherent resource count and architectural limitations of FPGAs. PEANO-ViT offers a novel approach to streamlining the implementation of the layer normalization layer by introducing a division-free technique that simultaneously approximates the division and square root function. Additionally, PEANO-ViT provides a multi-scale division strategy to eliminate division operations in the softmax layer, aided by a Pade-based approximation for the exponential function. Finally, PEANO-ViT introduces a piece-wise linear approximation for the GELU function, carefully designed to bypass the computationally intensive operations associated with GELU. In our comprehensive evaluations, PEANO-ViT exhibits minimal accuracy degradation (<= 0.5% for DeiT-B) while significantly enhancing power efficiency, achieving improvements of 1.91x, 1.39x, 8.01x for layer normalization, softmax, and GELU, respectively. This improvement is achieved through substantial reductions in DSP, LUT, and register counts for these non-linear operations. Consequently, PEANO-ViT enables efficient deployment of Vision Transformers on resource- and power-constrained FPGA platforms. △ Less

Submitted 20 June, 2024; originally announced June 2024.

arXiv:2406.12832 [pdf, other]

LaMDA: Large Model Fine-Tuning via Spectrally Decomposed Low-Dimensional Adaptation

Authors: Seyedarmin Azizi, Souvik Kundu, Massoud Pedram

Abstract: Low-rank adaptation (LoRA) has become the default approach to fine-tune large language models (LLMs) due to its significant reduction in trainable parameters. However, trainable parameter demand for LoRA increases with increasing model embedding dimensions, leading to high compute costs. Additionally, its backward updates require storing high-dimensional intermediate activations and optimizer stat… ▽ More Low-rank adaptation (LoRA) has become the default approach to fine-tune large language models (LLMs) due to its significant reduction in trainable parameters. However, trainable parameter demand for LoRA increases with increasing model embedding dimensions, leading to high compute costs. Additionally, its backward updates require storing high-dimensional intermediate activations and optimizer states, demanding high peak GPU memory. In this paper, we introduce large model fine-tuning via spectrally decomposed low-dimensional adaptation (LaMDA), a novel approach to fine-tuning large language models, which leverages low-dimensional adaptation to achieve significant reductions in trainable parameters and peak GPU memory footprint. LaMDA freezes a first projection matrix (PMA) in the adaptation path while introducing a low-dimensional trainable square matrix, resulting in substantial reductions in trainable parameters and peak GPU memory usage. LaMDA gradually freezes a second projection matrix (PMB) during the early fine-tuning stages, reducing the compute cost associated with weight updates to enhance parameter efficiency further. We also present an enhancement, LaMDA++, incorporating a ``lite-weight" adaptive rank allocation for the LoRA path via normalized spectrum analysis of pre-trained model weights. We evaluate LaMDA/LaMDA++ across various tasks, including natural language understanding with the GLUE benchmark, text summarization, natural language generation, and complex reasoning on different LLMs. Results show that LaMDA matches or surpasses the performance of existing alternatives while requiring up to 17.7x fewer parameter updates and up to 1.32x lower peak GPU memory usage during fine-tuning. Code will be publicly available. △ Less

Submitted 18 June, 2024; originally announced June 2024.

arXiv:2406.08871 [pdf, other]

Superconductor bistable vortex memory for data storage and in-memory computing

Authors: Mustafa Altay Karamuftuoglu, Beyza Zeynep Ucpinar, Sasan Razmkhah, Massoud Pedram

Abstract: Superconductor electronics (SCE) is a promising complementary and beyond CMOS technology. However, despite its practical benefits, the realization of SCE logic faces a significant challenge due to the absence of dense and scalable nonvolatile memory designs. While various nonvolatile memory technologies, including Non-destructive readout, vortex transitional memory (VTM), and magnetic memory, have… ▽ More Superconductor electronics (SCE) is a promising complementary and beyond CMOS technology. However, despite its practical benefits, the realization of SCE logic faces a significant challenge due to the absence of dense and scalable nonvolatile memory designs. While various nonvolatile memory technologies, including Non-destructive readout, vortex transitional memory (VTM), and magnetic memory, have been explored, achieving a superconductor random-access memory (RAM) crossbar array remains challenging. This paper introduces a novel, nonvolatile, high-density, and scalable VTM cell design for SCE applications. Our proposed design addresses scaling issues while boasting zero static power consumption characteristics. Our design leverages current summation, enabling analog multiply-accumulate operations -an essential feature for many in-memory computational tasks. We demonstrate the efficacy of our approach with a 32 x 32 superconductor memory array operating at 20 GHz. This design effectively addresses scaling issues and utilizes current summation that can be used for analog multiply-accumulate operations. Additionally, we showcase the accumulation property of the memory through analog simulations conducted on an 8 x 8 superconductor crossbar array. △ Less

Submitted 13 June, 2024; originally announced June 2024.

arXiv:2402.16384 [pdf, other]

Scalable Superconductor Neuron with Ternary Synaptic Connections for Ultra-Fast SNN Hardware

Authors: Mustafa Altay Karamuftuoglu, Beyza Zeynep Ucpinar, Arash Fayyazi, Sasan Razmkhah, Mehdi Kamal, Massoud Pedram

Abstract: A novel high-fan-in differential superconductor neuron structure designed for ultra-high-performance Spiking Neural Network (SNN) accelerators is presented. Utilizing a high-fan-in neuron structure allows us to design SNN accelerators with more synaptic connections, enhancing the overall network capabilities. The proposed neuron design is based on superconductor electronics fabric, incorporating m… ▽ More A novel high-fan-in differential superconductor neuron structure designed for ultra-high-performance Spiking Neural Network (SNN) accelerators is presented. Utilizing a high-fan-in neuron structure allows us to design SNN accelerators with more synaptic connections, enhancing the overall network capabilities. The proposed neuron design is based on superconductor electronics fabric, incorporating multiple superconducting loops, each with two Josephson Junctions. This arrangement enables each input data branch to have positive and negative inductive coupling, supporting excitatory and inhibitory synaptic data. Compatibility with synaptic devices and thresholding operation is achieved using a single flux quantum (SFQ) pulse-based logic style. The neuron design, along with ternary synaptic connections, forms the foundation for a superconductor-based SNN inference. To demonstrate the capabilities of our design, we train the SNN using snnTorch, augmenting the PyTorch framework. After pruning, the demonstrated SNN inference achieves an impressive 96.1% accuracy on MNIST images. Notably, the network exhibits a remarkable throughput of 8.92 GHz while consuming only 1.5 nJ per inference, including the energy consumption associated with cooling to 4K. These results underscore the potential of superconductor electronics in developing high-performance and ultra-energy-efficient neural network accelerator architectures. △ Less

Submitted 27 February, 2024; v1 submitted 26 February, 2024; originally announced February 2024.

Comments: 9 pages, 5 figures, 2 tables

arXiv:2402.13027 [pdf, other]

Solving the decision-making differential equations from eye fixation data in Unity software by using Hermite Long-Short-Term Memory neural network

Authors: Kourosh Parand, Saeed Setayeshi, Mir Mohsen Pedram, Ali Yoonesi, Aida Pakniyat

Abstract: Cognitive decision-making processes are crucial aspects of human behavior, influencing various personal and professional domains. This research delves into the application of differential equations in analyzing decision-making accuracy by leveraging eye-tracking data within a virtual industrial town setting. The study unveils a systematic approach to transforming raw data into a differential equat… ▽ More Cognitive decision-making processes are crucial aspects of human behavior, influencing various personal and professional domains. This research delves into the application of differential equations in analyzing decision-making accuracy by leveraging eye-tracking data within a virtual industrial town setting. The study unveils a systematic approach to transforming raw data into a differential equation, essential for deciphering the relationship between eye movements during decision-making processes. Mathematical relationship extraction and variable-parameter definition pave the way for deriving a differential equation that encapsulates the growth of fixations on characters. The key factors in this equation encompass the fixation rate $(λ)$ and separation rate $(μ)$, reflecting user interaction dynamics and their impact on decision-making complexities tied to user engagement with virtual characters. For a comprehensive grasp of decision dynamics, solving this differential equation requires initial fixation counts, fixation rate, and separation rate. The formulation of differential equations incorporates various considerations such as engagement duration, character-player distance, relative speed, and character attributes, enabling the representation of fixation changes, speed dynamics, distance variations, and the effects of character attributes. This comprehensive analysis not only enhances our comprehension of decision-making processes but also provides a foundational framework for predictive modeling and data-driven insights for future research and applications in cognitive science and virtual reality environments. △ Less

Submitted 23 February, 2024; v1 submitted 20 February, 2024; originally announced February 2024.

arXiv:2402.06004 [pdf, other]

Memory-Efficient Vision Transformers: An Activation-Aware Mixed-Rank Compression Strategy

Authors: Seyedarmin Azizi, Mahdi Nazemi, Massoud Pedram

Abstract: As Vision Transformers (ViTs) increasingly set new benchmarks in computer vision, their practical deployment on inference engines is often hindered by their significant memory bandwidth and (on-chip) memory footprint requirements. This paper addresses this memory limitation by introducing an activation-aware model compression methodology that uses selective low-rank weight tensor approximations of… ▽ More As Vision Transformers (ViTs) increasingly set new benchmarks in computer vision, their practical deployment on inference engines is often hindered by their significant memory bandwidth and (on-chip) memory footprint requirements. This paper addresses this memory limitation by introducing an activation-aware model compression methodology that uses selective low-rank weight tensor approximations of different layers to reduce the parameter count of ViTs. The key idea is to decompose the weight tensors into a sum of two parameter-efficient tensors while minimizing the error between the product of the input activations with the original weight tensor and the product of the input activations with the approximate tensor sum. This approximation is further refined by adopting an efficient layer-wise error compensation technique that uses the gradient of the layer's output loss. The combination of these techniques achieves excellent results while it avoids being trapped in a shallow local minimum early in the optimization process and strikes a good balance between the model compression and output accuracy. Notably, the presented method significantly reduces the parameter count of DeiT-B by 60% with less than 1% accuracy drop on the ImageNet dataset, overcoming the usual accuracy degradation seen in low-rank approximations. In addition to this, the presented compression technique can compress large DeiT/ViT models to have about the same model size as smaller DeiT/ViT variants while yielding up to 1.8% accuracy gain. These results highlight the efficacy of our approach, presenting a viable solution for embedding ViTs in memory-constrained environments without compromising their performance. △ Less

Submitted 8 February, 2024; originally announced February 2024.

arXiv:2312.02210 [pdf, other]

Low-Precision Mixed-Computation Models for Inference on Edge

Authors: Seyedarmin Azizi, Mahdi Nazemi, Mehdi Kamal, Massoud Pedram

Abstract: This paper presents a mixed-computation neural network processing approach for edge applications that incorporates low-precision (low-width) Posit and low-precision fixed point (FixP) number systems. This mixed-computation approach employs 4-bit Posit (Posit4), which has higher precision around zero, for representing weights with high sensitivity, while it uses 4-bit FixP (FixP4) for representing… ▽ More This paper presents a mixed-computation neural network processing approach for edge applications that incorporates low-precision (low-width) Posit and low-precision fixed point (FixP) number systems. This mixed-computation approach employs 4-bit Posit (Posit4), which has higher precision around zero, for representing weights with high sensitivity, while it uses 4-bit FixP (FixP4) for representing other weights. A heuristic for analyzing the importance and the quantization error of the weights is presented to assign the proper number system to different weights. Additionally, a gradient approximation for Posit representation is introduced to improve the quality of weight updates in the backpropagation process. Due to the high energy consumption of the fully Posit-based computations, neural network operations are carried out in FixP or Posit/FixP. An efficient hardware implementation of a MAC operation with a first Posit operand and FixP for a second operand and accumulator is presented. The efficacy of the proposed low-precision mixed-computation approach is extensively assessed on vision and language models. The results show that, on average, the accuracy of the mixed-computation is about 1.5% higher than that of FixP with a cost of 0.19% energy overhead. △ Less

Submitted 2 December, 2023; originally announced December 2023.

arXiv:2311.09238 [pdf, other]

doi 10.1109/TMC.2018.2843373

Toward Ultra-Low-Power Remote Health Monitoring: An Optimal and Adaptive Compressed Sensing Framework for Activity Recognition

Authors: J. Pagan, R. Fallahzadeh, M. Pedram, José L. Risco-Martín, J. M. Moya, J. L. Ayala, H. Ghasemzadeh

Abstract: Activity recognition, as an important component of behavioral monitoring and intervention, has attracted enormous attention, especially in Mobile Cloud Computing (MCC) and Remote Health Monitoring (RHM) paradigms. While recently resource constrained wearable devices have been gaining popularity, their battery life is limited and constrained by the frequent wireless transmission of data to more com… ▽ More Activity recognition, as an important component of behavioral monitoring and intervention, has attracted enormous attention, especially in Mobile Cloud Computing (MCC) and Remote Health Monitoring (RHM) paradigms. While recently resource constrained wearable devices have been gaining popularity, their battery life is limited and constrained by the frequent wireless transmission of data to more computationally powerful back-ends. This paper proposes an ultra-low power activity recognition system using a novel adaptive compressed sensing technique that aims to minimize transmission costs. Coarse-grained on-body sensor localization and unsupervised clustering modules are devised to autonomously reconfigure the compressed sensing module for further power saving. We perform a thorough heuristic optimization using Grammatical Evolution (GE) to ensure minimal computation overhead of the proposed methodology. Our evaluation on a real-world dataset and a low power wearable sensing node demonstrates that our approach can reduce the energy consumption of the wireless data transmission up to $81.2\%$ and $61.5\%$, with up to $60.6\%$ and $35.0\%$ overall power savings in comparison with baseline and a naive state-of-the-art approaches, respectively. These solutions lead to an average activity recognition accuracy of $89.0\%$ -- only $4.8\%$ less than the baseline accuracy -- while having a negligible energy overhead of on-node computation. △ Less

Submitted 2 November, 2023; originally announced November 2023.

Journal ref: IEEE Transactions on Mobile Computing, 18(3), pp. 658-673, 2019

arXiv:2310.13857 [pdf, other]

Superconductor Logic Implementation with All-JJ Inductor-Free Cell Library

Authors: Haolin Cong, Sasan Razmkhah, Mustafa Altay Karamuftuoglu, Massoud Pedram

Abstract: Single flux quantum (SFQ) technology has garnered significant attention due to its low switching power and high operational speed. Researchers have been actively pursuing more advanced devices and technologies to further reduce the reliance on inductors, bias, and dynamic power. Recently, innovative magnetic Josephson junction devices have emerged, enhancing the field of superconductor electronics… ▽ More Single flux quantum (SFQ) technology has garnered significant attention due to its low switching power and high operational speed. Researchers have been actively pursuing more advanced devices and technologies to further reduce the reliance on inductors, bias, and dynamic power. Recently, innovative magnetic Josephson junction devices have emerged, enhancing the field of superconductor electronics (SCE) logic. This paper introduces a novel cell library design that relies entirely on Josephson junctions (JJs), showing promising potential for eliminating the need for inductors in conventional SFQ cells. This results in a 55% reduction in cell size and an 80% decrease in both static and dynamic power consumption. The proposed library implements a half flux quantum (HFQ) logic, where each pulse duration is half that of a single flux quantum pulse. The paper presents the schematics of the basic cells, emphasizing critical circuit parameters and their margins. Additionally, it examines layout blueprints, showcasing the advantageous area-saving characteristics of the proposed design. △ Less

Submitted 20 October, 2023; originally announced October 2023.

Comments: 9 pages, 28 figures, 13 tables

arXiv:2310.07824 [pdf, other]

doi 10.1109/TASC.2024.3359164

An On-Chip Trainable Neuron Circuit for SFQ-Based Spiking Neural Networks

Authors: Beyza Zeynep Ucpinar, Mustafa Altay Karamuftuoglu, Sasan Razmkhah, Massoud Pedram

Abstract: We present an on-chip trainable neuron circuit. Our proposed circuit suits bio-inspired spike-based time-dependent data computation for training spiking neural networks (SNN). The thresholds of neurons can be increased or decreased depending on the desired application-specific spike generation rate. This mechanism provides us with a flexible design and scalable circuit structure. We demonstrate th… ▽ More We present an on-chip trainable neuron circuit. Our proposed circuit suits bio-inspired spike-based time-dependent data computation for training spiking neural networks (SNN). The thresholds of neurons can be increased or decreased depending on the desired application-specific spike generation rate. This mechanism provides us with a flexible design and scalable circuit structure. We demonstrate the trainable neuron structure under different operating scenarios. The circuits are designed and optimized for the MIT LL SFQ5ee fabrication process. Margin values for all parameters are above 25\% with a 3GHz throughput for a 16-input neuron. △ Less

Submitted 11 October, 2023; originally announced October 2023.

Comments: 5 pages, 8 figures. The work was presented in EUCAS 2023

Journal ref: in IEEE Transactions on Applied Superconductivity, vol. 34, no. 3, pp. 1-6, May 2024, Art no. 1300506

arXiv:2310.03918 [pdf, other]

doi 10.1109/TASC.2024.3367618

Unsupervised SFQ-Based Spiking Neural Network

Authors: Mustafa Altay Karamuftuoglu, Beyza Zeynep Ucpinar, Sasan Razmkhah, Mehdi Kamal, Massoud Pedram

Abstract: Single Flux Quantum (SFQ) technology represents a groundbreaking advancement in computational efficiency and ultra-high-speed neuromorphic processing. The key features of SFQ technology, particularly data representation, transmission, and processing through SFQ pulses, closely mirror fundamental aspects of biological neural structures. Consequently, SFQ-based circuits emerge as an ideal candidate… ▽ More Single Flux Quantum (SFQ) technology represents a groundbreaking advancement in computational efficiency and ultra-high-speed neuromorphic processing. The key features of SFQ technology, particularly data representation, transmission, and processing through SFQ pulses, closely mirror fundamental aspects of biological neural structures. Consequently, SFQ-based circuits emerge as an ideal candidate for realizing Spiking Neural Networks (SNNs). This study presents a proof-of-concept demonstration of an SFQ-based SNN architecture, showcasing its capacity for ultra-fast switching at remarkably low energy consumption per output activity. Notably, our work introduces innovative approaches: (i) We introduce a novel spike-timing-dependent plasticity mechanism to update synapses and to trace spike-activity by incorporating a leaky non-destructive readout circuit. (ii) We propose a novel method to dynamically regulate the threshold behavior of leaky integrate and fire superconductor neurons, enhancing the adaptability of our SNN architecture. (iii) Our research incorporates a novel winner-take-all mechanism, aligning with practical strategies for SNN development and enabling effective decision-making processes. The effectiveness of these proposed structural enhancements is evaluated by integrating high-level models into the BindsNET framework. By leveraging BindsNET, we model the online training of an SNN, integrating the novel structures into the learning process. To ensure the robustness and functionality of our circuits, we employ JoSIM for circuit parameter extraction and functional verification through simulation. △ Less

Submitted 5 October, 2023; originally announced October 2023.

Journal ref: IEEE Transactions on Applied Superconductivity, vol. 34, no. 3, pp. 1-8, May 2024, Art no. 1300708

arXiv:2309.14613 [pdf, other]

Design of a Superconducting Multiflux Non-Destructive Readout Memory Unit

Authors: Beyza Zeynep Ucpinar, Yasemin Kopur, Mustafa Altay Karamuftuoglu, Sasan Razmkhah, Massoud Pedram

Abstract: Due to low power consumption and high-speed performance, superconductor circuit technology has emerged as an attractive and compelling post-CMOS technology candidate. However, the design of dense memory circuits presents a significant challenge, especially for tasks that demand substantial memory resources. While superconductor memory cells offer impressive speed, their limited density is the prim… ▽ More Due to low power consumption and high-speed performance, superconductor circuit technology has emerged as an attractive and compelling post-CMOS technology candidate. However, the design of dense memory circuits presents a significant challenge, especially for tasks that demand substantial memory resources. While superconductor memory cells offer impressive speed, their limited density is the primary yet-to-be-solved challenge. This study tackles this challenge head-on by introducing a novel design for a Non-Destructive Readout (NDRO) memory unit with single or multi-fluxon storage capabilities within the same circuit architecture. Notably, single storage demonstrates a critical margin exceeding 20\%, and multi-fluxon storage demonstrates 64\%, ensuring reliable and robust operation even in the face of process variations. These memory units exhibit high clock frequencies of 10GHz. The proposed circuits offer compelling characteristics, including rapid data propagation and minimal data refreshment requirements, while effectively addressing the density concerns associated with superconductor memory, doubling the memory capacity while maintaining the high throughput speed. △ Less

Submitted 12 October, 2023; v1 submitted 25 September, 2023; originally announced September 2023.

Comments: 6 pages, 11 figures

arXiv:2309.03407 [pdf, other]

doi 10.1103/PhysRevB.109.014511

A Josephson Parametric Oscillator-Based Ising Machine

Authors: Sasan Razmkhah, Mehdi Kamal, Nobuyuki Yoshikawa, Massoud Pedram

Abstract: Ising machines have emerged as a promising solution for rapidly solving NP-complete combinatorial optimization problems, surpassing the capabilities of traditional computing methods. By efficiently determining the ground state of the Hamiltonian during the annealing process, Ising machines can effectively complement CPUs in tackling optimization challenges. To realize these Ising machines, a bi-st… ▽ More Ising machines have emerged as a promising solution for rapidly solving NP-complete combinatorial optimization problems, surpassing the capabilities of traditional computing methods. By efficiently determining the ground state of the Hamiltonian during the annealing process, Ising machines can effectively complement CPUs in tackling optimization challenges. To realize these Ising machines, a bi-stable oscillator is essential to emulate the atomic spins and interactions of the Ising model. This study introduces a Josephson parametric oscillator (JPO)-based tile structure, serving as a fundamental unit for scalable superconductor-based Ising machines. Leveraging the bi-stable nature of JPOs, which are superconductor-based oscillators, the proposed machine can operate at frequencies of 7.5GHz while consuming significantly less power (by three orders of magnitude) than CMOS-based systems. Furthermore, the compatibility of the proposed tile structure with the Lechner-Hauke-Zoller (LHZ) architecture ensures its viability for large-scale integration. We conducted simulations of the tile in a noisy environment to validate its functionality. We verified its operational characteristics by comparing the results with the analytical solution of its Hamiltonian model. This verification demonstrates the feasibility and effectiveness of the JPO-based tile in implementing Ising machines, opening new avenues for efficient and scalable combinatorial optimization in quantum computing. △ Less

Submitted 12 December, 2023; v1 submitted 6 September, 2023; originally announced September 2023.

Comments: 9 pages, 10 figures, 31 references. Accepted by PRB

Journal ref: Phys. Rev. B, vol. 109, p. 014511, Jan 2024

arXiv:2308.06422 [pdf, other]

Sensitivity-Aware Mixed-Precision Quantization and Width Optimization of Deep Neural Networks Through Cluster-Based Tree-Structured Parzen Estimation

Authors: Seyedarmin Azizi, Mahdi Nazemi, Arash Fayyazi, Massoud Pedram

Abstract: As the complexity and computational demands of deep learning models rise, the need for effective optimization methods for neural network designs becomes paramount. This work introduces an innovative search mechanism for automatically selecting the best bit-width and layer-width for individual neural network layers. This leads to a marked enhancement in deep neural network efficiency. The search do… ▽ More As the complexity and computational demands of deep learning models rise, the need for effective optimization methods for neural network designs becomes paramount. This work introduces an innovative search mechanism for automatically selecting the best bit-width and layer-width for individual neural network layers. This leads to a marked enhancement in deep neural network efficiency. The search domain is strategically reduced by leveraging Hessian-based pruning, ensuring the removal of non-crucial parameters. Subsequently, we detail the development of surrogate models for favorable and unfavorable outcomes by employing a cluster-based tree-structured Parzen estimator. This strategy allows for a streamlined exploration of architectural possibilities and swift pinpointing of top-performing designs. Through rigorous testing on well-known datasets, our method proves its distinct advantage over existing methods. Compared to leading compression strategies, our approach records an impressive 20% decrease in model size without compromising accuracy. Additionally, our method boasts a 12x reduction in search time relative to the best search-focused strategies currently available. As a result, our proposed method represents a leap forward in neural network design optimization, paving the way for quick model design and implementation in settings with limited resources, thereby propelling the potential of scalable deep learning solutions. △ Less

Submitted 16 August, 2023; v1 submitted 11 August, 2023; originally announced August 2023.

arXiv:2307.12216 [pdf, other]

A Life-Cycle Energy and Inventory Analysis of Adiabatic Quantum-Flux-Parametron Circuits

Authors: Masoud Zabihi, Yanyue Xie, Zhengang Li, Peiyan Dong, Geng Yuan, Olivia Chen, Massoud Pedram, Yanzhi Wang

Abstract: The production process of superconductive integrated circuits is complex and consumes significant amounts of resources and energy. Therefore, it is crucial to evaluate the environmental impact of this emerging technology. An attractive option for the next generation of superconductive technology is Adiabatic Quantum-Flux-Parametron (AQFP) devices. This study is the first to present a comprehensive… ▽ More The production process of superconductive integrated circuits is complex and consumes significant amounts of resources and energy. Therefore, it is crucial to evaluate the environmental impact of this emerging technology. An attractive option for the next generation of superconductive technology is Adiabatic Quantum-Flux-Parametron (AQFP) devices. This study is the first to present a comprehensive process-based life-cycle assessment (LCA) and inventory analysis of AQFP integrated circuits. To generate relevant outcomes, we conduct a comparative LCA that included the bulk CMOS technology. The inventory analysis considered the manufacturing, assembly, and use phases of the circuits. To ensure a fair assessment, we choose the 32-bit AQFP RISC-V single-core processor as the reference functional unit and compare its performance with that of a CMOS counterpart. Our findings reveal that the AQFP processor consumes several orders of magnitude less energy during the use phase than its CMOS counterpart. Consequently, the total life cycle energy (which encompasses manufacturing and assembly energies) of AQFP integrated circuits improves at least by two orders of magnitude. △ Less

Submitted 22 July, 2023; originally announced July 2023.

arXiv:2307.07503 [pdf]

Brain Tumor Detection using Convolutional Neural Networks with Skip Connections

Authors: Aupam Hamran, Marzieh Vaeztourshizi, Amirhossein Esmaili, Massoud Pedram

Abstract: In this paper, we present different architectures of Convolutional Neural Networks (CNN) to analyze and classify the brain tumors into benign and malignant types using the Magnetic Resonance Imaging (MRI) technique. Different CNN architecture optimization techniques such as widening and deepening of the network and adding skip connections are applied to improve the accuracy of the network. Results… ▽ More In this paper, we present different architectures of Convolutional Neural Networks (CNN) to analyze and classify the brain tumors into benign and malignant types using the Magnetic Resonance Imaging (MRI) technique. Different CNN architecture optimization techniques such as widening and deepening of the network and adding skip connections are applied to improve the accuracy of the network. Results show that a subset of these techniques can judiciously be used to outperform a baseline CNN model used for the same purpose. △ Less

Submitted 14 July, 2023; originally announced July 2023.

arXiv:2307.03784 [pdf, other]

NeuroBlend: Towards Low-Power yet Accurate Neural Network-Based Inference Engine Blending Binary and Fixed-Point Convolutions

Authors: Arash Fayyazi, Mahdi Nazemi, Arya Fayyazi, Massoud Pedram

Abstract: This paper introduces NeuroBlend, a novel neural network architecture featuring a unique building block known as the Blend module. This module incorporates binary and fixed-point convolutions in its main and skip paths, respectively. There is a judicious deployment of batch normalizations on both main and skip paths inside the Blend module and in between consecutive Blend modules. Additionally, we… ▽ More This paper introduces NeuroBlend, a novel neural network architecture featuring a unique building block known as the Blend module. This module incorporates binary and fixed-point convolutions in its main and skip paths, respectively. There is a judicious deployment of batch normalizations on both main and skip paths inside the Blend module and in between consecutive Blend modules. Additionally, we present a compiler and hardware architecture designed to map NeuroBlend models onto FPGA devices, aiming to minimize inference latency while maintaining high accuracy. Our NeuroBlend-20 (NeuroBlend-18) model, derived from ResNet-20 (ResNet-18) trained on CIFAR-10 (CIFAR-100), achieves 88.0\% (73.73\%) classification accuracy, outperforming state-of-the-art binary neural networks by 0.8\% (1.33\%), with an inference time of 0.38ms per image, 1.4x faster than previous FPGA implementation for BNNs. Similarly, our BlendMixer model for CIFAR-10 attains 90.6\% accuracy(1.59\% less than full precision MLPMixer), with a 3.5x reduction in model size compared to full precision MLPMixer. Furthermore, leveraging DSP blocks for 48-bit bitwise logic operations enables low-power FPGA implementation, yielding a 2.5x reduction in power consumption. △ Less

Submitted 1 May, 2024; v1 submitted 7 July, 2023; originally announced July 2023.

Comments: 6 pages - In proceeding of GLSVLSI 2024

arXiv:2305.04526 [pdf, other]

CrAFT: Compression-Aware Fine-Tuning for Efficient Visual Task Adaptation

Authors: Jung Hwan Heo, Seyedarmin Azizi, Arash Fayyazi, Massoud Pedram

Abstract: Transfer learning has become a popular task adaptation method in the era of foundation models. However, many foundation models require large storage and computing resources, which makes off-the-shelf deployment impractical. Post-training compression techniques such as pruning and quantization can help lower deployment costs. Unfortunately, the resulting performance degradation limits the usability… ▽ More Transfer learning has become a popular task adaptation method in the era of foundation models. However, many foundation models require large storage and computing resources, which makes off-the-shelf deployment impractical. Post-training compression techniques such as pruning and quantization can help lower deployment costs. Unfortunately, the resulting performance degradation limits the usability and benefits of such techniques. To close this performance gap, we propose CrAFT, a simple fine-tuning framework that enables effective post-training network compression. In CrAFT, users simply employ the default fine-tuning schedule along with sharpness minimization objective, simultaneously facilitating task adaptation and compression-friendliness. Contrary to the conventional sharpness minimization techniques, which are applied during pretraining, the CrAFT approach adds negligible training overhead as fine-tuning is done in under a couple of minutes or hours with a single GPU. The effectiveness of CrAFT, which is a general-purpose tool that can significantly boost one-shot pruning and post-training quantization, is demonstrated on both convolution-based and attention-based vision foundation models on a variety of target tasks. The code will be made publicly available. △ Less

Submitted 8 July, 2023; v1 submitted 8 May, 2023; originally announced May 2023.

Comments: Preprint

arXiv:2304.06299 [pdf, other]

Algorithms and Hardware for Efficient Processing of Logic-based Neural Networks

Authors: Jingkai Hong, Arash Fayyazi, Amirhossein Esmaili, Mahdi Nazemi, Massoud Pedram

Abstract: Recent efforts to improve the performance of neural network (NN) accelerators that meet today's application requirements have given rise to a new trend of logic-based NN inference relying on fixed-function combinational logic (FFCL). This paper presents an innovative optimization methodology for compiling and mapping NNs utilizing FFCL into a logic processor. The presented method maps FFCL blocks… ▽ More Recent efforts to improve the performance of neural network (NN) accelerators that meet today's application requirements have given rise to a new trend of logic-based NN inference relying on fixed-function combinational logic (FFCL). This paper presents an innovative optimization methodology for compiling and mapping NNs utilizing FFCL into a logic processor. The presented method maps FFCL blocks to a set of Boolean functions where Boolean operations in each function are mapped to high-performance, low-latency, parallelized processing elements. Graph partitioning and scheduling algorithms are presented to handle FFCL blocks that cannot straightforwardly fit the logic processor. Our experimental evaluations across several datasets and NNs demonstrate the superior performance of our framework in terms of the inference throughput compared to prior art NN accelerators. We achieve 25x higher throughput compared with the XNOR-based accelerator for VGG16 model that can be amplified 5x deploying the graph partitioning and merging algorithms. △ Less

Submitted 13 April, 2023; originally announced April 2023.

arXiv:2304.05237 [pdf]

TREBUCHET: Fully Homomorphic Encryption Accelerator for Deep Computation

Authors: David Bruce Cousins, Yuriy Polyakov, Ahmad Al Badawi, Matthew French, Andrew Schmidt, Ajey Jacob, Benedict Reynwar, Kellie Canida, Akhilesh Jaiswal, Clynn Mathew, Homer Gamil, Negar Neda, Deepraj Soni, Michail Maniatakos, Brandon Reagen, Naifeng Zhang, Franz Franchetti, Patrick Brinich, Jeremy Johnson, Patrick Broderick, Mike Franusich, Bo Zhang, Zeming Cheng, Massoud Pedram

Abstract: Secure computation is of critical importance to not only the DoD, but across financial institutions, healthcare, and anywhere personally identifiable information (PII) is accessed. Traditional security techniques require data to be decrypted before performing any computation. When processed on untrusted systems the decrypted data is vulnerable to attacks to extract the sensitive information. To ad… ▽ More Secure computation is of critical importance to not only the DoD, but across financial institutions, healthcare, and anywhere personally identifiable information (PII) is accessed. Traditional security techniques require data to be decrypted before performing any computation. When processed on untrusted systems the decrypted data is vulnerable to attacks to extract the sensitive information. To address these vulnerabilities Fully Homomorphic Encryption (FHE) keeps the data encrypted during computation and secures the results, even in these untrusted environments. However, FHE requires a significant amount of computation to perform equivalent unencrypted operations. To be useful, FHE must significantly close the computation gap (within 10x) to make encrypted processing practical. To accomplish this ambitious goal the TREBUCHET project is leading research and development in FHE processing hardware to accelerate deep computations on encrypted data, as part of the DARPA MTO Data Privacy for Virtual Environments (DPRIVE) program. We accelerate the major secure standardized FHE schemes (BGV, BFV, CKKS, FHEW, etc.) at >=128-bit security while integrating with the open-source PALISADE and OpenFHE libraries currently used in the DoD and in industry. We utilize a novel tile-based chip design with highly parallel ALUs optimized for vectorized 128b modulo arithmetic. The TREBUCHET coprocessor design provides a highly modular, flexible, and extensible FHE accelerator for easy reconfiguration, deployment, integration and application on other hardware form factors, such as System-on-Chip or alternate chip areas. △ Less

Submitted 18 April, 2023; v1 submitted 11 April, 2023; originally announced April 2023.

Comments: 6 pages, 5 figures and 2 tables

arXiv:2303.17118 [pdf, other]

RPU: The Ring Processing Unit

Authors: Deepraj Soni, Negar Neda, Naifeng Zhang, Benedict Reynwar, Homer Gamil, Benjamin Heyman, Mohammed Nabeel, Ahmad Al Badawi, Yuriy Polyakov, Kellie Canida, Massoud Pedram, Michail Maniatakos, David Bruce Cousins, Franz Franchetti, Matthew French, Andrew Schmidt, Brandon Reagen

Abstract: Ring-Learning-with-Errors (RLWE) has emerged as the foundation of many important techniques for improving security and privacy, including homomorphic encryption and post-quantum cryptography. While promising, these techniques have received limited use due to their extreme overheads of running on general-purpose machines. In this paper, we present a novel vector Instruction Set Architecture (ISA) a… ▽ More Ring-Learning-with-Errors (RLWE) has emerged as the foundation of many important techniques for improving security and privacy, including homomorphic encryption and post-quantum cryptography. While promising, these techniques have received limited use due to their extreme overheads of running on general-purpose machines. In this paper, we present a novel vector Instruction Set Architecture (ISA) and microarchitecture for accelerating the ring-based computations of RLWE. The ISA, named B512, is developed to meet the needs of ring processing workloads while balancing high-performance and general-purpose programming support. Having an ISA rather than fixed hardware facilitates continued software improvement post-fabrication and the ability to support the evolving workloads. We then propose the ring processing unit (RPU), a high-performance, modular implementation of B512. The RPU has native large word modular arithmetic support, capabilities for very wide parallel processing, and a large capacity high-bandwidth scratchpad to meet the needs of ring processing. We address the challenges of programming the RPU using a newly developed SPIRAL backend. A configurable simulator is built to characterize design tradeoffs and quantify performance. The best performing design was implemented in RTL and used to validate simulator performance. In addition to our characterization, we show that a RPU using 20.5mm2 of GF 12nm can provide a speedup of 1485x over a CPU running a 64k, 128-bit NTT, a core RLWE workload △ Less

Submitted 13 April, 2023; v1 submitted 29 March, 2023; originally announced March 2023.

arXiv:2303.02331 [pdf, other]

Training-Free Acceleration of ViTs with Delayed Spatial Merging

Authors: Jung Hwan Heo, Seyedarmin Azizi, Arash Fayyazi, Massoud Pedram

Abstract: Token merging has emerged as a new paradigm that can accelerate the inference of Vision Transformers (ViTs) without any retraining or fine-tuning. To push the frontier of training-free acceleration in ViTs, we improve token merging by adding the perspectives of 1) activation outliers and 2) hierarchical representations. Through a careful analysis of the attention behavior in ViTs, we characterize… ▽ More Token merging has emerged as a new paradigm that can accelerate the inference of Vision Transformers (ViTs) without any retraining or fine-tuning. To push the frontier of training-free acceleration in ViTs, we improve token merging by adding the perspectives of 1) activation outliers and 2) hierarchical representations. Through a careful analysis of the attention behavior in ViTs, we characterize a delayed onset of the convergent attention phenomenon, which makes token merging undesirable in the bottom blocks of ViTs. Moreover, we augment token merging with a hierarchical processing scheme to capture multi-scale redundancy between visual tokens. Combining these two insights, we build a unified inference framework called DSM: Delayed Spatial Merging. We extensively evaluate DSM on various ViT model scales (Tiny to Huge) and tasks (ImageNet-1k and transfer learning), achieving up to 1.8$\times$ FLOP reduction and 1.6$\times$ throughput speedup at a negligible loss while being two orders of magnitude faster than existing methods. △ Less

Submitted 1 July, 2024; v1 submitted 4 March, 2023; originally announced March 2023.

Comments: ICML 2024 ES-FoMo Workshop

arXiv:2208.13850 [pdf]

AMR-MUL: An Approximate Maximally Redundant Signed Digit Multiplier

Authors: Saba Amanollahi, Mehdi Kamal, Ali-Afzali-Kusha, Massoud Pedram

Abstract: In this paper, we present an energy-efficient, yet high-speed approximate maximally redundant signed digit (MRSD) multiplier (called AMR-MUL) based on a parallel structure. For the reduction stage, we suggest several approximate Full-Adder (FA) reduction cells with average positive and negative errors obtained by simplifying the structure of an exact FA cell. The optimum selection of these cells f… ▽ More In this paper, we present an energy-efficient, yet high-speed approximate maximally redundant signed digit (MRSD) multiplier (called AMR-MUL) based on a parallel structure. For the reduction stage, we suggest several approximate Full-Adder (FA) reduction cells with average positive and negative errors obtained by simplifying the structure of an exact FA cell. The optimum selection of these cells for each partial product reduction stage provides the lowest possible error, turning this task into a design space exploration problem. We also provide a branch-and-bound design space exploration algorithm to find the optimal assignment of reduction cells based on a predefined constraint (i.e., the width of the approximate part) by the user. The effectiveness of the proposed (Radix-16) multiplier design is assessed under different digit counts and approximate border column. The results show that the energy consumption of the MRSD multiplier is reduced by 7x at the cost of a 1.6% accuracy loss. △ Less

Submitted 29 August, 2022; originally announced August 2022.

arXiv:2208.08547 [pdf, other]

Better Than Worst-Case Decoding for Quantum Error Correction

Authors: Gokul Subramanian Ravi, Jonathan M. Baker, Arash Fayyazi, Sophia Fuhui Lin, Ali Javadi-Abhari, Massoud Pedram, Frederic T. Chong

Abstract: The overheads of classical decoding for quantum error correction on superconducting quantum systems grow rapidly with the number of logical qubits and their correction code distance. Decoding at room temperature is bottle-necked by refrigerator I/O bandwidth while cryogenic on-chip decoding is limited by area/power/thermal budget. To overcome these overheads, we are motivated by the observation… ▽ More The overheads of classical decoding for quantum error correction on superconducting quantum systems grow rapidly with the number of logical qubits and their correction code distance. Decoding at room temperature is bottle-necked by refrigerator I/O bandwidth while cryogenic on-chip decoding is limited by area/power/thermal budget. To overcome these overheads, we are motivated by the observation that in the common case, error signatures are fairly trivial with high redundancy/sparsity, since the error correction codes are over-provisioned to correct for uncommon worst-case complex scenarios (to ensure substantially low logical error rates). If suitably exploited, these trivial signatures can be decoded and corrected with insignificant overhead, thereby alleviating the bottlenecks described above, while still handling the worst-case complex signatures by state-of-the-art means. Our proposal, targeting Surface Codes, consists of: 1) Clique: A lightweight decoder for decoding and correcting trivial common-case errors, designed for the cryogenic domain. The decoder is implemented for SFQ logic. 2) A statistical confidence-based technique for off-chip decoding bandwidth allocation, to efficiently handle rare complex decodes which are not covered by the on-chip decoder. 3) A method for stalling circuit execution, for the worst-case scenarios in which the provisioned off-chip bandwidth is insufficient to complete all requested off-chip decodes. In all, our proposal enables 70-99+% off-chip bandwidth elimination across a range of logical and physical error rates, without significantly sacrificing the accuracy of state-of-the-art off-chip decoding. By doing so, it achieves 10-10000x bandwidth reduction over prior off-chip bandwidth reduction techniques. Furthermore, it achieves a 15-37x resource overhead reduction compared to prior on-chip-only decoding. △ Less

Submitted 25 October, 2022; v1 submitted 17 August, 2022; originally announced August 2022.

Comments: To appear at the 28th Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2023)

arXiv:2208.00302 [pdf]

Efficient Compilation and Mapping of Fixed Function Combinational Logic onto Digital Signal Processors Targeting Neural Network Inference and Utilizing High-level Synthesis

Authors: Soheil Nazar Shahsavani, Arash Fayyazi, Mahdi Nazemi, Massoud Pedram

Abstract: Recent efforts for improving the performance of neural network (NN) accelerators that meet today's application requirements have given rise to a new trend of logic-based NN inference relying on fixed function combinational logic. Mapping such large Boolean functions with many input variables and product terms to digital signal processors (DSPs) on Field-programmable gate arrays (FPGAs) needs a nov… ▽ More Recent efforts for improving the performance of neural network (NN) accelerators that meet today's application requirements have given rise to a new trend of logic-based NN inference relying on fixed function combinational logic. Mapping such large Boolean functions with many input variables and product terms to digital signal processors (DSPs) on Field-programmable gate arrays (FPGAs) needs a novel framework considering the structure and the reconfigurability of DSP blocks during this process. The proposed methodology in this paper maps the fixed function combinational logic blocks to a set of Boolean functions where Boolean operations corresponding to each function are mapped to DSP devices rather than look-up tables (LUTs) on the FPGAs to take advantage of the high performance, low latency, and parallelism of DSP blocks. % This paper also presents an innovative design and optimization methodology for compilation and mapping of NNs, utilizing fixed function combinational logic to DSPs on FPGAs employing high-level synthesis flow. % Our experimental evaluations across several \REVone{datasets} and selected NNs demonstrate the comparable performance of our framework in terms of the inference latency and output accuracy compared to prior art FPGA-based NN accelerators employing DSPs. △ Less

Submitted 30 July, 2022; originally announced August 2022.

Comments: 25 page, 10 figures. Under review

arXiv:2207.00068 [pdf, other]

doi 10.1145/3531437.3539715

Sparse Periodic Systolic Dataflow for Lowering Latency and Power Dissipation of Convolutional Neural Network Accelerators

Authors: Jung Hwan Heo, Arash Fayyazi, Amirhossein Esmaili, Massoud Pedram

Abstract: This paper introduces the sparse periodic systolic (SPS) dataflow, which advances the state-of-the-art hardware accelerator for supporting lightweight neural networks. Specifically, the SPS dataflow enables a novel hardware design approach unlocked by an emergent pruning scheme, periodic pattern-based sparsity (PPS). By exploiting the regularity of PPS, our sparsity-aware compiler optimally reorde… ▽ More This paper introduces the sparse periodic systolic (SPS) dataflow, which advances the state-of-the-art hardware accelerator for supporting lightweight neural networks. Specifically, the SPS dataflow enables a novel hardware design approach unlocked by an emergent pruning scheme, periodic pattern-based sparsity (PPS). By exploiting the regularity of PPS, our sparsity-aware compiler optimally reorders the weights and uses a simple indexing unit in hardware to create matches between the weights and activations. Through the compiler-hardware codesign, SPS dataflow enjoys higher degrees of parallelism while being free of the high indexing overhead and without model accuracy loss. Evaluated on popular benchmarks such as VGG and ResNet, the SPS dataflow and accompanying neural network compiler outperform prior work in convolutional neural network (CNN) accelerator designs targeting FPGA devices. Against other sparsity-supporting weight storage formats, SPS results in 4.49x energy efficiency gain while lowering storage requirements by 3.67x for total weight storage (non-pruned weights plus indexing) and 22,044x for indexing memory. △ Less

Submitted 30 June, 2022; originally announced July 2022.

Comments: 6 pages, Published in ACM/IEEE International Symposium on Low Power Electronics and Design (ISLPED) 2022

arXiv:2204.00426 [pdf, other]

A Fast and Efficient Conditional Learning for Tunable Trade-Off between Accuracy and Robustness

Authors: Souvik Kundu, Sairam Sundaresan, Massoud Pedram, Peter A. Beerel

Abstract: Existing models that achieve state-of-the-art (SOTA) performance on both clean and adversarially-perturbed images rely on convolution operations conditioned with feature-wise linear modulation (FiLM) layers. These layers require many new parameters and are hyperparameter sensitive. They significantly increase training time, memory cost, and potential latency which can prove costly for resource-lim… ▽ More Existing models that achieve state-of-the-art (SOTA) performance on both clean and adversarially-perturbed images rely on convolution operations conditioned with feature-wise linear modulation (FiLM) layers. These layers require many new parameters and are hyperparameter sensitive. They significantly increase training time, memory cost, and potential latency which can prove costly for resource-limited or real-time applications. In this paper, we present a fast learnable once-for-all adversarial training (FLOAT) algorithm, which instead of the existing FiLM-based conditioning, presents a unique weight conditioned learning that requires no additional layer, thereby incurring no significant increase in parameter count, training time, or network latency compared to standard adversarial training. In particular, we add configurable scaled noise to the weight tensors that enables a trade-off between clean and adversarial performance. Extensive experiments show that FLOAT can yield SOTA performance improving both clean and perturbed image classification by up to ~6% and ~10%, respectively. Moreover, real hardware measurement shows that FLOAT can reduce the training time by up to 1.43x with fewer model parameters of up to 1.47x on iso-hyperparameter settings compared to the FiLM-based alternatives. Additionally, to further improve memory efficiency we introduce FLOAT sparse (FLOATS), a form of non-iterative model pruning and provide detailed empirical analysis to provide a three way accuracy-robustness-complexity trade-off for these new class of pruned conditionally trained models. △ Less

Submitted 28 March, 2022; originally announced April 2022.

Comments: 14 pages, 10 figures, 1 table

arXiv:2112.13843 [pdf, other]

BMPQ: Bit-Gradient Sensitivity Driven Mixed-Precision Quantization of DNNs from Scratch

Authors: Souvik Kundu, Shikai Wang, Qirui Sun, Peter A. Beerel, Massoud Pedram

Abstract: Large DNNs with mixed-precision quantization can achieve ultra-high compression while retaining high classification performance. However, because of the challenges in finding an accurate metric that can guide the optimization process, these methods either sacrifice significant performance compared to the 32-bit floating-point (FP-32) baseline or rely on a compute-expensive, iterative training poli… ▽ More Large DNNs with mixed-precision quantization can achieve ultra-high compression while retaining high classification performance. However, because of the challenges in finding an accurate metric that can guide the optimization process, these methods either sacrifice significant performance compared to the 32-bit floating-point (FP-32) baseline or rely on a compute-expensive, iterative training policy that requires the availability of a pre-trained baseline. To address this issue, this paper presents BMPQ, a training method that uses bit gradients to analyze layer sensitivities and yield mixed-precision quantized models. BMPQ requires a single training iteration but does not need a pre-trained baseline. It uses an integer linear program (ILP) to dynamically adjust the precision of layers during training, subject to a fixed hardware budget. To evaluate the efficacy of BMPQ, we conduct extensive experiments with VGG16 and ResNet18 on CIFAR-10, CIFAR-100, and Tiny-ImageNet datasets. Compared to the baseline FP-32 models, BMPQ can yield models that have 15.4x fewer parameter bits with a negligible drop in accuracy. Compared to the SOTA "during training", mixed-precision training scheme, our models are 2.1x, 2.2x, and 2.9x smaller, on CIFAR-10, CIFAR-100, and Tiny-ImageNet, respectively, with an improved accuracy of up to 14.54%. △ Less

Submitted 23 December, 2021; originally announced December 2021.

Comments: 4 pages, 2 figures, 2 tables

arXiv:2110.11417 [pdf, other]

HIRE-SNN: Harnessing the Inherent Robustness of Energy-Efficient Deep Spiking Neural Networks by Training with Crafted Input Noise

Authors: Souvik Kundu, Massoud Pedram, Peter A. Beerel

Abstract: Low-latency deep spiking neural networks (SNNs) have become a promising alternative to conventional artificial neural networks (ANNs) because of their potential for increased energy efficiency on event-driven neuromorphic hardware. Neural networks, including SNNs, however, are subject to various adversarial attacks and must be trained to remain resilient against such attacks for many applications.… ▽ More Low-latency deep spiking neural networks (SNNs) have become a promising alternative to conventional artificial neural networks (ANNs) because of their potential for increased energy efficiency on event-driven neuromorphic hardware. Neural networks, including SNNs, however, are subject to various adversarial attacks and must be trained to remain resilient against such attacks for many applications. Nevertheless, due to prohibitively high training costs associated with SNNs, analysis, and optimization of deep SNNs under various adversarial attacks have been largely overlooked. In this paper, we first present a detailed analysis of the inherent robustness of low-latency SNNs against popular gradient-based attacks, namely fast gradient sign method (FGSM) and projected gradient descent (PGD). Motivated by this analysis, to harness the model robustness against these attacks we present an SNN training algorithm that uses crafted input noise and incurs no additional training time. To evaluate the merits of our algorithm, we conducted extensive experiments with variants of VGG and ResNet on both CIFAR-10 and CIFAR-100 datasets. Compared to standard trained direct input SNNs, our trained models yield improved classification accuracy of up to 13.7% and 10.1% on FGSM and PGD attack-generated images, respectively, with negligible loss in clean image accuracy. Our models also outperform inherently robust SNNs trained on rate-coded inputs with improved or similar classification performance on attack-generated images while having up to 25x and 4.6x lower latency and computation energy, respectively. △ Less

Submitted 6 October, 2021; originally announced October 2021.

Comments: 10 pages, 11 figures, 7 tables, International Conference on Computer Vision

arXiv:2109.09351 [pdf, other]

An Enhanced Differential Evolution Algorithm Using a Novel Clustering-based Mutation Operator

Authors: Seyed Jalaleddin Mousavirad, Gerald Schaefer, Iakov Korovin, Mahshid Helali Moghadam, Mehrdad Saadatmand, Mahdi Pedram

Abstract: Differential evolution (DE) is an effective population-based metaheuristic algorithm for solving complex optimisation problems. However, the performance of DE is sensitive to the mutation operator. In this paper, we propose a novel DE algorithm, Clu-DE, that improves the efficacy of DE using a novel clustering-based mutation operator. First, we find, using a clustering algorithm, a winner cluster… ▽ More Differential evolution (DE) is an effective population-based metaheuristic algorithm for solving complex optimisation problems. However, the performance of DE is sensitive to the mutation operator. In this paper, we propose a novel DE algorithm, Clu-DE, that improves the efficacy of DE using a novel clustering-based mutation operator. First, we find, using a clustering algorithm, a winner cluster in search space and select the best candidate solution in this cluster as the base vector in the mutation operator. Then, an updating scheme is introduced to include new candidate solutions in the current population. Experimental results on CEC-2017 benchmark functions with dimensionalities of 30, 50 and 100 confirm that Clu-DE yields improved performance compared to DE. △ Less

Submitted 20 September, 2021; originally announced September 2021.

Comments: 6 pages, IEEE International Conference on Systems, Man, and Cybernetics (SMC 2021)

arXiv:2107.12445 [pdf, other]

Towards Low-Latency Energy-Efficient Deep SNNs via Attention-Guided Compression

Authors: Souvik Kundu, Gourav Datta, Massoud Pedram, Peter A. Beerel

Abstract: Deep spiking neural networks (SNNs) have emerged as a potential alternative to traditional deep learning frameworks, due to their promise to provide increased compute efficiency on event-driven neuromorphic hardware. However, to perform well on complex vision applications, most SNN training frameworks yield large inference latency which translates to increased spike activity and reduced energy eff… ▽ More Deep spiking neural networks (SNNs) have emerged as a potential alternative to traditional deep learning frameworks, due to their promise to provide increased compute efficiency on event-driven neuromorphic hardware. However, to perform well on complex vision applications, most SNN training frameworks yield large inference latency which translates to increased spike activity and reduced energy efficiency. Hence,minimizing average spike activity while preserving accuracy indeep SNNs remains a significant challenge and opportunity.This paper presents a non-iterative SNN training technique thatachieves ultra-high compression with reduced spiking activitywhile maintaining high inference accuracy. In particular, our framework first uses the attention-maps of an un compressed meta-model to yield compressed ANNs. This step can be tuned to support both irregular and structured channel pruning to leverage computational benefits over a broad range of platforms. The framework then performs sparse-learning-based supervised SNN training using direct inputs. During the training, it jointly optimizes the SNN weight, threshold, and leak parameters to drastically minimize the number of time steps required while retaining compression. To evaluate the merits of our approach, we performed experiments with variants of VGG and ResNet, on both CIFAR-10 and CIFAR-100, and VGG16 on Tiny-ImageNet.The SNN models generated through the proposed technique yield SOTA compression ratios of up to 33.4x with no significant drops in accuracy compared to baseline unpruned counterparts. Compared to existing SNN pruning methods, we achieve up to 8.3x higher compression with improved accuracy. △ Less

Submitted 16 July, 2021; originally announced July 2021.

Comments: 10 Pages, 8 Figures, 5 Tables

arXiv:2104.05421 [pdf, other]

NullaNet Tiny: Ultra-low-latency DNN Inference Through Fixed-function Combinational Logic

Authors: Mahdi Nazemi, Arash Fayyazi, Amirhossein Esmaili, Atharva Khare, Soheil Nazar Shahsavani, Massoud Pedram

Abstract: While there is a large body of research on efficient processing of deep neural networks (DNNs), ultra-low-latency realization of these models for applications with stringent, sub-microsecond latency requirements continues to be an unresolved, challenging problem. Field-programmable gate array (FPGA)-based DNN accelerators are gaining traction as a serious contender to replace graphics processing u… ▽ More While there is a large body of research on efficient processing of deep neural networks (DNNs), ultra-low-latency realization of these models for applications with stringent, sub-microsecond latency requirements continues to be an unresolved, challenging problem. Field-programmable gate array (FPGA)-based DNN accelerators are gaining traction as a serious contender to replace graphics processing unit/central processing unit-based platforms considering their performance, flexibility, and energy efficiency. This paper presents NullaNet Tiny, an across-the-stack design and optimization framework for constructing resource and energy-efficient, ultra-low-latency FPGA-based neural network accelerators. The key idea is to replace expensive operations required to compute various filter/neuron functions in a DNN with Boolean logic expressions that are mapped to the native look-up tables (LUTs) of the FPGA device (examples of such operations are multiply-and-accumulate and batch normalization). At about the same level of classification accuracy, compared to Xilinx's LogicNets, our design achieves 2.36$\times$ lower latency and 24.42$\times$ lower LUT utilization. △ Less

Submitted 6 April, 2021; originally announced April 2021.

arXiv:2102.11651 [pdf]

A Novel Deep Learning Method for Textual Sentiment Analysis

Authors: Hossein Sadr, Mozhdeh Nazari Solimandarabi, Mir Mohsen Pedram, Mohammad Teshnehlab

Abstract: Sentiment analysis is known as one of the most crucial tasks in the field of natural language processing and Convolutional Neural Network (CNN) is one of those prominent models that is commonly used for this aim. Although convolutional neural networks have obtained remarkable results in recent years, they are still confronted with some limitations. Firstly, they consider that all words in a senten… ▽ More Sentiment analysis is known as one of the most crucial tasks in the field of natural language processing and Convolutional Neural Network (CNN) is one of those prominent models that is commonly used for this aim. Although convolutional neural networks have obtained remarkable results in recent years, they are still confronted with some limitations. Firstly, they consider that all words in a sentence have equal contributions in the sentence meaning representation and are not able to extract informative words. Secondly, they require a large number of training data to obtain considerable results while they have many parameters that must be accurately adjusted. To this end, a convolutional neural network integrated with a hierarchical attention layer is proposed which is able to extract informative words and assign them higher weight. Moreover, the effect of transfer learning that transfers knowledge learned in the source domain to the target domain with the aim of improving the performance is also explored. Based on the empirical results, the proposed model not only has higher classification accuracy and can extract informative words but also applying incremental transfer learning can significantly enhance the classification performance. △ Less

Submitted 23 February, 2021; originally announced February 2021.

arXiv:2101.09693 [pdf]

doi 10.1109/TNNLS.2022.3148818

A2P-MANN: Adaptive Attention Inference Hops Pruned Memory-Augmented Neural Networks

Authors: Mohsen Ahmadzadeh, Mehdi Kamal, Ali Afzali-Kusha, Massoud Pedram

Abstract: In this work, to limit the number of required attention inference hops in memory-augmented neural networks, we propose an online adaptive approach called A2P-MANN. By exploiting a small neural network classifier, an adequate number of attention inference hops for the input query is determined. The technique results in elimination of a large number of unnecessary computations in extracting the corr… ▽ More In this work, to limit the number of required attention inference hops in memory-augmented neural networks, we propose an online adaptive approach called A2P-MANN. By exploiting a small neural network classifier, an adequate number of attention inference hops for the input query is determined. The technique results in elimination of a large number of unnecessary computations in extracting the correct answer. In addition, to further lower computations in A2P-MANN, we suggest pruning weights of the final FC (fully-connected) layers. To this end, two pruning approaches, one with negligible accuracy loss and the other with controllable loss on the final accuracy, are developed. The efficacy of the technique is assessed by using the twenty question-answering (QA) tasks of bAbI dataset. The analytical assessment reveals, on average, more than 42% fewer computations compared to the baseline MANN at the cost of less than 1% accuracy loss. In addition, when used along with the previously published zero-skipping technique, a computation count reduction of up to 68% is achieved. Finally, when the proposed approach (without zero-skipping) is implemented on the CPU and GPU platforms, up to 43% runtime reduction is achieved. △ Less

Submitted 23 February, 2022; v1 submitted 24 January, 2021; originally announced January 2021.

Comments: 12 pages, 12 figures, 5 tables

arXiv:2101.02667 [pdf]

BRDS: An FPGA-based LSTM Accelerator with Row-Balanced Dual-Ratio Sparsification

Authors: Seyed Abolfazl Ghasemzadeh, Erfan Bank Tavakoli, Mehdi Kamal, Ali Afzali-Kusha, Massoud Pedram

Abstract: In this paper, first, a hardware-friendly pruning algorithm for reducing energy consumption and improving the speed of Long Short-Term Memory (LSTM) neural network accelerators is presented. Next, an FPGA-based platform for efficient execution of the pruned networks based on the proposed algorithm is introduced. By considering the sensitivity of two weight matrices of the LSTM models in pruning, d… ▽ More In this paper, first, a hardware-friendly pruning algorithm for reducing energy consumption and improving the speed of Long Short-Term Memory (LSTM) neural network accelerators is presented. Next, an FPGA-based platform for efficient execution of the pruned networks based on the proposed algorithm is introduced. By considering the sensitivity of two weight matrices of the LSTM models in pruning, different sparsity ratios (i.e., dual-ratio sparsity) are applied to these weight matrices. To reduce memory accesses, a row-wise sparsity pattern is adopted. The proposed hardware architecture makes use of computation overlapping and pipelining to achieve low-power and high-speed. The effectiveness of the proposed pruning algorithm and accelerator is assessed under some benchmarks for natural language processing, binary sentiment classification, and speech recognition. Results show that, e.g., compared to a recently published work in this field, the proposed accelerator could provide up to 272% higher effective GOPS/W and the perplexity error is reduced by up to 1.4% for the PTB dataset. △ Less

Submitted 7 January, 2021; originally announced January 2021.

Comments: 8 pages, 9 figures, 2 tables

arXiv:2011.03083 [pdf, other]

A Tunable Robust Pruning Framework Through Dynamic Network Rewiring of DNNs

Authors: Souvik Kundu, Mahdi Nazemi, Peter A. Beerel, Massoud Pedram

Abstract: This paper presents a dynamic network rewiring (DNR) method to generate pruned deep neural network (DNN) models that are robust against adversarial attacks yet maintain high accuracy on clean images. In particular, the disclosed DNR method is based on a unified constrained optimization formulation using a hybrid loss function that merges ultra-high model compression with robust adversarial trainin… ▽ More This paper presents a dynamic network rewiring (DNR) method to generate pruned deep neural network (DNN) models that are robust against adversarial attacks yet maintain high accuracy on clean images. In particular, the disclosed DNR method is based on a unified constrained optimization formulation using a hybrid loss function that merges ultra-high model compression with robust adversarial training. This training strategy dynamically adjusts inter-layer connectivity based on per-layer normalized momentum computed from the hybrid loss function. In contrast to existing robust pruning frameworks that require multiple training iterations, the proposed learning strategy achieves an overall target pruning ratio with only a single training iteration and can be tuned to support both irregular and structured channel pruning. To evaluate the merits of DNR, experiments were performed with two widely accepted models, namely VGG16 and ResNet-18, on CIFAR-10, CIFAR-100 as well as with VGG16 on Tiny-ImageNet. Compared to the baseline uncompressed models, DNR provides over20x compression on all the datasets with no significant drop in either clean or adversarial classification accuracy. Moreover, our experiments show that DNR consistently finds compressed models with better clean and adversarial image classification performance than what is achievable through state-of-the-art alternatives. △ Less

Submitted 24 November, 2020; v1 submitted 3 November, 2020; originally announced November 2020.

Comments: 8 pages, 4 figures, conference paper

arXiv:2007.15222 [pdf, other]

SynergicLearning: Neural Network-Based Feature Extraction for Highly-Accurate Hyperdimensional Learning

Authors: Mahdi Nazemi, Amirhossein Esmaili, Arash Fayyazi, Massoud Pedram

Abstract: Machine learning models differ in terms of accuracy, computational/memory complexity, training time, and adaptability among other characteristics. For example, neural networks (NNs) are well-known for their high accuracy due to the quality of their automatic feature extraction while brain-inspired hyperdimensional (HD) learning models are famous for their quick training, computational efficiency,… ▽ More Machine learning models differ in terms of accuracy, computational/memory complexity, training time, and adaptability among other characteristics. For example, neural networks (NNs) are well-known for their high accuracy due to the quality of their automatic feature extraction while brain-inspired hyperdimensional (HD) learning models are famous for their quick training, computational efficiency, and adaptability. This work presents a hybrid, synergic machine learning model that excels at all the said characteristics and is suitable for incremental, on-line learning on a chip. The proposed model comprises an NN and a classifier. The NN acts as a feature extractor and is specifically trained to work well with the classifier that employs the HD computing framework. This work also presents a parameterized hardware implementation of the said feature extraction and classification components while introducing a compiler that maps any arbitrary NN and/or classifier to the aforementioned hardware. The proposed hybrid machine learning model has the same level of accuracy (i.e. $\pm$1%) as NNs while achieving at least 10% improvement in accuracy compared to HD learning models. Additionally, the end-to-end hardware realization of the hybrid model improves power efficiency by 1.60x compared to state-of-the-art, high-performance HD learning implementations while improving latency by 2.13x. These results have profound implications for the application of such synergic models in challenging cognitive tasks. △ Less

Submitted 4 August, 2020; v1 submitted 30 July, 2020; originally announced July 2020.

arXiv:2007.01465 [pdf, other]

doi 10.1145/3370748.3406555

Deep-PowerX: A Deep Learning-Based Framework for Low-Power Approximate Logic Synthesis

Authors: Ghasem Pasandi, Mackenzie Peterson, Moises Herrera, Shahin Nazarian, Massoud Pedram

Abstract: This paper aims at integrating three powerful techniques namely Deep Learning, Approximate Computing, and Low Power Design into a strategy to optimize logic at the synthesis level. We utilize advances in deep learning to guide an approximate logic synthesis engine to minimize the dynamic power consumption of a given digital CMOS circuit, subject to a predetermined error rate at the primary outputs… ▽ More This paper aims at integrating three powerful techniques namely Deep Learning, Approximate Computing, and Low Power Design into a strategy to optimize logic at the synthesis level. We utilize advances in deep learning to guide an approximate logic synthesis engine to minimize the dynamic power consumption of a given digital CMOS circuit, subject to a predetermined error rate at the primary outputs. Our framework, Deep-PowerX, focuses on replacing or removing gates on a technology-mapped network and uses a Deep Neural Network (DNN) to predict error rates at primary outputs of the circuit when a specific part of the netlist is approximated. The primary goal of Deep-PowerX is to reduce the dynamic power whereas area reduction serves as a secondary objective. Using the said DNN, Deep-PowerX is able to reduce the exponential time complexity of standard approximate logic synthesis to linear time. Experiments are done on numerous open source benchmark circuits. Results show significant reduction in power and area by up to 1.47 times and 1.43 times compared to exact solutions and by up to 22% and 27% compared to state-of-the-art approximate logic synthesis tools while having orders of magnitudes lower run-time. △ Less

Submitted 2 July, 2020; originally announced July 2020.

arXiv:2006.03269 [pdf, other]

HIPE-MAGIC: A Technology-Aware Synthesis and Mapping Flow for HIghly Parallel Execution of Memristor-Aided LoGIC

Authors: Arash Fayyazi, Amirhossein Esmaili, Massoud Pedram

Abstract: Recent efforts for finding novel computing paradigms that meet today's design requirements have given rise to a new trend of processing-in-memory relying on non-volatile memories. In this paper, we present HIPE-MAGIC, a technology-aware synthesis and mapping flow for highly parallel execution of the memristor-based logic. Our framework is built upon two fundamental contributions: balancing techniq… ▽ More Recent efforts for finding novel computing paradigms that meet today's design requirements have given rise to a new trend of processing-in-memory relying on non-volatile memories. In this paper, we present HIPE-MAGIC, a technology-aware synthesis and mapping flow for highly parallel execution of the memristor-based logic. Our framework is built upon two fundamental contributions: balancing techniques during the logic synthesis, mainly targeting benefits of the parallelism offered by memristive crossbar arrays (MCAs), and an efficient technology mapping framework to maximize the performance and area-efficiency of the memristor-based logic. Our experimental evaluations across several benchmark suites demonstrate the superior performance of HIPE-MAGIC in terms of throughput and energy efficiency compared to recently developed synthesis and mapping flows targeting MCAs, as well as the conventional CPU computing. △ Less

Submitted 5 June, 2020; originally announced June 2020.

arXiv:2005.13735 [pdf]

Logic Verification of Ultra-Deep Pipelined Beyond-CMOS Technologies

Authors: Arash Fayyazi, Shahin Nazarian, Massoud Pedram

Abstract: Traditional logical equivalence checking (LEC) which plays a major role in entire chip design process faces challenges of meeting the requirements demanded by the many emerging technologies that are based on logic models different from standard complementary metal oxide semiconductor (CMOS). In this paper, we propose a LEC framework to be employed in the verification process of beyond-CMOS circuit… ▽ More Traditional logical equivalence checking (LEC) which plays a major role in entire chip design process faces challenges of meeting the requirements demanded by the many emerging technologies that are based on logic models different from standard complementary metal oxide semiconductor (CMOS). In this paper, we propose a LEC framework to be employed in the verification process of beyond-CMOS circuits. Our LEC framework is compatible with existing CMOS technologies, but, also able to check features and capabilities that are unique to beyond-CMOS technologies. For instance, the performance of some emerging technologies benefits from ultra-deep pipelining and verification of such circuits requires new models and algorithms. We, therefore, present the Multi-Cycle Input Dependency (MCID) circuit model which is a novel model representation of design to explicitly capture the dependency of primary outputs of the circuit on sequences of internal signals and inputs. Embedding the proposed circuit model and several structural checking modules, the process of verification can be independent of the underlying technology and signaling. We benchmark the proposed framework on post-synthesis rapid single-flux-quantum (RSFQ) netlists. Results show a comparative verification time of RSFQ circuit benchmark including 32-bit Kogge-Stone adder, 16-bit integer divider, and ISCAS'85 circuits with respect to ABC tool for similar CMOS circuits. △ Less

Submitted 27 May, 2020; originally announced May 2020.

Comments: 10 pages, 8 figures, 3 tables

arXiv:2002.05292 [pdf, other]

NN-PARS: A Parallelized Neural Network Based Circuit Simulation Framework

Authors: Mohammad Saeed Abrishami, Hao Ge, Justin F. Calderon, Massoud Pedram, Shahin Nazarian

Abstract: The shrinking of transistor geometries as well as the increasing complexity of integrated circuits, significantly aggravate nonlinear design behavior. This demands accurate and fast circuit simulation to meet the design quality and time-to-market constraints. The existing circuit simulators which utilize lookup tables and/or closed-form expressions are either slow or inaccurate in analyzing the no… ▽ More The shrinking of transistor geometries as well as the increasing complexity of integrated circuits, significantly aggravate nonlinear design behavior. This demands accurate and fast circuit simulation to meet the design quality and time-to-market constraints. The existing circuit simulators which utilize lookup tables and/or closed-form expressions are either slow or inaccurate in analyzing the nonlinear behavior of designs with billions of transistors. To address these shortcomings, we present NN-PARS, a neural network (NN) based and parallelized circuit simulation framework with optimized event-driven scheduling of simulation tasks to maximize concurrency, according to the underlying GPU parallel processing capabilities. NN-PARS replaces the required memory queries in traditional techniques with parallelized NN-based computation tasks. Experimental results show that compared to a state-of-the-art current-based simulation method, NN-PARS reduces the simulation time by over two orders of magnitude in large circuits. NN-PARS also provides high accuracy levels in signal waveform calculations, with less than $2\%$ error compared to HSPICE. △ Less

Submitted 12 February, 2020; originally announced February 2020.

arXiv:2002.05291 [pdf, other]

CSM-NN: Current Source Model Based Logic Circuit Simulation -- A Neural Network Approach

Authors: Mohammad Saeed Abrishami, Massoud Pedram, Shahin Nazarian

Abstract: The miniaturization of transistors down to 5nm and beyond, plus the increasing complexity of integrated circuits, significantly aggravate short channel effects, and demand analysis and optimization of more design corners and modes. Simulators need to model output variables related to circuit timing, power, noise, etc., which exhibit nonlinear behavior. The existing simulation and sign-off tools, b… ▽ More The miniaturization of transistors down to 5nm and beyond, plus the increasing complexity of integrated circuits, significantly aggravate short channel effects, and demand analysis and optimization of more design corners and modes. Simulators need to model output variables related to circuit timing, power, noise, etc., which exhibit nonlinear behavior. The existing simulation and sign-off tools, based on a combination of closed-form expressions and lookup tables are either inaccurate or slow, when dealing with circuits with more than billions of transistors. In this work, we present CSM-NN, a scalable simulation framework with optimized neural network structures and processing algorithms. CSM-NN is aimed at optimizing the simulation time by accounting for the latency of the required memory query and computation, given the underlying CPU and GPU parallel processing capabilities. Experimental results show that CSM-NN reduces the simulation time by up to $6\times$ compared to a state-of-the-art current source model based simulator running on a CPU. This speedup improves by up to $15\times$ when running on a GPU. CSM-NN also provides high accuracy levels, with less than $2\%$ error, compared to HSPICE. △ Less

Submitted 12 February, 2020; originally announced February 2020.

Comments: 37th IEEE International Conference on Computer Design (ICCD), 2019

arXiv:2002.04776 [pdf, other]

Efficient Training of Deep Convolutional Neural Networks by Augmentation in Embedding Space

Authors: Mohammad Saeed Abrishami, Amir Erfan Eshratifar, David Eigen, Yanzhi Wang, Shahin Nazarian, Massoud Pedram

Abstract: Recent advances in the field of artificial intelligence have been made possible by deep neural networks. In applications where data are scarce, transfer learning and data augmentation techniques are commonly used to improve the generalization of deep learning models. However, fine-tuning a transfer model with data augmentation in the raw input space has a high computational cost to run the full ne… ▽ More Recent advances in the field of artificial intelligence have been made possible by deep neural networks. In applications where data are scarce, transfer learning and data augmentation techniques are commonly used to improve the generalization of deep learning models. However, fine-tuning a transfer model with data augmentation in the raw input space has a high computational cost to run the full network for every augmented input. This is particularly critical when large models are implemented on embedded devices with limited computational and energy resources. In this work, we propose a method that replaces the augmentation in the raw input space with an approximate one that acts purely in the embedding space. Our experimental results show that the proposed method drastically reduces the computation, while the accuracy of models is negligibly compromised. △ Less

Submitted 11 February, 2020; originally announced February 2020.

arXiv:2001.10715 [pdf, other]

qBSA: Logic Design of a 32-bit Block-Skewed RSFQ Arithmetic Logic Unit

Authors: Souvik Kundu, Gourav datta, Peter A. Beerel, Massoud Pedram

Abstract: Single flux quantum (SFQ) circuits are an attractive beyond-CMOS technology because they promise two orders of magnitude lower power at clock frequencies exceeding 25 GHz.However, every SFQ gate is clocked creating very deep gate-level pipelines that are difficult to keep full, particularly for sequences that include data-dependent operations. This paper proposes to increase the throughput of SFQ… ▽ More Single flux quantum (SFQ) circuits are an attractive beyond-CMOS technology because they promise two orders of magnitude lower power at clock frequencies exceeding 25 GHz.However, every SFQ gate is clocked creating very deep gate-level pipelines that are difficult to keep full, particularly for sequences that include data-dependent operations. This paper proposes to increase the throughput of SFQ pipelines by re-designing the datapath to accept and operate on least-significant bits (LSBs) clock cycles earlier than more significant bits. This skewed datapath approach reduces the latency of the LSB side which can be feedback earlier for use in subsequent data-dependent operations increasing their throughput. In particular,we propose to group the bits into 4-bit blocks that are operatedon concurrently and create block-skewed datapath units for 32-bit operation. This skewed approach allows a subsequent data-dependent operation to start evaluating as soon as the first 4-bit block completes. Using this general approach, we developa block-skewed MIPS-compatible 32-bit ALU. Our gate-level Verilog design improves the throughput of 32-bit data dependent operations by 2x and 1.5x compared to previously proposed 4-bit bit-slice and 32-bit Ladner-Fischer ALUs respectively. △ Less

Submitted 29 January, 2020; originally announced January 2020.

Comments: 3 pages, 3 figures

arXiv:2001.10710 [pdf, other]

Pre-defined Sparsity for Low-Complexity Convolutional Neural Networks

Authors: Souvik Kundu, Mahdi Nazemi, Massoud Pedram, Keith M. Chugg, Peter A. Beerel

Abstract: The high energy cost of processing deep convolutional neural networks impedes their ubiquitous deployment in energy-constrained platforms such as embedded systems and IoT devices. This work introduces convolutional layers with pre-defined sparse 2D kernels that have support sets that repeat periodically within and across filters. Due to the efficient storage of our periodic sparse kernels, the par… ▽ More The high energy cost of processing deep convolutional neural networks impedes their ubiquitous deployment in energy-constrained platforms such as embedded systems and IoT devices. This work introduces convolutional layers with pre-defined sparse 2D kernels that have support sets that repeat periodically within and across filters. Due to the efficient storage of our periodic sparse kernels, the parameter savings can translate into considerable improvements in energy efficiency due to reduced DRAM accesses, thus promising significant improvements in the trade-off between energy consumption and accuracy for both training and inference. To evaluate this approach, we performed experiments with two widely accepted datasets, CIFAR-10 and Tiny ImageNet in sparse variants of the ResNet18 and VGG16 architectures. Compared to baseline models, our proposed sparse variants require up to 82% fewer model parameters with 5.6times fewer FLOPs with negligible loss in accuracy for ResNet18 on CIFAR-10. For VGG16 trained on Tiny ImageNet, our approach requires 5.8times fewer FLOPs and up to 83.3% fewer model parameters with a drop in top-5 (top-1) accuracy of only 1.2% (2.1%). We also compared the performance of our proposed architectures with that of ShuffleNet andMobileNetV2. Using similar hyperparameters and FLOPs, our ResNet18 variants yield an average accuracy improvement of 2.8%. △ Less

Submitted 4 February, 2020; v1 submitted 29 January, 2020; originally announced January 2020.

Comments: 14 pages, 13 figures

arXiv:2001.05870 [pdf, other]

Runtime Deep Model Multiplexing for Reduced Latency and Energy Consumption Inference

Authors: Amir Erfan Eshratifar, Massoud Pedram

Abstract: We propose a learning algorithm to design a light-weight neural multiplexer that given the input and computational resource requirements, calls the model that will consume the minimum compute resources for a successful inference. Mobile devices can use the proposed algorithm to offload the hard inputs to the cloud while inferring the easy ones locally. Besides, in the large scale cloud-based intel… ▽ More We propose a learning algorithm to design a light-weight neural multiplexer that given the input and computational resource requirements, calls the model that will consume the minimum compute resources for a successful inference. Mobile devices can use the proposed algorithm to offload the hard inputs to the cloud while inferring the easy ones locally. Besides, in the large scale cloud-based intelligent applications, instead of replicating the most-accurate model, a range of small and large models can be multiplexed from depending on the input's complexity which will save the cloud's computational resources. The input complexity or hardness is determined by the number of models that can predict the correct label. For example, if no model can predict the label correctly, then the input is considered as the hardest. The proposed algorithm allows the mobile device to detect the inputs that can be processed locally and the ones that require a larger model and should be sent a cloud server. Therefore, the mobile user benefits from not only the local processing but also from an accurate model hosted on a cloud server. Our experimental results show that the proposed algorithm improves mobile's model accuracy by 8.52% which is because of those inputs that are properly selected and offloaded to the cloud server. In addition, it saves the cloud providers' compute resources by a factor of 2.85x as small models are chosen for easier inputs. △ Less

Submitted 17 September, 2020; v1 submitted 14 January, 2020; originally announced January 2020.

arXiv:1912.05160 [pdf, other]

Energy-aware Scheduling of Jobs in Heterogeneous Cluster Systems Using Deep Reinforcement Learning

Authors: Amirhossein Esmaili, Massoud Pedram

Abstract: Energy consumption is one of the most critical concerns in designing computing devices, ranging from portable embedded systems to computer cluster systems. Furthermore, in the past decade, cluster systems have increasingly risen as popular platforms to run computing-intensive real-time applications in which the performance is of great importance. However, due to different characteristics of real-t… ▽ More Energy consumption is one of the most critical concerns in designing computing devices, ranging from portable embedded systems to computer cluster systems. Furthermore, in the past decade, cluster systems have increasingly risen as popular platforms to run computing-intensive real-time applications in which the performance is of great importance. However, due to different characteristics of real-time workloads, developing general job scheduling solutions that efficiently address both energy consumption and performance in real-time cluster systems is a challenging problem. In this paper, inspired by recent advances in applying deep reinforcement learning for resource management problems, we present the Deep-EAS scheduler that learns efficient energy-aware scheduling strategies for workloads with different characteristics without initially knowing anything about the scheduling task at hand. Results show that Deep-EAS converges quickly, and performs better compared to standard manually-tuned heuristics, especially in heavy load conditions. △ Less

Submitted 11 December, 2019; originally announced December 2019.

Comments: Accepted in International Symposium on Quality Electronic Design (ISQED), 2020

Showing 1–50 of 87 results for author: Pedram, M