Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content
  • 8472017629

Nick Iliev

This article presents a novel low latency CMOS hardware accelerator for fully connected (FC) layers in deep neural networks (DNNs). The accelerator, FC-Accel, is based on 128 8x8 or 16x16 processing elements (PEs) for matrix-vector... more
This article presents a novel low latency CMOS hardware accelerator for fully connected (FC) layers in deep neural networks (DNNs). The accelerator, FC-Accel, is based on 128 8x8 or 16x16 processing elements (PEs) for matrix-vector multiplication, and 128 multiply-accumulate (MAC) units integrated with 16 High Bandwidth Memory (HBM) stack units for storing the pre-trained weights. A dedicated non-blocking crossbar switch is not used in our low-latency page-bus demultiplexer-based interconnect between the 16 HBMs and the 128 PE array. We show near-linear speedup, reduction in space complexity, and reduction in time complexity with respect to traditional parallel matrix-vector multiplication with a checkerboard block decomposition algorithm, using a novel matched HBM2 memory subsystem for weights and input feature storage; we perform 16-bit fixed-point computation on the key kernels for DNN FC layer computation : FC kernel with KxM tiles which can be scaled for different FC layer sizes. We have designed a flexible processing element, PE, which implements the scalable kernel, in an 1D array of PEs to conserve resources. PE reconfiguration can be done as required by the layer being processed (FC6,FC7,FC8 in AlexNet of VGG16 for example). Micro-architectural details for CMOS ASIC implementations are presented and simulated performance is compared to recent hardware accelerators for DNNs for AlexNet and VGG_16. When comparing simulated processing latency for the FC8 layer, FC-Accel is able to achieve 108 GOPS (non-pipelined, with a 100 MHz clock) and 1048 GOPS (pipelined, with a 662 MHz clock) which improves on a recent EIE accelerator quoted at 102 GOPS with a 800 MHz clock and using compression for the same FC8 layer. When compared to Tensaurus, a recent accelerator of Sparse-Dense Tensor computations, FC-Accel (clocked at 662 MHz) delivers a 2.5 increase in throughput over Tensaurus (clocked at 2 GHz) for VGG16 FC8. The Xilinx Versal-ACAP VC1902 FPGA has an FC8 inferencing latency of 158 usec at 1.33 GHz, which is much slower than FC-Accel's FC8 latency of 8.5 usec. When compared with an NVIDIA Jetson AGX Xavier GPU running inference on VGG-16 FC8, FC-Accel reduces FC8 inferencing latency from the GPU's average of 120 usec to 8.5 usec. Intel's Arria-10 DLA FPGA achieves 26 usec for the VGG16 FC8 layer which is 3 times the latency of the proposed solution.
This work discusses a novel low-power digital CMOS architecture for speaker identification (SI) by combining k-means clustering with Gaussian mixture model (GMM) scoring. We show that k-means clustering at the front-end reduces the... more
This work discusses a novel low-power digital CMOS architecture for speaker identification (SI) by combining k-means clustering with Gaussian mixture model (GMM) scoring. We show that k-means clustering at the front-end reduces the dimensionality of speech features to minimize downstream processing without affecting SI accuracy. Implementation of cluster generator is discussed with novel distance computing and online centroid update datapaths to minimize overhead of the clustering layer (CL). The integrated design achieves 6× lower energy than the conventional for SI among ten speakers.
—This work proposes a low power digital block for spatial localization of sensors within the limited area/power budget of sensor nodes. We show a novel digital architecture for a Linear Program (LP) solver based on a recurrent (non-linear... more
—This work proposes a low power digital block for spatial localization of sensors within the limited area/power budget of sensor nodes. We show a novel digital architecture for a Linear Program (LP) solver based on a recurrent (non-linear feedback) Neural Network (RNN). Training data is not needed in our approach. We solve the primal and dual optimization problems for spatial localization with a single multi-functional data path which does not require matrix inversions. FPGA and ASIC implementations are presented which target a sensor (microrobot) localization problem in 2D using angle-of-arrival (AOA) measurements. The results show that the estimated locations (2D coordinates) are very close to the ground truth values in all tested scenarios. The proposed RNN has input and output layers, and a hidden layer with four neurons. FPGA implementation of the localizer in 180 nm process dissipates 180 mW of power at 1.5V and 31.25 MHz. When scaled to 128 neurons the performance is 13 Mop/sec/W for the same FPGA technology. The design has been also simulated in an ASIC 45 nm (PDK45 1V VDD) standard cell technology. With 128 neurons in the hidden layer the ASIC consumes 196.9 mW power at 516 MHz, which is equivalent to 677.165 Mop/sec/W. The proposed design significantly improves computing efficiency over the recently published results for similar implementations.
—Finding accurate positions of mobile devices based on visual information involves searching for query-matching images in a very large dataset, typically containing millions of images. Although the main problem is designing a reliable... more
—Finding accurate positions of mobile devices based on visual information involves searching for query-matching images in a very large dataset, typically containing millions of images. Although the main problem is designing a reliable image retrieval engine, accurate localization also depends on a good fusion algorithm between the GPS data (geo-tags) of each query-matching image and the query image. This paper proposes a new method for reliable estimation of the actual query camera position (geo-tag) by applying structure from motion (SFM) with bundle adjustment for sparse 3D camera position reconstruction, and a linear rigid transformation between two different 3D Cartesian coordinate systems. The experimental results on more than 170 query images show the proposed algorithm returns accurate results for a high percentage of the samples. The error range of the estimated query geo-tag is compared with other related research and indicates an average error less than 5 meters that improves on some of the published works.
Research Interests:
Rounding the result of a multiplication is used in many �fields such as signal processing. To reduce the power dissipation it is better to perform the rounding process directly inside the multiplier by using truncated multipliers. But... more
Rounding the result of a multiplication is used in many �fields such as signal processing. To reduce the power dissipation it is better to perform the rounding process directly inside the multiplier by using truncated multipliers. But using truncated multipliers requires adding a correction. Two methods already exist: vari- able and the constant correction. This paper introduces a new method for truncated multiplication which uti- lizes combinations of two previous methods. This paper also examines the amount of reduced power. It also presents how these corrections are calculated and focuses on how these truncated multipliers can be implemented. Simu- lations were also done and showed that the hybrid has the lowest average and mean square error.
... circuits Nick Iliev and James Stine ECE Dept. ... 7. M. Perkowski, T. Luba, R. Lisanke, N. Iliev, P. Burkey, R. Malvi, Z. Wang, and S. Zhou, “Unified Approach to Functional Decomposition of Switching Functions”, Portland State... more
... circuits Nick Iliev and James Stine ECE Dept. ... 7. M. Perkowski, T. Luba, R. Lisanke, N. Iliev, P. Burkey, R. Malvi, Z. Wang, and S. Zhou, “Unified Approach to Functional Decomposition of Switching Functions”, Portland State University, Technical report CS95, 1995. ...
Abstract—Spatial localization (colocation) of nodes in wireless sensor networks (WSNs) is an active area of research, with many applications in sensing from distributed systems, such as microaerial vehicles, smart dust sensors, and mobile... more
Abstract—Spatial localization (colocation) of nodes in wireless
sensor networks (WSNs) is an active area of research, with
many applications in sensing from distributed systems, such as
microaerial vehicles, smart dust sensors, and mobile robotics.
This paper provides a comprehensive review and comparison
of recent implementations (commercial and academic) of
physical measurement techniques used in sensor localization,
and of the localization algorithms that use these measurement
techniques. Physical methods for measuring distances and
angles between WSN nodes are reviewed, followed by a
comprehensive comparison of localization accuracy, applicable
ranges, node dimensions, and power consumption of the
different implementations. A summary of advantages and
disadvantages of each measurement technique is provided, along
with a comparison of colocalization methods in WSNs across
multiple algorithms and distance ranges. A discussion of possible
improvements to accuracy, range, and power consumption of
selected self-localization methods is included in the concluding
discussion. Although the preferred implementation depends
on the application, required accuracy, and range, passive
optical triangulation is reported as the most energy efficient
localization method for low-cost/low-power miniature sensor
nodes. It is capable of providing micrometer-level resolution,
however, the applicable range (internode distance) is limited to
single centimeters.
Inversion of a finite field element is the most time consuming of all field arithmetic operations which is why it is avoided as much as possible in elliptical curve cryptosystem implementations. Unfortunately, there exists only two... more
Inversion of a finite field element is the most time consuming of all field arithmetic operations which is why it is avoided as much as possible in elliptical curve cryptosystem implementations. Unfortunately, there exists only two methods for performing inversion: the Euclidean algorithm and inversion through multiplication based on Fermat's theorem. VLSI implementations of these methods are examined in detail using TSMC SCN6M 0.18μm technology in GF(2163) using polynomial basis representation. Observations are made comparing the variants of each method and strategies are presented to improve VLSI implementations.
We present a novel low latency CMOS hardware accelerator for fully connected (FC) layers in deep neural networks (DNNs). The FC accelerator, FC-ACCL, is based on 128 8x8 or 16x16 processing elements (PEs) for matrix-vector multiplication,... more
We present a novel low latency CMOS hardware accelerator for fully connected (FC) layers in deep neural networks (DNNs). The FC accelerator, FC-ACCL, is based on 128 8x8 or 16x16 processing elements (PEs) for matrix-vector multiplication, and 128 multiply-accumulate (MAC) units integrated with 16 High Bandwidth Memory (HBM) stack units for storing the pre-trained weights. A dedicated non-blocking crossbar switch is not used in our low-latency interconnect between the 16 HBMs and the 128 PE array. Micro-architectural details for CMOS ASIC implementations are presented and simulated performance is compared to recent hardware accelerators for DNNs for AlexNet and VGG 16. When comparing simulated processing latency for a 4096-1000 FC8 layer, our FC-ACCL is able to achieve 108 GOPS (with a 100 MHz clock) which improves on a recent FC8 layer EIE accelerator quoted at 102 GOPS with a 800 MHz clock and using compression. We have achieved this considerable improvement by fully utilizing the HBM units for storing and reading out column-specific FC-layer weights in 1 cycle with a novel colum-row-column schedule, and implementing a maximally parallel datapath for processing these weights with the corresponding MAC and PE units. When up-scaled to 128 16x16 PEs, for 16x16 tiles of weights, the design can reduce latency for the large FC6 layer by 60 % in AlexNet and by 3 % in VGG16 when compared to an alternative EIE solution which uses compression.
Failure isolation in linear stochastic systems: the associative recall approach - Kalman filter residuals from specific failure modes of state-space systems are modelled as ARMA processes; an ARMA coefficients library is built up and used... more
Failure isolation in linear stochastic systems: the associative recall approach - Kalman filter residuals from specific failure modes of state-space systems are modelled as ARMA processes; an ARMA coefficients library is built up and used for online classification with associative memory as one possible implementation.
Recent advances in CMOS VLSI technology have enabled the tremendous growth of devices at the edge of the cloud and in indoor environments: IoT indoor appliances, mobile indoor medical assistants, mobile indoor manufacturing platforms,... more
Recent advances in CMOS VLSI technology have enabled the tremendous growth of devices at the edge of the cloud and in indoor environments: IoT indoor appliances, mobile indoor medical assistants, mobile indoor manufacturing platforms, indoor drone assistants, and others. As anticipated, this growth (in edge-device numbers and capabilities) is generating large communications and data processing workloads for the servers in the cloud. One approach to help manage this trend is to make the edge-nodes more intelligent and able to process more data onboard (within the edge-node) before communicating with the servers. This thesis proposes hardware accelerator solutions to three types of onboard (within platform) processing: Spatial Self-Localization (SSL) which localizes the platform in space, Speaker Recognition (SpkrRec) which allows human voice control of the platform and authentication of the human speaker, and Fully-Connected layer evaluation in Neural Networks ( FC-NN ) for accelerated neural network processing withing the platform. Onboard processing is assumed to include a multi-core SoC (CPU/GPU), conventional SRAM and DRAM memory as well as high bandwidth memory, HBM or 3D-DRAM, and communication and sensing subsystems. The SSL, SpkRec, and FC-NN accelerators can be integrated with the SoC’s peripheral bus structures such as AXI-Stream, AXI-Lite, AXI-HBM, JESD235A, JESD235B, GPMC, DMA, and similar high-speed processor interfaces.