About

Phone Number

8472017629

Nick Iliev

University of Illinois at Chicago, ECE, Post-Doc

Other Affiliations:
add
Research Interests:
Neuromorphic systems, 3D Modelling (Architecture), Photogrammetry, 3D Reconstruction, 3D Laser scanning (Architecture), Close-range Photogrammetry, and 25 moreComputer Networks, VLSI ASIC FPGA Digital CAD, Computer Architecture, Digital Signal Processing and its applications to 2D and 3D image and video processing, and communications., Computer networks fibre optic media, Digital Frame-Level Controller of Network Traffic Byte-Frame Parsing, VLSI CAD tools methodology processor implementations, digital circuits for VLSI, Finite ( Galois ) field parallel architectures bit-parallel algorithms, computer arithmetic VLSI FPGA circuits, EE VLSI, Wireless Communications, VLSI and Circuit Design, Vlsi Design, Computer Science, Nanoscience, Nanotechnology, Semiconductor Sensor Devices and Electronics, Microsensors Arrays. CMOS-MEMS Mass Sensitive Chemical Sensor. Micro-Electro-Mechanical Systems (MEMS) actuator/sensor design and fabrication based on CMOS-MEMS technologies., Localization in WSN, Indoor Localization, Wireless Sensor Network, Embedded and Reconfigurable Systems, Quantum Computing, Electronics and biomedical engineering, and Kalman Filteredit
About:
edit
Advisors:
edit

This article presents a novel low latency CMOS hardware accelerator for fully connected (FC) layers in deep neural networks (DNNs). The accelerator, FC-Accel, is based on 128 8x8 or 16x16 processing elements (PEs) for matrix-vector multiplication, and 128 multiply-accumulate (MAC) units integrated with 16 High Bandwidth Memory (HBM) stack units for storing the pre-trained weights. A dedicated non-blocking crossbar switch is not used in our low-latency page-bus demultiplexer-based interconnect between the 16 HBMs and the 128 PE array. We show near-linear speedup, reduction in space complexity, and reduction in time complexity with respect to traditional parallel matrix-vector multiplication with a checkerboard block decomposition algorithm, using a novel matched HBM2 memory subsystem for weights and input feature storage; we perform 16-bit fixed-point computation on the key kernels for DNN FC layer computation : FC kernel with KxM tiles which can be scaled for different FC layer sizes. We have designed a flexible processing element, PE, which implements the scalable kernel, in an 1D array of PEs to conserve resources. PE reconfiguration can be done as required by the layer being processed (FC6,FC7,FC8 in AlexNet of VGG16 for example). Micro-architectural details for CMOS ASIC implementations are presented and simulated performance is compared to recent hardware accelerators for DNNs for AlexNet and VGG_16. When comparing simulated processing latency for the FC8 layer, FC-Accel is able to achieve 108 GOPS (non-pipelined, with a 100 MHz clock) and 1048 GOPS (pipelined, with a 662 MHz clock) which improves on a recent EIE accelerator quoted at 102 GOPS with a 800 MHz clock and using compression for the same FC8 layer. When compared to Tensaurus, a recent accelerator of Sparse-Dense Tensor computations, FC-Accel (clocked at 662 MHz) delivers a 2.5 increase in throughput over Tensaurus (clocked at 2 GHz) for VGG16 FC8. The Xilinx Versal-ACAP VC1902 FPGA has an FC8 inferencing latency of 158 usec at 1.33 GHz, which is much slower than FC-Accel's FC8 latency of 8.5 usec. When compared with an NVIDIA Jetson AGX Xavier GPU running inference on VGG-16 FC8, FC-Accel reduces FC8 inferencing latency from the GPU's average of 120 usec to 8.5 usec. Intel's Arria-10 DLA FPGA achieves 26 usec for the VGG16 FC8 layer which is 3 times the latency of the proposed solution.

Publication Date: 2022

Publication Name: https://www.researchsquare.com/

Download (.pdf)

This work discusses a novel low-power digital CMOS architecture for speaker identification (SI) by combining k-means clustering with Gaussian mixture model (GMM) scoring. We show that k-means clustering at the front-end reduces the dimensionality of speech features to minimize downstream processing without affecting SI accuracy. Implementation of cluster generator is discussed with novel distance computing and online centroid update datapaths to minimize overhead of the clustering layer (CL). The integrated design achieves 6× lower energy than the conventional for SI among ten speakers.

DOI: 10.1109/LES.2019.2915953

Publication Date: 2019

Publication Name: IEEE Embedded Eystems Letters

Research Interests:
Speaker Recognition, Analog & Digital VLSI Design, Growth Mixture Models GMM, Low-voltage and low-power mixed signal CMOS circuits, and K means Clustering

—This work proposes a low power digital block for spatial localization of sensors within the limited area/power budget of sensor nodes. We show a novel digital architecture for a Linear Program (LP) solver based on a recurrent (non-linear feedback) Neural Network (RNN). Training data is not needed in our approach. We solve the primal and dual optimization problems for spatial localization with a single multi-functional data path which does not require matrix inversions. FPGA and ASIC implementations are presented which target a sensor (microrobot) localization problem in 2D using angle-of-arrival (AOA) measurements. The results show that the estimated locations (2D coordinates) are very close to the ground truth values in all tested scenarios. The proposed RNN has input and output layers, and a hidden layer with four neurons. FPGA implementation of the localizer in 180 nm process dissipates 180 mW of power at 1.5V and 31.25 MHz. When scaled to 128 neurons the performance is 13 Mop/sec/W for the same FPGA technology. The design has been also simulated in an ASIC 45 nm (PDK45 1V VDD) standard cell technology. With 128 neurons in the hidden layer the ASIC consumes 196.9 mW power at 516 MHz, which is equivalent to 677.165 Mop/sec/W. The proposed design significantly improves computing efficiency over the recently published results for similar implementations.

Page Numbers: 297-300

Publication Date: 2017

Publication Name: IEEE 35th International Conference on Computer Design (ICCD)

Research Interests:
Neuromorphic VLSi, Neuromorphic aVLSI, and Deep Neural Networks (DNN)

Download (.pdf)

—Finding accurate positions of mobile devices based on visual information involves searching for query-matching images in a very large dataset, typically containing millions of images. Although the main problem is designing a reliable image retrieval engine, accurate localization also depends on a good fusion algorithm between the GPS data (geo-tags) of each query-matching image and the query image. This paper proposes a new method for reliable estimation of the actual query camera position (geo-tag) by applying structure from motion (SFM) with bundle adjustment for sparse 3D camera position reconstruction, and a linear rigid transformation between two different 3D Cartesian coordinate systems. The experimental results on more than 170 query images show the proposed algorithm returns accurate results for a high percentage of the samples. The error range of the estimated query geo-tag is compared with other related research and indicates an average error less than 5 meters that improves on some of the published works.

Research Interests:
Computer Vision, Content-Based Information Retrieval (CBIR), Content-Based Image Retrieval, Localization, Structure from Motion, and 3 moreNavigation, Mobile Robot Navigation, and Image Based Localization

Download (.pdf)

Rounding the result of a multiplication is used in many �fields such as signal processing. To reduce the power dissipation it is better to perform the rounding process directly inside the multiplier by using truncated multipliers. But using truncated multipliers requires adding a correction. Two methods already exist: vari- able and the constant correction. This paper introduces a new method for truncated multiplication which uti- lizes combinations of two previous methods. This paper also examines the amount of reduced power. It also presents how these corrections are calculated and focuses on how these truncated multipliers can be implemented. Simu- lations were also done and showed that the hybrid has the lowest average and mean square error.

Publication Date: Jun 1, 2011

Download (.pdf)

... circuits Nick Iliev and James Stine ECE Dept. ... 7. M. Perkowski, T. Luba, R. Lisanke, N. Iliev, P. Burkey, R. Malvi, Z. Wang, and S. Zhou, “Unified Approach to Functional Decomposition of Switching Functions”, Portland State... more

Publication Name: Citeseer

Download (.pdf)

Abstract—Spatial localization (colocation) of nodes in wireless
sensor networks (WSNs) is an active area of research, with
many applications in sensing from distributed systems, such as
microaerial vehicles, smart dust sensors, and mobile robotics.
This paper provides a comprehensive review and comparison
of recent implementations (commercial and academic) of
physical measurement techniques used in sensor localization,
and of the localization algorithms that use these measurement
techniques. Physical methods for measuring distances and
angles between WSN nodes are reviewed, followed by a
comprehensive comparison of localization accuracy, applicable
ranges, node dimensions, and power consumption of the
different implementations. A summary of advantages and
disadvantages of each measurement technique is provided, along
with a comparison of colocalization methods in WSNs across
multiple algorithms and distance ranges. A discussion of possible
improvements to accuracy, range, and power consumption of
selected self-localization methods is included in the concluding
discussion. Although the preferred implementation depends
on the application, required accuracy, and range, passive
optical triangulation is reported as the most energy efficient
localization method for low-cost/low-power miniature sensor
nodes. It is capable of providing micrometer-level resolution,
however, the applicable range (internode distance) is limited to
single centimeters.

Download (.pdf)

Publication Name: 3rd JILP Workshop on Computer Architecture 2012

Download (.pdf)

Publication Date: 1991

Research Interests:
Computer networks fibre optic media and Digital Frame-Level Controller of Network Traffic Byte-Frame Parsing

Download (.pdf)

Publication Date: 2005

Research Interests:
VLSI CAD tools methodology processor implementations and digital circuits for VLSI

Download (.pdf)

Publication Date: 2004

Research Interests:
Finite ( Galois ) field parallel architectures bit-parallel algorithms and computer arithmetic VLSI FPGA circuits

Download (.pdf)

Inversion of a finite field element is the most time consuming of all field arithmetic operations which is why it is avoided as much as possible in elliptical curve cryptosystem implementations. Unfortunately, there exists only two methods for performing inversion: the Euclidean algorithm and inversion through multiplication based on Fermat's theorem. VLSI implementations of these methods are examined in detail using TSMC SCN6M 0.18μm technology in GF(2163) using polynomial basis representation. Observations are made comparing the variants of each method and strategies are presented to improve VLSI implementations.

Publication Date: 2005

Research Interests:
EE VLSI

Download (.pdf)

Drafts

We present a novel low latency CMOS hardware accelerator for fully connected (FC) layers in deep neural networks (DNNs). The FC accelerator, FC-ACCL, is based on 128 8x8 or 16x16 processing elements (PEs) for matrix-vector multiplication, and 128 multiply-accumulate (MAC) units integrated with 16 High Bandwidth Memory (HBM) stack units for storing the pre-trained weights. A dedicated non-blocking crossbar switch is not used in our low-latency interconnect between the 16 HBMs and the 128 PE array. Micro-architectural details for CMOS ASIC implementations are presented and simulated performance is compared to recent hardware accelerators for DNNs for AlexNet and VGG 16. When comparing simulated processing latency for a 4096-1000 FC8 layer, our FC-ACCL is able to achieve 108 GOPS (with a 100 MHz clock) which improves on a recent FC8 layer EIE accelerator quoted at 102 GOPS with a 800 MHz clock and using compression. We have achieved this considerable improvement by fully utilizing the HBM units for storing and reading out column-specific FC-layer weights in 1 cycle with a novel colum-row-column schedule, and implementing a maximally parallel datapath for processing these weights with the corresponding MAC and PE units. When up-scaled to 128 16x16 PEs, for 16x16 tiles of weights, the design can reduce latency for the large FC6 layer by 60 % in AlexNet and by 3 % in VGG16 when compared to an alternative EIE solution which uses compression.

Publication Date: 2020

Download (.pdf)

Thesis Chapters

Failure isolation in linear stochastic systems: the associative recall approach - Kalman filter residuals from specific failure modes of state-space systems are modelled as ARMA processes; an ARMA coefficients library is built up and used... more

Publication Date: 1987

Publication Name: Thesis

Research Interests:
Online Learning, Real Time Embedded Systems, Failure Detectors, Kalman Filtering Algorithm, Amoregressive Moving-average (ARMA) Models, and 2 morecomputerized adaptive classification testing and Failure localization

Download (.pdf)

Recent advances in CMOS VLSI technology have enabled the tremendous growth of devices at the edge of the cloud and in indoor environments: IoT indoor appliances, mobile indoor medical assistants, mobile indoor manufacturing platforms, indoor drone assistants, and others. As anticipated, this growth (in edge-device numbers and capabilities) is generating large communications and data processing workloads for the servers in the cloud. One approach to help manage this trend is to make the edge-nodes more intelligent and able to process more data onboard (within the edge-node) before communicating with the servers. This thesis proposes hardware accelerator solutions to three types of onboard (within platform) processing: Spatial Self-Localization (SSL) which localizes the platform in space, Speaker Recognition (SpkrRec) which allows human voice control of the platform and authentication of the human speaker, and Fully-Connected layer evaluation in Neural Networks ( FC-NN ) for accelerated neural network processing withing the platform. Onboard processing is assumed to include a multi-core SoC (CPU/GPU), conventional SRAM and DRAM memory as well as high bandwidth memory, HBM or 3D-DRAM, and communication and sensing subsystems. The SSL, SpkRec, and FC-NN accelerators can be integrated with the SoC’s peripheral bus structures such as AXI-Stream, AXI-Lite, AXI-HBM, JESD235A, JESD235B, GPMC, DMA, and similar high-speed processor interfaces.

DOI: 10.25417/uic.13476432.v1

Publication Date: 2020

Publication Name: Ph.D. Dissertation, Univ. Illinois at Chicago

Research Interests:
digital signal processing communications computer VLSI algorithms parallel architectures Structure from Motion, neural networks

Related Authors

Armando Marques-Guedes

Hanung Adi Nugroho

Ibrahim A. Hameed

Fatiha Djemili Tolba

Nick Iliev

Publication Date: 2022

Publication Name: https://www.researchsquare.com/

DOI: 10.1109/LES.2019.2915953

Publication Date: 2019

Publication Name: IEEE Embedded Eystems Letters

Research Interests: Speaker Recognition, Analog & Digital VLSI Design, Growth Mixture Models GMM, Low-voltage and low-power mixed signal CMOS circuits, and K means Clustering<div>()</div>

Page Numbers: 297-300

Publication Date: 2017

Publication Name: IEEE 35th International Conference on Computer Design (ICCD)

Research Interests: Neuromorphic VLSi, Neuromorphic aVLSI, and Deep Neural Networks (DNN)<div>()</div>

Publication Date: Jun 1, 2011

Publication Name: Citeseer

Publication Name: 3rd JILP Workshop on Computer Architecture 2012

Publication Date: 1991

Research Interests: Computer networks fibre optic media and Digital Frame-Level Controller of Network Traffic Byte-Frame Parsing<div>()</div>

Publication Date: 2005

Research Interests: VLSI CAD tools methodology processor implementations and digital circuits for VLSI<div>()</div>

Publication Date: 2004

Research Interests: Finite ( Galois ) field parallel architectures bit-parallel algorithms and computer arithmetic VLSI FPGA circuits<div>()</div>

Publication Date: 2005

Research Interests: EE VLSI<div>()</div>

Publication Date: 2020

Publication Date: 1987

Publication Name: Thesis

DOI: 10.25417/uic.13476432.v1

Publication Date: 2020

Publication Name: Ph.D. Dissertation, Univ. Illinois at Chicago

Research Interests: digital signal processing communications computer VLSI algorithms parallel architectures Structure from Motion, neural networks<div>()</div>

Log In

Research Interests:
Speaker Recognition, Analog & Digital VLSI Design, Growth Mixture Models GMM, Low-voltage and low-power mixed signal CMOS circuits, and K means Clustering

Research Interests:
Neuromorphic VLSi, Neuromorphic aVLSI, and Deep Neural Networks (DNN)

Research Interests:
Computer networks fibre optic media and Digital Frame-Level Controller of Network Traffic Byte-Frame Parsing

Research Interests:
VLSI CAD tools methodology processor implementations and digital circuits for VLSI

Research Interests:
Finite ( Galois ) field parallel architectures bit-parallel algorithms and computer arithmetic VLSI FPGA circuits

Research Interests:
EE VLSI

Research Interests:
digital signal processing communications computer VLSI algorithms parallel architectures Structure from Motion, neural networks