-
UDC: Unified DNAS for Compressible TinyML Models
Authors:
Igor Fedorov,
Ramon Matas,
Hokchhay Tann,
Chuteng Zhou,
Matthew Mattina,
Paul Whatmough
Abstract:
Deploying TinyML models on low-cost IoT hardware is very challenging, due to limited device memory capacity. Neural processing unit (NPU) hardware address the memory challenge by using model compression to exploit weight quantization and sparsity to fit more parameters in the same footprint. However, designing compressible neural networks (NNs) is challenging, as it expands the design space across…
▽ More
Deploying TinyML models on low-cost IoT hardware is very challenging, due to limited device memory capacity. Neural processing unit (NPU) hardware address the memory challenge by using model compression to exploit weight quantization and sparsity to fit more parameters in the same footprint. However, designing compressible neural networks (NNs) is challenging, as it expands the design space across which we must make balanced trade-offs. This paper demonstrates Unified DNAS for Compressible (UDC) NNs, which explores a large search space to generate state-of-the-art compressible NNs for NPU. ImageNet results show UDC networks are up to $3.35\times$ smaller (iso-accuracy) or 6.25% more accurate (iso-model size) than previous work.
△ Less
Submitted 5 January, 2023; v1 submitted 15 January, 2022;
originally announced January 2022.
-
A Resource-Efficient Embedded Iris Recognition System Using Fully Convolutional Networks
Authors:
Hokchhay Tann,
Heng Zhao,
Sherief Reda
Abstract:
Applications of Fully Convolutional Networks (FCN) in iris segmentation have shown promising advances. For mobile and embedded systems, a significant challenge is that the proposed FCN architectures are extremely computationally demanding. In this article, we propose a resource-efficient, end-to-end iris recognition flow, which consists of FCN-based segmentation, contour fitting, followed by Daugm…
▽ More
Applications of Fully Convolutional Networks (FCN) in iris segmentation have shown promising advances. For mobile and embedded systems, a significant challenge is that the proposed FCN architectures are extremely computationally demanding. In this article, we propose a resource-efficient, end-to-end iris recognition flow, which consists of FCN-based segmentation, contour fitting, followed by Daugman normalization and encoding. To attain accurate and efficient FCN models, we propose a three-step SW/HW co-design methodology consisting of FCN architectural exploration, precision quantization, and hardware acceleration. In our exploration, we propose multiple FCN models, and in comparison to previous works, our best-performing model requires 50X less FLOPs per inference while achieving a new state-of-the-art segmentation accuracy. Next, we select the most efficient set of models and further reduce their computational complexity through weights and activations quantization using 8-bit dynamic fixed-point (DFP) format. Each model is then incorporated into an end-to-end flow for true recognition performance evaluation. A few of our end-to-end pipelines outperform the previous state-of-the-art on two datasets evaluated. Finally, we propose a novel DFP accelerator and fully demonstrate the SW/HW co-design realization of our flow on an embedded FPGA platform. In comparison with the embedded CPU, our hardware acceleration achieves up to 8.3X speedup for the overall pipeline while using less than 15% of the available FPGA resources. We also provide comparisons between the FPGA system and an embedded GPU showing different benefits and drawbacks for the two platforms.
△ Less
Submitted 8 September, 2019;
originally announced September 2019.
-
Principles of Information Storage in Small-Molecule Mixtures
Authors:
Jacob K. Rosenstein,
Christopher Rose,
Sherief Reda,
Peter M. Weber,
Eunsuk Kim,
Jason Sello,
Joseph Geiser,
Eamonn Kennedy,
Christopher Arcadia,
Amanda Dombroski,
Kady Oakley,
Shui Ling Chen,
Hokchhay Tann,
Brenda M. Rubenstein
Abstract:
Molecular data systems have the potential to store information at dramatically higher density than existing electronic media. Some of the first experimental demonstrations of this idea have used DNA, but nature also uses a wide diversity of smaller non-polymeric molecules to preserve, process, and transmit information. In this paper, we present a general framework for quantifying chemical memory,…
▽ More
Molecular data systems have the potential to store information at dramatically higher density than existing electronic media. Some of the first experimental demonstrations of this idea have used DNA, but nature also uses a wide diversity of smaller non-polymeric molecules to preserve, process, and transmit information. In this paper, we present a general framework for quantifying chemical memory, which is not limited to polymers and extends to mixtures of molecules of all types. We show that the theoretical limit for molecular information is two orders of magnitude denser by mass than DNA, although this comes with different practical constraints on total capacity. We experimentally demonstrate kilobyte-scale information storage in mixtures of small synthetic molecules, and we consider some of the new perspectives that will be necessary to harness the information capacity available from the vast non-genomic chemical space.
△ Less
Submitted 6 May, 2019;
originally announced May 2019.
-
Parallelized Linear Classification with Volumetric Chemical Perceptrons
Authors:
Christopher E. Arcadia,
Hokchhay Tann,
Amanda Dombroski,
Kady Ferguson,
Shui Ling Chen,
Eunsuk Kim,
Christopher Rose,
Brenda M. Rubenstein,
Sherief Reda,
Jacob K. Rosenstein
Abstract:
In this work, we introduce a new type of linear classifier that is implemented in a chemical form. We propose a novel encoding technique which simultaneously represents multiple datasets in an array of microliter-scale chemical mixtures. Parallel computations on these datasets are performed as robotic liquid handling sequences, whose outputs are analyzed by high-performance liquid chromatography.…
▽ More
In this work, we introduce a new type of linear classifier that is implemented in a chemical form. We propose a novel encoding technique which simultaneously represents multiple datasets in an array of microliter-scale chemical mixtures. Parallel computations on these datasets are performed as robotic liquid handling sequences, whose outputs are analyzed by high-performance liquid chromatography. As a proof of concept, we chemically encode several MNIST images of handwritten digits and demonstrate successful chemical-domain classification of the digits using volumetric perceptrons. We additionally quantify the performance of our method with a larger dataset of binary vectors and compare the experimental measurements against predicted results. Paired with appropriate chemical analysis tools, our approach can work on increasingly parallel datasets. We anticipate that related approaches will be scalable to multilayer neural networks and other more complex algorithms. Much like recent demonstrations of archival data storage in DNA, this work blurs the line between chemical and electrical information systems, and offers early insight into the computational efficiency and massive parallelism which may come with computing in chemical domains.
△ Less
Submitted 11 October, 2018;
originally announced October 2018.
-
BLASYS: Approximate Logic Synthesis Using Boolean Matrix Factorization
Authors:
Soheil Hashemi,
Hokchhay Tann,
Sherief Reda
Abstract:
Approximate computing is an emerging paradigm where design accuracy can be traded off for benefits in design metrics such as design area, power consumption or circuit complexity. In this work, we present a novel paradigm to synthesize approximate circuits using Boolean matrix factorization (BMF). In our methodology the truth table of a sub-circuit of the design is approximated using BMF to a contr…
▽ More
Approximate computing is an emerging paradigm where design accuracy can be traded off for benefits in design metrics such as design area, power consumption or circuit complexity. In this work, we present a novel paradigm to synthesize approximate circuits using Boolean matrix factorization (BMF). In our methodology the truth table of a sub-circuit of the design is approximated using BMF to a controllable approximation degree, and the results of the factorization are used to synthesize a less complex subcircuit. To scale our technique to large circuits, we devise a circuit decomposition method and a subcircuit design-space exploration technique to identify the best order for subcircuit approximations. Our method leads to a smooth trade-off between accuracy and full circuit complexity as measured by design area and power consumption. Using an industrial strength design flow, we extensively evaluate our methodology on a number of testcases, where we demonstrate that the proposed methodology can achieve up to 63% in power savings, while introducing an average relative error of 5%. We also compare our work to previous works in Boolean circuit synthesis and demonstrate significant improvements in design metrics for same accuracy targets.
△ Less
Submitted 15 May, 2018;
originally announced May 2018.
-
Flexible Deep Neural Network Processing
Authors:
Hokchhay Tann,
Soheil Hashemi,
Sherief Reda
Abstract:
The recent success of Deep Neural Networks (DNNs) has drastically improved the state of the art for many application domains. While achieving high accuracy performance, deploying state-of-the-art DNNs is a challenge since they typically require billions of expensive arithmetic computations. In addition, DNNs are typically deployed in ensemble to boost accuracy performance, which further exacerbate…
▽ More
The recent success of Deep Neural Networks (DNNs) has drastically improved the state of the art for many application domains. While achieving high accuracy performance, deploying state-of-the-art DNNs is a challenge since they typically require billions of expensive arithmetic computations. In addition, DNNs are typically deployed in ensemble to boost accuracy performance, which further exacerbates the system requirements. This computational overhead is an issue for many platforms, e.g. data centers and embedded systems, with tight latency and energy budgets. In this article, we introduce flexible DNNs ensemble processing technique, which achieves large reduction in average inference latency while incurring small to negligible accuracy drop. Our technique is flexible in that it allows for dynamic adaptation between quality of results (QoR) and execution runtime. We demonstrate the effectiveness of the technique on AlexNet and ResNet-50 using the ImageNet dataset. This technique can also easily handle other types of networks.
△ Less
Submitted 22 January, 2018;
originally announced January 2018.
-
Hardware-Software Codesign of Accurate, Multiplier-free Deep Neural Networks
Authors:
Hokchhay Tann,
Soheil Hashemi,
Iris Bahar,
Sherief Reda
Abstract:
While Deep Neural Networks (DNNs) push the state-of-the-art in many machine learning applications, they often require millions of expensive floating-point operations for each input classification. This computation overhead limits the applicability of DNNs to low-power, embedded platforms and incurs high cost in data centers. This motivates recent interests in designing low-power, low-latency DNNs…
▽ More
While Deep Neural Networks (DNNs) push the state-of-the-art in many machine learning applications, they often require millions of expensive floating-point operations for each input classification. This computation overhead limits the applicability of DNNs to low-power, embedded platforms and incurs high cost in data centers. This motivates recent interests in designing low-power, low-latency DNNs based on fixed-point, ternary, or even binary data precision. While recent works in this area offer promising results, they often lead to large accuracy drops when compared to the floating-point networks. We propose a novel approach to map floating-point based DNNs to 8-bit dynamic fixed-point networks with integer power-of-two weights with no change in network architecture. Our dynamic fixed-point DNNs allow different radix points between layers. During inference, power-of-two weights allow multiplications to be replaced with arithmetic shifts, while the 8-bit fixed-point representation simplifies both the buffer and adder design. In addition, we propose a hardware accelerator design to achieve low-power, low-latency inference with insignificant degradation in accuracy. Using our custom accelerator design with the CIFAR-10 and ImageNet datasets, we show that our method achieves significant power and energy savings while increasing the classification accuracy.
△ Less
Submitted 11 May, 2017;
originally announced May 2017.
-
Understanding the Impact of Precision Quantization on the Accuracy and Energy of Neural Networks
Authors:
Soheil Hashemi,
Nicholas Anthony,
Hokchhay Tann,
R. Iris Bahar,
Sherief Reda
Abstract:
Deep neural networks are gaining in popularity as they are used to generate state-of-the-art results for a variety of computer vision and machine learning applications. At the same time, these networks have grown in depth and complexity in order to solve harder problems. Given the limitations in power budgets dedicated to these networks, the importance of low-power, low-memory solutions has been s…
▽ More
Deep neural networks are gaining in popularity as they are used to generate state-of-the-art results for a variety of computer vision and machine learning applications. At the same time, these networks have grown in depth and complexity in order to solve harder problems. Given the limitations in power budgets dedicated to these networks, the importance of low-power, low-memory solutions has been stressed in recent years. While a large number of dedicated hardware using different precisions has recently been proposed, there exists no comprehensive study of different bit precisions and arithmetic in both inputs and network parameters. In this work, we address this issue and perform a study of different bit-precisions in neural networks (from floating-point to fixed-point, powers of two, and binary). In our evaluation, we consider and analyze the effect of precision scaling on both network accuracy and hardware metrics including memory footprint, power and energy consumption, and design area. We also investigate training-time methodologies to compensate for the reduction in accuracy due to limited bit precision and demonstrate that in most cases, precision scaling can deliver significant benefits in design metrics at the cost of very modest decreases in network accuracy. In addition, we propose that a small portion of the benefits achieved when using lower precisions can be forfeited to increase the network size and therefore the accuracy. We evaluate our experiments, using three well-recognized networks and datasets to show its generality. We investigate the trade-offs and highlight the benefits of using lower precisions in terms of energy and memory footprint.
△ Less
Submitted 12 December, 2016;
originally announced December 2016.
-
Runtime Configurable Deep Neural Networks for Energy-Accuracy Trade-off
Authors:
Hokchhay Tann,
Soheil Hashemi,
R. Iris Bahar,
Sherief Reda
Abstract:
We present a novel dynamic configuration technique for deep neural networks that permits step-wise energy-accuracy trade-offs during runtime. Our configuration technique adjusts the number of channels in the network dynamically depending on response time, power, and accuracy targets. To enable this dynamic configuration technique, we co-design a new training algorithm, where the network is increme…
▽ More
We present a novel dynamic configuration technique for deep neural networks that permits step-wise energy-accuracy trade-offs during runtime. Our configuration technique adjusts the number of channels in the network dynamically depending on response time, power, and accuracy targets. To enable this dynamic configuration technique, we co-design a new training algorithm, where the network is incrementally trained such that the weights in channels trained in earlier steps are fixed. Our technique provides the flexibility of multiple networks while storing and utilizing one set of weights. We evaluate our techniques using both an ASIC-based hardware accelerator as well as a low-power embedded GPGPU and show that our approach leads to only a small or negligible loss in the final network accuracy. We analyze the performance of our proposed methodology using three well-known networks for MNIST, CIFAR-10, and SVHN datasets, and we show that we are able to achieve up to 95% energy reduction with less than 1% accuracy loss across the three benchmarks. In addition, compared to prior work on dynamic network reconfiguration, we show that our approach leads to approximately 50% savings in storage requirements, while achieving similar accuracy.
△ Less
Submitted 20 July, 2016; v1 submitted 19 July, 2016;
originally announced July 2016.