I am a PhD candidate in the Department of Computer Science and Engineering at the University of California, San Diego (UCSD). I am a member of System Energy Efficiency Lab (SeeLab) and my research is focused on approximate computing, embedded systems and computer architecture. I am searching for alternative architecture to address memory bottleneck and computation cost including approximate computing, neuromorphic computing and memory-centric computing. This includes both architectural and circuit level works to make computation more energy efficient while delivering acceptable quality of service.
Memorization is an essential functionality that enables today's machine learning algorithms t... more Memorization is an essential functionality that enables today's machine learning algorithms to provide a high quality of learning and reasoning for each prediction. Memorization gives algorithms prior knowledge to keep the context and define confidence for their decision. Unfortunately, the existing deep learning algorithms have a weak and nontransparent notion of memorization. Brain-inspired HyperDimensional Computing (HDC) is introduced as a model of human memory. Therefore, it mimics several important functionalities of the brain memory by operating with a vector that is computationally tractable and mathematically rigorous in describing human cognition. In this manuscript, we introduce a brain-inspired system that represents HDC memorization capability over a graph of relations. We propose GrapHD, hyperdimensional memorization that represents graph-based information in high-dimensional space. GrapHD defines an encoding method representing complex graph structure while suppor...
2020 IEEE/ACM International Conference On Computer Aided Design (ICCAD), 2020
Hyperdimensional computing (HDC) is a brain-inspired computing paradigm that works with high-dime... more Hyperdimensional computing (HDC) is a brain-inspired computing paradigm that works with high-dimensional vectors, hypervectors, instead of numbers. HDC replaces several complex learning computations with bitwise and simpler arithmetic operations, resulting in a faster and more energy-efficient learning algorithm. However, it comes at the cost of an increased amount of data to process due to mapping the data into high-dimensional space. While some datasets may nearly fit in the memory, the resulting hypervectors more often than not can't be stored in memory, resulting in long data transfers from storage. In this paper, we propose THRIFTY, an in-storage computing (ISC) solution that performs HDC encoding and training across the flash hierarchy. To hide the latency of training and enable efficient computation, we introduce the concept of batching in HDC. It allows us to split HDC training into sub-components and process them independently. We also present, for the first time, on-ch...
IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2020
Deep neural networks (DNN) have demonstrated effectiveness for various applications such as image... more Deep neural networks (DNN) have demonstrated effectiveness for various applications such as image processing, video segmentation, and speech recognition. Running state-of-the-art DNNs on current systems mostly relies on either general-purpose processors, ASIC designs, or FPGA accelerators, all of which suffer from data movements due to the limited on-chip memory and data transfer bandwidth. In this work, we propose a novel framework, called RAPIDNN, which performs neuron-to-memory transformation in order to accelerate DNNs in a highly parallel architecture. RAPIDNN reinterprets a DNN model and maps it into a specialized accelerator, which is designed using non-volatile memory blocks that model four fundamental DNN operations, i.e., multiplication, addition, activation functions, and pooling. The framework extracts representative operands of a DNN model, e.g., weights and input values, using clustering methods to optimize the model for in-memory processing. Then, it maps the extracted operands and their pre-computed results into the accelerator memory blocks. At runtime, the accelerator identifies computation results based on efficient in-memory search capability which also provides tunability of approximation to improve computation efficiency further. Our evaluation shows that RAPIDNN achieves 68.4Ă—, 49.5Ă— energy efficiency improvement and 48.1Ă—, 10.9Ă— speedup as compared to ISAAC and PipeLayer, the state-of-the-art DNN accelerators, while ensuring less than 0.5% quality loss.
Running Internet of Things applications on general purpose processors results in a large energy a... more Running Internet of Things applications on general purpose processors results in a large energy and performance overhead, due to the high cost of data movement. Processing in-memory is a promising solution to reduce the data movement cost by processing the data locally inside the memory. In this paper, we design a MultiPurpose In-Memory Processing (MPIM) system, which can be used as main memory and for processing. MPIM consists of multiple crossbar memories with the capability of efficient in-memory computations. Instead of transferring the large dataset to the processors, MPIM provides three important in-memory processing capabilities: i) addition/multiplication, ii) data searching for the nearest neighbor iii) bitwise operations including OR, AND and XOR by designing new analog sense amplifiers. The experimental results show that over application with addition/multiplication MPIM can achieve 9.2Ă— energy efficiency improvement and 50.3Ă— speedup as compared to a recent AMD GPU architecture. Similarly, MPIM can provide up to 5.5Ă— energy savings and 19Ă— speedup for the search operations. For bitwise vector processing, we present 11000Ă— energy improvements with 62Ă— speedup over the SIMD-based computation, while outperforming other state-of-the-art in-memory processing techniques.
The nearest neighbor (NN) algorithm has been used in a broad range of applications including patt... more The nearest neighbor (NN) algorithm has been used in a broad range of applications including pattern recognition, classification, computer vision, databases, etc. The NN algorithm tests data points to find the nearest data to a query data point. With the Internet of Things the amount of data to search through grows exponentially, so we need to have more efficient NN design. Running NN on multicore processors or on general purpose GPUs has significant energy and performance overhead due to small available cache sizes resulting in moving a lot of data via limited bandwidth busses from memory. In this paper, we propose a nearest neighbor accelerator, called NNgine, consisting of ternary content addressable memory (TCAM) blocks which enable near-data computing. The proposed NNgine overcomes energy and performance bottleneck of traditional computing systems by utilizing multiple non-volatile TCAMs which search for nearest neighbor data in parallel. We evaluate the efficiency of our NNgine design by comparing to existing processor-based approaches. Our results show that NNgine can achieve 5590x higher energy efficiency and 510x speed up compared to the state-of-the-art techniques with a negligible accuracy loss of 0.5%. Keuwords—Processing in-memory, Non-volatile memory, Content addressable memory, K-nearest neighbor search
Vertical Nanowire-FET (VNFET) is a promising candidate to succeed in industry mainstream due to i... more Vertical Nanowire-FET (VNFET) is a promising candidate to succeed in industry mainstream due to its superior suppression of short-channel-effects and area efficiency. However, to design logic gates, CMOS is not an appropriate solution due to the process incompatibility with VNFET, which creates a technical challenge for mass production. In this work, we propose a novel VNFET-based logic design, called VnanoCML (Vertical Nanowire Transistor-based Current Mode Logic), which addresses the process issue while significantly improving power and performance of diverse logic designs. Unlike the CMOS-based logic, our design exploits current mode logic to overcome the fabrication issue. Furthermore, we reduce drain-to-source resistance of VnanoCML, which results in higher performance improvement without compromising the subthreshold swing. In order to show the impact of the proposed VnanoCML, we present key logic designs which are SRAM, full adder and multiplier, and also evaluate the application-level effectiveness of digital designs for image processing and mathematical computation. Our proposed design improves the fundamental circuit characteristics including output swing, delay time and power consumption compared to conventional planar MOSFET (PFET)-based circuits. Consequentially our architecture-level results show that VnanoCML can enhance the performance and power by 16.4Ă— and 1.15Ă—, respectively. Furthermore, we show that VnanoCML improves the energy-delay product by 38.5Ă— on average compared to PFET-based designs.
Today's computing systems use huge amount of energy and time to process basic queries in database... more Today's computing systems use huge amount of energy and time to process basic queries in database. A large part of it is spent in data movement between the memory and processing cores, owing to the limited cache capacity and memory bandwidth of traditional computers. In this paper, we propose a non-volatile memory-based query accelerator, called NVQuery, which performs several basic query functions in memory including aggregation, prediction, bit-wise operations, as well as exact and nearest distance search queries. NVQuery is implemented on a content addressable memory (CAM) and exploits the analog characteristic of non-volatile memory in order to enable in-memory processing. To implement nearest distance search in memory, we introduce a novel bitline driving scheme to give weights to the indices of the bits during the search operation. Our experimental evaluation shows that, NVQuery can provide 49.3Ă— performance speedup and 32.9Ă— energy savings as compared to running the same query on traditional processor. In addition, compared to the state-of-the-art query accelerators, NVQuery can achieve 26.2Ă— energy-delay product improvement while providing the similar accuracy.
Neural networks are machine learning models that have been successfully used in many applications... more Neural networks are machine learning models that have been successfully used in many applications. Due to the high computational complexity of neural networks, deploying such models on embedded devices with severe power/resource constraints is troublesome. Neural networks are inherently approximate and can be simplified. We propose LookNN, a methodology to replace floating-point multiplications with look-up table search. First, we devise an algorithmic solution to adapt conventional neural networks to LookNN such that the model's accuracy is minimally affected. We provide experimental results and theoretical analysis demonstrating the applicability of the method. Next, we design enhanced general purpose processors for searching look-up tables: each processing element of our GPU has access to a small associative memory, enabling it to bypass redundant computations. Our evaluations on AMD Southern Island GPU architecture shows that LookNN results in 2.2Ă— energy saving and 2.5Ă— speedup running four different neural network applications with zero additive error. For the same four applications, if we tolerate an additive error of less than 0.2%, LookNN can achieve an average of 3Ă— energy improvement and 2.6Ă— speedup compared to the traditional GPU architecture.
—Recently, neural networks have been demonstrated to be effective models for image processing, vi... more —Recently, neural networks have been demonstrated to be effective models for image processing, video segmentation, speech recognition, computer vision and gaming. However, high energy computation and low performance are the primary bottlenecks of running the neural networks. In this paper, we propose an energy/performance-efficient network acceleration technique on General Purpose GPU (GPGPU) architecture which utilizes specialized resistive nearest content addressable memory blocks, called NNCAM, by exploiting computation locality of the learning algorithms. NNCAM stores highly frequent patterns corresponding to neural network operations and searches for the most similar patterns to reuse the computation results. To improve NNCAM computation efficiency and accuracy, we proposed layer-based associative update and selective approximation techniques. The layer-based update improves data locality of NNCAM blocks by filling NNCAM values based on the frequent computation patterns of each neural network layer. To guarantee the appropriate level of computation accuracy while providing maximum energy saving, our design adaptively allocates the neural network operations to either NNCAM or GPGPU floating point units (FPUs). The selective approximation relaxes computation on neural network layers by considering the impact on accuracy. In evaluation, we integrate NNCAM blocks with the modern AMD southern Island GPU architecture. Our experimental evaluation shows that the enhanced GPGPU can result in 68% energy savings and 40% speedup running on four popular convolutional neural networks (CNN), ensuring acceptable < 2% quality loss.
IEEE International Symposium on Quality Electronic Design (ISQED)
Internet of Things is capable of generating huge amount of data, causing high overhead in terms o... more Internet of Things is capable of generating huge amount of data, causing high overhead in terms of energy and performance if run on traditional CPUs and GPUs. This inefficiency comes from the limited cache size and memory bandwidth which result in large amount of data movement through memory hierarchy. In this paper, we propose a configurable associative processor, called CAP, which accelerates computation using multiple parallel memory-based cores capable of approximate or exact matching. CAP is integrated next to the main memory so it fetches the data directly from DRAM. To exploit data locality, CAMs adaptively split into highly and less frequent components and update at runtime. To further improve the CAP efficiency, we integrate a novel signature-based associative memory (SIGAM) beside each processing cores, to store highly frequent patterns and in runtime retrieve them in exact or approximate modes. Our experimental evaluations show that the CAP in approximate (exact) mode can achieve 9.4x and 5.3x (7.2x and 4.2x) energy improvement, and 4.lx and l.3x speeds up compare to AMO GPU and ASIC CMOS-based designs while providing acceptable quality of service.
Brain-inspired hyperdimensional (HD) computing emulates cognition tasks by computing with hyperve... more Brain-inspired hyperdimensional (HD) computing emulates cognition tasks by computing with hypervectors as an alternative to computing with numbers. At its very core, HD computing is about manipulating and comparing large patterns, stored in memory as hypervectors: the input symbols are mapped to a hypervector and an associative search is performed for reasoning and classification. For every classification event, an associative memory is in charge of finding the closest match between a set of learned hypervectors and a query hypervector by using a distance metric. Hypervectors with the i.i.d. components qualify a memory-centric architecture to tolerate massive number of errors, hence it eases cooperation of various methodological design approaches for boosting energy efficiency and scalability. This paper proposes architectural designs for hyperdimensional associative memory (HAM) to facilitate energy-efficient, fast, and scalable search operation using three widely-used design approaches. These HAM designs search for the nearest Hamming distance, and linearly scale with the number of dimensions in the hypervectors while exploring a large design space with orders of magnitude higher efficiency. First, we propose a digital CMOS-based HAM (D-HAM) that modularly scales to any dimension. Second, we propose a resistive HAM (R-HAM) that exploits timing discharge characteristic of nonvolatile resistive elements to approximately compute Hamming distances at a lower cost. Finally, we combine such resistive characteristic with a current-based search method to design an analog HAM (A-HAM) that results in faster and denser alternative. Our experimental results show that R-HAM and A-HAM improve the energy-delay product by 9.6Ă— and 1347Ă— compared to D-HAM while maintaining a moderate accuracy of 94% in language recognition.
Running Internet of Things applications on general purpose processors results in a large energy a... more Running Internet of Things applications on general purpose processors results in a large energy and performance overhead, due to the high cost of data movement. Processing in-memory is a promising solution to reduce the data movement cost by processing the data locally inside the memory. In this paper, we design a MultiPurpose In-Memory Processing (MPIM) system, which can be used as main memory and for processing. MPIM consists of multiple crossbar memories with the capability of efficient in-memory computations. Instead of transferring the large dataset to the processors, MPIM provides two important in-memory processing capabilities: i) data searching for the nearest neighbor ii) bitwise operations including OR, AND and XOR with small analog sense amplifiers. The experimental results show that the MPIM can achieve up to 5.5x energy savings and 19x speedup for the search operations as compared to AMD GPU-based implementation. For bitwise vector processing, we present 11000x energy improvements with 62x speedup over the SIMD-based computation, while outperforming other state-of-the-art in-memory processing techniques.
The Internet of Things significantly increases the amount of data generated that strains the proc... more The Internet of Things significantly increases the amount of data generated that strains the processing capability of current computing systems. Approximate computing can accelerate the computation and dramatically reduce the energy consumption with controllable accuracy loss. In this paper, we propose a Resistive Associative Unit, called RAU, which approximates computation alongside processing cores. RAU exploits the data locality with associative memory. It finds a row which has the closest distance to input patterns while considering the impact of each bit index on the computation accuracy. Our evaluation shows that RAU can accelerate the GPGPU computation by 1.15× and improve the energy efficiency by 36% at only 10% accuracy loss. CCS Concepts • Hardware → Analysis and design of emerging devices and systems;
The Internet of Things (IoT) dramatically increases the size of input dataset for many applicatio... more The Internet of Things (IoT) dramatically increases the size of input dataset for many applications including multimedia. Unlike traditional computing environment, the workload of IoT significantly varies overtime. Thus, to enable memories-based computation, the precomputed data should be profiled in a runtime manner. In this paper, we propose an approximate computing technique using a low-cost adaptive associative memory, named ACAM, which utilizes runtime learning and profiling for diverse multimedia dataset. To recognize the temporal locality of data in real-world applications, our design exploits a reinforcement learning algorithm with a LRU strategy to select images to be profiled; the profiler is implemented using an approximate concurrent state machine. Since the selected images represent the observed input dataset, we can reduce a large amount of computational energy with high hit rate in the associative memory. We evaluate ACAM on the recent AMD Southern Island GPU architecture, and the experimental results shows that the proposed design achieves by 34.7% energy saving for image processing applications with an acceptable quality of service (PSNR>30dB).
Memory-based computing using associative memory is a promising way to reduce the energy consumpti... more Memory-based computing using associative memory is a promising way to reduce the energy consumption of important classes of streaming applications by avoiding redundant computations. A set of frequent patterns that represent basic functions are pre-stored in Ternary Content Addressable Memory (TCAM) and reused. The primary limitation to using associative memory in modern parallel processors is the large search energy required by TCAMs. In TCAMs, all rows that match, except hit rows, precharge and discharge for every search operation, resulting in high energy consumption. In this paper, we propose a new Multiple-Access Single-Charge (MASC) TCAM architecture which is capable of searching TCAM contents multiple times with only a single precharge cycle. In contrast to previous designs, the MASC TCAM keeps the match-line voltage of all miss-rows high and uses their charge for the next search operation, while only the hit rows discharge. We use periodic refresh to control the accuracy of the search. We also implement a new type of approximate associative memory by setting longer refresh times for MASC TCAMs, which yields search results within 1-2 bit Hamming distances of the exact value. To further decrease the energy consumption of MASC TCAM and reduce the area, we implement MASC with crossbar TCAMs. Our evaluation on AMD Southern Island GPU shows that using MASC (crossbar MASC) associative memory can improve the average floating point units energy efficiency by 33.4%, 38.1%, and 36.7% (37.7%, 42.6%, and 43.1%) for exact matching, selective 1-HD and 2-HD approximations respectively, providing an acceptable quality of service (PSNR>30dB and average relative error<10%). This shows that MASC (crossbar MASC) can achieve 1.77X (1.93X) higher energy savings as compared to the state of the art implementation of GPGPU that uses voltage overscaling on TCAM.
STT-RAMs are good candidates to replace conventional SRAM cache and DRAM in main memory, but thei... more STT-RAMs are good candidates to replace conventional SRAM cache and DRAM in main memory, but their applicability in digital logic circuits is unclear. Our experiments explore the power benefit of utilizing non-volatile buffers in digital circuits, with a case study of Ripple Carry Adder and Carry-Skip Adder circuits. We design a low-overhead 2T1MTJ buffer and place it in the intermediate non-critical paths that hold data for long times. Use of NVMs in these paths allows us to turn off parts of the adder to save power. Our simulations show 22.4% and 10.8% power saving in 512-bit RCA and SCA structures respectively with minimal area overheads.
Modern caches are designed to hold 64-bits wide data, however a proportion of data in the caches ... more Modern caches are designed to hold 64-bits wide data, however a proportion of data in the caches continues to be narrow width. In this paper, we propose a new cache architecture which increases the effective cache capacity up to 2X for the systems with narrow-width values, while also improving its power efficiency, bandwidth, and reliability. The proposed double capacity cache (DCC) architecture uses a fast and efficient peripheral circuitry to store two narrow-width values in a single wordline. In order to minimize the latency overhead in workloads without narrow-width data, the flag bits are added to tag store. The proposed DCC architecture decreases cache miss-rate by 50%, which results in 27% performance improvement and 30% higher dynamic energy efficiency. To improve reliability, DCC modifies the data distribution on individual bits, which results in 20% and 25% average static-noise margin (SNM) improvement in L1 and L2 caches respectively.
The Internet of things (IoT) significantly increases the volume of computations and the number of... more The Internet of things (IoT) significantly increases the volume of computations and the number of running applications on processors, from mobiles to servers. Big data computation requires massive parallel processing and acceleration. In parallel processing, associative memories represent a promising solution to improve energy efficiency by eliminating redundant computations. However, the tradeoff between memory size and search energy consumption limits their applications. In this paper, we propose a novel low energy Resistive Multi-stage Associative Memory (ReMAM) architecture, which significantly reduces the search energy consumption by employing selective row activation and in-advance precharging techniques. ReMAM splits the search in the Ternary Content Addressable Memory (TCAM) to a number of shorter searches in consecutive stages. Then, it selectively activates TCAM rows at each stage based on the hits of previous stages, thus enabling energy saving. The proposed in-advance precharging technique mitigates the delay of the sequential TCAM search and limits the number of precharges to two low-cost steps. Our experimental evaluation on AMD Southern Island GPUs shows that ReMAM reduces energy consumption by 38.2% on average, which is 1.62X larger than using GPGPU with conventional single-stage associative memory.
Modern computing machines are increasingly characterized by large scale parallelism in hardware (... more Modern computing machines are increasingly characterized by large scale parallelism in hardware (such as GP-GPUs) and advent of large scale and innovative memory blocks. Parallelism enables expanded performance tradeoffs whereas memories enable reuse of computational work. To be effective, however, one needs to ensure energy efficiency with minimal reuse overheads. In this paper, we describe a resistive con-figurable associative memory (ReCAM) that enables selective approximation and asymmetric voltage overscaling to manage delivered efficiency. The ReCAM structure matches an input pattern with pre-stored ones by applying an approximate search on selected bit indices (bitline-configurable) or selective pre-stored patterns (row-configurable). To further reduce energy, we explore proper ReCAM sizing, various configurable search operations with low overhead voltage overscaling, and different ReCAM update policies. Experimental result on the AMD Southern Islands GPUs for eight applications shows bitline-configurable and row-configurable ReCAM achieve on average to 43.6% and 44.5% energy savings with an acceptable quality loss of 10%.
Static Random Access Memories (SRAMs) occupy a large area of today's microprocessors, and are a p... more Static Random Access Memories (SRAMs) occupy a large area of today's microprocessors, and are a prime source of leakage power in highly scaled technologies. Low leakage and high density Spin-Transfer Torque RAMs (STT-RAMs) are ideal candidates for a power-efficient memory. However, STT-RAM suffers from high write energy and latency, especially when writing 'one' data. In this paper we propose a novel data-aware hybrid STT-RAM/SRAM cache architecture which stores data in the two partitions based on their bit counts. To exploit the new resultant data distribution in the SRAM partition, we employ an asymmetric low-power 5T-SRAM structure which has high reliability for majority 'one' data. The proposed design significantly reduces the number of writes and hence dynamic energy in both STT-RAM and SRAM partitions. We employed a write cache policy and a small swap memory to control data migration between cache partitions. Our evaluation on UltraSPARC-III processor shows that utilizing STT-RAM/6T-SRAM and STT-RAM/5T-SRAM architectures for the L2 cache results in 42% and 53% energy efficiency, 9.3% and 9.1% performance improvement and 16.9% and 20.3% area efficiency respectively, with respect to SRAM-based cache running SPEC CPU 2006 benchmarks.
Memorization is an essential functionality that enables today's machine learning algorithms t... more Memorization is an essential functionality that enables today's machine learning algorithms to provide a high quality of learning and reasoning for each prediction. Memorization gives algorithms prior knowledge to keep the context and define confidence for their decision. Unfortunately, the existing deep learning algorithms have a weak and nontransparent notion of memorization. Brain-inspired HyperDimensional Computing (HDC) is introduced as a model of human memory. Therefore, it mimics several important functionalities of the brain memory by operating with a vector that is computationally tractable and mathematically rigorous in describing human cognition. In this manuscript, we introduce a brain-inspired system that represents HDC memorization capability over a graph of relations. We propose GrapHD, hyperdimensional memorization that represents graph-based information in high-dimensional space. GrapHD defines an encoding method representing complex graph structure while suppor...
2020 IEEE/ACM International Conference On Computer Aided Design (ICCAD), 2020
Hyperdimensional computing (HDC) is a brain-inspired computing paradigm that works with high-dime... more Hyperdimensional computing (HDC) is a brain-inspired computing paradigm that works with high-dimensional vectors, hypervectors, instead of numbers. HDC replaces several complex learning computations with bitwise and simpler arithmetic operations, resulting in a faster and more energy-efficient learning algorithm. However, it comes at the cost of an increased amount of data to process due to mapping the data into high-dimensional space. While some datasets may nearly fit in the memory, the resulting hypervectors more often than not can't be stored in memory, resulting in long data transfers from storage. In this paper, we propose THRIFTY, an in-storage computing (ISC) solution that performs HDC encoding and training across the flash hierarchy. To hide the latency of training and enable efficient computation, we introduce the concept of batching in HDC. It allows us to split HDC training into sub-components and process them independently. We also present, for the first time, on-ch...
IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2020
Deep neural networks (DNN) have demonstrated effectiveness for various applications such as image... more Deep neural networks (DNN) have demonstrated effectiveness for various applications such as image processing, video segmentation, and speech recognition. Running state-of-the-art DNNs on current systems mostly relies on either general-purpose processors, ASIC designs, or FPGA accelerators, all of which suffer from data movements due to the limited on-chip memory and data transfer bandwidth. In this work, we propose a novel framework, called RAPIDNN, which performs neuron-to-memory transformation in order to accelerate DNNs in a highly parallel architecture. RAPIDNN reinterprets a DNN model and maps it into a specialized accelerator, which is designed using non-volatile memory blocks that model four fundamental DNN operations, i.e., multiplication, addition, activation functions, and pooling. The framework extracts representative operands of a DNN model, e.g., weights and input values, using clustering methods to optimize the model for in-memory processing. Then, it maps the extracted operands and their pre-computed results into the accelerator memory blocks. At runtime, the accelerator identifies computation results based on efficient in-memory search capability which also provides tunability of approximation to improve computation efficiency further. Our evaluation shows that RAPIDNN achieves 68.4Ă—, 49.5Ă— energy efficiency improvement and 48.1Ă—, 10.9Ă— speedup as compared to ISAAC and PipeLayer, the state-of-the-art DNN accelerators, while ensuring less than 0.5% quality loss.
Running Internet of Things applications on general purpose processors results in a large energy a... more Running Internet of Things applications on general purpose processors results in a large energy and performance overhead, due to the high cost of data movement. Processing in-memory is a promising solution to reduce the data movement cost by processing the data locally inside the memory. In this paper, we design a MultiPurpose In-Memory Processing (MPIM) system, which can be used as main memory and for processing. MPIM consists of multiple crossbar memories with the capability of efficient in-memory computations. Instead of transferring the large dataset to the processors, MPIM provides three important in-memory processing capabilities: i) addition/multiplication, ii) data searching for the nearest neighbor iii) bitwise operations including OR, AND and XOR by designing new analog sense amplifiers. The experimental results show that over application with addition/multiplication MPIM can achieve 9.2Ă— energy efficiency improvement and 50.3Ă— speedup as compared to a recent AMD GPU architecture. Similarly, MPIM can provide up to 5.5Ă— energy savings and 19Ă— speedup for the search operations. For bitwise vector processing, we present 11000Ă— energy improvements with 62Ă— speedup over the SIMD-based computation, while outperforming other state-of-the-art in-memory processing techniques.
The nearest neighbor (NN) algorithm has been used in a broad range of applications including patt... more The nearest neighbor (NN) algorithm has been used in a broad range of applications including pattern recognition, classification, computer vision, databases, etc. The NN algorithm tests data points to find the nearest data to a query data point. With the Internet of Things the amount of data to search through grows exponentially, so we need to have more efficient NN design. Running NN on multicore processors or on general purpose GPUs has significant energy and performance overhead due to small available cache sizes resulting in moving a lot of data via limited bandwidth busses from memory. In this paper, we propose a nearest neighbor accelerator, called NNgine, consisting of ternary content addressable memory (TCAM) blocks which enable near-data computing. The proposed NNgine overcomes energy and performance bottleneck of traditional computing systems by utilizing multiple non-volatile TCAMs which search for nearest neighbor data in parallel. We evaluate the efficiency of our NNgine design by comparing to existing processor-based approaches. Our results show that NNgine can achieve 5590x higher energy efficiency and 510x speed up compared to the state-of-the-art techniques with a negligible accuracy loss of 0.5%. Keuwords—Processing in-memory, Non-volatile memory, Content addressable memory, K-nearest neighbor search
Vertical Nanowire-FET (VNFET) is a promising candidate to succeed in industry mainstream due to i... more Vertical Nanowire-FET (VNFET) is a promising candidate to succeed in industry mainstream due to its superior suppression of short-channel-effects and area efficiency. However, to design logic gates, CMOS is not an appropriate solution due to the process incompatibility with VNFET, which creates a technical challenge for mass production. In this work, we propose a novel VNFET-based logic design, called VnanoCML (Vertical Nanowire Transistor-based Current Mode Logic), which addresses the process issue while significantly improving power and performance of diverse logic designs. Unlike the CMOS-based logic, our design exploits current mode logic to overcome the fabrication issue. Furthermore, we reduce drain-to-source resistance of VnanoCML, which results in higher performance improvement without compromising the subthreshold swing. In order to show the impact of the proposed VnanoCML, we present key logic designs which are SRAM, full adder and multiplier, and also evaluate the application-level effectiveness of digital designs for image processing and mathematical computation. Our proposed design improves the fundamental circuit characteristics including output swing, delay time and power consumption compared to conventional planar MOSFET (PFET)-based circuits. Consequentially our architecture-level results show that VnanoCML can enhance the performance and power by 16.4Ă— and 1.15Ă—, respectively. Furthermore, we show that VnanoCML improves the energy-delay product by 38.5Ă— on average compared to PFET-based designs.
Today's computing systems use huge amount of energy and time to process basic queries in database... more Today's computing systems use huge amount of energy and time to process basic queries in database. A large part of it is spent in data movement between the memory and processing cores, owing to the limited cache capacity and memory bandwidth of traditional computers. In this paper, we propose a non-volatile memory-based query accelerator, called NVQuery, which performs several basic query functions in memory including aggregation, prediction, bit-wise operations, as well as exact and nearest distance search queries. NVQuery is implemented on a content addressable memory (CAM) and exploits the analog characteristic of non-volatile memory in order to enable in-memory processing. To implement nearest distance search in memory, we introduce a novel bitline driving scheme to give weights to the indices of the bits during the search operation. Our experimental evaluation shows that, NVQuery can provide 49.3Ă— performance speedup and 32.9Ă— energy savings as compared to running the same query on traditional processor. In addition, compared to the state-of-the-art query accelerators, NVQuery can achieve 26.2Ă— energy-delay product improvement while providing the similar accuracy.
Neural networks are machine learning models that have been successfully used in many applications... more Neural networks are machine learning models that have been successfully used in many applications. Due to the high computational complexity of neural networks, deploying such models on embedded devices with severe power/resource constraints is troublesome. Neural networks are inherently approximate and can be simplified. We propose LookNN, a methodology to replace floating-point multiplications with look-up table search. First, we devise an algorithmic solution to adapt conventional neural networks to LookNN such that the model's accuracy is minimally affected. We provide experimental results and theoretical analysis demonstrating the applicability of the method. Next, we design enhanced general purpose processors for searching look-up tables: each processing element of our GPU has access to a small associative memory, enabling it to bypass redundant computations. Our evaluations on AMD Southern Island GPU architecture shows that LookNN results in 2.2Ă— energy saving and 2.5Ă— speedup running four different neural network applications with zero additive error. For the same four applications, if we tolerate an additive error of less than 0.2%, LookNN can achieve an average of 3Ă— energy improvement and 2.6Ă— speedup compared to the traditional GPU architecture.
—Recently, neural networks have been demonstrated to be effective models for image processing, vi... more —Recently, neural networks have been demonstrated to be effective models for image processing, video segmentation, speech recognition, computer vision and gaming. However, high energy computation and low performance are the primary bottlenecks of running the neural networks. In this paper, we propose an energy/performance-efficient network acceleration technique on General Purpose GPU (GPGPU) architecture which utilizes specialized resistive nearest content addressable memory blocks, called NNCAM, by exploiting computation locality of the learning algorithms. NNCAM stores highly frequent patterns corresponding to neural network operations and searches for the most similar patterns to reuse the computation results. To improve NNCAM computation efficiency and accuracy, we proposed layer-based associative update and selective approximation techniques. The layer-based update improves data locality of NNCAM blocks by filling NNCAM values based on the frequent computation patterns of each neural network layer. To guarantee the appropriate level of computation accuracy while providing maximum energy saving, our design adaptively allocates the neural network operations to either NNCAM or GPGPU floating point units (FPUs). The selective approximation relaxes computation on neural network layers by considering the impact on accuracy. In evaluation, we integrate NNCAM blocks with the modern AMD southern Island GPU architecture. Our experimental evaluation shows that the enhanced GPGPU can result in 68% energy savings and 40% speedup running on four popular convolutional neural networks (CNN), ensuring acceptable < 2% quality loss.
IEEE International Symposium on Quality Electronic Design (ISQED)
Internet of Things is capable of generating huge amount of data, causing high overhead in terms o... more Internet of Things is capable of generating huge amount of data, causing high overhead in terms of energy and performance if run on traditional CPUs and GPUs. This inefficiency comes from the limited cache size and memory bandwidth which result in large amount of data movement through memory hierarchy. In this paper, we propose a configurable associative processor, called CAP, which accelerates computation using multiple parallel memory-based cores capable of approximate or exact matching. CAP is integrated next to the main memory so it fetches the data directly from DRAM. To exploit data locality, CAMs adaptively split into highly and less frequent components and update at runtime. To further improve the CAP efficiency, we integrate a novel signature-based associative memory (SIGAM) beside each processing cores, to store highly frequent patterns and in runtime retrieve them in exact or approximate modes. Our experimental evaluations show that the CAP in approximate (exact) mode can achieve 9.4x and 5.3x (7.2x and 4.2x) energy improvement, and 4.lx and l.3x speeds up compare to AMO GPU and ASIC CMOS-based designs while providing acceptable quality of service.
Brain-inspired hyperdimensional (HD) computing emulates cognition tasks by computing with hyperve... more Brain-inspired hyperdimensional (HD) computing emulates cognition tasks by computing with hypervectors as an alternative to computing with numbers. At its very core, HD computing is about manipulating and comparing large patterns, stored in memory as hypervectors: the input symbols are mapped to a hypervector and an associative search is performed for reasoning and classification. For every classification event, an associative memory is in charge of finding the closest match between a set of learned hypervectors and a query hypervector by using a distance metric. Hypervectors with the i.i.d. components qualify a memory-centric architecture to tolerate massive number of errors, hence it eases cooperation of various methodological design approaches for boosting energy efficiency and scalability. This paper proposes architectural designs for hyperdimensional associative memory (HAM) to facilitate energy-efficient, fast, and scalable search operation using three widely-used design approaches. These HAM designs search for the nearest Hamming distance, and linearly scale with the number of dimensions in the hypervectors while exploring a large design space with orders of magnitude higher efficiency. First, we propose a digital CMOS-based HAM (D-HAM) that modularly scales to any dimension. Second, we propose a resistive HAM (R-HAM) that exploits timing discharge characteristic of nonvolatile resistive elements to approximately compute Hamming distances at a lower cost. Finally, we combine such resistive characteristic with a current-based search method to design an analog HAM (A-HAM) that results in faster and denser alternative. Our experimental results show that R-HAM and A-HAM improve the energy-delay product by 9.6Ă— and 1347Ă— compared to D-HAM while maintaining a moderate accuracy of 94% in language recognition.
Running Internet of Things applications on general purpose processors results in a large energy a... more Running Internet of Things applications on general purpose processors results in a large energy and performance overhead, due to the high cost of data movement. Processing in-memory is a promising solution to reduce the data movement cost by processing the data locally inside the memory. In this paper, we design a MultiPurpose In-Memory Processing (MPIM) system, which can be used as main memory and for processing. MPIM consists of multiple crossbar memories with the capability of efficient in-memory computations. Instead of transferring the large dataset to the processors, MPIM provides two important in-memory processing capabilities: i) data searching for the nearest neighbor ii) bitwise operations including OR, AND and XOR with small analog sense amplifiers. The experimental results show that the MPIM can achieve up to 5.5x energy savings and 19x speedup for the search operations as compared to AMD GPU-based implementation. For bitwise vector processing, we present 11000x energy improvements with 62x speedup over the SIMD-based computation, while outperforming other state-of-the-art in-memory processing techniques.
The Internet of Things significantly increases the amount of data generated that strains the proc... more The Internet of Things significantly increases the amount of data generated that strains the processing capability of current computing systems. Approximate computing can accelerate the computation and dramatically reduce the energy consumption with controllable accuracy loss. In this paper, we propose a Resistive Associative Unit, called RAU, which approximates computation alongside processing cores. RAU exploits the data locality with associative memory. It finds a row which has the closest distance to input patterns while considering the impact of each bit index on the computation accuracy. Our evaluation shows that RAU can accelerate the GPGPU computation by 1.15× and improve the energy efficiency by 36% at only 10% accuracy loss. CCS Concepts • Hardware → Analysis and design of emerging devices and systems;
The Internet of Things (IoT) dramatically increases the size of input dataset for many applicatio... more The Internet of Things (IoT) dramatically increases the size of input dataset for many applications including multimedia. Unlike traditional computing environment, the workload of IoT significantly varies overtime. Thus, to enable memories-based computation, the precomputed data should be profiled in a runtime manner. In this paper, we propose an approximate computing technique using a low-cost adaptive associative memory, named ACAM, which utilizes runtime learning and profiling for diverse multimedia dataset. To recognize the temporal locality of data in real-world applications, our design exploits a reinforcement learning algorithm with a LRU strategy to select images to be profiled; the profiler is implemented using an approximate concurrent state machine. Since the selected images represent the observed input dataset, we can reduce a large amount of computational energy with high hit rate in the associative memory. We evaluate ACAM on the recent AMD Southern Island GPU architecture, and the experimental results shows that the proposed design achieves by 34.7% energy saving for image processing applications with an acceptable quality of service (PSNR>30dB).
Memory-based computing using associative memory is a promising way to reduce the energy consumpti... more Memory-based computing using associative memory is a promising way to reduce the energy consumption of important classes of streaming applications by avoiding redundant computations. A set of frequent patterns that represent basic functions are pre-stored in Ternary Content Addressable Memory (TCAM) and reused. The primary limitation to using associative memory in modern parallel processors is the large search energy required by TCAMs. In TCAMs, all rows that match, except hit rows, precharge and discharge for every search operation, resulting in high energy consumption. In this paper, we propose a new Multiple-Access Single-Charge (MASC) TCAM architecture which is capable of searching TCAM contents multiple times with only a single precharge cycle. In contrast to previous designs, the MASC TCAM keeps the match-line voltage of all miss-rows high and uses their charge for the next search operation, while only the hit rows discharge. We use periodic refresh to control the accuracy of the search. We also implement a new type of approximate associative memory by setting longer refresh times for MASC TCAMs, which yields search results within 1-2 bit Hamming distances of the exact value. To further decrease the energy consumption of MASC TCAM and reduce the area, we implement MASC with crossbar TCAMs. Our evaluation on AMD Southern Island GPU shows that using MASC (crossbar MASC) associative memory can improve the average floating point units energy efficiency by 33.4%, 38.1%, and 36.7% (37.7%, 42.6%, and 43.1%) for exact matching, selective 1-HD and 2-HD approximations respectively, providing an acceptable quality of service (PSNR>30dB and average relative error<10%). This shows that MASC (crossbar MASC) can achieve 1.77X (1.93X) higher energy savings as compared to the state of the art implementation of GPGPU that uses voltage overscaling on TCAM.
STT-RAMs are good candidates to replace conventional SRAM cache and DRAM in main memory, but thei... more STT-RAMs are good candidates to replace conventional SRAM cache and DRAM in main memory, but their applicability in digital logic circuits is unclear. Our experiments explore the power benefit of utilizing non-volatile buffers in digital circuits, with a case study of Ripple Carry Adder and Carry-Skip Adder circuits. We design a low-overhead 2T1MTJ buffer and place it in the intermediate non-critical paths that hold data for long times. Use of NVMs in these paths allows us to turn off parts of the adder to save power. Our simulations show 22.4% and 10.8% power saving in 512-bit RCA and SCA structures respectively with minimal area overheads.
Modern caches are designed to hold 64-bits wide data, however a proportion of data in the caches ... more Modern caches are designed to hold 64-bits wide data, however a proportion of data in the caches continues to be narrow width. In this paper, we propose a new cache architecture which increases the effective cache capacity up to 2X for the systems with narrow-width values, while also improving its power efficiency, bandwidth, and reliability. The proposed double capacity cache (DCC) architecture uses a fast and efficient peripheral circuitry to store two narrow-width values in a single wordline. In order to minimize the latency overhead in workloads without narrow-width data, the flag bits are added to tag store. The proposed DCC architecture decreases cache miss-rate by 50%, which results in 27% performance improvement and 30% higher dynamic energy efficiency. To improve reliability, DCC modifies the data distribution on individual bits, which results in 20% and 25% average static-noise margin (SNM) improvement in L1 and L2 caches respectively.
The Internet of things (IoT) significantly increases the volume of computations and the number of... more The Internet of things (IoT) significantly increases the volume of computations and the number of running applications on processors, from mobiles to servers. Big data computation requires massive parallel processing and acceleration. In parallel processing, associative memories represent a promising solution to improve energy efficiency by eliminating redundant computations. However, the tradeoff between memory size and search energy consumption limits their applications. In this paper, we propose a novel low energy Resistive Multi-stage Associative Memory (ReMAM) architecture, which significantly reduces the search energy consumption by employing selective row activation and in-advance precharging techniques. ReMAM splits the search in the Ternary Content Addressable Memory (TCAM) to a number of shorter searches in consecutive stages. Then, it selectively activates TCAM rows at each stage based on the hits of previous stages, thus enabling energy saving. The proposed in-advance precharging technique mitigates the delay of the sequential TCAM search and limits the number of precharges to two low-cost steps. Our experimental evaluation on AMD Southern Island GPUs shows that ReMAM reduces energy consumption by 38.2% on average, which is 1.62X larger than using GPGPU with conventional single-stage associative memory.
Modern computing machines are increasingly characterized by large scale parallelism in hardware (... more Modern computing machines are increasingly characterized by large scale parallelism in hardware (such as GP-GPUs) and advent of large scale and innovative memory blocks. Parallelism enables expanded performance tradeoffs whereas memories enable reuse of computational work. To be effective, however, one needs to ensure energy efficiency with minimal reuse overheads. In this paper, we describe a resistive con-figurable associative memory (ReCAM) that enables selective approximation and asymmetric voltage overscaling to manage delivered efficiency. The ReCAM structure matches an input pattern with pre-stored ones by applying an approximate search on selected bit indices (bitline-configurable) or selective pre-stored patterns (row-configurable). To further reduce energy, we explore proper ReCAM sizing, various configurable search operations with low overhead voltage overscaling, and different ReCAM update policies. Experimental result on the AMD Southern Islands GPUs for eight applications shows bitline-configurable and row-configurable ReCAM achieve on average to 43.6% and 44.5% energy savings with an acceptable quality loss of 10%.
Static Random Access Memories (SRAMs) occupy a large area of today's microprocessors, and are a p... more Static Random Access Memories (SRAMs) occupy a large area of today's microprocessors, and are a prime source of leakage power in highly scaled technologies. Low leakage and high density Spin-Transfer Torque RAMs (STT-RAMs) are ideal candidates for a power-efficient memory. However, STT-RAM suffers from high write energy and latency, especially when writing 'one' data. In this paper we propose a novel data-aware hybrid STT-RAM/SRAM cache architecture which stores data in the two partitions based on their bit counts. To exploit the new resultant data distribution in the SRAM partition, we employ an asymmetric low-power 5T-SRAM structure which has high reliability for majority 'one' data. The proposed design significantly reduces the number of writes and hence dynamic energy in both STT-RAM and SRAM partitions. We employed a write cache policy and a small swap memory to control data migration between cache partitions. Our evaluation on UltraSPARC-III processor shows that utilizing STT-RAM/6T-SRAM and STT-RAM/5T-SRAM architectures for the L2 cache results in 42% and 53% energy efficiency, 9.3% and 9.1% performance improvement and 16.9% and 20.3% area efficiency respectively, with respect to SRAM-based cache running SPEC CPU 2006 benchmarks.
Uploads
Papers by Mohsen Imani