research-article

Memory-Centric Reconfigurable Accelerator for Classification and Machine Learning Applications

Authors:

Swarup BhuniaAuthors Info & Claims

ACM Journal on Emerging Technologies in Computing Systems (JETC), Volume 13, Issue 3

Article No.: 34, Pages 1 - 24

https://doi.org/10.1145/2997649

Published: 01 May 2017 Publication History

Abstract

Big Data refers to the growing challenge of turning massive, often unstructured datasets into meaningful, organized, and actionable data. As datasets grow from petabytes to exabytes and beyond, it becomes increasingly difficult to run advanced analytics, especially Machine Learning (ML) applications, in a reasonable time and on a practical power budget using traditional architectures. Previous work has focused on accelerating analytics readily implemented as SQL queries on data-parallel platforms, generally using off-the-shelf CPUs and General Purpose Graphics Processing Units (GPGPUs) for computation or acceleration. However, these systems are general-purpose and still require a vast amount of data transfer between the storage devices and computing elements, thus limiting the system efficiency. As an alternative, this article presents a reconfigurable memory-centric advanced analytics accelerator that operates at the last level of memory and dramatically reduces energy required for data transfer. We functionally validate the framework using an FPGA-based hardware emulation platform and three representative applications: Naïve Bayesian Classification, Convolutional Neural Networks, and k-Means Clustering. Results are compared with implementations on a modern CPU and workstation GPGPU. Finally, the use of in-memory dataset decompression to further reduce data transfer volume is investigated. With these techniques, the system achieves an average energy efficiency improvement of 74× and 212× over GPU and single-threaded CPU, respectively, while dataset compression is shown to improve overall efficiency by an additional 1.8× on average.

References

[1]

Altera. 2016. Quartus II Subscription Edition. Retrieved March 2016, from http://www.altera.com.

[2]

Mauricio Araya-Polo, Javier Cabezas, Mauricio Hanzich, Miquel Pericas, Felix Rubio, Isaac Gelado, Muhammad Shafiq, Enric Morancho, Nacho Navarro, Eduard Ayguade, and others. 2011. Assessing accelerator-based HPC reverse time migration. IEEE Transactions on Parallel and Distributed Systems 22, 1 (2011), 147--162.

Digital Library

[3]

Kubilay Atasu, Raphael Polig, Christoph Hagleitner, and Frederick R. Reiss. 2013. Hardware-accelerated regular expression matching for high-throughput text analytics. In Proceedings of the 23rd International Conference on Field Programmable Logic and Applications (FPL’13). IEEE, 1--7.

[4]

Peter Bakkum and Kevin Skadron. 2010. Accelerating SQL database operations on a GPU with CUDA. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units. ACM, 94--103.

Digital Library

[5]

S. R. Bandre and J. N. Nandimath. 2015. Design consideration of network intrusion detection system using hadoop and GPGPU. In Proceedings of the 2015 International Conference on Pervasive Computing (ICPC’15). 1--6.

[6]

Keren Bergman, Shekhar Borkar, Dan Campbell, William Carlson, William Dally, Monty Denneau, Paul Franzon, William H. arrod, Kerry Hill, Jon Hiller, and others. 2008. Exascale computing study: Technology challenges in achieving exascale systems. Defense Advanced Research Projects Agency Information Processing Techniques Office (DARPA IPTO). Tech. Rep. 15 (2008).

[7]

James Bergstra, Olivier Breuleux, Frédéric Bastien, Pascal Lamblin, Razvan Pascanu, Guillaume Desjardins, Joseph Turian, David Warde-Farley, and Yoshua Bengio. 2010. Theano: A CPU and GPU math expression compiler. In Proceedings of the Python for Scientific Computing Conference (SciPy’10). Oral Presentation.

[8]

Bernhard E. Boser, Isabelle M. Guyon, and Vladimir N. Vapnik. 1992. A training algorithm for optimal margin classifiers. In Proceedings of the 5th Annual Workshop on Computational Learning Theory. ACM, 144--152.

Digital Library

[9]

CACTI. Online. Retrieved from http://arch.cs.utah.edu/cacti/.

[10]

Linchuan Chen, Xin Huo, and Gagan Agrawal. 2012. Accelerating mapreduce on a coupled CPU-GPU architecture. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE Computer Society Press, 25.

Digital Library

[11]

clock-gettime(3) - Linux main page. Retrieved from. http://linux.die.net/man/3/clock_gettime.

[12]

Jason Cong and Songjie Xu. 1998. Technology mapping for FPGAs with embedded memory blocks. In Proceedings of the 1998 ACM/SIGDA 6th International Symposium on Field Programmable Gate Arrays. ACM, 179--188.

Digital Library

[13]

CUDA Profiling Tools Interface. Retrieved from https://developer.nvidia.com/cuda-profiling-tools-interface.

[14]

Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified data processing on large clusters. Commun. ACM 51, 1 (2008), 107--113.

Digital Library

[15]

Richard O. Duda, Peter E. Hart, and others. 1973. Pattern Classification and Scene Analysis. Vol. 3. Wiley New York.

[16]

Carl Ebeling, Darren C. Cronquist, and Paul Franklin. 1996. RaPiDreconfigurable pipelined datapath. In Field-Programmable Logic Smart Applications, New Paradigms and Compilers. Springer, 126--135.

[17]

Hadi Esmaeilzadeh, Emily Blem, Renée St. Amant, Karthikeyan Sankaralingam, and Doug Burger. 2013. Power challenges may end the multicore era. Communications of the ACM 56, 2 (2013), 93--102.

Digital Library

[18]

Reza Farivar, Daniel Rebolledo, Ellick Chan, and Roy H. Campbell. 2008. A parallel implementation of k-means clustering on GPUs. 13, 2 (2008).

[19]

Nir Friedman, Dan Geiger, and Moises Goldszmidt. 1997. Bayesian network classifiers. Machine Learning 29, 2--3 (1997), 131--163.

Digital Library

[20]

Zhisong Fu, Michael Personick, and Bryan Thompson. 2014. MapGraph: A high level API for fast development of high performance graph analytics on GPUs. In Proceedings of Workshop on GRAph Data Management Experiences and Systems. ACM, 1--6.

Digital Library

[21]

Lee Garber. 2012. Using in-memory analytics to quickly crunch big data. Computer 45, 10 (2012), 16--18.

Digital Library

[22]

Varghese George, Sanjeev Jahagirdar, Chao Tong, Ken Smits, Satish Damaraju, Scott Siers, Ves Naydenov, Tanveer Khondker, Sanjib Sarkar, and Puneet Singh. 2007. Penryn: 45-nm next generation intel® core 2 processor. In Proceedings of the IEEE Asian Solid-State Circuits Conference (ASSCC’07). IEEE, 14--17.

[23]

Naga K. Govindaraju, Brandon Lloyd, Wei Wang, Ming Lin, and Dinesh Manocha. 2004. Fast computation of database operations using graphics processors. In Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data. ACM, 215--226.

Digital Library

[24]

Alexander Gray. 2013. Analyzing Massive Datasets. Retrieved from http://www.skytree.net/resources/.

[25]

Qi Guo, Nikolaos Alachiotis, Berkin Akin, Fazle Sadi, Guanglin Xu, Tze Meng Low, Larry Pileggi, James C. Hoe, and Franz Franchetti. 2014. 3d-stacked memory-side acceleration: Accelerator and system design. In Proceedings of the Workshop on Near-Data Processing (WoNDP) (Held in Conjunction with MICRO-47.).

[26]

Robert J. Halstead, Bharat Sukhwani, Hong Min, Mathew Thoennes, Parijat Dube, Sameh Asaad, and Brijesh Iyer. 2013. Accelerating join operation for relational databases with FPGAs. In Proceedings of the IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’13). IEEE, 17--20.

Digital Library

[27]

Stephen C. Helmreich and Jim R. Cowie. 2006. Data-centric computing with the netezza architecture. (2006).

[28]

David A. Huffman. 1952. A method for the construction of minimum-redundancy codes. Proceedings of the IRE 40, 9 (1952), 1098--1101.

[29]

Intel Core2 Quad Processor Q8200. Retrieved from http://ark.intel.com/Products/Spec/SLG9S.

[30]

Adam Jacobs. 2009. The pathologies of big data. Communications of the ACM 52, 8 (2009), 36--44.

Digital Library

[31]

Robert Karam, Ruchir Puri, and Swarup Bhunia. 2016. Energy-efficient adaptive hardware accelerator for text mining application kernels. IEEE Transactions on Very Large Scale Integration 24, 12 (Dec. 2016).

Digital Library

[32]

Robert Karam, Ruchir Puri, Swaroop Ghosh, and Swarup Bhunia. 2015a. Emerging trends in design and applications of memory-based computing and content-addressable memories. Proceedings of the IEEE 103, 8 (2015), 1311--1330.

[33]

Robert Karam, Kai Yang, and Swarup Bhunia. 2015b. Energy-efficient reconfigurable computing using spintronic memory. In Proceedings of the 58th IEEE International Midwest Symposium on Circuits and Systems (MWSCAS’15). IEEE, 1--4.

[34]

Piotr Kraj, Ashok Sharma, Nikhil Garge, Robert Podolsky, and Richard A. McIndoe. 2008. ParaKMeans: Implementation of a parallelized K-means algorithm suitable for general laboratory use. BMC Bioinformatics 9, 1 (2008), 200.

[35]

Steve LaValle, Eric Lesser, Rebecca Shockley, Michael S. Hopkins, and Nina Kruschwitz. 2011. Big data, analytics and the path from insights to value. MIT Sloan Management Review 52, 2 (2011), 21--31.

[36]

Yann Lecun and Corinna Cortes. 2016. The MNIST database of handwritten digits. Retrieved from http://yann.lecun.com/exdb/mnist/.

[37]

Christianto C. Liu, Ilya Ganusov, Martin Burtscher, and Sandip Tiwari. 2005. Bridging the processor-memory performance gap with 3D IC technology. IEEE Design 8 Test of Computers 22, 6 (2005), 556--564.

[38]

James MacQueen and others. 1967. Some methods for classification and analysis of multivariate observations. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, Vol. 1. Berkeley, CA, 14.

[39]

Ethan Mirsky and Andre DeHon. 1996. MATRIX: A reconfigurable computing architecture with configurable instruction distribution and deployable resources. In Proceedings of the 1996 IEEE Symposium on FPGAs for Custom Computing Machines. IEEE, 157--166.

[40]

Kazuaki Murakami, Satoru Shirakawa, and Hiroshi Miyajima. 1997. Parallel processing RAM chip with 256 Mb DRAM and quad processors. In Digest of Technical Papers of the 43rd IEEE International Solid-State Circuits Conference (ISSCC’97). IEEE, 228--229.

[41]

K. Neshatpour, M. Malik, M. A. Ghodrat, and H. Homayoun. 2015. Accelerating big data analytics using FPGAs. In Proceedings of the IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’15). 164--164.

Digital Library

[42]

John Nickolls, Ian Buck, Michael Garland, and Kevin Skadron. 2008. Scalable parallel programming with CUDA. Queue 6, 2 (2008), 40--53.

Digital Library

[43]

Nios II Processor: The World’s Most Versatile Embedded Processor. Retrieved from http://www.altera.com/devices/processor/nios2/ni2-index.html.

[44]

NVIDIA. 2016. CUDA Toolkit Documentation (7.5 ed.). NVIDIA. Retrieved from https://docs.nvidia.com/cuda/.

[45]

Markos Papadonikolakis and C. Bouganis. 2012. Novel cascade FPGA accelerator for support vector machines classification. IEEE Transactions on Neural Networks and Learning Systems 23, 7 (2012), 1040--1052.

[46]

David Patterson, Thomas Anderson, Neal Cardwell, Richard Fromm, Kimberly Keeton, Christoforos Kozyrakis, Randi Thomas, and Katherine Yelick. 1997. A case for intelligent RAM. IEEE Micro 17, 2 (1997), 34--44.

Digital Library

[47]

S. Paul, S. Chatterjee, S. Mukhopadhyay, and S. Bhunia. 2009. Nanoscale reconfigurable computing using non-volatile 2-D STTRAM array. In Proceedings of the 9th IEEE Conference on Nanotechnology (IEEE-NANO 2009). 880--883.

[48]

Somnath Paul, Robert Karam, Swarup Bhunia, and Ruchir Puri. 2014a. Energy-efficient hardware acceleration through computing in the memory. In Proceedings of the Conference on Design, Automation 8 Test in Europe. European Design and Automation Association, 266.

[49]

Somnath Paul, Aswin Krishna, Wenchao Qian, Robert Karam, and Swarup Bhunia. 2014b. MAHA: An energy-efficient malleable hardware accelerator for data-intensive applications. IEEE Transactions on Very Large Scale Integration Systems. IEEE, 1005--1016.

[50]

Andrew Putnam, Adrian M. Caulfield, Eric S. Chung, Derek Chiou, Kypros Constantinides, John Demme, Hadi Esmaeilzadeh, Jeremy Fowers, Gopi Prashanth Gopal, Jordan Gray, et al. 2014. A reconfigurable fabric for accelerating large-scale datacenter services. In Proceedings of the ACM/IEEE 41st International Symposium on Computer Architecture (ISCA’14). IEEE, 13--24.

[51]

QUADRO FOR DESKTOP WORKSTATIONS. Retrieved from http://www.nvidia.com/object/quadro-desktop-gpus.html.

[52]

Timothy G. Rogers, Mike O’Connor, and Tor M. Aamodt. 2013. Divergence-aware warp scheduling. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 99--110.

Digital Library

[53]

Yi Shan, Bo Wang, Jing Yan, Yu Wang, Ningyi Xu, and Huazhong Yang. 2010. FPMR: MapReduce framework on FPGA. In Proceedings of the 18th Annual ACM/SIGDA International Symposium on Field Programmable Gate Arrays. ACM, 93--102.

Digital Library

[54]

Hertej Singh, Ming-Hau Lee, Guangming Lu, Fadi J. Kurdahi, Nader Bagherzadeh, and M. Chaves Eliseu Filho. 2000. MorphoSys: An integrated reconfigurable system for data-parallel and computation-intensive applications. IEEE Transactions on Computers 49, 5 (2000), 465--481.

Digital Library

[55]

Kilian Stoffel and Abdelkader Belkoniene. 1999. Parallel k/h-means clustering for large data sets. European Conference on Parallel Processing. Springer, 1451--1454.

[56]

Bharat Sukhwani, Hong Min, Mathew Thoennes, Parijat Dube, Balakrishna Iyer, Bernard Brezzo, Donna Dillenberger, and Sameh Asaad. 2012. Database analytics acceleration using FPGAs. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques. ACM, 411--420.

Digital Library

[57]

Helen Sun and Peter Heller. 2012. Oracle information architecture: An architect’s guide to big data. Oracle, Redwood Shores (2012).

[58]

Yangzihao Wang, Andrew Davidson, Yuechao Pan, Yuduo Wu, Andy Riffel, and John D. Owens. 2015. Gunrock: A high-performance graph processing library on the GPU. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, 265--266.

Digital Library

[59]

Yu Wang, Boxun Li, Rong Luo, Yiran Chen, Ningyi Xu, and Huazhong Yang. 2014. Energy efficient neural networks for big data analytics. In Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE’14). IEEE, 1--2.

[60]

Louis Woods and Gustavo Alonso. 2011. Fast data analytics with FPGAs. In Proceedings of the IEEE 27th International Conference on Data Engineering Workshops (ICDEW’11). IEEE, 296--299.

Digital Library

[61]

Ren Wu, Bin Zhang, and Meichun Hsu. 2009. Clustering billions of data points using GPUs. In Proceedings of the Combined Workshops on UnConventional High Performance Computing Workshop Plus Memory Access Workshop. ACM, 1--6.

Digital Library

[62]

Weizhong Zhao, Huifang Ma, and Qing He. 2009. Parallel k-means clustering based on mapreduce. IEEE International Conference on Cloud Computing. Springer, 674--679.

Digital Library

[63]

Jacob Ziv and Abraham Lempel. 1977. A universal algorithm for sequential data compression. IEEE Transactions on Information Theory 23, 3 (1977), 337--343.

Digital Library

[64]

Jacob Ziv and Abraham Lempel. 1978. Compression of individual sequences via variable-rate coding. IEEE Transaction on Information Theory 24, 5 (1978), 530--536.

Digital Library

Cited By

Jia YSun YWang Y(2023)Analysis of Verilog-based improvements to the memory transferJournal of Physics: Conference Series10.1088/1742-6596/2649/1/0120562649:1(012056)Online publication date: 1-Nov-2023
https://doi.org/10.1088/1742-6596/2649/1/012056
de Lima Jde Moura RCarro L(2020)Analog Memristive CAMs for Area- and Energy-Efficient Reconfigurable ComputingIEEE Transactions on Circuits and Systems II: Express Briefs10.1109/TCSII.2020.298300567:5(856-860)Online publication date: May-2020
https://doi.org/10.1109/TCSII.2020.2983005
Yang Q(2020)Research on Power System Performance Evaluation Based on Machine Learning TechnologyIOP Conference Series: Materials Science and Engineering10.1088/1757-899X/782/3/032011782(032011)Online publication date: 15-Apr-2020
https://doi.org/10.1088/1757-899X/782/3/032011
Show More Cited By

Index Terms

Recommendations

The hardware accelerator debate

Display Omitted Parallel algorithms for real-time aggregate risk analysis is developed.The algorithms are evaluated on hardware accelerators such as GPUs and Phis.Both hardware accelerators are useful in different contexts for risk analysis.The Phi ...
Data Parallel Algorithmic Skeletons with Accelerator Support

Hardware accelerators such as GPUs or Intel Xeon Phi comprise hundreds or thousands of cores on a single chip and promise to deliver high performance. They are widely used to boost the performance of highly parallel applications. However, because of ...
From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming

In this work, we evaluate OpenCL as a programming tool for developing performance-portable applications for GPGPU. While the Khronos group developed OpenCL with programming portability in mind, performance is not necessarily portable. OpenCL has ...

Comments

Information & Contributors

Information

Published In

cover image ACM Journal on Emerging Technologies in Computing Systems

ACM Journal on Emerging Technologies in Computing Systems Volume 13, Issue 3

Special Issue on Hardware and Algorithms for Learning On-a-chip and Special Issue on Alternative Computing Systems

July 2017

418 pages

ISSN:1550-4832

EISSN:1550-4840

DOI:10.1145/3051701

Editor:
Yuan Xie
University of California, Santa Barbara, USA

Issue’s Table of Contents

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

ACM Journals for the Design of Smart and Connected Systems

Publication History

Published: 01 May 2017

Accepted: 01 September 2016

Revised: 01 July 2016

Received: 01 March 2016

Published in JETC Volume 13, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
474
Total Downloads

Downloads (Last 12 months)20
Downloads (Last 6 weeks)1

Reflects downloads up to 03 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Jia YSun YWang Y(2023)Analysis of Verilog-based improvements to the memory transferJournal of Physics: Conference Series10.1088/1742-6596/2649/1/0120562649:1(012056)Online publication date: 1-Nov-2023
https://doi.org/10.1088/1742-6596/2649/1/012056
de Lima Jde Moura RCarro L(2020)Analog Memristive CAMs for Area- and Energy-Efficient Reconfigurable ComputingIEEE Transactions on Circuits and Systems II: Express Briefs10.1109/TCSII.2020.298300567:5(856-860)Online publication date: May-2020
https://doi.org/10.1109/TCSII.2020.2983005
Yang Q(2020)Research on Power System Performance Evaluation Based on Machine Learning TechnologyIOP Conference Series: Materials Science and Engineering10.1088/1757-899X/782/3/032011782(032011)Online publication date: 15-Apr-2020
https://doi.org/10.1088/1757-899X/782/3/032011
Hazarika APoddar SRahaman H(2019)Survey on memory management techniques in heterogeneous computing systemsIET Computers & Digital Techniques10.1049/iet-cdt.2019.0092Online publication date: 19-Dec-2019
https://doi.org/10.1049/iet-cdt.2019.0092

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents