Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Memory-Centric Reconfigurable Accelerator for Classification and Machine Learning Applications

Published: 01 May 2017 Publication History

Abstract

Big Data refers to the growing challenge of turning massive, often unstructured datasets into meaningful, organized, and actionable data. As datasets grow from petabytes to exabytes and beyond, it becomes increasingly difficult to run advanced analytics, especially Machine Learning (ML) applications, in a reasonable time and on a practical power budget using traditional architectures. Previous work has focused on accelerating analytics readily implemented as SQL queries on data-parallel platforms, generally using off-the-shelf CPUs and General Purpose Graphics Processing Units (GPGPUs) for computation or acceleration. However, these systems are general-purpose and still require a vast amount of data transfer between the storage devices and computing elements, thus limiting the system efficiency. As an alternative, this article presents a reconfigurable memory-centric advanced analytics accelerator that operates at the last level of memory and dramatically reduces energy required for data transfer. We functionally validate the framework using an FPGA-based hardware emulation platform and three representative applications: Naïve Bayesian Classification, Convolutional Neural Networks, and k-Means Clustering. Results are compared with implementations on a modern CPU and workstation GPGPU. Finally, the use of in-memory dataset decompression to further reduce data transfer volume is investigated. With these techniques, the system achieves an average energy efficiency improvement of 74× and 212× over GPU and single-threaded CPU, respectively, while dataset compression is shown to improve overall efficiency by an additional 1.8× on average.

References

[1]
Altera. 2016. Quartus II Subscription Edition. Retrieved March 2016, from http://www.altera.com.
[2]
Mauricio Araya-Polo, Javier Cabezas, Mauricio Hanzich, Miquel Pericas, Felix Rubio, Isaac Gelado, Muhammad Shafiq, Enric Morancho, Nacho Navarro, Eduard Ayguade, and others. 2011. Assessing accelerator-based HPC reverse time migration. IEEE Transactions on Parallel and Distributed Systems 22, 1 (2011), 147--162.
[3]
Kubilay Atasu, Raphael Polig, Christoph Hagleitner, and Frederick R. Reiss. 2013. Hardware-accelerated regular expression matching for high-throughput text analytics. In Proceedings of the 23rd International Conference on Field Programmable Logic and Applications (FPL’13). IEEE, 1--7.
[4]
Peter Bakkum and Kevin Skadron. 2010. Accelerating SQL database operations on a GPU with CUDA. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units. ACM, 94--103.
[5]
S. R. Bandre and J. N. Nandimath. 2015. Design consideration of network intrusion detection system using hadoop and GPGPU. In Proceedings of the 2015 International Conference on Pervasive Computing (ICPC’15). 1--6.
[6]
Keren Bergman, Shekhar Borkar, Dan Campbell, William Carlson, William Dally, Monty Denneau, Paul Franzon, William H. arrod, Kerry Hill, Jon Hiller, and others. 2008. Exascale computing study: Technology challenges in achieving exascale systems. Defense Advanced Research Projects Agency Information Processing Techniques Office (DARPA IPTO). Tech. Rep. 15 (2008).
[7]
James Bergstra, Olivier Breuleux, Frédéric Bastien, Pascal Lamblin, Razvan Pascanu, Guillaume Desjardins, Joseph Turian, David Warde-Farley, and Yoshua Bengio. 2010. Theano: A CPU and GPU math expression compiler. In Proceedings of the Python for Scientific Computing Conference (SciPy’10). Oral Presentation.
[8]
Bernhard E. Boser, Isabelle M. Guyon, and Vladimir N. Vapnik. 1992. A training algorithm for optimal margin classifiers. In Proceedings of the 5th Annual Workshop on Computational Learning Theory. ACM, 144--152.
[9]
CACTI. Online. Retrieved from http://arch.cs.utah.edu/cacti/.
[10]
Linchuan Chen, Xin Huo, and Gagan Agrawal. 2012. Accelerating mapreduce on a coupled CPU-GPU architecture. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE Computer Society Press, 25.
[11]
clock-gettime(3) - Linux main page. Retrieved from. http://linux.die.net/man/3/clock_gettime.
[12]
Jason Cong and Songjie Xu. 1998. Technology mapping for FPGAs with embedded memory blocks. In Proceedings of the 1998 ACM/SIGDA 6th International Symposium on Field Programmable Gate Arrays. ACM, 179--188.
[13]
CUDA Profiling Tools Interface. Retrieved from https://developer.nvidia.com/cuda-profiling-tools-interface.
[14]
Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified data processing on large clusters. Commun. ACM 51, 1 (2008), 107--113.
[15]
Richard O. Duda, Peter E. Hart, and others. 1973. Pattern Classification and Scene Analysis. Vol. 3. Wiley New York.
[16]
Carl Ebeling, Darren C. Cronquist, and Paul Franklin. 1996. RaPiDreconfigurable pipelined datapath. In Field-Programmable Logic Smart Applications, New Paradigms and Compilers. Springer, 126--135.
[17]
Hadi Esmaeilzadeh, Emily Blem, Renée St. Amant, Karthikeyan Sankaralingam, and Doug Burger. 2013. Power challenges may end the multicore era. Communications of the ACM 56, 2 (2013), 93--102.
[18]
Reza Farivar, Daniel Rebolledo, Ellick Chan, and Roy H. Campbell. 2008. A parallel implementation of k-means clustering on GPUs. 13, 2 (2008).
[19]
Nir Friedman, Dan Geiger, and Moises Goldszmidt. 1997. Bayesian network classifiers. Machine Learning 29, 2--3 (1997), 131--163.
[20]
Zhisong Fu, Michael Personick, and Bryan Thompson. 2014. MapGraph: A high level API for fast development of high performance graph analytics on GPUs. In Proceedings of Workshop on GRAph Data Management Experiences and Systems. ACM, 1--6.
[21]
Lee Garber. 2012. Using in-memory analytics to quickly crunch big data. Computer 45, 10 (2012), 16--18.
[22]
Varghese George, Sanjeev Jahagirdar, Chao Tong, Ken Smits, Satish Damaraju, Scott Siers, Ves Naydenov, Tanveer Khondker, Sanjib Sarkar, and Puneet Singh. 2007. Penryn: 45-nm next generation intel® core 2 processor. In Proceedings of the IEEE Asian Solid-State Circuits Conference (ASSCC’07). IEEE, 14--17.
[23]
Naga K. Govindaraju, Brandon Lloyd, Wei Wang, Ming Lin, and Dinesh Manocha. 2004. Fast computation of database operations using graphics processors. In Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data. ACM, 215--226.
[24]
Alexander Gray. 2013. Analyzing Massive Datasets. Retrieved from http://www.skytree.net/resources/.
[25]
Qi Guo, Nikolaos Alachiotis, Berkin Akin, Fazle Sadi, Guanglin Xu, Tze Meng Low, Larry Pileggi, James C. Hoe, and Franz Franchetti. 2014. 3d-stacked memory-side acceleration: Accelerator and system design. In Proceedings of the Workshop on Near-Data Processing (WoNDP) (Held in Conjunction with MICRO-47.).
[26]
Robert J. Halstead, Bharat Sukhwani, Hong Min, Mathew Thoennes, Parijat Dube, Sameh Asaad, and Brijesh Iyer. 2013. Accelerating join operation for relational databases with FPGAs. In Proceedings of the IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’13). IEEE, 17--20.
[27]
Stephen C. Helmreich and Jim R. Cowie. 2006. Data-centric computing with the netezza architecture. (2006).
[28]
David A. Huffman. 1952. A method for the construction of minimum-redundancy codes. Proceedings of the IRE 40, 9 (1952), 1098--1101.
[29]
Intel Core2 Quad Processor Q8200. Retrieved from http://ark.intel.com/Products/Spec/SLG9S.
[30]
Adam Jacobs. 2009. The pathologies of big data. Communications of the ACM 52, 8 (2009), 36--44.
[31]
Robert Karam, Ruchir Puri, and Swarup Bhunia. 2016. Energy-efficient adaptive hardware accelerator for text mining application kernels. IEEE Transactions on Very Large Scale Integration 24, 12 (Dec. 2016).
[32]
Robert Karam, Ruchir Puri, Swaroop Ghosh, and Swarup Bhunia. 2015a. Emerging trends in design and applications of memory-based computing and content-addressable memories. Proceedings of the IEEE 103, 8 (2015), 1311--1330.
[33]
Robert Karam, Kai Yang, and Swarup Bhunia. 2015b. Energy-efficient reconfigurable computing using spintronic memory. In Proceedings of the 58th IEEE International Midwest Symposium on Circuits and Systems (MWSCAS’15). IEEE, 1--4.
[34]
Piotr Kraj, Ashok Sharma, Nikhil Garge, Robert Podolsky, and Richard A. McIndoe. 2008. ParaKMeans: Implementation of a parallelized K-means algorithm suitable for general laboratory use. BMC Bioinformatics 9, 1 (2008), 200.
[35]
Steve LaValle, Eric Lesser, Rebecca Shockley, Michael S. Hopkins, and Nina Kruschwitz. 2011. Big data, analytics and the path from insights to value. MIT Sloan Management Review 52, 2 (2011), 21--31.
[36]
Yann Lecun and Corinna Cortes. 2016. The MNIST database of handwritten digits. Retrieved from http://yann.lecun.com/exdb/mnist/.
[37]
Christianto C. Liu, Ilya Ganusov, Martin Burtscher, and Sandip Tiwari. 2005. Bridging the processor-memory performance gap with 3D IC technology. IEEE Design 8 Test of Computers 22, 6 (2005), 556--564.
[38]
James MacQueen and others. 1967. Some methods for classification and analysis of multivariate observations. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, Vol. 1. Berkeley, CA, 14.
[39]
Ethan Mirsky and Andre DeHon. 1996. MATRIX: A reconfigurable computing architecture with configurable instruction distribution and deployable resources. In Proceedings of the 1996 IEEE Symposium on FPGAs for Custom Computing Machines. IEEE, 157--166.
[40]
Kazuaki Murakami, Satoru Shirakawa, and Hiroshi Miyajima. 1997. Parallel processing RAM chip with 256 Mb DRAM and quad processors. In Digest of Technical Papers of the 43rd IEEE International Solid-State Circuits Conference (ISSCC’97). IEEE, 228--229.
[41]
K. Neshatpour, M. Malik, M. A. Ghodrat, and H. Homayoun. 2015. Accelerating big data analytics using FPGAs. In Proceedings of the IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM’15). 164--164.
[42]
John Nickolls, Ian Buck, Michael Garland, and Kevin Skadron. 2008. Scalable parallel programming with CUDA. Queue 6, 2 (2008), 40--53.
[43]
Nios II Processor: The World’s Most Versatile Embedded Processor. Retrieved from http://www.altera.com/devices/processor/nios2/ni2-index.html.
[44]
NVIDIA. 2016. CUDA Toolkit Documentation (7.5 ed.). NVIDIA. Retrieved from https://docs.nvidia.com/cuda/.
[45]
Markos Papadonikolakis and C. Bouganis. 2012. Novel cascade FPGA accelerator for support vector machines classification. IEEE Transactions on Neural Networks and Learning Systems 23, 7 (2012), 1040--1052.
[46]
David Patterson, Thomas Anderson, Neal Cardwell, Richard Fromm, Kimberly Keeton, Christoforos Kozyrakis, Randi Thomas, and Katherine Yelick. 1997. A case for intelligent RAM. IEEE Micro 17, 2 (1997), 34--44.
[47]
S. Paul, S. Chatterjee, S. Mukhopadhyay, and S. Bhunia. 2009. Nanoscale reconfigurable computing using non-volatile 2-D STTRAM array. In Proceedings of the 9th IEEE Conference on Nanotechnology (IEEE-NANO 2009). 880--883.
[48]
Somnath Paul, Robert Karam, Swarup Bhunia, and Ruchir Puri. 2014a. Energy-efficient hardware acceleration through computing in the memory. In Proceedings of the Conference on Design, Automation 8 Test in Europe. European Design and Automation Association, 266.
[49]
Somnath Paul, Aswin Krishna, Wenchao Qian, Robert Karam, and Swarup Bhunia. 2014b. MAHA: An energy-efficient malleable hardware accelerator for data-intensive applications. IEEE Transactions on Very Large Scale Integration Systems. IEEE, 1005--1016.
[50]
Andrew Putnam, Adrian M. Caulfield, Eric S. Chung, Derek Chiou, Kypros Constantinides, John Demme, Hadi Esmaeilzadeh, Jeremy Fowers, Gopi Prashanth Gopal, Jordan Gray, et al. 2014. A reconfigurable fabric for accelerating large-scale datacenter services. In Proceedings of the ACM/IEEE 41st International Symposium on Computer Architecture (ISCA’14). IEEE, 13--24.
[51]
QUADRO FOR DESKTOP WORKSTATIONS. Retrieved from http://www.nvidia.com/object/quadro-desktop-gpus.html.
[52]
Timothy G. Rogers, Mike O’Connor, and Tor M. Aamodt. 2013. Divergence-aware warp scheduling. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 99--110.
[53]
Yi Shan, Bo Wang, Jing Yan, Yu Wang, Ningyi Xu, and Huazhong Yang. 2010. FPMR: MapReduce framework on FPGA. In Proceedings of the 18th Annual ACM/SIGDA International Symposium on Field Programmable Gate Arrays. ACM, 93--102.
[54]
Hertej Singh, Ming-Hau Lee, Guangming Lu, Fadi J. Kurdahi, Nader Bagherzadeh, and M. Chaves Eliseu Filho. 2000. MorphoSys: An integrated reconfigurable system for data-parallel and computation-intensive applications. IEEE Transactions on Computers 49, 5 (2000), 465--481.
[55]
Kilian Stoffel and Abdelkader Belkoniene. 1999. Parallel k/h-means clustering for large data sets. European Conference on Parallel Processing. Springer, 1451--1454.
[56]
Bharat Sukhwani, Hong Min, Mathew Thoennes, Parijat Dube, Balakrishna Iyer, Bernard Brezzo, Donna Dillenberger, and Sameh Asaad. 2012. Database analytics acceleration using FPGAs. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques. ACM, 411--420.
[57]
Helen Sun and Peter Heller. 2012. Oracle information architecture: An architect’s guide to big data. Oracle, Redwood Shores (2012).
[58]
Yangzihao Wang, Andrew Davidson, Yuechao Pan, Yuduo Wu, Andy Riffel, and John D. Owens. 2015. Gunrock: A high-performance graph processing library on the GPU. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, 265--266.
[59]
Yu Wang, Boxun Li, Rong Luo, Yiran Chen, Ningyi Xu, and Huazhong Yang. 2014. Energy efficient neural networks for big data analytics. In Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE’14). IEEE, 1--2.
[60]
Louis Woods and Gustavo Alonso. 2011. Fast data analytics with FPGAs. In Proceedings of the IEEE 27th International Conference on Data Engineering Workshops (ICDEW’11). IEEE, 296--299.
[61]
Ren Wu, Bin Zhang, and Meichun Hsu. 2009. Clustering billions of data points using GPUs. In Proceedings of the Combined Workshops on UnConventional High Performance Computing Workshop Plus Memory Access Workshop. ACM, 1--6.
[62]
Weizhong Zhao, Huifang Ma, and Qing He. 2009. Parallel k-means clustering based on mapreduce. IEEE International Conference on Cloud Computing. Springer, 674--679.
[63]
Jacob Ziv and Abraham Lempel. 1977. A universal algorithm for sequential data compression. IEEE Transactions on Information Theory 23, 3 (1977), 337--343.
[64]
Jacob Ziv and Abraham Lempel. 1978. Compression of individual sequences via variable-rate coding. IEEE Transaction on Information Theory 24, 5 (1978), 530--536.

Cited By

View all
  • (2023)Analysis of Verilog-based improvements to the memory transferJournal of Physics: Conference Series10.1088/1742-6596/2649/1/0120562649:1(012056)Online publication date: 1-Nov-2023
  • (2020)Analog Memristive CAMs for Area- and Energy-Efficient Reconfigurable ComputingIEEE Transactions on Circuits and Systems II: Express Briefs10.1109/TCSII.2020.298300567:5(856-860)Online publication date: May-2020
  • (2020)Research on Power System Performance Evaluation Based on Machine Learning TechnologyIOP Conference Series: Materials Science and Engineering10.1088/1757-899X/782/3/032011782(032011)Online publication date: 15-Apr-2020
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Journal on Emerging Technologies in Computing Systems
ACM Journal on Emerging Technologies in Computing Systems  Volume 13, Issue 3
Special Issue on Hardware and Algorithms for Learning On-a-chip and Special Issue on Alternative Computing Systems
July 2017
418 pages
ISSN:1550-4832
EISSN:1550-4840
DOI:10.1145/3051701
  • Editor:
  • Yuan Xie
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

Publication History

Published: 01 May 2017
Accepted: 01 September 2016
Revised: 01 July 2016
Received: 01 March 2016
Published in JETC Volume 13, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Reconfigurable architectures
  2. energy-efficiency
  3. hardware accelerators
  4. machine learning
  5. memory-centric
  6. parallel processing

Qualifiers

  • Research-article
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)20
  • Downloads (Last 6 weeks)1
Reflects downloads up to 03 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Analysis of Verilog-based improvements to the memory transferJournal of Physics: Conference Series10.1088/1742-6596/2649/1/0120562649:1(012056)Online publication date: 1-Nov-2023
  • (2020)Analog Memristive CAMs for Area- and Energy-Efficient Reconfigurable ComputingIEEE Transactions on Circuits and Systems II: Express Briefs10.1109/TCSII.2020.298300567:5(856-860)Online publication date: May-2020
  • (2020)Research on Power System Performance Evaluation Based on Machine Learning TechnologyIOP Conference Series: Materials Science and Engineering10.1088/1757-899X/782/3/032011782(032011)Online publication date: 15-Apr-2020
  • (2019)Survey on memory management techniques in heterogeneous computing systemsIET Computers & Digital Techniques10.1049/iet-cdt.2019.0092Online publication date: 19-Dec-2019

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media