research-article

Enabling portable energy efficiency with memory accelerated library

Authors:

Nikolaos Alachiotis,

Franz FranchettiAuthors Info & Claims

MICRO-48: Proceedings of the 48th International Symposium on Microarchitecture

Pages 750 - 761

https://doi.org/10.1145/2830772.2830788

Published: 05 December 2015 Publication History

Abstract

Over the last decade, the looming power wall has spurred a flurry of interest in developing heterogeneous systems with hardware accelerators. The questions, then, are what and how accelerators should be designed, and what software support is required. Our accelerator design approach stems from the observation that many efficient and portable software implementations rely on high performance software libraries with well-established application programming interfaces (APIs). We propose the integration of hardware accelerators on 3D-stacked memory that explicitly targets the memory-bounded operations within high performance libraries. The fixed APIs with limited configurability simplify the design of the accelerators, while ensuring that the accelerators have wide applicability. With our software support that automatically converts library APIs to accelerator invocations, an additional advantage of our approach is that library-based legacy code automatically gains the benefit of memory-side accelerators without requiring a reimplementation. On average, the legacy code using our proposed MEmory Accelerated Library (MEALib) improves performance and energy efficiency for individual operations in Intel's Math Kernel Library (MKL) by 38x and 75x, respectively. For a real-world signal processing application that employs Intel MKL, MEALib attains more than 10x better energy efficiency.

References

[1]

G. Venkatesh et al., "Conservation cores: Reducing the energy of mature computations," in ASPLOS, 2010.

Digital Library

[2]

G. Venkatesh et al., "Qscores: Trading dark silicon for scalable energy efficiency with quasi-specific cores," in MICRO, 2011.

Digital Library

[3]

A. Pedram et al., "Codesign tradeoffs for high-performance, low-power linear algebra architectures," IEEE Trans. Comput., 2012.

Digital Library

[4]

P. Milder et al., "Computer generation of hardware for linear digital signal processing transforms," ACM Trans. Des. Autom. Electron. Syst., 2012.

Digital Library

[5]

L. Wu et al., "Navigating big data with high-throughput, energy-efficient data partitioning," in ISCA, 2013.

Digital Library

[6]

L. Wu et al., "Q100: The architecture and design of a database processing unit," in ASPLOS, 2014.

Digital Library

[7]

D. Liu et al., "Pudiannao: A polyvalent machine learning accelerator," in ASPLOS, 2015.

Digital Library

[8]

W. Qadeer et al., "Convolution engine: Balancing efficiency & flexibility in specialized computing," in ISCA, 2013.

Digital Library

[9]

R. Nair et al., "Active memory cube: A processing-in-memory architecture for exascale systems," IBM Journal of Research and Development, 2015.

[10]

"BLAS (basic linear algebra subprograms)." http://www.netlib.org/blas/.

[11]

M. Frigo and S. Johnson, "The design and implementation of FFTW3," Proceedings of the IEEE, 2005.

[12]

"The R project for statistical computing." http://www.r-project.org/.

[13]

K. Barker et al., PERFECT (Power Efficiency Revolution For Embedded Computing Technologies) Benchmark Suite Manual. Pacific Northwest National Laboratory and Georgia Tech Research Institute, 2013.

[14]

C. Bienia et al., "The PARSEC benchmark suite: Characterization and architectural implications," in PACT, 2008.

Digital Library

[15]

"Intel math kernel library (MKL)." http://software.intel.com/en-us/articles/intel-mkl/.

[16]

J. Cebrian et al., "Optimized hardware for suboptimal software: The case for SIMD-aware benchmarks," in ISPASS, 2014.

[17]

J. Jeddeloh and B. Keeth, "Hybrid memory cube new dram architecture increases density and performance," in VLSIT, 2012.

[18]

P. Colella, "Defining software requirements for scientific computing," 2004.

[19]

G. H. Loh, "3D-stacked memory architectures for multi-core processors," in ISCA, 2008.

Digital Library

[20]

D. H. Woo et al., "An optimized 3D-stacked memory architecture by exploiting excessive, high-density tsv bandwidth," in HPCA, 2010.

[21]

D. H. Kim et al., "3D-maps: 3d massively parallel processor with stacked memory," in ISSCC, 2012.

[22]

Samsung, "Samsung to release 3D memory modules with 50% greater density," 2010.

[23]

B. Akin et al., "Data reorganization in memory using 3d-stacked dram," in ISCA, 2015.

Digital Library

[24]

B. Akin et al., "Understanding the design space of dram optimized hardware FFT accelerators," in ASAP, 2014.

[25]

K. Sudan et al., "Micro-pages: Increasing dram efficiency with locality-aware data placement," in ASPLOS, 2010.

Digital Library

[26]

P. Luszczek et al., "Introduction to the hpc challenge benchmark suite," ICL Technical Report, 2005.

[27]

F. Sadi et al., "Algorithm/hardware co-optimized sar image reconstruction with 3d-stacked logic in memory," in HPEC, 2014.

[28]

K. Hwang et al., "Benchmark evaluation of the IBM SP2 for parallel signal processing," IEEE Trans. Parallel Distrib. Syst., 1996.

Digital Library

[29]

A. Farmahini-Farahani et al., "Nda: Near-dram acceleration architecture leveraging commodity dram devices and standard memory modules," in HPCA, 2015.

[30]

"More knights landing xeon phi secrets unveiled." http://www.theplatform.net/2015/03/25/more-knights-landing-xeon-phi-secrets-unveiled/.

[31]

"Desktop 4th generation intel core processor family, desktop intel pentium processor family, and desktop intel celeron processor family datasheet - volume 1 of 2," 2014.

[32]

S. Browne et al., "A portable programming interface for performance evaluation on modern processors," Int. J. High Perform. Comput. Appl., 2000.

Digital Library

[33]

"Intel 64 and IA-32 architectures software developerś." http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software developer-vol-3b-part-2-manual.pdf.

[34]

K. Chen et al., "CACTI-3DD: Architecture-level modeling for 3d die-stacked dram main memory," in DATE, 2012.

Digital Library

[35]

Q. Zhu et al., "Accelerating sparse matrix-matrix multiplication with 3D-stacked logic-in-memory hardware," in HPEC, 2013.

[36]

T. A. Davis and Y. Hu, "The University of Florida sparse matrix collection," ACM Trans. Math. Softw., 2011.

Digital Library

[37]

R. Gonzalez and M. Horowitz, "Energy dissipation in general purpose microprocessors," IEEE Journal of Solid-State Circuits, 1996.

[38]

S. H. Pugsley et al., "NDC: analyzing the impact of 3d-stacked memory+logic devices on mapreduce workloads," in ISPASS, 2014.

[39]

"Intel integrated performance primitives." https://software.intel.com/en-us/intel-ipp.

[40]

"GotoBLAS2." https://www.tacc.utexas.edu/research-development/tacc-software/gotoblas2.

[41]

E.-J. Im and K. Yelick, "Optimizing sparse matrix computations for register reuse in sparsity," in Computational Science, 2001.

Digital Library

[42]

F. G. V. Zee et al., "The libflame library for dense matrix computations," Computing in Science and Engineering, 2009.

Digital Library

[43]

Y. Wang et al., "Gunrock: A high-performance graph processing library on the gpu," in PPoPP, 2015.

Digital Library

[44]

"GPU-accelerated libraries." https://developer.nvidia.com/gpu-accelerated-libraries.

[45]

H. Esmaeilzadeh et al., "Neural acceleration for general-purpose approximate programs," in MICRO, 2012.

Digital Library

[46]

R. St. Amant et al., "General-purpose code acceleration with limited-precision analog computation," in ISCA, 2014.

Digital Library

[47]

K. Lim et al., "Thin servers with smart pipes: Designing soc accelerators for memcached," in ISCA, 2013.

Digital Library

[48]

O. Kocberber et al., "Meet the walkers: Accelerating index traversals for in-memory databases," in MICRO, 2013.

Digital Library

[49]

A. Madhavan et al., "Race logic: A hardware acceleration for dynamic programming algorithms," in ISCA, 2014.

Digital Library

[50]

"OpenACC: Directives for accelerators." http://www.openacc-standard.org/.

[51]

"CUDA C programming guide." http://docs.nvidia.com/cuda/cuda-c-programming-guide.

[52]

"OpenCL: the open standard for paralle programming of heterogeneous systems." https://www.khronos.org/opencl/.

[53]

"Nvidia's next generation cuda compute architecture: Fermi." http://www.nvidia.com/content/pdf/fermi_white_papers/nvidia_fermi_compute_architecture_whitepaper.pdf.

[54]

J. Power et al., "Supporting x86-64 address translation for 100s of gpu lanes," in HPCA, 2014.

[55]

B. Pichai et al., "Architectural support for address translation on gpus: Designing memory management units for cpu/gpus with unified address spaces," in ASPLOS, 2014.

Digital Library

[56]

S. Pugsley et al., "Comparing implementations of near-data computing with in-memory mapreduce workloads," IEEE Micro, 2014.

[57]

D. Zhang et al., "Top-pim: Throughput-oriented programmable processing in memory," in HPDC, 2014.

Digital Library

[58]

Q. Guo et al., "3D-stacked memory-side acceleration: Accelerator and system design," in WoNDP, 2014.

[59]

E. Azarkhish et al., "High performance AXI-4.0 based interconnect for extensible smart memory cubes," in DATE, 2015.

Digital Library

Cited By

Sokulski RSantos Pdos Santos SAlves M(2022)SAPIVe: Simple AVX to PIM Vectorizer2022 XII Brazilian Symposium on Computing Systems Engineering (SBESC)10.1109/SBESC56799.2022.9964539(1-8)Online publication date: 21-Nov-2022
https://doi.org/10.1109/SBESC56799.2022.9964539
Baskaran SSampson J(2020)Decentralized Offload-based Execution on Memory-centric Compute CoresProceedings of the International Symposium on Memory Systems10.1145/3422575.3422778(61-76)Online publication date: 28-Sep-2020
https://dl.acm.org/doi/10.1145/3422575.3422778
Singh GChelini LCorda SAwan AStuijk SJordans RCorporaal HBoonstra A(2019)Near-memory computingMicroprocessors & Microsystems10.1016/j.micpro.2019.10286871:COnline publication date: 1-Nov-2019
https://dl.acm.org/doi/10.1016/j.micpro.2019.102868
Show More Cited By

Index Terms

Enabling portable energy efficiency with memory accelerated library
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Heterogeneous (hybrid) systems
2. Hardware
  1. Very large scale integration design
    1. Application-specific VLSI designs
      1. Application specific processors

Recommendations

Energy Efficiency Analysis of GPUs
IPDPSW '12: Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum

In the last few years, Graphics Processing Units (GPUs) have become a great tool for massively parallel computing. GPUs are specifically designed for throughput and face several design challenges, specially what is known as the Power and Memory Walls. ...
On the energy efficiency of parallel multi-core vs hardware accelerated HD video decoding
Special Issue on the 4th Embedded Operating Systems Workshop (EWiLi 2014)

Hardware video accelerators are used on mobile devices to provide support for energy efficient real time High definition (HD) video decoding. Recently, the rise of multi-core architectures on those devices increased their performances and make real time ...
Direct MPI Library for Intel Xeon Phi Co-Processors
IPDPSW '13: Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD Forum

DCFA-MPI is an MPI library implementation for Intel Xeon Phi co-processor clusters, where a compute node consists of an Intel Xeon Phi co-processor card connected to the host via PCI Express with InfiniBand. DCFA-MPI enables direct data transfer between ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MICRO-48: Proceedings of the 48th International Symposium on Microarchitecture

December 2015

787 pages

ISBN:9781450340342

DOI:10.1145/2830772

General Chair:
Milos Prvulovic
Georgia Tech

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

IEEE Computer Society TC-uARCH
SIGMICRO: ACM Special Interest Group on Microarchitectural Research and Processing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 December 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

MICRO-48

Sponsor:

SIGMICRO

MICRO-48: The 48th Annual IEEE/ACM International Symposium of Microarchitecture

December 5 - 9, 2015

Waikiki, Hawaii

Acceptance Rates

MICRO-48 Paper Acceptance Rate 61 of 283 submissions, 22%;

Overall Acceptance Rate 484 of 2,242 submissions, 22%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
220
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)1

Reflects downloads up to 10 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Sokulski RSantos Pdos Santos SAlves M(2022)SAPIVe: Simple AVX to PIM Vectorizer2022 XII Brazilian Symposium on Computing Systems Engineering (SBESC)10.1109/SBESC56799.2022.9964539(1-8)Online publication date: 21-Nov-2022
https://doi.org/10.1109/SBESC56799.2022.9964539
Baskaran SSampson J(2020)Decentralized Offload-based Execution on Memory-centric Compute CoresProceedings of the International Symposium on Memory Systems10.1145/3422575.3422778(61-76)Online publication date: 28-Sep-2020
https://dl.acm.org/doi/10.1145/3422575.3422778
Singh GChelini LCorda SAwan AStuijk SJordans RCorporaal HBoonstra A(2019)Near-memory computingMicroprocessors & Microsystems10.1016/j.micpro.2019.10286871:COnline publication date: 1-Nov-2019
https://dl.acm.org/doi/10.1016/j.micpro.2019.102868
Li SNiu DMalladi KZheng HBrennan BXie YHunter HMoreno JEmer JSanchez D(2017)DRISAProceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3123939.3123977(288-301)Online publication date: 14-Oct-2017
https://dl.acm.org/doi/10.1145/3123939.3123977
Balasubramonian R(2016)Making the Case for Feature-Rich Memory Systems: The March Toward Specialized SystemsIEEE Solid-State Circuits Magazine10.1109/MSSC.2016.25461988:2(57-65)Online publication date: Sep-2017
https://doi.org/10.1109/MSSC.2016.2546198

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents