Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2830772.2830788acmconferencesArticle/Chapter ViewAbstractPublication PagesmicroConference Proceedingsconference-collections
research-article

Enabling portable energy efficiency with memory accelerated library

Published: 05 December 2015 Publication History

Abstract

Over the last decade, the looming power wall has spurred a flurry of interest in developing heterogeneous systems with hardware accelerators. The questions, then, are what and how accelerators should be designed, and what software support is required. Our accelerator design approach stems from the observation that many efficient and portable software implementations rely on high performance software libraries with well-established application programming interfaces (APIs). We propose the integration of hardware accelerators on 3D-stacked memory that explicitly targets the memory-bounded operations within high performance libraries. The fixed APIs with limited configurability simplify the design of the accelerators, while ensuring that the accelerators have wide applicability. With our software support that automatically converts library APIs to accelerator invocations, an additional advantage of our approach is that library-based legacy code automatically gains the benefit of memory-side accelerators without requiring a reimplementation. On average, the legacy code using our proposed MEmory Accelerated Library (MEALib) improves performance and energy efficiency for individual operations in Intel's Math Kernel Library (MKL) by 38x and 75x, respectively. For a real-world signal processing application that employs Intel MKL, MEALib attains more than 10x better energy efficiency.

References

[1]
G. Venkatesh et al., "Conservation cores: Reducing the energy of mature computations," in ASPLOS, 2010.
[2]
G. Venkatesh et al., "Qscores: Trading dark silicon for scalable energy efficiency with quasi-specific cores," in MICRO, 2011.
[3]
A. Pedram et al., "Codesign tradeoffs for high-performance, low-power linear algebra architectures," IEEE Trans. Comput., 2012.
[4]
P. Milder et al., "Computer generation of hardware for linear digital signal processing transforms," ACM Trans. Des. Autom. Electron. Syst., 2012.
[5]
L. Wu et al., "Navigating big data with high-throughput, energy-efficient data partitioning," in ISCA, 2013.
[6]
L. Wu et al., "Q100: The architecture and design of a database processing unit," in ASPLOS, 2014.
[7]
D. Liu et al., "Pudiannao: A polyvalent machine learning accelerator," in ASPLOS, 2015.
[8]
W. Qadeer et al., "Convolution engine: Balancing efficiency & flexibility in specialized computing," in ISCA, 2013.
[9]
R. Nair et al., "Active memory cube: A processing-in-memory architecture for exascale systems," IBM Journal of Research and Development, 2015.
[10]
"BLAS (basic linear algebra subprograms)." http://www.netlib.org/blas/.
[11]
M. Frigo and S. Johnson, "The design and implementation of FFTW3," Proceedings of the IEEE, 2005.
[12]
"The R project for statistical computing." http://www.r-project.org/.
[13]
K. Barker et al., PERFECT (Power Efficiency Revolution For Embedded Computing Technologies) Benchmark Suite Manual. Pacific Northwest National Laboratory and Georgia Tech Research Institute, 2013.
[14]
C. Bienia et al., "The PARSEC benchmark suite: Characterization and architectural implications," in PACT, 2008.
[15]
"Intel math kernel library (MKL)." http://software.intel.com/en-us/articles/intel-mkl/.
[16]
J. Cebrian et al., "Optimized hardware for suboptimal software: The case for SIMD-aware benchmarks," in ISPASS, 2014.
[17]
J. Jeddeloh and B. Keeth, "Hybrid memory cube new dram architecture increases density and performance," in VLSIT, 2012.
[18]
P. Colella, "Defining software requirements for scientific computing," 2004.
[19]
G. H. Loh, "3D-stacked memory architectures for multi-core processors," in ISCA, 2008.
[20]
D. H. Woo et al., "An optimized 3D-stacked memory architecture by exploiting excessive, high-density tsv bandwidth," in HPCA, 2010.
[21]
D. H. Kim et al., "3D-maps: 3d massively parallel processor with stacked memory," in ISSCC, 2012.
[22]
Samsung, "Samsung to release 3D memory modules with 50% greater density," 2010.
[23]
B. Akin et al., "Data reorganization in memory using 3d-stacked dram," in ISCA, 2015.
[24]
B. Akin et al., "Understanding the design space of dram optimized hardware FFT accelerators," in ASAP, 2014.
[25]
K. Sudan et al., "Micro-pages: Increasing dram efficiency with locality-aware data placement," in ASPLOS, 2010.
[26]
P. Luszczek et al., "Introduction to the hpc challenge benchmark suite," ICL Technical Report, 2005.
[27]
F. Sadi et al., "Algorithm/hardware co-optimized sar image reconstruction with 3d-stacked logic in memory," in HPEC, 2014.
[28]
K. Hwang et al., "Benchmark evaluation of the IBM SP2 for parallel signal processing," IEEE Trans. Parallel Distrib. Syst., 1996.
[29]
A. Farmahini-Farahani et al., "Nda: Near-dram acceleration architecture leveraging commodity dram devices and standard memory modules," in HPCA, 2015.
[30]
"More knights landing xeon phi secrets unveiled." http://www.theplatform.net/2015/03/25/more-knights-landing-xeon-phi-secrets-unveiled/.
[31]
"Desktop 4th generation intel core processor family, desktop intel pentium processor family, and desktop intel celeron processor family datasheet - volume 1 of 2," 2014.
[32]
S. Browne et al., "A portable programming interface for performance evaluation on modern processors," Int. J. High Perform. Comput. Appl., 2000.
[33]
"Intel 64 and IA-32 architectures software developerś." http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software developer-vol-3b-part-2-manual.pdf.
[34]
K. Chen et al., "CACTI-3DD: Architecture-level modeling for 3d die-stacked dram main memory," in DATE, 2012.
[35]
Q. Zhu et al., "Accelerating sparse matrix-matrix multiplication with 3D-stacked logic-in-memory hardware," in HPEC, 2013.
[36]
T. A. Davis and Y. Hu, "The University of Florida sparse matrix collection," ACM Trans. Math. Softw., 2011.
[37]
R. Gonzalez and M. Horowitz, "Energy dissipation in general purpose microprocessors," IEEE Journal of Solid-State Circuits, 1996.
[38]
S. H. Pugsley et al., "NDC: analyzing the impact of 3d-stacked memory+logic devices on mapreduce workloads," in ISPASS, 2014.
[39]
"Intel integrated performance primitives." https://software.intel.com/en-us/intel-ipp.
[40]
"GotoBLAS2." https://www.tacc.utexas.edu/research-development/tacc-software/gotoblas2.
[41]
E.-J. Im and K. Yelick, "Optimizing sparse matrix computations for register reuse in sparsity," in Computational Science, 2001.
[42]
F. G. V. Zee et al., "The libflame library for dense matrix computations," Computing in Science and Engineering, 2009.
[43]
Y. Wang et al., "Gunrock: A high-performance graph processing library on the gpu," in PPoPP, 2015.
[44]
"GPU-accelerated libraries." https://developer.nvidia.com/gpu-accelerated-libraries.
[45]
H. Esmaeilzadeh et al., "Neural acceleration for general-purpose approximate programs," in MICRO, 2012.
[46]
R. St. Amant et al., "General-purpose code acceleration with limited-precision analog computation," in ISCA, 2014.
[47]
K. Lim et al., "Thin servers with smart pipes: Designing soc accelerators for memcached," in ISCA, 2013.
[48]
O. Kocberber et al., "Meet the walkers: Accelerating index traversals for in-memory databases," in MICRO, 2013.
[49]
A. Madhavan et al., "Race logic: A hardware acceleration for dynamic programming algorithms," in ISCA, 2014.
[50]
"OpenACC: Directives for accelerators." http://www.openacc-standard.org/.
[51]
"CUDA C programming guide." http://docs.nvidia.com/cuda/cuda-c-programming-guide.
[52]
"OpenCL: the open standard for paralle programming of heterogeneous systems." https://www.khronos.org/opencl/.
[53]
"Nvidia's next generation cuda compute architecture: Fermi." http://www.nvidia.com/content/pdf/fermi_white_papers/nvidia_fermi_compute_architecture_whitepaper.pdf.
[54]
J. Power et al., "Supporting x86-64 address translation for 100s of gpu lanes," in HPCA, 2014.
[55]
B. Pichai et al., "Architectural support for address translation on gpus: Designing memory management units for cpu/gpus with unified address spaces," in ASPLOS, 2014.
[56]
S. Pugsley et al., "Comparing implementations of near-data computing with in-memory mapreduce workloads," IEEE Micro, 2014.
[57]
D. Zhang et al., "Top-pim: Throughput-oriented programmable processing in memory," in HPDC, 2014.
[58]
Q. Guo et al., "3D-stacked memory-side acceleration: Accelerator and system design," in WoNDP, 2014.
[59]
E. Azarkhish et al., "High performance AXI-4.0 based interconnect for extensible smart memory cubes," in DATE, 2015.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MICRO-48: Proceedings of the 48th International Symposium on Microarchitecture
December 2015
787 pages
ISBN:9781450340342
DOI:10.1145/2830772
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 December 2015

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. 3D DRAM
  2. accelerator
  3. energy efficiency
  4. library

Qualifiers

  • Research-article

Conference

MICRO-48
Sponsor:

Acceptance Rates

MICRO-48 Paper Acceptance Rate 61 of 283 submissions, 22%;
Overall Acceptance Rate 484 of 2,242 submissions, 22%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)1
Reflects downloads up to 10 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2022)SAPIVe: Simple AVX to PIM Vectorizer2022 XII Brazilian Symposium on Computing Systems Engineering (SBESC)10.1109/SBESC56799.2022.9964539(1-8)Online publication date: 21-Nov-2022
  • (2020)Decentralized Offload-based Execution on Memory-centric Compute CoresProceedings of the International Symposium on Memory Systems10.1145/3422575.3422778(61-76)Online publication date: 28-Sep-2020
  • (2019)Near-memory computingMicroprocessors & Microsystems10.1016/j.micpro.2019.10286871:COnline publication date: 1-Nov-2019
  • (2017)DRISAProceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3123939.3123977(288-301)Online publication date: 14-Oct-2017
  • (2016)Making the Case for Feature-Rich Memory Systems: The March Toward Specialized SystemsIEEE Solid-State Circuits Magazine10.1109/MSSC.2016.25461988:2(57-65)Online publication date: Sep-2017

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media