research-article

Data reorganization in memory using 3D-stacked DRAM

Authors:

Franz Franchetti,

James C. HoeAuthors Info & Claims

ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer Architecture

Pages 131 - 143

https://doi.org/10.1145/2749469.2750397

Published: 13 June 2015 Publication History

Abstract

In this paper we focus on common data reorganization operations such as shuffle, pack/unpack, swap, transpose, and layout transformations. Although these operations simply relocate the data in the memory, they are costly on conventional systems mainly due to inefficient access patterns, limited data reuse and roundtrip data traversal throughout the memory hierarchy. This paper presents a two pronged approach for efficient data reorganization, which combines (i) a proposed DRAM-aware reshape accelerator integrated within 3D-stacked DRAM, and (ii) a mathematical framework that is used to represent and optimize the reorganization operations.

We evaluate our proposed system through two major use cases. First, we demonstrate the reshape accelerator in performing a physical address remapping via data layout transform to utilize the internal parallelism/locality of the 3D-stacked DRAM structure more efficiently for general purpose workloads. Then, we focus on offloading and accelerating commonly used data reorganization routines selected from the Intel Math Kernel Library package. We evaluate the energy and performance benefits of our approach by comparing it against existing optimized implementations on state-of-the-art GPUs and CPUs. For the various test cases, in-memory data reorganization provides orders of magnitude performance and energy efficiency improvements via low overhead hardware.

References

[1]

"CACTI 6.5, HP labs," http://www.hpl.hp.com/research/cacti/.

[2]

"DDR3-1600 dram datasheet, MT41J256M4, Micron," http://www.micron.com/parts/dram/ddr3-sdram.

[3]

"Intel math kernel library (MKL)," http://software.intel.com/en-us/articles/intel-mkl/.

[4]

"McPAT 1.0, HP labs," http://www.hpl.hp.com/research/mcpat/.

[5]

"Performance application programming interface (PAPI)," http://icl.cs.utk.edu/papi/.

[6]

"Gromacs," http://www.gromacs.org, 2008.

[7]

"Itrs interconnect working group, winter update," http://www.itrs.net/, Dec 2012.

[8]

"Memory scheduling championship (MSC)," http://www.cs.utah.edu/rajeev/jwac12/, 2012.

[9]

"High bandwidth memory (HBM) dram," JEDEC, JESD235, 2013.

[10]

"Intel 64 and ia-32 architectures software developers," http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-3b-part-2-manual.pdf, October 2014.

[11]

B. Akin, F. Franchetti, and J. C. Hoe, "FFTS with near-optimal memory access through block data layouts," in IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2014, Florence, Italy, May 4--9, 2014, 2014, pp. 3898--3902.

[12]

B. Akin, F. Franchetti, and J. C. Hoe, "Understanding the design space of dram-optimized hardware FFT accelerators," in IEEE 25th International Conference on Application-Specific Systems, Architectures and Processors, ASAP 2014, Zurich, Switzerland, June 18--20, 2014, 2014, pp. 248--255.

[13]

B. Akin, J. C. Hoe, and F. Franchetti, "Hamlet: Hardware accelerated memory layout transform within 3d-stacked DRAM," in IEEE High Performance Extreme Computing Conference, HPEC 2014, Waltham, MA, USA, September 9--11, 2014, 2014, pp. 1--6.

[14]

B. Akin, P. A. Milder, F. Franchetti, and J. C. Hoe, "Memory bandwidth efficient two-dimensional fast fourier transform algorithm and implementation for large problem sizes," in 2012 IEEE 20th Annual International Symposium on Field-Programmable Custom Computing Machines, FCCM 2012, 29 April -- 1 May 2012, Toronto, Ontario, Canada, 2012, pp. 188--191.

Digital Library

[15]

A. Basu, J. Gandhi, J. Chang, M. D. Hill, and M. M. Swift, "Efficient virtual memory for big memory servers," in Proceedings of the 40th Annual International Symposium on Computer Architecture. ACM, 2013, pp. 237--248.

Digital Library

[16]

G. Baumgartner, A. Auer, D. Bernholdt, A. Bibireata, V. Choppella, D. Cociorva, X. Gao, R. Harrison, S. Hirata, S. Krishnamoorthy, S. Krishnan, C. Lam, Q. Lu, M. Nooijen, R. Pitzer, J. Ramanujam, P. Sadayappan, and A. Sibiryakov, "Synthesis of high-performance parallel programs for a class of ab initio quantum chemistry models," Proceedings of the IEEE, vol. 93, no. 2, pp. 276--292, Feb 2005.

[17]

C. Bienia, S. Kumar, J. P. Singh, and K. Li, "The parsec benchmark suite: Characterization and architectural implications," in Proceedings of the 17th international conference on Parallel architectures and compilation techniques. ACM, 2008, pp. 72--81.

Digital Library

[18]

A. Buluç, J. T. Fineman, M. Frigo, J. R. Gilbert, and C. E. Leiserson, "Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks," in Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures. ACM, 2009, pp. 233--244.

Digital Library

[19]

J. Carter, W. Hsieh, L. Stoller, M. Swanson, L. Zhang, E. Brunvand, A. Davis, C.-C. Kuo, R. Kuramkote, M. Parker, L. Schaelicke, and T. Tateyama, "Impulse: building a smarter memory controller," in High-Performance Computer Architecture, 1999. Proceedings. Fifth International Symposium On, Jan 1999, pp. 70--79.

Digital Library

[20]

N. Chatterjee, R. Balasubramonian, M. Shevgoor, S. Pugsley, A. Udipi, A. Shafiee, K. Sudan, M. Awasthi, and Z. Chishti, "Usimm: the utah simulated memory module," 2012.

[21]

S. Che, J. W. Sheaffer, and K. Skadron, "Dymaxion: Optimizing memory access patterns for heterogeneous systems," in Proc. of Intl. Conf. for High Perf. Comp., Networking, Storage and Analysis (SC), 2011, pp. 13:1--13:11.

Digital Library

[22]

K. Chen, S. Li, N. Muralimanohar, J.-H. Ahn, J. Brockman, and N. Jouppi, "CACTI-3DD: Architecture-level modeling for 3D die-stacked DRAM main memory," in Design, Automation Test in Europe (DATE), 2012, pp. 33--38.

Digital Library

[23]

T. O. Dickson, Y. Liu, S. V. Rylov, B. Dang, C. K. Tsang, P. S. Andry, J. F. Bulzacchelli, H. A. Ainspan, X. Gu, L. Turlapati et al., "An 8x 10-gb/s source-synchronous i/o system based on high-density silicon carrier interconnects," Solid-State Circuits, IEEE Journal of, vol. 47, no. 4, pp. 884--896, 2012.

[24]

X. Dong, Y. Xie, N. Muralimanohar, and N. P. Jouppi, "Simple but effective heterogeneous main memory with on-chip memory controller support," in Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Computer Society, 2010, pp. 1--11.

Digital Library

[25]

R. G. Dreslinski, D. Fick, B. Giridhar, G. Kim, S. Seo, M. Fojtik, S. Satpathy, Y. Lee, D. Kim, N. Liu, M. Wieckowski, G. Chen, D. Sylvester, D. Blaauw, and T. Mudge, "Centip3de: A many-core prototype exploring 3d integration and near-threshold computing," Commun. ACM, vol. 56, no. 11, pp. 97--104, Nov. 2013.

Digital Library

[26]

A. Farmahini-Farahani, J. H. Ahn, K. Morrow, and N. S. Kim, "Nda: Near-dram acceleration architecture leveraging commodity dram devices and standard memory modules," in High Performance Computer Architecture (HPCA), 2015 IEEE 21st International Symposium on, Feb 2015, pp. 283--295.

[27]

M. Frigo and S. G. Johnson, "The design and implementation of FFTW3," Proceedings of the IEEE, Special issue on "Program Generation, Optimization, and Platform Adaptation", vol. 93, no. 2, pp. 216--231, 2005.

[28]

M. Gokhale, B. Holmes, and K. Iobst, "Processing in memory: the terasys massively parallel pim array," Computer, vol. 28, no. 4, pp. 23--31, Apr 1995.

Digital Library

[29]

K. Goto and R. A. v. d. Geijn, "Anatomy of high-performance matrix multiplication," ACM Trans. Math. Softw., vol. 34, no. 3, pp. 12:1--12:25, May 2008.

Digital Library

[30]

C. Gou, G. Kuzmanov, and G. N. Gaydadjiev, "Sams multi-layout memory: Providing multiple views of data to boost simd performance," in Proceedings of the 24th ACM International Conference on Supercomputing, ser. ICS '10, 2010, pp. 179--188.

Digital Library

[31]

Q. Guo, N. Alachiotis, B. Akin, F. Sadi, G. Xu, T. M. Low, L. Pileggi, J. C. Hoe, and F. Franchetti, "3d-stacked memory-side acceleration: Accelerator and system design," in In the Workshop on Near-Data Processing (WoNDP) (Held in conjunction with MICRO-47.), 2014.

[32]

J. L. Henning, "Spec cpu2006 benchmark descriptions," ACM SIGARCH Computer Architecture News, vol. 34, no. 4, pp. 1--17, 2006.

Digital Library

[33]

M. Islam, M. Scrback, K. Kavi, M. Ignatowski, and N. Jayasena, "Improving node-level map-reduce performance using processing-in-memory technologies," in 7th Workshop on UnConventional High Performance Computing held in conjunction with the EuroPar 2014, ser. UCHPC2014, 2014.

[34]

J. Jeddeloh and B. Keeth, "Hybrid memory cube new dram architecture increases density and performance," in VLSI Technology (VLSIT), 2012 Symposium on, June 2012, pp. 87--88.

[35]

M. Kandemir, A. Choudhary, J. Ramanujam, and P. Banerjee, "Improving locality using loop and data transformations in an integrated framework," in Proceedings of the 31st Annual ACM/IEEE International Symposium on Microarchitecture, ser. MICRO 31, 1998, pp. 285--297.

Digital Library

[36]

Y. Kang, W. Huang, S.-M. Yoo, D. Keen, Z. Ge, V. Lam, P. Pattnaik, and J. Torrellas, "Flexram: Toward an advanced intelligent memory system," in Computer Design (ICCD), 2012 IEEE 30th International Conference on. IEEE, 2012, pp. 5--14.

Digital Library

[37]

S. W. Keckler, W. J. Dally, B. Khailany, M. Garland, and D. Glasco, "Gpus and the future of parallel computing," IEEE Micro, vol. 31, no. 5, pp. 7--17, 2011.

Digital Library

[38]

G. Kestor, R. Gioiosa, D. Kerbyson, and A. Hoisie, "Quantifying the energy cost of data movement in scientific applications," in Workload Characterization (IISWC), 2013 IEEE International Symposium on, Sept 2013, pp. 56--65.

[39]

D. H. Kim, K. Athikulwongse, M. Healy, M. Hossain, M. Jung, I. Khorosh, G. Kumar, Y.-J. Lee, D. Lewis, T.-W. Lin, C. Liu, S. Panth, M. Pathak, M. Ren, G. Shen, T. Song, D. H. Woo, X. Zhao, J. Kim, H. Choi, G. Loh, H.-H. Lee, and S.-K. Lim, "3d-maps: 3d massively parallel processor with stacked memory," in Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2012 IEEE International, Feb 2012, pp. 188--190.

[40]

G. H. Loh, "3d-stacked memory architectures for multi-core processors," in Proc. of the 35th Annual International Symposium on Computer Architecture, (ISCA), 2008, pp. 453--464.

Digital Library

[41]

M. Mansuri, J. E. Jaussi, J. T. Kennedy, T. Hsueh, S. Shekhar, G. Balamurugan, F. O'Mahony, C. Roberts, R. Mooney, and B. Casper, "A scalable 0.128-to-1tb/s 0.8-to-2.6 pj/b 64-lane parallel i/o in 32nm cmos," in Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2013 IEEE International. IEEE, 2013, pp. 402--403.

[42]

G. M. Morton, A computer oriented geodetic data base and a new technique in file sequencing. International Business Machines Company, 1966.

[43]

M. Oskin, F. T. Chong, and T. Sherwood, "Active pages: A computation model for intelligent memory," in ISCA, 1998, pp. 192--203.

Digital Library

[44]

N. Park, B. Hong, and V. Prasanna, "Tiling, block data layout, and memory hierarchy performance," IEEE Transactions on Parallel and Distributed Systems, vol. 14, no. 7, pp. 640--654, July 2003.

Digital Library

[45]

D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton, C. Kozyrakis, R. Thomas, and K. Yelick, "A case for intelligent ram," Micro, IEEE, vol. 17, no. 2, pp. 34--44, Mar 1997.

Digital Library

[46]

J. T. Pawlowski, "Hybrid memory cube (HMC)," in Hotchips, 2011.

[47]

J. W. Poulton, W. J. Dally, X. Chen, J. G. Eyles, T. H. Greer, S. G. Tell, J. M. Wilson, and C. T. Gray, "A 0.54 pj/b 20 gb/s ground-referenced single-ended short-reach serial link in 28 nm cmos for advanced packaging applications," 2013.

[48]

S. Pugsley, J. Jestes, H. Zhang, R. Balasubramonian, V. Srinivasan, A. Buyuktosunoglu, A. Davis, and F. Li, "NDC: Analyzing the impact of 3D-stacked memory+logic devices on mapreduce workloads," in Proc. of IEEE Intl. Symp. on Perf. Analysis of Sys. and Soft. (ISPASS), 2014.

[49]

M. Püschel, P. A. Milder, and J. C. Hoe, "Permuting streaming data using rams," J. ACM, vol. 56, no. 2, pp. 10:1--10:34, Apr. 2009.

Digital Library

[50]

M. Püschel, J. M. F. Moura, J. Johnson, D. Padua, M. Veloso, B. Singer, J. Xiong, F. Franchetti, A. Gacic, Y. Voronenko, K. Chen, R. W. Johnson, and N. Rizzolo, "SPIRAL: Code generation for DSP transforms," Proc. of IEEE, special issue on "Program Generation, Optimization, and Adaptation", vol. 93, no. 2, pp. 232--275, 2005.

[51]

L. E. Ramos, E. Gorbatov, and R. Bianchini, "Page placement in hybrid memory systems," in Proceedings of the international conference on Supercomputing. ACM, 2011, pp. 85--95.

Digital Library

[52]

G. Ruetsch and P. Micikevicius, "Optimizing matrix transpose in CUDA," Nvidia CUDA SDK Application Note, 2009.

[53]

V. Seshadri, Y. Kim, C. Fallin, D. Lee, R. Ausavarungnirun, G. Pekhimenko, Y. Luo, O. Mutlu, P. B. Gibbons, M. A. Kozuch, and T. C. Mowry, "Rowclone: Fast and energy-efficient in-dram bulk data copy and initialization," in Proc. of the IEEE/ACM Intl. Symp. on Microarchitecture, ser. MICRO-46, 2013, pp. 185--197.

Digital Library

[54]

K. Sudan, N. Chatterjee, D. Nellans, M. Awasthi, R. Balasubramonian, and A. Davis, "Micro-pages: Increasing dram efficiency with locality-aware data placement," in Proc. of Arch. Sup. for Prog. Lang. and OS, ser. ASPLOS XV, 2010, pp. 219--230.

Digital Library

[55]

I.-J. Sung, G. Liu, and W.-M. Hwu, "Dl: A data layout transformation system for heterogeneous computing," in Innovative Parallel Computing (InPar), 2012, May 2012, pp. 1--11.

[56]

C. Van Loan, Computational frameworks for the fast Fourier transform. SIAM, 1992.

[57]

C. Weis, I. Loi, L. Benini, and N. Wehn, "Exploration and optimization of 3-d integrated dram subsystems," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 32, no. 4, pp. 597--610, April 2013.

Digital Library

[58]

D. H. Woo, N. H. Seong, D. L. Lewis, and H.-H. Lee, "An optimized 3d-stacked memory architecture by exploiting excessive, high-density tsv bandwidth," in High Performance Computer Architecture (HPCA), 2010 IEEE 16th International Symposium on. IEEE, 2010, pp. 1--12.

[59]

J. Xiong, J. Johnson, R. W. Johnson, and D. Padua, "SPL: A language and compiler for DSP algorithms," in Programming Languages Design and Implementation (PLDI), 2001, pp. 298--308.

Digital Library

[60]

D. Zhang, N. Jayasena, A. Lyashevsky, J. L. Greathouse, L. Xu, and M. Ignatowski, "Top-pim: Throughput-oriented programmable processing in memory," in Proceedings of the 23rd International Symposium on High-performance Parallel and Distributed Computing, ser. HPDC '14. New York, NY, USA: ACM, 2014, pp. 85--98.

Digital Library

[61]

Z. Zhang, Z. Zhu, and X. Zhang, "A permutation-based page interleaving scheme to reduce row-buffer conflicts and exploit data locality," in In Proceedings of the 33rd Annual International Symposium on Microarchitecture. ACM Press, 2000, pp. 32--41.

Digital Library

[62]

L. Zhao, R. Iyer, S. Makineni, L. Bhuyan, and D. Newell, "Hardware support for bulk data movement in server platforms," in Proc. of IEEE Intl. Conf. on Computer Design, (ICCD), Oct 2005, pp. 53--60.

Digital Library

[63]

Q. Zhu, B. Akin, H. Sumbul, F. Sadi, J. Hoe, L. Pileggi, and F. Franchetti, "A 3d-stacked logic-in-memory accelerator for application-specific data intensive computing," in 3D Systems Integration Conference (3DIC), 2013 IEEE International, Oct 2013, pp. 1--7.

Cited By

Steiner LLehnigk-Emden TFehrenz MWehn N(2024)A Mapping of Triangular Block Interleavers to DRAM for Optical Satellite Communication2024 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE58400.2024.10546787(1-2)Online publication date: 25-Mar-2024
https://doi.org/10.23919/DATE58400.2024.10546787
Akbarzadeh NDarabi SGheibi-Fetrat AMirzaei ASadrosadati MSarbazi-Azad H(2024)H3DM: A High-bandwidth High-capacity Hybrid 3D Memory Design for GPUsProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/36390388:1(1-28)Online publication date: 21-Feb-2024
https://dl.acm.org/doi/10.1145/3639038
Schwedock BBeckmann N(2024)Leviathan: A Unified System for General-Purpose Near-Data Computing2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00095(1278-1294)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00095
Show More Cited By

Index Terms

Data reorganization in memory using 3D-stacked DRAM
1. Hardware
  1. Hardware validation
  2. Integrated circuits
    1. Semiconductor memory
      1. Dynamic memory

Recommendations

A scalable processing-in-memory accelerator for parallel graph processing
ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer Architecture

The explosion of digital data and the ever-growing need for fast data analysis have made in-memory big-data processing in computer systems increasingly important. In particular, large-scale graph processing is gaining attention due to its broad ...
TOP-PIM: throughput-oriented programmable processing in memory
HPDC '14: Proceedings of the 23rd international symposium on High-performance parallel and distributed computing

As computation becomes increasingly limited by data movement and energy consumption, exploiting locality throughout the memory hierarchy becomes critical to continued performance scaling. Moving computation closer to memory presents an opportunity to ...
PRIME: a novel processing-in-memory architecture for neural network computation in ReRAM-based main memory
ISCA'16

Processing-in-memory (PIM) is a promising solution to address the "memory wall" challenges for future computer systems. Prior proposed PIM architectures put additional computation logic in or near memory. The emerging metal-oxide resistive random access ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer Architecture

June 2015

768 pages

ISBN:9781450334020

DOI:10.1145/2749469

General Chair:
Debbie Marr
Intel
,
Program Chair:
David Albonesi
Cornell

ACM SIGARCH Computer Architecture News Volume 43, Issue 3S
ISCA'15
June 2015
745 pages
ISSN:0163-5964
DOI:10.1145/2872887
Editor:
Doug DeGroot
acm dot org
Issue’s Table of Contents

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

IEEE TCCA: IEEE Computer Society Technical Committee on Computer Architecture
SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 June 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Funding Sources

Defense Advanced Research Projects Agency

Conference

ISCA '15

Sponsor:

IEEE TCCA
SIGARCH

ISCA '15: The 42nd Annual International Symposium on Computer Architecture

June 13 - 17, 2015

Oregon, Portland

Acceptance Rates

Overall Acceptance Rate 543 of 3,203 submissions, 17%

Upcoming Conference

ISCA '25

Sponsor:
sigarch

The 52nd Annual International Symposium on Computer Architecture

June 21 - 25, 2025

Tokyo , Japan

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

134
Total Citations
View Citations
1,811
Total Downloads

Downloads (Last 12 months)171
Downloads (Last 6 weeks)38

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Steiner LLehnigk-Emden TFehrenz MWehn N(2024)A Mapping of Triangular Block Interleavers to DRAM for Optical Satellite Communication2024 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE58400.2024.10546787(1-2)Online publication date: 25-Mar-2024
https://doi.org/10.23919/DATE58400.2024.10546787
Akbarzadeh NDarabi SGheibi-Fetrat AMirzaei ASadrosadati MSarbazi-Azad H(2024)H3DM: A High-bandwidth High-capacity Hybrid 3D Memory Design for GPUsProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/36390388:1(1-28)Online publication date: 21-Feb-2024
https://dl.acm.org/doi/10.1145/3639038
Schwedock BBeckmann N(2024)Leviathan: A Unified System for General-Purpose Near-Data Computing2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00095(1278-1294)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00095
Lee DHyun BKim TRhu M(2024)PIM-MMU: A Memory Management Unit for Accelerating Data Transfers in Commercial PIM Systems2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00053(627-642)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00053
Kabra MC PDeshpande KRao M(2023)HIE-DRAM: High Performance Efficient In-DRAM Computing Architecture for SIMD2023 24th International Symposium on Quality Electronic Design (ISQED)10.1109/ISQED57927.2023.10129370(1-7)Online publication date: 5-Apr-2023
https://doi.org/10.1109/ISQED57927.2023.10129370
Gómez-Luna JGuo YBrocard SLegriel JCimadomo ROliveira GSingh GMutlu O(2023)Evaluating Machine LearningWorkloads on Memory-Centric Computing Systems2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS57527.2023.00013(35-49)Online publication date: Apr-2023
https://doi.org/10.1109/ISPASS57527.2023.00013
Denzler AOliveira GHajinazar NBera RSingh GGómez-Luna JMutlu O(2023)Casper: Accelerating Stencil Computations Using Near-Cache ProcessingIEEE Access10.1109/ACCESS.2023.325200211(22136-22154)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3252002
Jin HQi HZhao JJiang XHuang YGui CWang QShen XZhang YHu AChen DLiu CLiu HHe HYe XWang RYuan JYao PZhang YZheng LLiao X(2022)Software Systems Implementation and Domain-Specific Architectures towards Graph AnalyticsIntelligent Computing10.34133/2022/98067582022Online publication date: 29-Oct-2022
https://doi.org/10.34133/2022/9806758
Olgun ALuna JKanellopoulos KSalami BHassan HErgin OMutlu O(2022) PiDRAM: A Holistic End-to-end FPGA-based Frameworkfor P rocessing- i n- DRAM ACM Transactions on Architecture and Code Optimization10.1145/3563697Online publication date: 14-Sep-2022
https://doi.org/10.1145/3563697
Singh GDiamantopoulos DGómez-Luna JHagleitner CStuijk SCorporaal HMutlu O(2022)Accelerating Weather Prediction using Near-Memory Reconfigurable FabricACM Transactions on Reconfigurable Technology and Systems10.1145/3501804Online publication date: 9-Feb-2022
https://doi.org/10.1145/3501804
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten