Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2749469.2750397acmconferencesArticle/Chapter ViewAbstractPublication PagesiscaConference Proceedingsconference-collections
research-article

Data reorganization in memory using 3D-stacked DRAM

Published: 13 June 2015 Publication History

Abstract

In this paper we focus on common data reorganization operations such as shuffle, pack/unpack, swap, transpose, and layout transformations. Although these operations simply relocate the data in the memory, they are costly on conventional systems mainly due to inefficient access patterns, limited data reuse and roundtrip data traversal throughout the memory hierarchy. This paper presents a two pronged approach for efficient data reorganization, which combines (i) a proposed DRAM-aware reshape accelerator integrated within 3D-stacked DRAM, and (ii) a mathematical framework that is used to represent and optimize the reorganization operations.
We evaluate our proposed system through two major use cases. First, we demonstrate the reshape accelerator in performing a physical address remapping via data layout transform to utilize the internal parallelism/locality of the 3D-stacked DRAM structure more efficiently for general purpose workloads. Then, we focus on offloading and accelerating commonly used data reorganization routines selected from the Intel Math Kernel Library package. We evaluate the energy and performance benefits of our approach by comparing it against existing optimized implementations on state-of-the-art GPUs and CPUs. For the various test cases, in-memory data reorganization provides orders of magnitude performance and energy efficiency improvements via low overhead hardware.

References

[1]
"CACTI 6.5, HP labs," http://www.hpl.hp.com/research/cacti/.
[2]
"DDR3-1600 dram datasheet, MT41J256M4, Micron," http://www.micron.com/parts/dram/ddr3-sdram.
[3]
"Intel math kernel library (MKL)," http://software.intel.com/en-us/articles/intel-mkl/.
[4]
"McPAT 1.0, HP labs," http://www.hpl.hp.com/research/mcpat/.
[5]
"Performance application programming interface (PAPI)," http://icl.cs.utk.edu/papi/.
[6]
"Gromacs," http://www.gromacs.org, 2008.
[7]
"Itrs interconnect working group, winter update," http://www.itrs.net/, Dec 2012.
[8]
"Memory scheduling championship (MSC)," http://www.cs.utah.edu/rajeev/jwac12/, 2012.
[9]
"High bandwidth memory (HBM) dram," JEDEC, JESD235, 2013.
[10]
"Intel 64 and ia-32 architectures software developers," http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-3b-part-2-manual.pdf, October 2014.
[11]
B. Akin, F. Franchetti, and J. C. Hoe, "FFTS with near-optimal memory access through block data layouts," in IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2014, Florence, Italy, May 4--9, 2014, 2014, pp. 3898--3902.
[12]
B. Akin, F. Franchetti, and J. C. Hoe, "Understanding the design space of dram-optimized hardware FFT accelerators," in IEEE 25th International Conference on Application-Specific Systems, Architectures and Processors, ASAP 2014, Zurich, Switzerland, June 18--20, 2014, 2014, pp. 248--255.
[13]
B. Akin, J. C. Hoe, and F. Franchetti, "Hamlet: Hardware accelerated memory layout transform within 3d-stacked DRAM," in IEEE High Performance Extreme Computing Conference, HPEC 2014, Waltham, MA, USA, September 9--11, 2014, 2014, pp. 1--6.
[14]
B. Akin, P. A. Milder, F. Franchetti, and J. C. Hoe, "Memory bandwidth efficient two-dimensional fast fourier transform algorithm and implementation for large problem sizes," in 2012 IEEE 20th Annual International Symposium on Field-Programmable Custom Computing Machines, FCCM 2012, 29 April -- 1 May 2012, Toronto, Ontario, Canada, 2012, pp. 188--191.
[15]
A. Basu, J. Gandhi, J. Chang, M. D. Hill, and M. M. Swift, "Efficient virtual memory for big memory servers," in Proceedings of the 40th Annual International Symposium on Computer Architecture. ACM, 2013, pp. 237--248.
[16]
G. Baumgartner, A. Auer, D. Bernholdt, A. Bibireata, V. Choppella, D. Cociorva, X. Gao, R. Harrison, S. Hirata, S. Krishnamoorthy, S. Krishnan, C. Lam, Q. Lu, M. Nooijen, R. Pitzer, J. Ramanujam, P. Sadayappan, and A. Sibiryakov, "Synthesis of high-performance parallel programs for a class of ab initio quantum chemistry models," Proceedings of the IEEE, vol. 93, no. 2, pp. 276--292, Feb 2005.
[17]
C. Bienia, S. Kumar, J. P. Singh, and K. Li, "The parsec benchmark suite: Characterization and architectural implications," in Proceedings of the 17th international conference on Parallel architectures and compilation techniques. ACM, 2008, pp. 72--81.
[18]
A. Buluç, J. T. Fineman, M. Frigo, J. R. Gilbert, and C. E. Leiserson, "Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks," in Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures. ACM, 2009, pp. 233--244.
[19]
J. Carter, W. Hsieh, L. Stoller, M. Swanson, L. Zhang, E. Brunvand, A. Davis, C.-C. Kuo, R. Kuramkote, M. Parker, L. Schaelicke, and T. Tateyama, "Impulse: building a smarter memory controller," in High-Performance Computer Architecture, 1999. Proceedings. Fifth International Symposium On, Jan 1999, pp. 70--79.
[20]
N. Chatterjee, R. Balasubramonian, M. Shevgoor, S. Pugsley, A. Udipi, A. Shafiee, K. Sudan, M. Awasthi, and Z. Chishti, "Usimm: the utah simulated memory module," 2012.
[21]
S. Che, J. W. Sheaffer, and K. Skadron, "Dymaxion: Optimizing memory access patterns for heterogeneous systems," in Proc. of Intl. Conf. for High Perf. Comp., Networking, Storage and Analysis (SC), 2011, pp. 13:1--13:11.
[22]
K. Chen, S. Li, N. Muralimanohar, J.-H. Ahn, J. Brockman, and N. Jouppi, "CACTI-3DD: Architecture-level modeling for 3D die-stacked DRAM main memory," in Design, Automation Test in Europe (DATE), 2012, pp. 33--38.
[23]
T. O. Dickson, Y. Liu, S. V. Rylov, B. Dang, C. K. Tsang, P. S. Andry, J. F. Bulzacchelli, H. A. Ainspan, X. Gu, L. Turlapati et al., "An 8x 10-gb/s source-synchronous i/o system based on high-density silicon carrier interconnects," Solid-State Circuits, IEEE Journal of, vol. 47, no. 4, pp. 884--896, 2012.
[24]
X. Dong, Y. Xie, N. Muralimanohar, and N. P. Jouppi, "Simple but effective heterogeneous main memory with on-chip memory controller support," in Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Computer Society, 2010, pp. 1--11.
[25]
R. G. Dreslinski, D. Fick, B. Giridhar, G. Kim, S. Seo, M. Fojtik, S. Satpathy, Y. Lee, D. Kim, N. Liu, M. Wieckowski, G. Chen, D. Sylvester, D. Blaauw, and T. Mudge, "Centip3de: A many-core prototype exploring 3d integration and near-threshold computing," Commun. ACM, vol. 56, no. 11, pp. 97--104, Nov. 2013.
[26]
A. Farmahini-Farahani, J. H. Ahn, K. Morrow, and N. S. Kim, "Nda: Near-dram acceleration architecture leveraging commodity dram devices and standard memory modules," in High Performance Computer Architecture (HPCA), 2015 IEEE 21st International Symposium on, Feb 2015, pp. 283--295.
[27]
M. Frigo and S. G. Johnson, "The design and implementation of FFTW3," Proceedings of the IEEE, Special issue on "Program Generation, Optimization, and Platform Adaptation", vol. 93, no. 2, pp. 216--231, 2005.
[28]
M. Gokhale, B. Holmes, and K. Iobst, "Processing in memory: the terasys massively parallel pim array," Computer, vol. 28, no. 4, pp. 23--31, Apr 1995.
[29]
K. Goto and R. A. v. d. Geijn, "Anatomy of high-performance matrix multiplication," ACM Trans. Math. Softw., vol. 34, no. 3, pp. 12:1--12:25, May 2008.
[30]
C. Gou, G. Kuzmanov, and G. N. Gaydadjiev, "Sams multi-layout memory: Providing multiple views of data to boost simd performance," in Proceedings of the 24th ACM International Conference on Supercomputing, ser. ICS '10, 2010, pp. 179--188.
[31]
Q. Guo, N. Alachiotis, B. Akin, F. Sadi, G. Xu, T. M. Low, L. Pileggi, J. C. Hoe, and F. Franchetti, "3d-stacked memory-side acceleration: Accelerator and system design," in In the Workshop on Near-Data Processing (WoNDP) (Held in conjunction with MICRO-47.), 2014.
[32]
J. L. Henning, "Spec cpu2006 benchmark descriptions," ACM SIGARCH Computer Architecture News, vol. 34, no. 4, pp. 1--17, 2006.
[33]
M. Islam, M. Scrback, K. Kavi, M. Ignatowski, and N. Jayasena, "Improving node-level map-reduce performance using processing-in-memory technologies," in 7th Workshop on UnConventional High Performance Computing held in conjunction with the EuroPar 2014, ser. UCHPC2014, 2014.
[34]
J. Jeddeloh and B. Keeth, "Hybrid memory cube new dram architecture increases density and performance," in VLSI Technology (VLSIT), 2012 Symposium on, June 2012, pp. 87--88.
[35]
M. Kandemir, A. Choudhary, J. Ramanujam, and P. Banerjee, "Improving locality using loop and data transformations in an integrated framework," in Proceedings of the 31st Annual ACM/IEEE International Symposium on Microarchitecture, ser. MICRO 31, 1998, pp. 285--297.
[36]
Y. Kang, W. Huang, S.-M. Yoo, D. Keen, Z. Ge, V. Lam, P. Pattnaik, and J. Torrellas, "Flexram: Toward an advanced intelligent memory system," in Computer Design (ICCD), 2012 IEEE 30th International Conference on. IEEE, 2012, pp. 5--14.
[37]
S. W. Keckler, W. J. Dally, B. Khailany, M. Garland, and D. Glasco, "Gpus and the future of parallel computing," IEEE Micro, vol. 31, no. 5, pp. 7--17, 2011.
[38]
G. Kestor, R. Gioiosa, D. Kerbyson, and A. Hoisie, "Quantifying the energy cost of data movement in scientific applications," in Workload Characterization (IISWC), 2013 IEEE International Symposium on, Sept 2013, pp. 56--65.
[39]
D. H. Kim, K. Athikulwongse, M. Healy, M. Hossain, M. Jung, I. Khorosh, G. Kumar, Y.-J. Lee, D. Lewis, T.-W. Lin, C. Liu, S. Panth, M. Pathak, M. Ren, G. Shen, T. Song, D. H. Woo, X. Zhao, J. Kim, H. Choi, G. Loh, H.-H. Lee, and S.-K. Lim, "3d-maps: 3d massively parallel processor with stacked memory," in Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2012 IEEE International, Feb 2012, pp. 188--190.
[40]
G. H. Loh, "3d-stacked memory architectures for multi-core processors," in Proc. of the 35th Annual International Symposium on Computer Architecture, (ISCA), 2008, pp. 453--464.
[41]
M. Mansuri, J. E. Jaussi, J. T. Kennedy, T. Hsueh, S. Shekhar, G. Balamurugan, F. O'Mahony, C. Roberts, R. Mooney, and B. Casper, "A scalable 0.128-to-1tb/s 0.8-to-2.6 pj/b 64-lane parallel i/o in 32nm cmos," in Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2013 IEEE International. IEEE, 2013, pp. 402--403.
[42]
G. M. Morton, A computer oriented geodetic data base and a new technique in file sequencing. International Business Machines Company, 1966.
[43]
M. Oskin, F. T. Chong, and T. Sherwood, "Active pages: A computation model for intelligent memory," in ISCA, 1998, pp. 192--203.
[44]
N. Park, B. Hong, and V. Prasanna, "Tiling, block data layout, and memory hierarchy performance," IEEE Transactions on Parallel and Distributed Systems, vol. 14, no. 7, pp. 640--654, July 2003.
[45]
D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton, C. Kozyrakis, R. Thomas, and K. Yelick, "A case for intelligent ram," Micro, IEEE, vol. 17, no. 2, pp. 34--44, Mar 1997.
[46]
J. T. Pawlowski, "Hybrid memory cube (HMC)," in Hotchips, 2011.
[47]
J. W. Poulton, W. J. Dally, X. Chen, J. G. Eyles, T. H. Greer, S. G. Tell, J. M. Wilson, and C. T. Gray, "A 0.54 pj/b 20 gb/s ground-referenced single-ended short-reach serial link in 28 nm cmos for advanced packaging applications," 2013.
[48]
S. Pugsley, J. Jestes, H. Zhang, R. Balasubramonian, V. Srinivasan, A. Buyuktosunoglu, A. Davis, and F. Li, "NDC: Analyzing the impact of 3D-stacked memory+logic devices on mapreduce workloads," in Proc. of IEEE Intl. Symp. on Perf. Analysis of Sys. and Soft. (ISPASS), 2014.
[49]
M. Püschel, P. A. Milder, and J. C. Hoe, "Permuting streaming data using rams," J. ACM, vol. 56, no. 2, pp. 10:1--10:34, Apr. 2009.
[50]
M. Püschel, J. M. F. Moura, J. Johnson, D. Padua, M. Veloso, B. Singer, J. Xiong, F. Franchetti, A. Gacic, Y. Voronenko, K. Chen, R. W. Johnson, and N. Rizzolo, "SPIRAL: Code generation for DSP transforms," Proc. of IEEE, special issue on "Program Generation, Optimization, and Adaptation", vol. 93, no. 2, pp. 232--275, 2005.
[51]
L. E. Ramos, E. Gorbatov, and R. Bianchini, "Page placement in hybrid memory systems," in Proceedings of the international conference on Supercomputing. ACM, 2011, pp. 85--95.
[52]
G. Ruetsch and P. Micikevicius, "Optimizing matrix transpose in CUDA," Nvidia CUDA SDK Application Note, 2009.
[53]
V. Seshadri, Y. Kim, C. Fallin, D. Lee, R. Ausavarungnirun, G. Pekhimenko, Y. Luo, O. Mutlu, P. B. Gibbons, M. A. Kozuch, and T. C. Mowry, "Rowclone: Fast and energy-efficient in-dram bulk data copy and initialization," in Proc. of the IEEE/ACM Intl. Symp. on Microarchitecture, ser. MICRO-46, 2013, pp. 185--197.
[54]
K. Sudan, N. Chatterjee, D. Nellans, M. Awasthi, R. Balasubramonian, and A. Davis, "Micro-pages: Increasing dram efficiency with locality-aware data placement," in Proc. of Arch. Sup. for Prog. Lang. and OS, ser. ASPLOS XV, 2010, pp. 219--230.
[55]
I.-J. Sung, G. Liu, and W.-M. Hwu, "Dl: A data layout transformation system for heterogeneous computing," in Innovative Parallel Computing (InPar), 2012, May 2012, pp. 1--11.
[56]
C. Van Loan, Computational frameworks for the fast Fourier transform. SIAM, 1992.
[57]
C. Weis, I. Loi, L. Benini, and N. Wehn, "Exploration and optimization of 3-d integrated dram subsystems," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 32, no. 4, pp. 597--610, April 2013.
[58]
D. H. Woo, N. H. Seong, D. L. Lewis, and H.-H. Lee, "An optimized 3d-stacked memory architecture by exploiting excessive, high-density tsv bandwidth," in High Performance Computer Architecture (HPCA), 2010 IEEE 16th International Symposium on. IEEE, 2010, pp. 1--12.
[59]
J. Xiong, J. Johnson, R. W. Johnson, and D. Padua, "SPL: A language and compiler for DSP algorithms," in Programming Languages Design and Implementation (PLDI), 2001, pp. 298--308.
[60]
D. Zhang, N. Jayasena, A. Lyashevsky, J. L. Greathouse, L. Xu, and M. Ignatowski, "Top-pim: Throughput-oriented programmable processing in memory," in Proceedings of the 23rd International Symposium on High-performance Parallel and Distributed Computing, ser. HPDC '14. New York, NY, USA: ACM, 2014, pp. 85--98.
[61]
Z. Zhang, Z. Zhu, and X. Zhang, "A permutation-based page interleaving scheme to reduce row-buffer conflicts and exploit data locality," in In Proceedings of the 33rd Annual International Symposium on Microarchitecture. ACM Press, 2000, pp. 32--41.
[62]
L. Zhao, R. Iyer, S. Makineni, L. Bhuyan, and D. Newell, "Hardware support for bulk data movement in server platforms," in Proc. of IEEE Intl. Conf. on Computer Design, (ICCD), Oct 2005, pp. 53--60.
[63]
Q. Zhu, B. Akin, H. Sumbul, F. Sadi, J. Hoe, L. Pileggi, and F. Franchetti, "A 3d-stacked logic-in-memory accelerator for application-specific data intensive computing," in 3D Systems Integration Conference (3DIC), 2013 IEEE International, Oct 2013, pp. 1--7.

Cited By

View all
  • (2024)A Mapping of Triangular Block Interleavers to DRAM for Optical Satellite Communication2024 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE58400.2024.10546787(1-2)Online publication date: 25-Mar-2024
  • (2024)H3DM: A High-bandwidth High-capacity Hybrid 3D Memory Design for GPUsProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/36390388:1(1-28)Online publication date: 21-Feb-2024
  • (2023)HIE-DRAM: High Performance Efficient In-DRAM Computing Architecture for SIMD2023 24th International Symposium on Quality Electronic Design (ISQED)10.1109/ISQED57927.2023.10129370(1-7)Online publication date: 5-Apr-2023
  • Show More Cited By

Index Terms

  1. Data reorganization in memory using 3D-stacked DRAM

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer Architecture
      June 2015
      768 pages
      ISBN:9781450334020
      DOI:10.1145/2749469
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 13 June 2015

      Permissions

      Request permissions for this article.

      Check for updates

      Qualifiers

      • Research-article

      Funding Sources

      Conference

      ISCA '15
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 543 of 3,203 submissions, 17%

      Upcoming Conference

      ISCA '25

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)127
      • Downloads (Last 6 weeks)11
      Reflects downloads up to 10 Nov 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)A Mapping of Triangular Block Interleavers to DRAM for Optical Satellite Communication2024 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE58400.2024.10546787(1-2)Online publication date: 25-Mar-2024
      • (2024)H3DM: A High-bandwidth High-capacity Hybrid 3D Memory Design for GPUsProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/36390388:1(1-28)Online publication date: 21-Feb-2024
      • (2023)HIE-DRAM: High Performance Efficient In-DRAM Computing Architecture for SIMD2023 24th International Symposium on Quality Electronic Design (ISQED)10.1109/ISQED57927.2023.10129370(1-7)Online publication date: 5-Apr-2023
      • (2023)Evaluating Machine LearningWorkloads on Memory-Centric Computing Systems2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS57527.2023.00013(35-49)Online publication date: Apr-2023
      • (2023)Casper: Accelerating Stencil Computations Using Near-Cache ProcessingIEEE Access10.1109/ACCESS.2023.325200211(22136-22154)Online publication date: 2023
      • (2022)Software Systems Implementation and Domain-Specific Architectures towards Graph AnalyticsIntelligent Computing10.34133/2022/98067582022Online publication date: 29-Oct-2022
      • (2022) PiDRAM: A Holistic End-to-end FPGA-based Frameworkfor P rocessing- i n- DRAM ACM Transactions on Architecture and Code Optimization10.1145/3563697Online publication date: 14-Sep-2022
      • (2022)Accelerating Weather Prediction using Near-Memory Reconfigurable FabricACM Transactions on Reconfigurable Technology and Systems10.1145/3501804Online publication date: 9-Feb-2022
      • (2022)täkōProceedings of the 49th Annual International Symposium on Computer Architecture10.1145/3470496.3527379(42-58)Online publication date: 18-Jun-2022
      • (2022)ADC-PIM: Accelerating Convolution on the GPU via In-Memory Approximate Data ComparisonIEEE Journal on Emerging and Selected Topics in Circuits and Systems10.1109/JETCAS.2022.316739112:2(458-471)Online publication date: Jun-2022
      • Show More Cited By

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media