research-article

Towards a scatter-gather architecture: hardware and software issues

Authors:

Arun Rodrigues,

Gwendolyn VoskuilenAuthors Info & Claims

MEMSYS '19: Proceedings of the International Symposium on Memory Systems

Pages 261 - 271

https://doi.org/10.1145/3357526.3357571

Published: 30 September 2019 Publication History

Abstract

The on-node performance of High performance computing (HPC) applications is traditionally dominated by memory operations. Put simply, memory is what these applications "do." Unfortunately, they don't do it well. Caches, our first line of attack in the battle for memory performance, often throw away most of the data they fetch before using it. Processor cores, one of our most expensive resources, spend an inordinate amount of time performing simple address computations. Addressing these issues will require new approaches to how on-chip memory is organized and how memory operations are performed. Under Project 38, a joint Department of Energy / Department of Defense architectural resarch project, we have focused on exploring what a flexible in-memory scatter-gather architecture could look like in the context of several important HPC applications.

References

[1]

Adams, M. F., Brown, J., Shalf, J., Straalen, B. V., Strohmaier, E., and Williams, S. Hpgmg 1.0: A benchmark for ranking high performance computing systems. Tech. rep., hpgmg.org, 2014. https://hpgmg.org/static/hpgmg-tr-1.0.pdf.

[2]

Anderson, E., Brooks, J., Grassl, C., and Scott, S. Performance of the cray t3e multiprocessor. In Proceedings of the 1997 ACM/IEEE Conference on Supercomputing (New York, NY, USA, 1997), SC '97, ACM, pp. 1--17.

[3]

Asanovic, K., and Patterson, D. A. Instruction sets should be free: The case for risc-v. Tech. Rep. UCB/EECS-2014-146, EECS Department Univ. of California Berkeley, August 2014.

[4]

Beard, J. The sparse data reduction engine. In Proceedings of the 2017 International Symposium on Memory Systems (2017), ACM.

Digital Library

[5]

Brooks, E. Attack of the killer micros. Presentation at Supercomputing 1990, November 1990.

[6]

Chou, C. H., Severance, A., Brant, A. D., Liu, Z., Sant, S., and Lemieux, G. G. Vegas: Soft vector processor with scratchpad memory. In Proceedings of the 19th ACM/SIGDA International Symposium on Field Programmable Gate Arrays (New York, NY, USA, 2011), FPGA '11, ACM, pp. 15--24.

[7]

Edwards, H. C., Trott, C. R., and Sunderland, D. Kokkos: Enabling manycore performance portability through polymorphic memory access patterns. Journal of Parallel and Distributed Computing 74, 12 (2014), 3202--3216. Domain-Specific Languages and High-Level Frameworks for High-Performance Computing.

[8]

Farber, R., and Mizell, D. Experimental comparison of emulated lock-free vs. fine-grain locked data structures on the cray xmt. In 2010 IEEE International Symposium on Parallel Distributed Processing, Workshops and Phd Forum (IPDPSW) (April 2010), pp. 1--7.

[9]

Gokhale, M., Lloyd, S., and Hajas, C. Near memory data structure rearrangement. In Proceedings of the 2015 International Symposium on Memory Systems (2015), ACM, pp. 283--290.

Digital Library

[10]

Hall, M., Kogge, P., Koller, J., Diniz, P., Chame, J., Draper, J., LaCoss, J., Granacki, J., Srivastava, A., Athas, W., Brockman, J., Freeh, V., Park, J., and Shin, J. Mapping Irregular Applications to DIVA, A PIM-based Data-Intensive Architecture. In Supercomputing, Portland, OR (November 1999).

[11]

Hornung, R., Jones, H., Keasler, J., Neely, R., Pearce, O., Hammond, S., Trott, C., Lin, P., Vaughan, C., Cook, J., Hoekstra, R., Bergen, B., Payne, J., and Womeldorff, G. Asc tri-lab co-design level 2 milestone report 2015. Tech. Rep. LLNL-TR-677453, Lawrence Livermore National Lab, September 2015.

[12]

Kunen, A. J., Bailey, T. S., and Brown, P. N. Kripke - a massively parallel transport mini-app. In American Nuclear Society M&C (April 2015).

[13]

Lacy, S. W., Noe, J., Ogden, J., and Hammond, S. Building 725 astra and vanguard. Tech. Rep. SAND2018-9361R, Sandia National Labs, August 2018.

[14]

Levine, D., Callahan, D., and Dongarra, J. A comparative study of automatic vectorizing compilers. Parallel Computing 17 (1991), 1223--1244.

Digital Library

[15]

Lewis, J. G., and Simon, H. D. The impact of hardware gather/scatter on sparse gaussian elimination. SIAM J. Sci. Stat. Comput. 9, 2 (Mar. 1988), 304--311.

[16]

Lloyd, S., and Gokhale, M. Near memory key/value lookup acceleration. In Proceedings of the 2017 International Symposium on Memory Systems (2017), ACM.

Digital Library

[17]

Luk, C.-K., Cohn, R., Muth, R., Patil, H., Klauser, A., Lowney, G., Wallace, S., Reddi, V. J., and Hazelwood, K. Pin: Building customized program analysis tools with dynamic instrumentation. In PLDI (2005).

[18]

McKee, S. A. Reflections on the memory wall. In CF '04: Proceedings of the 1st conference on Computing frontiers (New York, NY, USA, 2004), ACM, p. 162.

Digital Library

[19]

Medina, D. S., St-Cyr, A., and Warburton, T. Occa: A unified approach to multi-threading languages. arXiv preprint arXiv:1403.0968 (2014).

[20]

Murphy, R., Rodrigues, A., Kogge, P., and Underwood, K. The implications of working set analysis on supercomputing memory hierarchy design. In Proceedings of the 19th Annual International Conference on Supercomputing (New York, NY, USA, 2005), ICS '05, ACM, pp. 332--340.

[21]

Nair, R., Antao, S., Bertolli, C., Bose, P., et al. Active Memory Cube: A processing-in-memory architecture for exascale systems. IBM Journal of Research and Development 59, 2/3 (March-May 2015), 17:1--17:14.

[22]

Rodrigues, A. F., Hemmert, K. S., Barrett, B. W., Kersey, C., Oldfield, R., Weston, M., Risen, R., Cook, J., Rosenfeld, P., CooperBalls, E., and Jacob, B. The structural simulation toolkit. SIGMETRICS Perform. Eval. Rev. 38 (March 2011), 37--42.

Digital Library

[23]

Rupnow, K., Rodrigues, A., Underwood, K., and Compton, K. Scientific applications vs. spec-fp: A comparison of program behavior. In Proceedings of the 20th Annual International Conference on Supercomputing (New York, NY, USA, 2006), ICS '06, ACM, pp. 66--74.

[24]

Salehian, S., and Yan, Y. Evaluation of knight landing high bandwidth memory for hpc workloads. In Proceedings of the Seventh Workshop on Irregular Applications: Architectures and Algorithms (New York, NY, USA, 2017), IA3'17, ACM, pp. 10:1--10:4.

Digital Library

[25]

Seshadri, V., Mullins, T., Boroumand, A., Mutlu, O., Gibbons, P. B., Kozuch, M. A., and Mowry, T. C. Gather-scatter dram: In-dram address translation to improve the spatial locality of non-unit strided accesses. In Proceedings of the 48th International Symposium on Microarchitecture (New York, NY, USA, 2015), MICRO-48, ACM, pp. 267--280.

Digital Library

[26]

Tramm, J. R., Siegel, A. R., Islam, T., and Schulz, M. Bench - the development and verification of a performance abstraction for monte carlo reactor analysis. In PHYSOR 2014 - The Role of Reactor Physics toward a Sustainable Future (Kyoto, 2014).

[27]

Wang, D., et al. DRAMsim: A memory-system simulator. SIGARCH Computer Architecture News 33, 4 (Sept. 2005), 100--107.

Digital Library

[28]

Watanabe, T., Matsumoto, H., and Tannenbaum, P. D. Hardware technology and architecture of the nec sx-3/sx-x supercomputer system. In Supercomputing '89:Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Nov 1989), pp. 842--846.

Digital Library

Cited By

Miller ENeville-Neil GBenetopoulos AMehra PBittman D(2023)Pointers in Far MemoryCommunications of the ACM10.1145/361758166:12(40-45)Online publication date: 17-Nov-2023
https://dl.acm.org/doi/10.1145/3617581
Miller EBenetopoulos ANeville-Neil GMehra PBittman D(2023)Pointers in Far MemoryQueue10.1145/360602921:3(75-93)Online publication date: 17-Jul-2023
https://dl.acm.org/doi/10.1145/3606029
Stark SMarkettos AMoore S(2023)How Flexible is CXL's Memory Protection?Queue10.1145/360601421:3(54-64)Online publication date: 5-Jul-2023
https://dl.acm.org/doi/10.1145/3606014
Show More Cited By

Index Terms

Towards a scatter-gather architecture: hardware and software issues
1. Hardware
  1. Emerging technologies
    1. Memory and dense storage

Recommendations

Gather-scatter DRAM: in-DRAM address translation to improve the spatial locality of non-unit strided accesses
MICRO-48: Proceedings of the 48th International Symposium on Microarchitecture

Many data structures (e.g., matrices) are typically accessed with multiple access patterns. Depending on the layout of the data structure in physical address space, some access patterns result in non-unit strides. In existing systems, which are ...
Specializing the network for scatter-gather workloads
SoCC '20: Proceedings of the 11th ACM Symposium on Cloud Computing

Data processing and distributed querying workloads often involve a "scatter-gather" or "partition-aggregate" architectural pattern, whereby one application queries hundreds or even thousands of workers. Network communication is often a bottleneck in ...
A frequent-value based PRAM memory architecture
ASPDAC '11: Proceedings of the 16th Asia and South Pacific Design Automation Conference

Phase Change Random Access Memory (PRAM) has great potential as the replacement of DRAM as main memory, due to its advantages of high density, non-volatility, fast read speed, and excellent scalability. However, poor endurance and high write energy ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

MEMSYS '19: Proceedings of the International Symposium on Memory Systems

September 2019

517 pages

ISBN:9781450372060

DOI:10.1145/3357526

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 September 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

MEMSYS '19

MEMSYS '19: The International Symposium on Memory Systems

September 30 - October 3, 2019

District of Columbia, Washington, USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

10
Total Citations
View Citations
372
Total Downloads

Downloads (Last 12 months)50
Downloads (Last 6 weeks)6

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Miller ENeville-Neil GBenetopoulos AMehra PBittman D(2023)Pointers in Far MemoryCommunications of the ACM10.1145/361758166:12(40-45)Online publication date: 17-Nov-2023
https://dl.acm.org/doi/10.1145/3617581
Miller EBenetopoulos ANeville-Neil GMehra PBittman D(2023)Pointers in Far MemoryQueue10.1145/360602921:3(75-93)Online publication date: 17-Jul-2023
https://dl.acm.org/doi/10.1145/3606029
Stark SMarkettos AMoore S(2023)How Flexible is CXL's Memory Protection?Queue10.1145/360601421:3(54-64)Online publication date: 5-Jul-2023
https://dl.acm.org/doi/10.1145/3606014
Videla A(2023)Echoes of IntelligenceQueue10.1145/360601121:3(36-53)Online publication date: 27-Jun-2023
https://dl.acm.org/doi/10.1145/3606011
Gómez-Luna JGuo YBrocard SLegriel JCimadomo ROliveira GSingh GMutlu O(2023)Evaluating Machine LearningWorkloads on Memory-Centric Computing Systems2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS57527.2023.00013(35-49)Online publication date: Apr-2023
https://doi.org/10.1109/ISPASS57527.2023.00013
Gomez-Luna JGuo YBrocard SLegriel JCimadomo ROliveira GSingh GMutlu O(2022)Machine Learning Training on a Real Processing-in-Memory System2022 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)10.1109/ISVLSI54635.2022.00064(292-295)Online publication date: Jul-2022
https://doi.org/10.1109/ISVLSI54635.2022.00064
Fernandez IQuislant RGiannoula CAlser MGomez-Luna JGutierrez EPlata OMutlu O(2022)Exploiting Near-Data Processing to Accelerate Time Series Analysis2022 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)10.1109/ISVLSI54635.2022.00061(279-282)Online publication date: Jul-2022
https://doi.org/10.1109/ISVLSI54635.2022.00061
Boroumand AGhose SOliveira GMutlu O(2022)Polynesia: Enabling High-Performance and Energy-Efficient Hybrid Transactional/Analytical Databases with Hardware/Software Co-Design2022 IEEE 38th International Conference on Data Engineering (ICDE)10.1109/ICDE53745.2022.00270(2997-3011)Online publication date: May-2022
https://doi.org/10.1109/ICDE53745.2022.00270
Gomez-Luna JHajj IFernandez IGiannoula COliveira GMutlu O(2022)Benchmarking a New Paradigm: Experimental Analysis and Characterization of a Real Processing-in-Memory SystemIEEE Access10.1109/ACCESS.2022.317410110(52565-52608)Online publication date: 2022
https://doi.org/10.1109/ACCESS.2022.3174101
Asgari BHadidi RCao JShim DLim SKim H(2021)FAFNIR: Accelerating Sparse Gathering by Using Efficient Near-Memory Intelligent Reduction2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA51647.2021.00080(908-920)Online publication date: Feb-2021
https://doi.org/10.1109/HPCA51647.2021.00080

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten