Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3357526.3357571acmotherconferencesArticle/Chapter ViewAbstractPublication PagesmemsysConference Proceedingsconference-collections
research-article

Towards a scatter-gather architecture: hardware and software issues

Published: 30 September 2019 Publication History

Abstract

The on-node performance of High performance computing (HPC) applications is traditionally dominated by memory operations. Put simply, memory is what these applications "do." Unfortunately, they don't do it well. Caches, our first line of attack in the battle for memory performance, often throw away most of the data they fetch before using it. Processor cores, one of our most expensive resources, spend an inordinate amount of time performing simple address computations. Addressing these issues will require new approaches to how on-chip memory is organized and how memory operations are performed. Under Project 38, a joint Department of Energy / Department of Defense architectural resarch project, we have focused on exploring what a flexible in-memory scatter-gather architecture could look like in the context of several important HPC applications.

References

[1]
Adams, M. F., Brown, J., Shalf, J., Straalen, B. V., Strohmaier, E., and Williams, S. Hpgmg 1.0: A benchmark for ranking high performance computing systems. Tech. rep., hpgmg.org, 2014. https://hpgmg.org/static/hpgmg-tr-1.0.pdf.
[2]
Anderson, E., Brooks, J., Grassl, C., and Scott, S. Performance of the cray t3e multiprocessor. In Proceedings of the 1997 ACM/IEEE Conference on Supercomputing (New York, NY, USA, 1997), SC '97, ACM, pp. 1--17.
[3]
Asanovic, K., and Patterson, D. A. Instruction sets should be free: The case for risc-v. Tech. Rep. UCB/EECS-2014-146, EECS Department Univ. of California Berkeley, August 2014.
[4]
Beard, J. The sparse data reduction engine. In Proceedings of the 2017 International Symposium on Memory Systems (2017), ACM.
[5]
Brooks, E. Attack of the killer micros. Presentation at Supercomputing 1990, November 1990.
[6]
Chou, C. H., Severance, A., Brant, A. D., Liu, Z., Sant, S., and Lemieux, G. G. Vegas: Soft vector processor with scratchpad memory. In Proceedings of the 19th ACM/SIGDA International Symposium on Field Programmable Gate Arrays (New York, NY, USA, 2011), FPGA '11, ACM, pp. 15--24.
[7]
Edwards, H. C., Trott, C. R., and Sunderland, D. Kokkos: Enabling manycore performance portability through polymorphic memory access patterns. Journal of Parallel and Distributed Computing 74, 12 (2014), 3202--3216. Domain-Specific Languages and High-Level Frameworks for High-Performance Computing.
[8]
Farber, R., and Mizell, D. Experimental comparison of emulated lock-free vs. fine-grain locked data structures on the cray xmt. In 2010 IEEE International Symposium on Parallel Distributed Processing, Workshops and Phd Forum (IPDPSW) (April 2010), pp. 1--7.
[9]
Gokhale, M., Lloyd, S., and Hajas, C. Near memory data structure rearrangement. In Proceedings of the 2015 International Symposium on Memory Systems (2015), ACM, pp. 283--290.
[10]
Hall, M., Kogge, P., Koller, J., Diniz, P., Chame, J., Draper, J., LaCoss, J., Granacki, J., Srivastava, A., Athas, W., Brockman, J., Freeh, V., Park, J., and Shin, J. Mapping Irregular Applications to DIVA, A PIM-based Data-Intensive Architecture. In Supercomputing, Portland, OR (November 1999).
[11]
Hornung, R., Jones, H., Keasler, J., Neely, R., Pearce, O., Hammond, S., Trott, C., Lin, P., Vaughan, C., Cook, J., Hoekstra, R., Bergen, B., Payne, J., and Womeldorff, G. Asc tri-lab co-design level 2 milestone report 2015. Tech. Rep. LLNL-TR-677453, Lawrence Livermore National Lab, September 2015.
[12]
Kunen, A. J., Bailey, T. S., and Brown, P. N. Kripke - a massively parallel transport mini-app. In American Nuclear Society M&C (April 2015).
[13]
Lacy, S. W., Noe, J., Ogden, J., and Hammond, S. Building 725 astra and vanguard. Tech. Rep. SAND2018-9361R, Sandia National Labs, August 2018.
[14]
Levine, D., Callahan, D., and Dongarra, J. A comparative study of automatic vectorizing compilers. Parallel Computing 17 (1991), 1223--1244.
[15]
Lewis, J. G., and Simon, H. D. The impact of hardware gather/scatter on sparse gaussian elimination. SIAM J. Sci. Stat. Comput. 9, 2 (Mar. 1988), 304--311.
[16]
Lloyd, S., and Gokhale, M. Near memory key/value lookup acceleration. In Proceedings of the 2017 International Symposium on Memory Systems (2017), ACM.
[17]
Luk, C.-K., Cohn, R., Muth, R., Patil, H., Klauser, A., Lowney, G., Wallace, S., Reddi, V. J., and Hazelwood, K. Pin: Building customized program analysis tools with dynamic instrumentation. In PLDI (2005).
[18]
McKee, S. A. Reflections on the memory wall. In CF '04: Proceedings of the 1st conference on Computing frontiers (New York, NY, USA, 2004), ACM, p. 162.
[19]
Medina, D. S., St-Cyr, A., and Warburton, T. Occa: A unified approach to multi-threading languages. arXiv preprint arXiv:1403.0968 (2014).
[20]
Murphy, R., Rodrigues, A., Kogge, P., and Underwood, K. The implications of working set analysis on supercomputing memory hierarchy design. In Proceedings of the 19th Annual International Conference on Supercomputing (New York, NY, USA, 2005), ICS '05, ACM, pp. 332--340.
[21]
Nair, R., Antao, S., Bertolli, C., Bose, P., et al. Active Memory Cube: A processing-in-memory architecture for exascale systems. IBM Journal of Research and Development 59, 2/3 (March-May 2015), 17:1--17:14.
[22]
Rodrigues, A. F., Hemmert, K. S., Barrett, B. W., Kersey, C., Oldfield, R., Weston, M., Risen, R., Cook, J., Rosenfeld, P., CooperBalls, E., and Jacob, B. The structural simulation toolkit. SIGMETRICS Perform. Eval. Rev. 38 (March 2011), 37--42.
[23]
Rupnow, K., Rodrigues, A., Underwood, K., and Compton, K. Scientific applications vs. spec-fp: A comparison of program behavior. In Proceedings of the 20th Annual International Conference on Supercomputing (New York, NY, USA, 2006), ICS '06, ACM, pp. 66--74.
[24]
Salehian, S., and Yan, Y. Evaluation of knight landing high bandwidth memory for hpc workloads. In Proceedings of the Seventh Workshop on Irregular Applications: Architectures and Algorithms (New York, NY, USA, 2017), IA3'17, ACM, pp. 10:1--10:4.
[25]
Seshadri, V., Mullins, T., Boroumand, A., Mutlu, O., Gibbons, P. B., Kozuch, M. A., and Mowry, T. C. Gather-scatter dram: In-dram address translation to improve the spatial locality of non-unit strided accesses. In Proceedings of the 48th International Symposium on Microarchitecture (New York, NY, USA, 2015), MICRO-48, ACM, pp. 267--280.
[26]
Tramm, J. R., Siegel, A. R., Islam, T., and Schulz, M. Bench - the development and verification of a performance abstraction for monte carlo reactor analysis. In PHYSOR 2014 - The Role of Reactor Physics toward a Sustainable Future (Kyoto, 2014).
[27]
Wang, D., et al. DRAMsim: A memory-system simulator. SIGARCH Computer Architecture News 33, 4 (Sept. 2005), 100--107.
[28]
Watanabe, T., Matsumoto, H., and Tannenbaum, P. D. Hardware technology and architecture of the nec sx-3/sx-x supercomputer system. In Supercomputing '89:Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Nov 1989), pp. 842--846.

Cited By

View all

Index Terms

  1. Towards a scatter-gather architecture: hardware and software issues

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    MEMSYS '19: Proceedings of the International Symposium on Memory Systems
    September 2019
    517 pages
    ISBN:9781450372060
    DOI:10.1145/3357526
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 30 September 2019

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. memory acceleration
    2. scatter gather

    Qualifiers

    • Research-article

    Conference

    MEMSYS '19
    MEMSYS '19: The International Symposium on Memory Systems
    September 30 - October 3, 2019
    District of Columbia, Washington, USA

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)57
    • Downloads (Last 6 weeks)7
    Reflects downloads up to 10 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Pointers in Far MemoryCommunications of the ACM10.1145/361758166:12(40-45)Online publication date: 17-Nov-2023
    • (2023)Pointers in Far MemoryQueue10.1145/360602921:3(75-93)Online publication date: 17-Jul-2023
    • (2023)How Flexible is CXL's Memory Protection?Queue10.1145/360601421:3(54-64)Online publication date: 5-Jul-2023
    • (2023)Echoes of IntelligenceQueue10.1145/360601121:3(36-53)Online publication date: 27-Jun-2023
    • (2023)Evaluating Machine LearningWorkloads on Memory-Centric Computing Systems2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS57527.2023.00013(35-49)Online publication date: Apr-2023
    • (2022)Machine Learning Training on a Real Processing-in-Memory System2022 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)10.1109/ISVLSI54635.2022.00064(292-295)Online publication date: Jul-2022
    • (2022)Exploiting Near-Data Processing to Accelerate Time Series Analysis2022 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)10.1109/ISVLSI54635.2022.00061(279-282)Online publication date: Jul-2022
    • (2022)Polynesia: Enabling High-Performance and Energy-Efficient Hybrid Transactional/Analytical Databases with Hardware/Software Co-Design2022 IEEE 38th International Conference on Data Engineering (ICDE)10.1109/ICDE53745.2022.00270(2997-3011)Online publication date: May-2022
    • (2022)Benchmarking a New Paradigm: Experimental Analysis and Characterization of a Real Processing-in-Memory SystemIEEE Access10.1109/ACCESS.2022.317410110(52565-52608)Online publication date: 2022
    • (2021)FAFNIR: Accelerating Sparse Gathering by Using Efficient Near-Memory Intelligent Reduction2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA51647.2021.00080(908-920)Online publication date: Feb-2021

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media