Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1274971.1275004acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
Article

Active memory operations

Published: 17 June 2007 Publication History
  • Get Citation Alerts
  • Abstract

    The performance of modern microprocessors is increasingly limited by their inability to hide main memory latency. The problem is worse in large-scale shared memory systems, where remote memory latencies are hundreds, and soon thousands, of processor cycles. To mitigate this problem, we propose the use of Active Memory Operations (AMOs), in which select operations can be sent to and executed on the home memory controller of data. AMOs can eliminate significant number of coherence messages, minimize intranode and internode memory traffic, and create opportunities for parallelism. Our implementation of AMOs is cache-coherent and requires no changes to the processor core or DRAM chips.
    In this paper we present architectural and programming models for AMOs, and compare its performance to that of several other memory architectures on a variety of scientific and commercial benchmarks. Through simulation we show that AMOs offer dramatic performance improvements for an important set of data-intensive operations, e.g., up to 50X faster barriers, 12X faster spinlocks, 8.5X-15X faster stream/array operations, and 3X faster database queries. Based on a standard cell implementation, we predict that the circuitry required to support AMOs is less than 1% of the typical chip area of a high performance microprocessor.

    References

    [1]
    TPC-D, Past, Present and Future: An Interview between Berni Schiefer, Chair of the TPC-D Subcommittee and Kim Shanley, TPC Chief Operating Officer. available from http://www.tpc.org/.
    [2]
    J. H. Ahn, M. Erez, and W. J. Dally. Scatter-add in data parallel architectures. In HPCA-11, pp. 132--142, Feb. 2005.
    [3]
    A. Ailamaki, D. DeWitt, M. Hill, and D. Wood. DBMSs on a modern processor: Where does time go? In VLDB-25, pp. 266--277, Sept. 1999.
    [4]
    T. Anderson. The performance of spin lock alternatives for shared-memory multiprocessors. IEEE TPDS, 1(1):6--16, Jan. 1990.
    [5]
    L. A. Barroso, K. Gharachorloo, and E. Bugnion. Memory system characterization of commercial workloads. In Proc. of the 25th ISCA, pp. 3--14, 1998.
    [6]
    P. A. Boncz, S. Manegold, and M. L. Kersten. Database architecture optimized for the new bottleneck: Memory access. In VLDB-25, pp. 54--65, 1999.
    [7]
    D. Patterson et. al. A case for Intelligent RAM: IRAM. IEEE Micro, 17(2):34--44, Apr. 1997.
    [8]
    Z. Fang. Active memory operations, Ph.D thesis, University of Utah. 2006.
    [9]
    A. Gottlieb, R. Grishman, C. Kruskal, K. McAuliffe, L. Rudolph, and M. Snir. The NYU multicomputer - designing a MIMD shared-memory parallel machine. IEEE TOPLAS, 5(2):164--189, Apr. 1983.
    [10]
    J. Gray, editor. The Benchmark Handbook for Database and Transaction Systems, Chapter 6. 1993.
    [11]
    M. Hall, et al. Mapping irregular appilcations to DIVA, a PIM-based data-intensive architecture. In SC'99, Nov. 1999.
    [12]
    Hewlett-Packard Inc. The open source database benchmark.
    [13]
    Intel Corp. Intel Itanium 2 processor reference manual.
    [14]
    International Technology Roadmap for Semiconductors.
    [15]
    K. Keeton and D. Patterson. Towards a Simplified Database Workloads for Computer Architecture Evaluation. 2000.
    [16]
    D. Kim, M. Chaudhuri, M. Heinrich, and E. Speight. Architectural support for uniprocessor and multiprocessor active memory systems. IEEE Trans. on Computers, 53(3):288--307, Mar. 2004.
    [17]
    D. Koester and J. Kepner. HPCS Assessment Framework and Benchmarks. MITRE and MIT Lincoln Laboratory, Mar. 2003.
    [18]
    P. Kogge. The EXECUBE approach to massively parallel processing. In International Conference on Parallel Processing, Aug. 1994.
    [19]
    J. Kuskin, et al. The Stanford FLASH multiprocessor. In Proc. of the 21st ISCA, pp. 302--313, May 1994.
    [20]
    J. Laudon and D. Lenoski. The SGI Origin: A ccNUMA highly scalable server. In ISCA97, pp. 241--251, June 1997.
    [21]
    J. McCalpin. The stream benchmark, 1999.
    [22]
    J. M. Mellor-Crummey and M. L. Scott. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM TOCS, 9(1):21--65, 1991.
    [23]
    D. S. Nikolopoulos and T. A. Papatheodorou. The architecture and operating system implications on the performance of synchronization on ccNUMA multiprocessors. IJPP, 29(3):249--282, June 2001.
    [24]
    M. Oskin, F. Chong, and T. Sherwood. Active pages: A model of computation for intelligent memory. In ISCA-25, pp. 192--203, 1998.
    [25]
    F. Petrini, et al. Scalable collective communication on the ASCI Q machine. In Hot Interconnects 11, Aug. 2003.
    [26]
    T. Pinkston, A. Agarwal, W. Dally, J. Duato, B. Horst, and T. B. Smith. What will have the greatest impact in 2010: The processor, the memory, or the interconnect? HPCA8 Panel Session, 2002.
    [27]
    R. Rajwar, A. Kagi, and J. R. Goodman. Improving the throughput of synchronization by insertion of delays. In Proc. of the Sixth HPCA, pp. 168--179, Jan. 2000.
    [28]
    S. Reinhardt, J. Larus, and D. Wood. Tempest and Typhoon: User-level shared memory. In Proc. of the 21st ISCA, pp. 325--336, Apr. 1994.
    [29]
    S. Scott. Synchronization and communication in the T3E multiprocessor. In Proc. of the 7th ASPLOS, Oct. 1996.
    [30]
    SGI. SN2-MIPS Communication Protocol Specification, 2001.
    [31]
    SGI. Orbit Functional Specification, Vol. 1, 2002.
    [32]
    M. Shao, A. Ailamaki, and B. Falsafi. DBmbench: Fast and accurate database workload representation on modern microarchitecture. TR CMU-CS-03-161, Carnegie Mellon University, 2003.
    [33]
    Y. Solihin, J. Lee, and J. Torrellas. Using a user-level memory thread for correlation prefetching. In Proc. of the 29th ISCA, May 2002.
    [34]
    P. J. Teller, R. Kenner, and M. Snir. TLB consistency on highly-parallel shared-memory multiprocessors. In 21st Annual Hawaii International Conference on System Sciences, pp. 184--193, 1988.
    [35]
    V. Tipparaju, J. Nieplocha, and D. Panda. Fast collective operations using shared and remote memory access protocols on clusters. In Proc. of IPDPS, page 84a, Apr. 2003.
    [36]
    J. Torrellas, A.-T. Nguyen, and L. Yang. Toward a cost-effective DSM organization that exploits processor-memory integration. In Proc. of the 7th HPCA, pp. 15--25, Jan. 2000.
    [37]
    T. von Eicken, D. Culler, S. Goldstein, and K. Schauser. Active Messages: A mechanism for integrated communication and computation. In Proc. of the 19th ISCA, pp. 256--266, May 1992.
    [38]
    L. Zhang. UVSIM reference manual. TR UUCS-03-011, University of Utah, May 2003.
    [39]
    L. Zhang, Z. Fang, and J. B. Carter. Highly efficient synchronization based on active memory operations. In IPDPS, Apr. 2004.
    [40]
    L. Zhang, Z. Fang, M. Parker, B. Mathew, L. Schaelicke, J. Carter, W. Hsieh, and S. McKee. The Impulse memory controller. IEEE Trans. on Computers, 50(11):1117--1132, Nov. 2001.

    Cited By

    View all
    • (2023)DynAMO: Improving Parallelism Through Dynamic Placement of Atomic Memory OperationsProceedings of the 50th Annual International Symposium on Computer Architecture10.1145/3579371.3589065(1-13)Online publication date: 17-Jun-2023
    • (2021)Design space for scaling-in general purpose computing within the DDR DRAM hierarchy for map-reduce workloadsProceedings of the 18th ACM International Conference on Computing Frontiers10.1145/3457388.3458661(113-123)Online publication date: 11-May-2021
    • (2018)StaleLearn: Learning Acceleration with Asynchronous Synchronization Between Model Replicas on PIMIEEE Transactions on Computers10.1109/TC.2017.278023767:6(861-873)Online publication date: 1-Jun-2018
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ICS '07: Proceedings of the 21st annual international conference on Supercomputing
    June 2007
    315 pages
    ISBN:9781595937681
    DOI:10.1145/1274971
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 17 June 2007

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. DRAM
    2. cache coherence
    3. distributed shared memory
    4. memory performance
    5. stream processing
    6. thread synchronization

    Qualifiers

    • Article

    Conference

    ICS07
    Sponsor:
    ICS07: International Conference on Supercomputing
    June 17 - 21, 2007
    Washington, Seattle

    Acceptance Rates

    Overall Acceptance Rate 629 of 2,180 submissions, 29%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)11
    • Downloads (Last 6 weeks)2

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)DynAMO: Improving Parallelism Through Dynamic Placement of Atomic Memory OperationsProceedings of the 50th Annual International Symposium on Computer Architecture10.1145/3579371.3589065(1-13)Online publication date: 17-Jun-2023
    • (2021)Design space for scaling-in general purpose computing within the DDR DRAM hierarchy for map-reduce workloadsProceedings of the 18th ACM International Conference on Computing Frontiers10.1145/3457388.3458661(113-123)Online publication date: 11-May-2021
    • (2018)StaleLearn: Learning Acceleration with Asynchronous Synchronization Between Model Replicas on PIMIEEE Transactions on Computers10.1109/TC.2017.278023767:6(861-873)Online publication date: 1-Jun-2018
    • (2018)Architectural Support for Task Dependence Management with Flexible Software Scheduling2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA.2018.00033(283-295)Online publication date: Feb-2018
    • (2017)Excavating the Hidden Parallelism Inside DRAM Architectures With Buffered ComparesIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2017.265572225:6(1793-1806)Online publication date: Jun-2017
    • (2017)Shared-Memory Parallelism Can Be Simple, Fast, and ScalableundefinedOnline publication date: 9-Jun-2017
    • (2016)Buffered comparesProceedings of the 2016 Conference on Design, Automation & Test in Europe10.5555/2971808.2972099(1243-1248)Online publication date: 14-Mar-2016
    • (2016)Data-Centric Computing FrontiersProceedings of the Second International Symposium on Memory Systems10.1145/2989081.2989087(295-308)Online publication date: 3-Oct-2016
    • (2016)Accelerating Linked-list Traversal Through Near-Data ProcessingProceedings of the 2016 International Conference on Parallel Architectures and Compilation10.1145/2967938.2967958(113-124)Online publication date: 11-Sep-2016
    • (2016)Prefetching Techniques for Near-memory Throughput ProcessorsProceedings of the 2016 International Conference on Supercomputing10.1145/2925426.2926282(1-14)Online publication date: 1-Jun-2016
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media