Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/514191.514217acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
Article

Profile-guided post-link stride prefetching

Published: 22 June 2002 Publication History
  • Get Citation Alerts
  • Abstract

    Data prefetching is an effective approach to addressing the memory latency problem. While a few processors have implemented hardware-based data prefetching, the majority of modern processors support data-prefetch instructions and rely on compilers to automatically insert prefetches. However, most prefetching schemes in commercial compilers suffer from two limitations: (1) the source code must be available before prefetching can be applied, and (2) these prefetching schemes target only loops with statically-known strided accesses. In this study, we broaden the scope of software-controlled prefetching by addressing the above two limitations. We use profiling to discover strided accesses that frequently occur during program execution but are not determinable by the compiler. We then use the strides discovered to insert prefetches into the executable directly, without the need for re-compilation. Performance evaluation was done on an Alpha 21264-based system with a 64KB data cache and an 8MB secondary cache. We find that even with such large caches, our technique offers speedups ranging from 3% to 56% in 11 out of the 26 SPEC2000 benchmarks. Our technique has been incorporated into Pixie and Spike, two products in Compaq's Tru64 Unix.

    References

    [1]
    S. G. Abraham, R. A. Sugumar, D. Windheiser, B. R. Rau, and R. Gupta. Predictability of load/store instruction latencies. In Proceedings of the 26th Annual ACM/IEEE International Symposium on Microarchitecture, pages 139--152, December 1993.]]
    [2]
    B. Alpern, M. N. Wegman, and F. K. Zadeck. Detecting equality of values in programs. In Proceedings of the 15th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, January 1990.]]
    [3]
    J. Anderson, L. M. Berc, J. Dean, S. Ghemawat, M. R. Henzinger, S.-T. Leung, R. L. Sites, M. T. Vandevoorde, C. A. Waldspurger, and W. E. Weihi. Continuous profiling: Where have all the cycles gone. In Proceedings of the 16th Symposium on Operating System Principles, October 1997.]]
    [4]
    R. Barnes, R. Chaiken, and D. M. Gillies. Feedback-directed data cache optimizations for the x86. In Second ACM Workshop on Feedback-Directed Optimizations, November 1999.]]
    [5]
    D. Bernstein, D. Cohen, A. Freund, and D. E. Maydan. Compiler techniques for data prefetching on the PowerPC. In Proceedings of the 1995 International Conference on Parallel Architectures and Compilation Techniques, pages 19--26, June 1995.]]
    [6]
    B. Calder, P. Feller, and A. Eustace. Value profiling and optimization. Journal of Instruction Level Parallelism, March 1999.]]
    [7]
    D. Callahan, K. Kennedy, and A. Porterfield. Software prefetching. In Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 40--52, April 1991.]]
    [8]
    M. Charney and A. Reeves. Generalized correlation based hardware prefetching. Technical Report EE-CEG-95-1, Cornell University, Feb 1995.]]
    [9]
    T.-F. Chen and J.-L. Baer. Effective hardware-based data prefetching for high-performance processors. IEEE Transactions on Computers, 44(5), May 1995.]]
    [10]
    R. Cohn, D. Goodwin, and P. G. Lowney. Optimizing Alpha executables on Windows NT with Spike. Digital Technical Journal, 9(4):3--20, 1997.]]
    [11]
    Compaq Computer Corporation. AlphaServer DS20E Product Information. http://www.compaq.com/alphaserver/ds20e/index.html.]]
    [12]
    Compaq Computer Corporation. Spike for Tru64 UNIX. http://www.tru64unix.compaq.com/spike.]]
    [13]
    R. Cytron, J. Ferrante, B. K. Rosen abd M. N. Wegman, and F. K. Zadeck. Efficiently computing static single assignment form and the control dependency graph. Technical Report Technical Report RC14756, IBM, March 1991.]]
    [14]
    F. Dahlgren and P. Stenström. Evaluation of hardware-based stride and sequential prefetching in shared-memory multiprocessors. IEEE Transactions on Parallel and Distributed Systems, 7(4), 1996.]]
    [15]
    G. Doshi, R. Krishnaiyer, and K. Muthukumar. Optimizing software data prefetches with rotating registers. In Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques, September 2001.]]
    [16]
    K. Farkas, P. Chow, N. Jouppi, and Z. Vranesic. Memory-system design considerations for dynamically-scheduled processors. In Proceedings of the 24th Annual International Symposium on Computer Architecture, June 1997.]]
    [17]
    J. L. Henning. SPEC CPU2000: measuring cpu performance in the new millennium. IEEE Computer, 33(7):28--35, July 2000.]]
    [18]
    G. Hinton, D. Sager, M. Upton, D. Boggs, D. Carmean, and A. Kyker. The Microarchitecture of the Pentium 4 Processor. Intel Technology Journal, Q1, February 2001.]]
    [19]
    Y. Jegou and O. Temam. Speculative prefetching. In Proceedings of the 1993 International Conference on Supercomputing, pages 57--66, 1993.]]
    [20]
    D. Joseph and D. Grunwald. Prefetching using Markov predictors. In Proceedings of the 24th Annual International Symposium on Computer Architecture, pages 252--263, June 1997.]]
    [21]
    N. P. Jouppi. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In Proceedings of the 17th Annual International Symposium on Computer Architecture, pages 364--373, May 1990.]]
    [22]
    M. Karlsson, F. Dahlgren, and P. Stenström. A prefetching technique for irregular accesses to linked data structures. In Proceedings of the 6th International Symposium on High Performance Computer Architecture, January 2000.]]
    [23]
    R. E. Kessler. The Alpha 21264 Microprocessor. IEEE Micro, 19(2):24--36, March/April 1999.]]
    [24]
    A. Lai, C. Fide, and B. Falsafi. Dead-block prediction and dead-block correlating prefetchers. In Proceedings of the 28th Annual International Symposium on Computer Architecture, pages 144--154, June 2001.]]
    [25]
    C.-K. Luk and T. C. Mowry. Compiler-based prefetching for recursive data structures. In Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 222--233, October 1996.]]
    [26]
    T. C. Mowry, M. S. Lam, and A. Gupta. Design and evaluation of a compiler algorithm for prefetching. In Proceedings of the 5th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 62--73, October 1992.]]
    [27]
    T. C. Mowry and C.-K. Luk. Predicting data cache misses in non-numeric applications through correlation profiling. In Proceedings of the 30th Annual ACM/IEEE International Symposium on Microarchitecture, pages 314--320, December 1997.]]
    [28]
    A. Roth, A. Moshovos, and G. Sohi. Dependence based prefetching for linked data structures. In Proceedings of the 8th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 115--126, October 1998.]]
    [29]
    A. Roth and G. Sohi. Effective jump-pointer prefetching for linked data structures. In Proceedings of the 26th Annual International Symposium on Computer Architecture, pages 111--121, May 1999.]]
    [30]
    V. Santhanam, E. Gornish, and W.-C. Hsu. Data prefetching on the HP PA8000. In Proceedings of the 24th Annual International Symposium on Computer Architecture, pages 264--273, June 1997.]]
    [31]
    R. M. Shapiro and H. Saint. The representation of algorithm. Technical Report TR CA-7002-1432, Computer Associates, February 1970.]]
    [32]
    T. Sherwood, S. Sair, and B. Calder. Predictor-directed stream buffers. In Proceedings of the 33rd Annual ACM/IEEE International Symposium on Microarchitecture, December 2000.]]
    [33]
    A. Smith. Sequential program prefetching in memory hierarchies. IEEE Computer, 11(2):7--21, 1978.]]
    [34]
    M. D. Smith. Tracing with pixie. Technical Report CSL-TR-91-497, Stanford University, November 1991.]]
    [35]
    A. Srivastava and A. Eustace. Atom: A system for building customized program analysis tools. In Proceedings of the ACM SIGPLAN 94 Conference on Programming Language Design and Implementation, pages 196--205, 1994.]]
    [36]
    A. Stoutchinin, J. Amaral, G. Gao, J. Dehnert, S. Jain, and A. Douillet. Speculative prefetching of induction pointers. Lecture Notes in Computer Science, (2027):289--303, 2001.]]
    [37]
    J. M. Tendler, S. Dodson, S. Fields, H. Le, and B. Sinharoy. POWER4 System Microarchitecture. IBM, October 2001.]]
    [38]
    Y. Wu, M. Serrano, R. Krishnaiyer, W. Li, and J. Fang. Value profile guided prefetching for irregular code. In Proceedings of the International Conference on Compiler Construction, 2002.]]

    Cited By

    View all
    • (2024)Limoncello: Prefetchers for ScaleProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3620666.3651373(577-590)Online publication date: 27-Apr-2024
    • (2024)RPG2: Robust Profile-Guided Runtime Prefetch GenerationProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640396(999-1013)Online publication date: 27-Apr-2024
    • (2023)Propeller: A Profile Guided, Relinking Optimizer for Warehouse-Scale ApplicationsProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3575727(617-631)Online publication date: 27-Jan-2023
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ICS '02: Proceedings of the 16th international conference on Supercomputing
    June 2002
    338 pages
    ISBN:1581134835
    DOI:10.1145/514191
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 22 June 2002

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. address strides
    2. data prefetching
    3. memory latency
    4. post-link optimizations
    5. profiling

    Qualifiers

    • Article

    Conference

    ICS02
    Sponsor:
    ICS02: International Conference on Supercomputing
    June 22 - 26, 2002
    New York, New York, USA

    Acceptance Rates

    ICS '02 Paper Acceptance Rate 31 of 144 submissions, 22%;
    Overall Acceptance Rate 629 of 2,180 submissions, 29%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)12
    • Downloads (Last 6 weeks)2

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Limoncello: Prefetchers for ScaleProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3620666.3651373(577-590)Online publication date: 27-Apr-2024
    • (2024)RPG2: Robust Profile-Guided Runtime Prefetch GenerationProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640396(999-1013)Online publication date: 27-Apr-2024
    • (2023)Propeller: A Profile Guided, Relinking Optimizer for Warehouse-Scale ApplicationsProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3575727(617-631)Online publication date: 27-Jan-2023
    • (2022)SPAMeR: Speculative Push for Anticipated Message Requests in Multi-Core SystemsProceedings of the 51st International Conference on Parallel Processing10.1145/3545008.3545044(1-12)Online publication date: 29-Aug-2022
    • (2022)APT-GETProceedings of the Seventeenth European Conference on Computer Systems10.1145/3492321.3519583(747-764)Online publication date: 28-Mar-2022
    • (2022)Adaptive Page Migration Policy With Huge Pages in Tiered Memory SystemsIEEE Transactions on Computers10.1109/TC.2020.303668671:1(53-68)Online publication date: 1-Jan-2022
    • (2022)CSPM: A Coordinated Software Prefetching Mechanism For Multi-Level Caches2022 7th International Conference on Computer and Communication Systems (ICCCS)10.1109/ICCCS55155.2022.9846079(86-91)Online publication date: 22-Apr-2022
    • (2020)Classifying Memory Access Patterns for PrefetchingProceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3373376.3378498(513-526)Online publication date: 9-Mar-2020
    • (2018)A Post-link Prefetching Based on Event SamplingAdvanced Computer Architecture10.1007/978-981-13-2423-9_5(53-65)Online publication date: 13-Sep-2018
    • (2015)Profile-guided meta-programmingACM SIGPLAN Notices10.1145/2813885.273799050:6(403-412)Online publication date: 3-Jun-2015
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media