Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Fast data-locality profiling of native execution

Published: 06 June 2005 Publication History
  • Get Citation Alerts
  • Abstract

    Performance tools based on hardware counters can efficiently profile the cache behavior of an application and help software developers improve its cache utilization. Simulator-based tools can potentially provide more insights and flexibility and model many different cache configurations, but have the drawback of large run-time overhead.We present StatCache, a performance tool based on a statistical cache model. It has a small run-time overhead while providing much of the flexibility of simulator-based tools. A monitor process running in the background collects sparse memory access statistics about the analyzed application running natively on a host computer. Generic locality information is derived and presented in a code-centric and/or data-centric view.We evaluate the accuracy and performance of the tool using ten SPEC CPU2000 benchmarks. We also exemplify how the flexibility of the tool can be used to better understand the characteristics of cache-related performance problems.

    References

    [1]
    J. Anderson, L. Berc, J. Dean, S. Ghemawat, M. Henzinger, S. Leung, D. Sites, M. Vandevoorde, C. Waldspurger, and W. Weihl. Continuous profiling: Where have all the cycles gone? ACM Transactions on Computer Systems, 1997.
    [2]
    E. Berg and E. Hagersten. SIP: Performance Tuning through Source Code Interdependence. In Proceedings of the 8th International Euro-Par Conference (Euro-Par 2002), pages 177--186, Paderborn, Germany, August 2002.
    [3]
    E. Berg and E. Hagersten. StatCache: A probabilistic approach to efficient and accurate data locality analysis. Technical report 2003-57, Department of information technology, Uppsala University, Sweden, 2003.
    [4]
    E. Berg and E. Hagersten. StatCache: A probabilistic approach to efficient and accurate data locality analysis. In Proceedings of International Symposium on Performance Analysis of Systems And Software, 2004.
    [5]
    K. Beyls, E. D'Hollander, and Y. Yu. Visualization enables the programmer to reduce cache misses. In Proceedings of Conference on Parallel and Distributed Computing and Systems, 2002.
    [6]
    S. Browne, J. Dongarra, N. Garner, K. London, and P. Mucci. A scalable cross-platform infrastructure for application performance tuning using hardware counters. In Proceedings of SuperComputing, 2000.
    [7]
    B. Buck and J. Hollingsworth. Using hardware performance monitors to isolate memory bottlenecks. In Proceedings of Supercomputing, 2000.
    [8]
    C. Cascaval and D. A. Padua. Estimating cache misses and locality using stack distances. In Proceedings of International Conference on Supercomputing, 2003.
    [9]
    T. M. Chilimbi. Efficient representations and abstractions for quantifying and exploiting data reference locality. In SIGPLAN Conference on Programming Language Design and Implementation, pages 191--202, 2001.
    [10]
    T. M. Chilimbi. Dynamic hot data stream prefetching for general-purpose programs. In PLDI, 2002.
    [11]
    T. M. Conte, M. A. Hirsch, and W. W. Hwu. Combining trace sampling with single pass methods for efficient cache simulation. IEEE Transactions on Computers, 47(6):714--720, 1998.
    [12]
    Intel Corporation. Intel VTune Performance Analyzers http://www.intel.com/software/products/vtune/.
    [13]
    L. DeRose, K. Ekanadham, and J. K. Hollingsworth. Sigma: A simulator infrastructure to guide memory analysis. In Proceedings of SuperComputing, 2002.
    [14]
    A. Eustace and A. Srivastava. ATOM: A flexible interface for building high performance program analysis tools. In USENIX Winter, pages 303--314, 1995.
    [15]
    S. Ghosh, M. Martonosi, and S. Malik. Cache miss equations: a compiler framework for analyzing and tuning memory behavior. ACM Transactions on Programming Languages and Systems, 21(4):703--746, 1999.
    [16]
    M. Itzkowitz, B.J.N. Wylie, C. Aoki, and N. Kosche. Memory profiling using hardware counters. In Proceedings of Supercomputing, 2003.
    [17]
    R. Fowler J. Mellor-Crummey and D. Whalley. Tools for application-oriented performance tuning. In Proceedings of the 2001 ACM International Conference on Supercomputing, 2001.
    [18]
    R. E. Kessler, M. D. Hill, and D. A. Wood. A comparison of trace-sampling techniques for multi-megabyte caches. IEEE Transactions on Computers, 43(6):664--675, 1994.
    [19]
    S. Laha, J. A. Patel, and R. K. Iyer. Accurate low-cost methods for performance evaluation of cache memory systems. IEEE Transactions on computers}, 1988.
    [20]
    J. R. Larus and E. Schnarr. EEL: Machine-independent executable editing. In SIGPLAN Conference on Programming Language Design and Implementation, pages 291--300, 1995.
    [21]
    A. R. Lebeck and D. A. Wood. Cache profiling and the SPEC benchmarks: A case study. IEEE Computer, 27(10):15--26, 1994.
    [22]
    S. Devine M. Rosenblum, E. Bugnion and S. Herrod. Using the simos machine simulator to study complex systems. ACM Transactions on Modelling and Computer Simulation, 7:78--103, 1997.
    [23]
    J. Maebe, M. Ronsse, and K. De Bosschere. DIOTA: Dynamic instrumentation, optimization and transformation of applications. In Compendium of Workshops and Tutorials. Held in conjunction with International Conference on Parallel Architectures and Compilation Techniques., September 2002.
    [24]
    P. Magnusson, F. Larsson, A. Moestedt, B. Werner, F. Dahlgren, M. Karlsson, F. Lundholm, J. Nilsson, P. Stenström, and H. Grahn. SimICS/sun4m: A virtual workstation. In Proceedings of the Usenix Annual Technical Conference, pages 119--130, 1998.
    [25]
    G. Marin and J. Mellor-Crummey. Cross-architecture performance predictions for scientific applications using parameterized models. In Proceedings of Joint International Conference on Measurement and Modeling of Computer Systems, pages 2--13, New York, NY, June 2004.
    [26]
    M. Martonosi, A. Gupta, and T. Anderson. Memspy: Analyzing memory system bottlenecks in programs. In Proceedings of International Conference on Modeling of Computer Systems, pages 1--12, 1992.
    [27]
    M. Martonosi, A. Gupta, and T. E. Anderson. Tuning memory performance of sequential and parallel programs. IEEE Computer, 28(4):32--40, 1995.
    [28]
    R. L. Mattson, J. Gecsei, D. R. Slutz, and I. L. Traiger. Evaluation techniques for storage hierarchies. IBM Systems Journal, 9(2):78--117, 1970.
    [29]
    T. Mohan, B. R. de Supinski, S. A. McKee, F. Mueller, A. Yoo, and M. Schultz. Identifying and exploiting spatial regularity in data memory references. In Proceedings of Supercomputing, 2003.
    [30]
    L. Noordergraaf and R. Zak. Smp system interconnect instrumentation for performance analysis. In Proceedings of Supercomputing, 2002.
    [31]
    E. Perelman, G. Hamerly, M. Van Biesbrouck, T. Sherwood, and B. Calder. Using SimPoint for accurate and efficient simulation. In Proceedings of SIGMETRICS, 2003.
    [32]
    E. Perelman, G. Hamerly, and B. Calder. Picking statistically valid and early simulation points. In In Proceedings of Parallel Architectures and Compilation Techniques, 2003.
    [33]
    SPEC. Standard performance evaluation corporation http://www.spec.org/.
    [34]
    R. Uhlig, D. Nagle, T. N. Mudge, and S. Sechrest. Trap-driven simulation with tapeworm II. In Proceedings of Architectural Support for Programming Languages and Operating Systems, pages 132--144, 1994.
    [35]
    X. Vera and J. Xue. Let's study whole-program cache behaviour analytically. In Proceedings of 8th International Symposium on High-Performance Computer Architecture, 2002.
    [36]
    D. A. Wood, M. D. Hill, and R. E. Kessler. A model for estimating trace-sample miss ratios. ACM SIGMETRICS Performance Evaluation Review, 19(1), May 21-24, 1991.
    [37]
    R. E. Wunderlich, T. F. Wenisch, B. Falsafi, and J. C. Hoe. SMARTS: Accelerating microarchitecture simulation via rigorous statistical sampling. In Proceedings of International Symposium of Computer Architecture, 2003.
    [38]
    Y. Zhong, S. G. Dropsho, and C. Ding. Miss rate prediction across all program inputs. In Proceedings of Parallel Architechtures and Compilation Techniques, 2003.

    Cited By

    View all
    • (2023)DJXPerf: Identifying Memory Inefficiencies via Object-Centric Profiling for JavaProceedings of the 21st ACM/IEEE International Symposium on Code Generation and Optimization10.1145/3579990.3580010(81-94)Online publication date: 17-Feb-2023
    • (2023)DroidPerf: Profiling Memory Objects on Android DevicesProceedings of the 29th Annual International Conference on Mobile Computing and Networking10.1145/3570361.3592503(1-15)Online publication date: 2-Oct-2023
    • (2023)Precise event sampling‐based data locality tools for AMD multicore architecturesConcurrency and Computation: Practice and Experience10.1002/cpe.770735:24Online publication date: 3-Apr-2023
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM SIGMETRICS Performance Evaluation Review
    ACM SIGMETRICS Performance Evaluation Review  Volume 33, Issue 1
    Performance evaluation review
    June 2005
    417 pages
    ISSN:0163-5999
    DOI:10.1145/1071690
    Issue’s Table of Contents
    • cover image ACM Conferences
      SIGMETRICS '05: Proceedings of the 2005 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
      June 2005
      428 pages
      ISBN:1595930221
      DOI:10.1145/1064212
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 06 June 2005
    Published in SIGMETRICS Volume 33, Issue 1

    Check for updates

    Author Tags

    1. cache behavior
    2. profiling tool

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)21
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 14 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)DJXPerf: Identifying Memory Inefficiencies via Object-Centric Profiling for JavaProceedings of the 21st ACM/IEEE International Symposium on Code Generation and Optimization10.1145/3579990.3580010(81-94)Online publication date: 17-Feb-2023
    • (2023)DroidPerf: Profiling Memory Objects on Android DevicesProceedings of the 29th Annual International Conference on Mobile Computing and Networking10.1145/3570361.3592503(1-15)Online publication date: 2-Oct-2023
    • (2023)Precise event sampling‐based data locality tools for AMD multicore architecturesConcurrency and Computation: Practice and Experience10.1002/cpe.770735:24Online publication date: 3-Apr-2023
    • (2021)ReuseTracker: Fast Yet Accurate Multicore Reuse Distance AnalyzerACM Transactions on Architecture and Code Optimization10.1145/348419919:1(1-25)Online publication date: 6-Dec-2021
    • (2021)A Study on Modeling and Optimization of Memory SystemsJournal of Computer Science and Technology10.1007/s11390-021-0771-836:1(71-89)Online publication date: 30-Jan-2021
    • (2020)ExtraPeak: Advanced Automatic Performance Modeling for HPC ApplicationsSoftware for Exascale Computing - SPPEXA 2016-201910.1007/978-3-030-47956-5_15(453-482)Online publication date: 31-Jul-2020
    • (2019)Data-flow/dependence profiling for structured transformationsProceedings of the 24th Symposium on Principles and Practice of Parallel Programming10.1145/3293883.3295737(173-185)Online publication date: 16-Feb-2019
    • (2019)Featherlight Reuse-Distance Measurement2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA.2019.00056(440-453)Online publication date: Feb-2019
    • (2018)An Evaluation of Vectorization and Cache Reuse Tradeoffs on Modern CPUsProceedings of the 9th International Workshop on Programming Models and Applications for Multicores and Manycores10.1145/3178442.3178445(21-30)Online publication date: 24-Feb-2018
    • (2018)CaL: Extending Data Locality to Consider Concurrency for Performance OptimizationIEEE Transactions on Big Data10.1109/TBDATA.2017.27538254:2(273-288)Online publication date: 1-Jun-2018
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media