Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Source-Code-Correlated Cache Coherence Characterization of OpenMP Benchmarks

Published: 01 June 2007 Publication History

Abstract

Cache coherence in shared-memory multiprocessor systems has been studied mostly from an architecture viewpoint, often by means of aggregating metrics. In many cases, aggregate events provide insufficient information for programmers to understand and optimize the coherence behavior of their applications. A better understanding would be given by source code correlations of not only aggregate events, but also finer granularity metrics directly linked to high-level source code constructs, such as source lines and data structures. In this paper, we explore a novel application-centric approach to studying coherence traffic. We develop a coherence analysis framework based on incremental coherence simulation of actual reference traces. We provide tool support to extract these reference traces and synchronization information from OpenMP threads at runtime using dynamic binary rewriting of the application executable. These traces are fed to ccSIM, our cache-coherence simulator. The novelty of ccSIM lies in its ability to relate low-level cache coherence metrics (such as coherence misses and their causative invalidations) to high-level source code constructs including source code locations and data structures. We explore the degree of freedom in interleaving data traces from different processors and assess simulation accuracy in comparison to metrics obtained from hardware performance counters. Our quantitative results show that: 1) Cache coherence traffic can be simulated with a considerable degree of accuracy for SPMD programs, as the invalidation traffic closely matches the corresponding hardware performance counters. 2) Detailed, high-level coherence statistics are very useful in detecting, isolating, and understanding coherence bottlenecks. We use ccSIM with several well-known benchmarks and find coherence optimization opportunities leading to significant reductions in coherence traffic and savings in wall-clock execution time.

References

[1]
E.A. Brewer, C.N. Dellarocas, A. Colbrook, and W.E. Weihl, “Proteus: A High-Performance Parallel-Architecture Simulator,” Proc. ACM Joint Int'l Conf. Measurement and Modeling of Computer Systems (SIGMETRICS and PERFORMANCE '92), pp. 247-248, June 1992.
[2]
D. Burger, T.M. Austin, and S. Bennett, “Evaluating Future Microprocessors: The Simplescalar Tool Set,” Technical Report CS-TR-1996-1308, Univ. of Wisconsin, Madison, July 1996.
[3]
H. Davis, S.R. Goldschmidt, and J. Hennessy, “Multiprocessor Simulation and Tracing Using Tango,” Proc. Int'l Conf. Parallel Processing (ICPP '91), vol. 2, pp. II-99-II-107, Aug. 1991.
[4]
C. Hughes, V. Pai, P. Ranganathan, and S. Adve, “Rsim: Simulating Shared-Memory Multiprocessors with ILP Processors,” Computer, vol. 35, no. 2, pp. 40-49, Feb. 2002.
[5]
A.-T. Nguyen, M. Michael, A. Sharma, and J. Torrellas, “The Augmint Multiprocessor Simulation Toolkit: Implementation, Experimentation and Tracing Facilities,” Proc. IEEE Int'l Conf. Computer Design: VLSI in Computers and Processors, pp. 486-491, Oct. 1996.
[6]
M. Rosenblum, S.A. Herrod, E. Witchel, and A. Gupta, “Complete Computer System Simulation: The SimOS Approach,” IEEE Parallel and Distributed Technology: Systems and Applications, vol. 3, no. 4, pp. 34-43, http://www.computer. org/concurrency/pd1995/p4034abs.htm; http://dlib.computer. org/pd/books/pd1995/pdf/h40034.pdf, Winter 1995.
[7]
A. Krishnamurthy and K. Yelick, “Optimizing Parallel Programs with Explicit Synchronization,” Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation, pp. 196-204, 1995.
[8]
S. Satoh, K. Kusano, and M. Sato, “Compiler Optimization Techniques for OpenMP Programs,” Scientific Programming, vol. 9, nos. 2-3, pp. 131-142, 2001.
[9]
J. Vetter and F. Mueller, “Communication Characteristics of Large-Scale Scientific Applications for Contemporary Cluster Architectures,” Proc. Int'l Parallel and Distributed Processing Symp. (IPDS '02), Apr. 2002.
[10]
J. Vetter and F. Mueller, “Communication Characteristics of Large-Scale Scientific Applications for Contemporary Cluster Architectures,” J. Parallel Distributed Computing, vol. 63, no. 9, pp. 853-865, Sept. 2003.
[11]
B. Buck and J.K. Hollingsworth, “An API for Runtime Code Patching,” Int'l J. High Performance Computing Applications, vol. 14, no. 4, pp. 317-329, Winter 2000.
[12]
J. Marathe, F. Mueller, T. Mohan, B.R. de Supinski, S.A. McKee, and A. Yoo, “Metric: Tracking Down Inefficiencies in the Memory Hierarchy via Binary Rewriting,” Proc. Int'l Symp. Code Generation and Optimization, pp. 289-300, Mar. 2003.
[13]
Official OpenMP Specification, http://www.openmp.org/drupal/mp-documents/spec25.pdf, May 2005.
[14]
J. Marathe, A. Nagarajan, and F. Mueller, “Detailed Cache Coherence Characterization for OpenMP Benchmarks,” Proc. 18th ACM Int'l Conf. Supercomputing, June 2004.
[15]
H. Jin, M. Frumkin, and J. Yan, “The OpenMP Implementations of NAS Parallel Benchmarks and Its Performance,” Technical Report NAS-99-011, NASA Ames Research Center, Oct. 1999.
[16]
W. Gunsteren and H. Berendsen, “Gromos: Groningen Molecular Simulation Software,” technical report, Laboratory of Physical Chemistry, Univ. of Groningen, 1988.
[17]
Lawrence Livermore National Laboratory (LLNL), ASCI Purple Codes, http://www.llnl.gov/asci/purple, 2002.
[18]
Standard Performance Evaluation Corporation (SPEC), SPEC OMPM2001 Benchmarks, http://www.spec.org/omp, 2001.
[19]
J. Marathe, “METRIC: Tracking Memory Bottlenecks via Binary Rewriting,” master's thesis, North Carolina State Univ., June 2003.
[20]
M.F. Oberhumer, LZO Real-Time Data Compression Library, http://www.oberhumer.com/opensource/lzo/, 2002.
[21]
J. Marathe, F. Mueller, and B. de Supinski, “A Hybrid Hardware/Software Approach to Efficiently Determine Cache Coherence Bottlenecks,” Proc. 19th ACM Int'l Conf. Supercomputing, pp. 21-30, June 2005.
[22]
R.D.F.E. Chow and A.J. Cleary, “Design of the Hypre Preconditioner Library,” Proc. SIAM Workshop Object Oriented Methods for Inter-Operable Scientific and Eng. Computing, Oct. 1998.
[23]
J. Marathe and F. Mueller, “Hardware Profile-Guided Automatic Page Placement for ccNUMA Systems,” Proc. ACM SIGPLAN Symp. Principles and Practice of Parallel Programming, pp. 90-99, Mar. 2006.
[24]
Real-World Computing Partnership (RWCP), C Versions of NAS-2.3 Serial Programs, http://phase.hpcc.jp/Omni/benchmarks/NPB, 2003.
[25]
J. Gibson, “Memory Profiling on Shared Memory Multiprocessors,” PhD dissertation, Stanford Univ., July 2003.
[26]
M. Martonosi, A. Gupta, and T. Anderson, “Memspy: Analyzing Memory System Bottlenecks in Programs,” Proc. ACM Joint Int'l Conf. Measurement and Modeling of Computer Systems (SIGMETRICS and PERFORMANCE '92), pp. 1-12, 1992.
[27]
A.R. Lebeck and D.A. Wood, “Cache Profiling and the SPEC Benchmarks: A Case Study,” Computer, vol. 27, no. 10, pp. 15-26, Oct. 1994.
[28]
J. Larus and T. Ball, “Rewriting Executable Files to Measure Program Behavior,” Software: Practice and Experience, vol. 24, no. 2, pp. 197-218, Feb. 1994.
[29]
J.R. Larus and E. Schnarr, “EEL: Machine-Independent Executable Editing,” Proc. ACM SIGPLAN Conf. Programming Language Design and Implementation, pp. 291-300, June 1995.
[30]
A.R. Lebeck and D.A. Wood, “Active Memory: A New Abstraction for Memory System Simulation,” ACM Trans. Modeling and Computer Simulation, vol. 7, no. 1, pp. 42-77, Jan. 1997.
[31]
B.R. Buck and J.K. Hollingsworth, “Using Hardware Performance Monitors to Isolate Memory Bottlenecks,” Proc. Int'l Conf. Supercomputing, pp. 64-65, http://www.sc2000.org/proceedings/techpapr/papers/pap197.pdf, 2000.
[32]
J. Dean, J. Hicks, C. Waldspurger, W. Weihl, and G. Chrysos, “ProfileMe: Hardware Support for Instruction-Level Profiling on Out-of-Order Processors,” Proc. 30th Ann. IEEE/ACM Int'l Symp. Microarchitecture (MICRO '97), pp. 292-302, Dec. 1997.
[33]
Y. Solihin, V. Lam, and J. Torrellas, “Scal-Tool: Pinpointing and Quantifying Scalability Bottlenecks in DSM Multiprocessors,” Proc. Supercomputing Conf. (SC' 99), Nov. 1999.
[34]
D. Nikolopoulos, C. Polychronopoulos, and E. Ayguade, “Scaling Irregular Parallel Codes with Minimal Programming Effort,” Proc. ACM/IEEE Conf. Supercomputing (SC '01), 2001.
[35]
M. Brorsson, “A Tool to Visualise and Find Cache Coherence Performance Bottlenecks in Multiprocessor Programs,” Proc. ACM Joint Int'l Conf. Measurement and Modeling of Computer Systems (SIGMETRICS and PERFORMANCE '95), pp. 178-187, May 1995.
[36]
L. DeRose, K. Ekanadham, J.K. Hollingsworth, and S. Sbaraglia, “SIGMA: A Simulator Infrastructure to Guide Memory Analysis,” Proc. ACM/IEEE Conf. Supercomputing (SC '02), Nov. 2002.
[37]
G. Marin and J. Mellor-Crummey, “Cross Architecture Performance Predictions for Scientific Applications Using Parameterized Models,” Proc. Joint Int'l Conf. Measurement and Modeling of Computer Systems (SIGMETRICS and PERFORMANCE '04), 2004.
[38]
S. Browne, J. Dongarra, N. Garner, K. London, and P. Mucci, “A Scalable Cross-Platform Infrastructure for Application Performance Tuning Using Hardware Counters,” Proc. Supercomputing Conf. (SC '00), Nov. 2000.
[39]
J. Mellor-Crummey, R. Fowler, and D. Whalley, “Tools for Application-Oriented Performance Tuning,” Proc. ACM/IEEE Conf. Supercomputing (SC' 01), pp. 154-165, June 2001.

Cited By

View all
  • (2015)Initial results on computational performance of Intel many integrated core, sandy bridge, and graphical processing unit architecturesConcurrency and Computation: Practice & Experience10.1002/cpe.324827:3(581-593)Online publication date: 10-Mar-2015
  • (2013)Experimenting with low-overhead OpenMP runtime on IBM Blue Gene/QIBM Journal of Research and Development10.1147/JRD.2012.222876957:1(91-98)Online publication date: 1-Jan-2013
  • (2011)Memory Trace Compression and Replay for SPMD Systems using Extended PRSDs?ACM SIGMETRICS Performance Evaluation Review10.1145/1964218.196422438:4(30-36)Online publication date: 29-Mar-2011
  • Show More Cited By

Index Terms

  1. Source-Code-Correlated Cache Coherence Characterization of OpenMP Benchmarks

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image IEEE Transactions on Parallel and Distributed Systems
      IEEE Transactions on Parallel and Distributed Systems  Volume 18, Issue 6
      June 2007
      142 pages

      Publisher

      IEEE Press

      Publication History

      Published: 01 June 2007

      Author Tags

      1. Cache memories
      2. SMPs
      3. coherence protocols.
      4. dynamic binary rewriting
      5. program instrumentation
      6. simulation

      Qualifiers

      • Research-article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)0
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 16 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2015)Initial results on computational performance of Intel many integrated core, sandy bridge, and graphical processing unit architecturesConcurrency and Computation: Practice & Experience10.1002/cpe.324827:3(581-593)Online publication date: 10-Mar-2015
      • (2013)Experimenting with low-overhead OpenMP runtime on IBM Blue Gene/QIBM Journal of Research and Development10.1147/JRD.2012.222876957:1(91-98)Online publication date: 1-Jan-2013
      • (2011)Memory Trace Compression and Replay for SPMD Systems using Extended PRSDs?ACM SIGMETRICS Performance Evaluation Review10.1145/1964218.196422438:4(30-36)Online publication date: 29-Mar-2011
      • (2009)A Methodology to Characterize Critical Section Bottlenecks in DSM MultiprocessorsProceedings of the 15th International Euro-Par Conference on Parallel Processing10.1007/978-3-642-03869-3_17(149-161)Online publication date: 23-Aug-2009

      View Options

      View options

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media