Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1088149.1088153acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
Article

A hybrid hardware/software approach to efficiently determine cache coherence Bottlenecks

Published: 20 June 2005 Publication History

Abstract

High-end computing increasingly relies on shared-memory multiprocessors (SMPs), such as clusters of SMPs, nodes of chip-multiprocessors (CMP) or large-scale single-system image (SSI) SMPs. In such systems, performance is often affected by the sharing pattern of data within applications and its impact on cache coherence. Sharing patterns that result in frequent invalidations followed by subsequent coherence misses create cache coherence bottlenecks with significant performance penalties. Past work on identifying coherence bottlenecks based oil tracing memory accesses incurs considerable runtime overhead and does not scale well with increasing problem sizes, which makes it infeasible to use with real-world programs.In this paper, we introduce a novel low-cost, hardware-assisted approach to determine coherence bottlenecks in shared-memory OpenMP applications. We assess the merits of our approach on a contemporary SMP platform. Specifically, we assess the feasibility of lossy tracing to pin-point coherence problems in applications. We evaluate the qualitative and quantitative trade-offs between tracing overhead and accuracy of the generated coherence traffic metrics, correlated to memory access points at the program source level.Our lossy tracing mechanism closely approximates the degree of accuracy of determining coherence misses in full traces for most of the benchmarks we study while reducing run-time execution overhead and trace sizes by one to two orders of magnitude. To the best of our knowledge, this novel method significantly outperforms any of the prior approaches and, for the first time, makes cache coherence analysis feasible for long-running applications.

References

[1]
Asci purple codes. http://www.llnl.gov/asci/purple, 2002.
[2]
C versions of nas-2.3 serial program. http://phase.hpcc.jp/Omni/benchmarks/NPB, 2003.
[3]
D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, D. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, H. D. Simon, V. Venkatakrishnan, and S. K. Weeratunga. The NAS Parallel Benchmarks. The International Journal of Supercomputer Applications, 5(3):63--73, Fall 1991.
[4]
E. A. Brewer, C. N. Dellarocas, A. Colbrook, and W. E. Weihl. Proteus: A high-performance parallel-architecture simulator. In Proceedings of the SIGMETRICS and PERFORMANCE '92 International Conference on Measurement and Modeling of Computer Systems, pages 247--248, New York, NY, USA, June 1992. ACM Press.
[5]
B. Buck and J. K. Hollingsworth. An API for runtime code patching. The International Journal of High Performance Computing Applications, 14(4):317--329, Winter 2000.
[6]
B. R. Buck and J. K. Hollingsworth. Using hardware performance monitors to isolate memory bottlenecks. In ACM, editor, Supercomputing, pages 64--65, 2000.
[7]
B. R. Buck and J. K. Hollingsworth. Data centric cache measurement on the intel itanium 2 processor. In ACM, editor, Supercomputing. 2004.
[8]
D. Burger, T. M. Austin, and S. Bennett. Evaluating future microprocessors: The simplescalar tool set. Technical Report CS-TR-1996-1308, University of Wisconsin, Madison, July 1996.
[9]
H. Davis, S. R. Goldschmidt, and J. Hennessy. Multiprocessor simulation and tracing using tango. In Proceedings of the 1991 International Conference on Parallel Processing, volume II, Software, pages II-99-II-107, Boca Raton, FL, Aug. 1991, CRC Press.
[10]
L. DeRose, K. Ekanadham, J. K. Hollingsworth, and S. Sbaraglia. SIGMA: A simulator infrastructure to guide memory analysis. In Supercomputing, Nov. 2002.
[11]
J. Gibson, Memory Profiling on Shared Memory Multiprocessors. PhD thesis, Stanford University, July 2003.
[12]
C. Hughes, V. Pai, P. Ranganathan, and S. Adve. Rsim: Simulating Shared-Memory Multiprocessors with ILP Processors. IEEE Computer, 35(2):40--49, February 2002.
[13]
Intel. Intel Itanium2 Processor Reference Manual for Software Development und Optimization, volume 1, Intel, 2004.
[14]
Intel Corp. Intel Itanaan2 Processor Reference Manual, May 2004.
[15]
A. Krishnamurthy and K. Yelick. Optimizing parallel programs with explicit synchronization. In ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 196--204, 1995.
[16]
A. R. Lebeck and D. A. Wood. Cache profiling and the SPEC benchmarks: A case study, Computer, 27(10):15--26, Oct. 1994.
[17]
A. R. Lebeck and D. A. Wood. Active memory: A new abstraction for memory system simulation. ACM Transactions an Modeling and Computer Simulation, 7(1):42--77, Jan. 1997.
[18]
C.-K. Luk. R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. Reddi, and K. Hazelwood. Pin: Building customized program analysis tools with dynamic instrumentation. In ACM SIGPLAN Conference on Programming Language Design and Implementation, June 2005.
[19]
J. Marathe, F. Mueller, T. Mohan, B. R. de Supinski, S. A. McKee, and A. Yoo. Metric: Tracking down inefficiencies in the memory hierarchy via binary rewriting. In International Symposium on Code Generation and Optimization, pages 289--300, Mar. 2003.
[20]
J. Marathe, A. Nagarajan, and F. Mueller. Detailed cache coherence characterization for openmp benchmarks. In International Conference on Supercomputing, pages 287--297, June 2004.
[21]
M. Martonosi, A. Gupta, and T. Anderson, Memspy: analyzing memory system bottlenecks in programs. In Proceedings of the 1992 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems, pages 1--12, 1992.
[22]
J. Mellor-Crummey, R. Fowler, and D. Whalley, Tools for application-oriented performance tuning. In International Conference on Supercomputing, pages 154--165, June 2001.
[23]
T. Mohan, B. R. de Supinski, S. A. McKee, F. Mueller, A. Yoo, and M. Schulz. Identifying and exploiting spatial regularity in data memory references. In Supercomputing, Nov. 2003.
[24]
J. K. H. Mustafa M. Tikir. Using hardware counters to automatically improve memory performance. In ACM, editor, Supercomputing, 2004.
[25]
A.-T. Nguyen, M. Michael, A. Sharma, and J. Torrellas. The augmint multiprocessor simulation toolkit: Implementation, experimentation and tracing facilities. In IEEE International Conference on Computer Design: VLSI in Computers and Processors, pages 486--491, Washington - Brussels - Tokyo, Oct. 1996. IEEE. Computer Society.
[26]
H. Patil, R. Cohn, M. Charney, R. Kapoor, A. Sun, and A. Karunanidhi. Pinpointing representative portions of large intel itanium programs with dynamic instrumentation. In 37th International Symposium on Microarchitecture, Dec. 2004.
[27]
M. Rosenblum, S. A. Herrod, E. Witchel, and A. Gupta. Complete computer system simulation: The SimOS approach. IEEE parallel and distributed technology: systems and applications, 3(4):34--43, Winter 1995.
[28]
M. Sato, S. Satoh, K. Kusano, and Y. Tanaka. Design of OpenMP compiler for an SMP cluster. In EWOMP '99, pages 32--39, Sept. 1999.
[29]
S. Satoh, K. Kusano, and M. Sato. Compiler optimization techniques for openMP programs. Scientific Programming, 9(2-3):131--142, 2001.
[30]
C. Thiffault, M. Voss, S. T. Healey, and S. W. Kim. Dynamic instrumentation of large-scale mpi/openmp applications. In International Parallel and Distributed Processing Symposium, Apr. 2003.

Cited By

View all
  • (2017)Trusted Performance Analysis on Systems With a Shared MemoryIEEE Systems Journal10.1109/JSYST.2014.236523411:1(272-282)Online publication date: Mar-2017
  • (2013)Elastic and scalable tracing and accurate replay of non-deterministic eventsProceedings of the 27th international ACM conference on International conference on supercomputing10.1145/2464996.2465001(59-68)Online publication date: 10-Jun-2013
  • (2013)Using Traditional Data Analysis Algorithms to Detect Access Patterns for Massive Data Processing2013 IEEE 10th International Conference on High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing10.1109/HPCC.and.EUC.2013.155(1097-1104)Online publication date: Nov-2013
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICS '05: Proceedings of the 19th annual international conference on Supercomputing
June 2005
414 pages
ISBN:1595931678
DOI:10.1145/1088149
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 June 2005

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. SMPs
  2. cache analysis
  3. coherence protocols
  4. dynamic binary rewriting
  5. hardware performance monitoring
  6. program instrumentation

Qualifiers

  • Article

Conference

ICS05
Sponsor:
ICS05: International Conference on Supercomputing 2005
June 20 - 22, 2005
Massachusetts, Cambridge

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)5
  • Downloads (Last 6 weeks)1
Reflects downloads up to 18 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2017)Trusted Performance Analysis on Systems With a Shared MemoryIEEE Systems Journal10.1109/JSYST.2014.236523411:1(272-282)Online publication date: Mar-2017
  • (2013)Elastic and scalable tracing and accurate replay of non-deterministic eventsProceedings of the 27th international ACM conference on International conference on supercomputing10.1145/2464996.2465001(59-68)Online publication date: 10-Jun-2013
  • (2013)Using Traditional Data Analysis Algorithms to Detect Access Patterns for Massive Data Processing2013 IEEE 10th International Conference on High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing10.1109/HPCC.and.EUC.2013.155(1097-1104)Online publication date: Nov-2013
  • (2011)Memory Trace Compression and Replay for SPMD Systems using Extended PRSDs?ACM SIGMETRICS Performance Evaluation Review10.1145/1964218.196422438:4(30-36)Online publication date: 29-Mar-2011
  • (2010)Feedback-directed page placement for ccNUMA via hardware-generated memory tracesJournal of Parallel and Distributed Computing10.1016/j.jpdc.2010.08.01570:12(1204-1219)Online publication date: 1-Dec-2010
  • (2008)Application Performance Tuning for Clusters with ccNUMA NodesProceedings of the 2008 11th IEEE International Conference on Computational Science and Engineering10.1109/CSE.2008.46(245-252)Online publication date: 16-Jul-2008
  • (2008)Guided Prefetching Based on Runtime Access PatternsComputational Science – ICCS 200810.1007/978-3-540-69389-5_31(268-275)Online publication date: 2008
  • (2007)METRICACM Transactions on Programming Languages and Systems10.1145/1216374.121638029:2(12-es)Online publication date: 1-Apr-2007
  • (2007)Source-Code-Correlated Cache Coherence Characterization of OpenMP BenchmarksIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2007.105818:6(818-834)Online publication date: 1-Jun-2007
  • (2006)Hardware profile-guided automatic page placement for ccNUMA systemsProceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming10.1145/1122971.1122987(90-99)Online publication date: 29-Mar-2006

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media