Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article
Free access

Integrating performance monitoring and communication in parallel computers

Published: 15 May 1996 Publication History

Abstract

A large and increasing gap exists between processor and memory speeds in scalable cache-coherent multiprocessors. To cope with this situation, programmers and compiler writers must increasingly be aware of the memory hierarchy as they implement software. Tools to support memory performance tuning have, however, been hobbled by the fact that it is difficult to observe the caching behavior of a running program. Little hardware support exists specifically for observing caching behavior; furthermore, what support does exist is often difficult to use for making fine-grained observations about program memory behavior.Our work observes that in a multiprocessor, the actions required for memory performance monitoring are similar to those required for enforcing cache coherence. In fact, we argue that on several machines, the coherence/communication system itself can be used as machine support for performance monitoring. We have demonstrated this idea by implementing the FlashPoint memory performance monitoring tool. FlashPoint is implemented as a special performance-monitoring coherence protocol for the Stanford FLASH Multiprocessor. By embedding performance monitoring into a cache-coherence scheme based on a programmable controller, we can gather detailed, per-data-structure, memory statistics with less than a 10% slowdown compared to unmonitored program executions. We present results on the accuracy of the data collected, and on how FlashPoint performance scales with the number of processors.

References

[1]
A. Agarwal, R. Bianchini, D. Chaiken, et al. The MIT Alewife Machine: Architecture and Performance. Proc, 22nd Int'l. Syrup. on Computer Architecture. Jun, 1995.
[2]
N. J. Boden, D. Cohen, R. E. Felderman, et al. Myrinet: A Gigabit-per-Second Local Area Network. IEEE Micro 15(i). pages 29-36. Feb., 1995.
[3]
M. A. Blumrich, K. Li, R. Alpert, et al. Virtual Memory Mapped Network Interface for the SHRIMP Multicomputer. Proc. 21st Int'l. Symp. on Computer Architecture. pages 141-153. April, 1994.
[4]
H. Burkhart and R. Millen. Performance-Measurement Tools in a Multiprocessor Environment. IEEE Trans. on Computers, 38(5):725-737, May 1989.
[5]
R. Chandra, S. Devine, B. Verghese, et al. Scheduling and Page Migration for Multiprocessor Compute Servers. In Proc. 6th Int'l Conf. on Architectural Support for Programming Languages and Operating Systems, pages 12-24, Oct. 1994.
[6]
D. Dobberpuhl etal. A 100MHz 64b Dual-issue CMOS Microprocessor. In int'l Solid State Circuits Conf. Digest of Technical Papers, Feb 1992.
[7]
J. Dongarra, O. Brewer, J. A. Kohl and S.Hneberg.A Tool to Aid{ in the Design, Implementation, and Understanding of Matrix Algorithms for Parallel Processors. Journal of Parallel and Distributed Computing. pages 185-202. Jun, 1990.
[8]
DEC. DECChip 21064 RISC Microprocessor Preliminary Data Sheet. Technical report, 1992.
[9]
A.J. Goldberg and J. L. Hennessy. Mtool: An Integrated System for Performance Debugging Shared Memory Multiprocessor Applications. IEEE Trans. on Parallel and Distributed Systems, pages 28-40, Jan.1993.
[10]
S. R. Goldschmidt. Simulation of Multiprocessors, Speed and Accuracy. Ph.D. Thesis, Stanford University, June, 1993.
[11]
J. Heinrich. MIPS R10000 Microprocessor User's Manual. 1995.
[12]
M. Heinrich. DASH Performance Monitor Hardware Documentation. Stanford University, Unpublished Memo. 1993.
[13]
J. Heinlein, K. Gharachodoo, S. Dresser, et al. Integration of Message Passing and Shared Memory in the Stanford FLASH Multiprocessor. In Proc. 6th Int'l Conf. on Architectural Support for Programming Languages and Operating Systems, pages 38-50, Oct. 1994.
[14]
M. Heinrich, J. Kuskin, D. Ofelt, et al. The Performance Impact of Flexibility in the Stanford FLASH Multiprocessor. In Proc. 6th lnt'l Conf. on Architectural Support for Programming Languages and Operating Systems, pages 274-285, Oct. 1994.
[15]
M. D. Hill, J. R. Larus, and D. A. Wood. Tempest: A Substrate for Portable Parallel Programs. Proc. Compcon. March, 1995.
[16]
M.Horowitz, M. Martonosi, T. Mowry, M. D. Smith. Informing Memory Operations: Providing Memory Performance Feedback in Modem Processors. Proc. 23rd Int'l. Syrup. on Computer Architecture., May, 1996.
[17]
J. Hennessy and D. Patterson. Computer Architecture: A Quantitative Approach. Morgan Kaufmann, 1990.
[18]
J. Kuskin, D. Ofelt, M. Heinrich, et al. The Stanford FLASH Multiproeessor. Proc. 21st lnt'l. Symp. on Computer Architecture, Apr.1994.
[19]
A. R. Lebeck and D. A. Wood. Cache Profiling and the SPEC Benchmarks: A Case Study. IEEE Computer, October 1994.
[20]
M. Martonosi. Analyzing and Tuning Memory Performance in Sequential and Parallel Programs. Ph.D, Thesis, Stanford Univ., Dec. 1993.
[21]
M. Martonosi, A. Gupta and T. Anderson. Effectiveness of Trace Sampling for Performance Debugging Tools. Proc. A CM SIGMETRICS Conf. on Measurement and Modeling of Computer Systems. May, 1993.
[22]
Terje Mathison. Pentium Secrets. Byte, pages 191-192, July 1994.
[23]
A. Nowatzyk, G. Aybay, M. Browne, et al. The S3.mp Scalable Shared Memory Multiprocessor. Proc. 27th Hawaii Int'l Conf. on System Sciences Vol. I: Architecture. pages 144-53. Jan, 1994.
[24]
S. K. Reinhardt, J. R. Larus and D. A. Wood. Tempest and Typhoon: User-Level Shared Memory. Proc. 21st Int'l. Symposium on Computer Architecture. April, 1994.
[25]
R. Simoni. Cache Coherence Directories for Scalable Multiprocessors. Ph.D. Thesis, Stanford Univ., Nov. 1992.
[26]
M.D. Smith. Support for Speculative Execution in High- Performance Processors. Ph.D. Thesis, Stanford Univ., Nov. 1992.
[27]
M. D. Smith, M. Johnson, and M. Horowitz. Limits on Multiple Instruction Issue. Proc. 3rd lnt'l Conf. on Architectural Support for Programming Languages and Operating Systems, 1989, pages 290-302.
[28]
R. Stallman. Using and Porting GNU CC. Free Software Foundation, Cambridge, MA, June 1993.
[29]
S. Woo. M. Ohara, E. Torrie, et al. Methodological Considerations and Characterization of the SPLASH-2 Parallel Application Suite. Proc. 22nd Int'l. Symp. on Computer Architecture. Jun, 1995.
[30]
B. Zorn and P. N. Hilfinger. A Memory Allocation Profiler for C and Lisp. Technical Report UCB/CSD 88/404, Feb. 1988.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM SIGMETRICS Performance Evaluation Review
ACM SIGMETRICS Performance Evaluation Review  Volume 24, Issue 1
May 1996
273 pages
ISSN:0163-5999
DOI:10.1145/233008
Issue’s Table of Contents
  • cover image ACM Conferences
    SIGMETRICS '96: Proceedings of the 1996 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
    May 1996
    279 pages
    ISBN:0897917936
    DOI:10.1145/233013
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 May 1996
Published in SIGMETRICS Volume 24, Issue 1

Check for updates

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)53
  • Downloads (Last 6 weeks)14
Reflects downloads up to 09 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2010)Locating cache performance bottlenecks using data profilingProceedings of the 5th European conference on Computer systems10.1145/1755913.1755947(335-348)Online publication date: 13-Apr-2010
  • (2002)SIP: Performance Tuning through Source Code InterdependenceEuro-Par 2002 Parallel Processing10.1007/3-540-45706-2_22(177-186)Online publication date: 20-Aug-2002
  • (2011)KismetACM SIGPLAN Notices10.1145/2076021.204810846:10(519-536)Online publication date: 22-Oct-2011
  • (2011)KismetProceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications10.1145/2048066.2048108(519-536)Online publication date: 22-Oct-2011
  • (2009)ECMonACM SIGARCH Computer Architecture News10.1145/1555815.155579837:3(349-360)Online publication date: 20-Jun-2009
  • (2009)ECMonProceedings of the 36th annual international symposium on Computer architecture10.1145/1555754.1555798(349-360)Online publication date: 20-Jun-2009
  • (2009)Core monitorsProceedings of the 6th ACM conference on Computing frontiers10.1145/1531743.1531751(31-40)Online publication date: 18-May-2009
  • (2006)NoC Monitoring Hardware Support for Fast NoC Design Space Exploration and Potential NoC Partial Dynamic Reconfiguration2006 International Symposium on Industrial Embedded Systems10.1109/IES.2006.357481(1-10)Online publication date: Oct-2006
  • (2006)Exploiting spatial and temporal locality of accesses: A new hardware-based monitoring approach for DSM systemsEuro-Par’98 Parallel Processing10.1007/BFb0057854(206-215)Online publication date: 30-Jun-2006
  • (2005)TAPEProceedings of the 19th annual international conference on Supercomputing10.1145/1088149.1088176(199-208)Online publication date: 20-Jun-2005
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media