Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/166955.167023acmconferencesArticle/Chapter ViewAbstractPublication PagesmetricsConference Proceedingsconference-collections
Article
Free access

Effectiveness of trace sampling for performance debugging tools

Published: 01 June 1993 Publication History

Abstract

Recently there has been a surge of interest in developing performance debugging tools to help programmers tune their applications for better memory performance [2, 4, 10]. These tools vary both in the detail of feedback provided to the user, and in the run-time overbead of using them. MemSpy [10] is a simulation-based tool which gives programmers detailed statistics on the memory system behavior of applications. It provides information on the frequency and causes of cache misses, and presents it in terms of source-level data and code objects with which the programmer is familiar. However, using MemSpy increases a program's execution time by roughly 10 to 40 fold. This overhead is generally acceptable for applications with execution times of several minutes or less, but it can be inconvenient when tuning applications with very long execution times.This paper examines the use of trace sampling techniques to reduce the execution time overhead of tools like MemSpy. When simulating one tenth of the references, we find that MemSpy's execution time overhead is improved by a factor of 4 to 6. That is, the execution time when using MemSpy is generally within a factor of 3 to 8 times the normal exwution time. With this improved performance, we observe only small errors in the performance statistics reported by MemSpy. On moderate sized caches of 16KB to 128KB, simulating as few as one tenth of the references (in samples of 0.5M references each) allows us to estimate the program's actual cache miss rate with an absolute error no greater than 0.3% on our five benchmarks. These errors are quite tolerable within the context of performance bugging. With larger caches we can also obtain good accuracy by using longer sample lengths. We conclude that, used with care, trace sampling is a powerful technique that makes possible performance debugging tools which provide both detailed memory statistics and low execution time overheads.

References

[1]
1". E. Anderson and E. D. Lazowska. Quartz: A Tool for Tuning Parallel Progrmn Performance. In Proc. A CM $1GMETRICS Conf. on the Measurement and Modeling of Computer Systems, pages 115--125, May 1990.
[2]
I. Dongarra, O. Brewer, 3. A. Kohl, and S. Fineberg. A Tool to Aid in the Design, Implementation, and Understanding of Matrix Alg~s for Parallel Processors. Journal of Parallel and Distributed Computing, 9:185--202, June 1990.
[3]
A. J. Goldberg and J. Hennessy. MTOOL: A Method for Isolating Memory Bottlenecks in Shared Memory Multiprocessor Programs, in Proc. Intl. Conf. on Parallel Processing, pages 251-257, Aug. 1991.
[4]
A. J. Goldberg and J. Hennessy. Performance Debugging Shaxed Memory Multiprocessor Programs with MTOOL. In Proc. Supercomputing, pages 481-490, Nov. 1991.
[5]
S. L. Graham, P. B. Kessler, and M. IC McKusick. An Execution Profiler for Modular Programs. Software Practice and Experience, 13;671-685, Aug. 1983.
[6]
I. Hexmessy and N. 1ouppi. Computex Technology and Arehitectare: An Evolving Interaction. IEEE Computer, pages 18- 29, Sept. 1991.
[7]
R. E. Kessl~, M. D. Hill, and D. A. Wood. A Comparison of Trace-Sampling Techniques for Multi-Megabyte Caches. Technical Report 1048, Univ. of Wisconsin Computer Sciences Department, Sept. 1991.
[8]
S. Laha, J. H. Patel, and R. K. Iyer. Accurate Low-Cost Methods for Performance Evaluation of Cache Memory Systems. IEEE Tranz. on Computers, pages 1325-1336, Nov. 1988.
[9]
M. Lain, E. Rothberg, and M. Wolf. The Cache Performance and Optimizations of Blocked Algorithms. In Proc. Fourth Intl. Conf. on Architectural Support for Programming Languagez and Operating Systems (ASPLOS), pages 63-74, Apr. 1991.
[10]
M. Martonosi, A. Oupta, and T. Anderson. MemSpy: Analyzing Memory System Bottlenecks in Programs. In Proc.ACM SIGMET- RICS Conf. on Measurement and Modeling of Computer Systems, pages 1-12, June 1992.
[11]
E. Rothberg and A. Gupta. Parallel ICCG on a Hierarchical Memory Multiprocessor--- Addressing the Triangular Solve Bottleneck. Parallel Computing, 18(7):719-41, July 1992.
[12]
J.P. Singh, W.-D. Weber, and A. Gupta. SPLASH: Stanford Parallel Applications for Shared-Memory. Computer Architecture News, 20(1):5--44, March 1992.
[13]
Sl~e benchmark ~uite release 1.0, Oct. 1989.
[14]
H. S. Stone. High-Performance Computer Architecture. Addison- Wesley, Reading, MA, second edition, 1990,
[15]
D.A. Wood, M. D. Hill, and R. E. Kessler. A Model for Estimating Trace-Sample Miss Ratios. In Proc. A CM SIGMETRICS Conf. on the Measurement and Modeling of Computer Systems, pages 79-- 89, June 1991.

Cited By

View all
  • (2020)FirePerfProceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3373376.3378455(715-731)Online publication date: 9-Mar-2020
  • (2008)Sampled Processor Simulation: A SurveyAdvances in COMPUTERS - High Performance Computing10.1016/S0065-2458(08)00004-1(173-224)Online publication date: 2008
  • (2006)Yet shorter warmup by combining no-state-loss and MRRL for sampled LRU cache simulationJournal of Systems and Software10.1016/j.jss.2005.06.01679:5(645-652)Online publication date: 1-May-2006
  • Show More Cited By

Recommendations

Reviews

Alan Cole

Cache simulation tools can provide valuable information for use in tuning applications for better memory performance. The overhead of a complete cache simulation can be substantial enough to inhibit its frequent use, however. The authors describe a technique to improve the performance of these tools. They do this by simulating only subsections of the complete reference trace. By sampling one- tenth of the full trace, they improve the performance of their MemSpy tool by a factor of four to six on five benchmark applications. They perform experiments showing how the accuracy of the simulation depends on several parameters, including cache size, sample length, and number of samples. The use of sampling is most accurate on programs with high cache miss rates, exactly the situation where a performance tool is most needed. Because the state of the cache is unknown at the beginning of each sample, larger caches lead to higher uncertainty in estimating cache miss ratios. In their benchmarks, using cache sizes of 16KB to 128KB, the sampling technique leads to estimates of cache miss ratios with an absolute error of no more than 0.3 percent. These measurements show that the sampling technique is most useful for programs with high cache miss rates or many references, and lead to a set of recommended strategies for memory performance tuning depending on the characteristics of the application.

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMETRICS '93: Proceedings of the 1993 ACM SIGMETRICS conference on Measurement and modeling of computer systems
June 1993
286 pages
ISBN:0897915801
DOI:10.1145/166955
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 June 1993

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Article

Conference

SIGMETRICS93
Sponsor:

Acceptance Rates

Overall Acceptance Rate 459 of 2,691 submissions, 17%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)202
  • Downloads (Last 6 weeks)12
Reflects downloads up to 10 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2020)FirePerfProceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3373376.3378455(715-731)Online publication date: 9-Mar-2020
  • (2008)Sampled Processor Simulation: A SurveyAdvances in COMPUTERS - High Performance Computing10.1016/S0065-2458(08)00004-1(173-224)Online publication date: 2008
  • (2006)Yet shorter warmup by combining no-state-loss and MRRL for sampled LRU cache simulationJournal of Systems and Software10.1016/j.jss.2005.06.01679:5(645-652)Online publication date: 1-May-2006
  • (2006)Discovery of locality-improving refactorings by reuse path analysisProceedings of the Second international conference on High Performance Computing and Communications10.1007/11847366_23(220-229)Online publication date: 13-Sep-2006
  • (2005)Using Dynamic Tracing Sampling to Measure Long Running ProgramsProceedings of the 2005 ACM/IEEE conference on Supercomputing10.1109/SC.2005.77Online publication date: 12-Nov-2005
  • (2005)Optimal sample length for efficient cache simulationJournal of Systems Architecture: the EUROMICRO Journal10.1016/j.sysarc.2004.12.00451:9(513-525)Online publication date: 1-Sep-2005
  • (2004)Cluster miss prediction with prefetch on miss for embedded CPU instruction cachesProceedings of the 2004 international conference on Compilers, architecture, and synthesis for embedded systems10.1145/1023833.1023839(24-34)Online publication date: 22-Sep-2004
  • (2004)Data Centric Cache Measurement on the Intel ltanium 2 ProcessorProceedings of the 2004 ACM/IEEE conference on Supercomputing10.1109/SC.2004.21Online publication date: 6-Nov-2004
  • (2004)Efficient simulation of trace samples on parallel machinesParallel Computing10.1016/j.parco.2004.02.00330:3(317-335)Online publication date: 1-Mar-2004
  • (2003)DiSTACM SIGMETRICS Performance Evaluation Review10.1145/885651.78102931:1(1-12)Online publication date: 10-Jun-2003
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media