Analyzing and tuning memory performance in sequential and parallel programs

January 1994

Author:
Margaret Rose Martonosi
Stanford Univ., Stanford, CA

Publisher:

Stanford University
408 Panama Mall, Suite 217
Stanford
CA
United States

Order Number:UMI Order No. GAX94-22104

Bibliometrics

Abstract

Recent architecture and technology trends have led to a significant gap between processor and main memory speeds. Responding to this gap, architects have introduced cache memories that are placed between processors and memories to mask high latencies. If cache misses are common, however, memory stalls can still significantly degrade execution time. To help identify and fix such memory bottlenecks, this work presents techniques to efficiently collect detailed information about program memory performance and effectively organize the data collected. These techniques help guide programmers or compilers to memory bottlenecks. They apply to both sequential and parallel applications and are embodied in the MemSpy performance monitoring system.

Experiences performance tuning several programs have driven this research, leading to the following conclusions. First, this thesis contends that the natural interrelationship between program memory bottlenecks and program data structures mandates the use of data oriented statistics, a novel approach that associates program performance information with application data structures. Data oriented statistics, viewed alone or paired with traditional code oriented statistics, offer a powerful, new dimension for performance analysis. The dissertation develops techniques for aggregating statistics on similarly-used data structures and for extracting intuitive source-code names for statistics.

Second, this thesis also argues that detailed statistics on the frequency and causes of cache misses are crucial in understanding memory bottlenecks. Common memory performance bugs are most easily distinguished by noting the causes of their resulting cache misses. Offering such information, MemSpy's performance profiles have been invaluable in analyzing memory bottlenecks in several applications.

Third, since collecting such detailed information seems, at first glance, to require large execution time slowdowns, this dissertation also evaluates techniques to improve the performance of MemSpy's simulation-based monitoring. The first optimization, hit bypassing, improves simulation performance by specializing processing of cache hits. The second optimization, reference trace sampling, improves performance by simulating only sampled portions out of the full reference trace. Together, these optimizations reduce simulation time by nearly an order of magnitude. Overall, having used MemSpy to tune several applications, these experiences demonstrate that MemSpy generates effective memory performance profiles, at speeds competitive with previous, less detailed approaches.

Cited By

Contributors

Margaret Martonosi
Princeton University
- Publication Years1992 - 2024
- Publication counts194
- Citation count20,755
- Available for Download167
- Downloads (cumulative)230,618
- Downloads (12 months)19,718
- Downloads (6 weeks)2,342
- Average Downloads per Article1,381
- Average Citation per Article107
View Full Profile

Index Terms

Comments

Recommendations

Analyzing and Tuning Memory Performance in Sequential and Parallel Programs
Tuning Memory Performance of Sequential and Parallel Programs

To improve program memory performance, programmers and compiler writers can transform the application so that its memory-referencing behavior better exploits the memory hierarchy. The challenge in achieving these program transformations is overcoming ...
MemSpy: analyzing memory system bottlenecks in programs

To cope with the increasing difference between processor and main memory speeds, modern computer systems use deep memory hierarchies. In the presence of such hierarchies, the performance attained by an application is largely determined by its memory ...

Browse Theses

Sections

Cited By

Index Terms