Recent architecture and technology trends have led to a significant gap between processor and main memory speeds. Responding to this gap, architects have introduced cache memories that are placed between processors and memories to mask high latencies. If cache misses are common, however, memory stalls can still significantly degrade execution time. To help identify and fix such memory bottlenecks, this work presents techniques to efficiently collect detailed information about program memory performance and effectively organize the data collected. These techniques help guide programmers or compilers to memory bottlenecks. They apply to both sequential and parallel applications and are embodied in the MemSpy performance monitoring system.
Experiences performance tuning several programs have driven this research, leading to the following conclusions. First, this thesis contends that the natural interrelationship between program memory bottlenecks and program data structures mandates the use of data oriented statistics, a novel approach that associates program performance information with application data structures. Data oriented statistics, viewed alone or paired with traditional code oriented statistics, offer a powerful, new dimension for performance analysis. The dissertation develops techniques for aggregating statistics on similarly-used data structures and for extracting intuitive source-code names for statistics.
Second, this thesis also argues that detailed statistics on the frequency and causes of cache misses are crucial in understanding memory bottlenecks. Common memory performance bugs are most easily distinguished by noting the causes of their resulting cache misses. Offering such information, MemSpy's performance profiles have been invaluable in analyzing memory bottlenecks in several applications.
Third, since collecting such detailed information seems, at first glance, to require large execution time slowdowns, this dissertation also evaluates techniques to improve the performance of MemSpy's simulation-based monitoring. The first optimization, hit bypassing, improves simulation performance by specializing processing of cache hits. The second optimization, reference trace sampling, improves performance by simulating only sampled portions out of the full reference trace. Together, these optimizations reduce simulation time by nearly an order of magnitude. Overall, having used MemSpy to tune several applications, these experiences demonstrate that MemSpy generates effective memory performance profiles, at speeds competitive with previous, less detailed approaches.
Cited By
- Torrie E, Martonosi M, Tseng C and Hall M (1996). Characterizing the Memory Behavior of Compiler-Parallelized Applications, IEEE Transactions on Parallel and Distributed Systems, 7:12, (1224-1237), Online publication date: 1-Dec-1996.
- Martonosi M, Ofelt D and Heinrich M (1996). Integrating performance monitoring and communication in parallel computers, ACM SIGMETRICS Performance Evaluation Review, 24:1, (138-147), Online publication date: 15-May-1996.
- Martonosi M, Ofelt D and Heinrich M Integrating performance monitoring and communication in parallel computers Proceedings of the 1996 ACM SIGMETRICS international conference on Measurement and modeling of computer systems, (138-147)
- Brorsson M (1995). SM-prof, ACM SIGMETRICS Performance Evaluation Review, 23:1, (178-187), Online publication date: 1-May-1995.
- Brorsson M SM-prof Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems, (178-187)
- Torrie E, Tseng C, Martonosi M and Hall M Evaluating the impact of advanced memory systems on compiler-parallelized codes Proceedings of the IFIP WG10.3 working conference on Parallel architectures and compilation techniques, (204-213)
Index Terms
- Analyzing and tuning memory performance in sequential and parallel programs
Recommendations
Tuning Memory Performance of Sequential and Parallel Programs
To improve program memory performance, programmers and compiler writers can transform the application so that its memory-referencing behavior better exploits the memory hierarchy. The challenge in achieving these program transformations is overcoming ...
MemSpy: analyzing memory system bottlenecks in programs
To cope with the increasing difference between processor and main memory speeds, modern computer systems use deep memory hierarchies. In the presence of such hierarchies, the performance attained by an application is largely determined by its memory ...