Abstract
Compiler-parallelized applications are increasing in importance as moderate-scale multiprocessors become common. This paper evaluates how features of advanced memory systems (e.g., longer cache lines) impact memory system behavior for applications amenable to compiler parallelization. Using full-sized input data sets and applications taken from the SPEC, NAS, PERFECT, and RICEPS benchmark suites, we measure statistics such as speedups, memory costs, causes of cache misses, cache line utilization, and data traffic. This exploration allows us to draw several conclusions. First, we find that larger granularity parallelism often correlates with good memory system behavior, good overall performance, and high speedup in these applications. Second, we show that when long (512 byte) cache lines are used, many of these applications suffer from false sharing and low cache line utilization. Third, we identify some of the common artifacts in compiler-parallelized codes that can lead to false sharing or other types of poor memory system performance, and we suggest methods for improving them. Overall, this study offers both an important snapshot of the behavior of applications compiled by state-of-the-art compilers, as well as an increased understanding of the interplay between cache line size, program granularity, and memory performance in moderate-scale multiprocessors.
Similar content being viewed by others
References
S. J. Eggers and R. H. Katz, The Effect of Sharing on the Cache and Bus Performance of Parallel Programs, Third Int’l. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS IV), pp. 257–270 (April 1989).
A. Gupta and W.-D. Weber, Cache Invalidation Patterns in Shared-Memory Multiprocessors, IEEE Trans. on Computers 41(7):794–810 (July 1992).
D. Lenoski, J. Laudon, K. Gharachorloo, A. Gupta, and J. Hennessy, The Directory-Based Protocol for the DASH Multiprocessor, Proc. 17th Ann. Int’l. Symp. on Computer Architecture (May 1990).
J. P. Singh, W.-D. Weber, and A. Gupta, SPLASH: Stanford Parallel Applications for Shared-Memory, Computer Architecture News 20(1):5–44 (March 1992).
S. Woo, M. Ohara, E. Torrie, J. Singh, and A. Gupta, Methodological Considerations and Characterization of the SPLASH-2 Parallel Application Suite, Proc. of the 22st Int’l. Symp. on Computer Architecture, Santa Margherita Ligure, Italy (June 1995).
R. Wilson et al. SUIF: An Infrastructure for Research on Parallelizing and Optimizing Compilers, ACM SIGPLAN Notices 29(12):31–37 (December 1994).
C.-W. Tseng, J. Anderson, S. Amarasinghe, and M. Lam, Unified Compilation Techniques for Shared and Distributed Address Space Machines, Proc. of the ACM Int’l. Conf. on Supercomputing, Barcelona, Spain (July 1995).
M. Martonosi, A. Gupta, and T. Anderson, MemSpy: Analyzing Memory System Bottlenecks in Programs, Proc. ACM SIGMETRICS Conf. on Measurement and Modeling of Computer Systems, pp. 1–12 (June 1992).
M. R. Martonosi, Analyzing and Tuning Memory Performance in Sequential and Parallel Programs, Ph.D. Thesis, Stanford University (December 1993). Also Stanford CSL Technical Report CSL-TR-94-602.
H. Davis, S. R. Goldschmidt, and J. Hennessy, Multiprocessor Simulation and Tracing Using Tango, Proc. Int’l. Conf. on Parallel Processing (August 1991).
S. R. Goldschmidt, Simulation of Multiprocessors, Speed and Accuracy, Ph.D. Thesis, Stanford University (June 1993).
M. Dubois, J. Skeppstedt, L. Ricciulli, K. Ramamurthy and P. Strenstrom. The Detection and Elimination of Useless Misses in Multiprocessors, Proc. 20th Int’l. Symp. on Computer Architecture, pp. 88–97 (May 1993).
J. Kuskin et al. The Stanford FLASH Multiprocessor, Proc. of the 21st Int’l. Symp. on Computer Architecture, Chicago, Illinois, pp. 302–313 (April 1994).
M. W. Hall, B. R. Murphy, and S. P. Amarasinghe, Interprocedural Parallelization Analysis: A Case Study, Proc. Seventh SIAM Conf. on Parallel Processing for Scientific Computing, San Francisco (February 1995).
R. L. Lee, The Effectiveness of Caches and Data Prefetch Buffers in Large-Scale Shared Memory Multiprocessors, Ph.D. Thesis, University of Illinois at Urbana-Champaign (May 1987).
J. Torrellas, M. S. Lam, and J. L. Hennessy, False Sharing and Spatial Locality in Multiprocessor Caches, IEEE Trans. on Computers 43(6):651–663 (June 1994).
S. K. Reinhardt, J. R. Larus, and D. A. Wood, Tempest and Typhoon: User-Level Shared Memory, Proc. 21st Ann. Int’l. Symp. on Computer Architecture, pp. 325–377 (April 1994).
M. Hall, S. Amarasinghe, B. Murphy, S. Liao, and M. Lam, Detecting Coarse-Grain Parallelism Using an Interprocedural Parallelizing Compiler, Proc. of Supercomputing ’95 (December 1995).
W. Bolosky and M. Scott, False Sharing and Its Effect on Shared Memory Performance, Proc. of the USENIX Symp. on Experiences with Distributed and Multiprocessor Systems (SEDMS IV), San Diego, California (September 1993).
S. J. Eggers and T. E. Jeremiassen, Eliminating False Sharing, Proc. 1991 Int’l. Conf. on Parallel Processing, St. Charles, Illinois (August 1991).
T. Jeremiassen and S. Eggers, Reducing False Sharing on Shared Memory Multiprocessors Through Compile Time Data Transformations, Proc. of the Fifth ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming, Santa Barbara, California (July 1995).
W. Blume and R. Eigenmann, Performance Analysis of Parallelizing Compilers on the Perfect Benchmarks Programs, IEEE Trans. on Parallel and Distributed Systems 3(6):643–656 (November 1992).
C. Natarajan, S. Sharma, and R. Iyer, Measurement-based Characterization of Global Memory and Network Contention, Operating System and Parallelization Overheads: Case Study on a Shared-Memory Multiprocessor, Proc. of the 21st Int’l. Symp. on Computer Architecture, Chicago, Illinois (May 1994).
D. J. Lilja, The Impact of Parallel Loop Scheduling Strategies on Prefetching in a Shared-Memory Multiprocessor, IEEE Trans. on Parallel and Distributed Systems 5(6):573–584 (June 1994).
S. Carr, K. S. McKinley, and C.-W. Tseng, Compiler Optimizations for Improving Data Locality, Proc. Sixth Int’l. Conf. on Architectured Support for Programming Languages and Operating Systems (ASPLOS), pp. 252–262 (October 1994).
J. Anderson and M. Lam, Global Optimizations for Parallelism and Locality on Scalable Parallel Machines, Proc. SIGPLAN ’93 Conf. on Programming Language Design and Implementation, Albuquerque, New Mexico (June 1993).
E. Granston and H. Wishoff, Managing Pages in Shared Virtual Memory Systems: Getting the Compiler into the Game, Proc. 1993 ACM Int’l Conf. on Supercomputing, Tokyo, Japan (July 1993).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Torrie, E., Martonosi, M., Hall, M.W. et al. Memory Referencing Behavior in Compiler-Parallelized Applications. Int J Parallel Prog 24, 349–376 (1996). https://doi.org/10.1007/BF03356754
Published:
Issue Date:
DOI: https://doi.org/10.1007/BF03356754