In this disertation we present a compile-time performance prediction environment for Fortran scientific programs. The performance data are expressed as symbolic expressions, with variables for program constructs, input data size and machine parameters. We focus on modeling the processor and its memory hierarchy. The results from the static estimation can be used to drive optimizations or can be displayed using performance visualization tools. The integration of our model within the Delphi system allows the user to do performance tuning and scalability analysis faster and easier than by using instrumentation.
The main contribution of this work is the cache behavior estimation using the stack distances algorithm. We have designed and implemented a compile-time algorithm that computes the stack histogram at compile-time. We use the stack histogram to predict program performance statically with very good accuracy. Experimental results are presented for two processor/memory architectures, the MIPS R10000 and UltraSparc II i . The most interesting feature of the stack algorithm is that once the histogram is computed, the number of cache misses can be estimated for any cache size.
We use stack distances to quantify locality and we show that the average locality computed using stack distances is a very reliable metric. A new algorithm for stack processing, that is 30% faster than the best know algorithm on the suite of programs traced, is also presented.
Cited By
- Chen G, Wu B, Li D and Shen X PORPLE Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, (88-100)
- Wu B, Zhao Z, Zhang E, Jiang Y and Shen X (2013). Complexity analysis and algorithm design for reorganizing data to minimize non-coalesced memory accesses on GPU, ACM SIGPLAN Notices, 48:8, (57-68), Online publication date: 23-Aug-2013.
- Wu B, Zhao Z, Zhang E, Jiang Y and Shen X Complexity analysis and algorithm design for reorganizing data to minimize non-coalesced memory accesses on GPU Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming, (57-68)
- Andrade D, Fraguela B and Doallo R (2007). Precise automatable analytical modeling of the cache behavior of codes with indirections, ACM Transactions on Architecture and Code Optimization, 4:3, (16-es), Online publication date: 1-Sep-2007.
- Andrade D, Fraguela B and Doallo R Cache behavior modelling for codes involving banded matrices Proceedings of the 19th international conference on Languages and compilers for parallel computing, (205-219)
- Fraguela B, Doallo R, Touriño J and Zapata E (2004). A compiler tool to predict memory hierarchy performance of scientific codes, Parallel Computing, 30:2, (225-248), Online publication date: 1-Feb-2004.
- Ding C and Zhong Y Predicting whole-program locality through reuse distance analysis Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation, (245-257)
- Ding C and Zhong Y (2003). Predicting whole-program locality through reuse distance analysis, ACM SIGPLAN Notices, 38:5, (245-257), Online publication date: 9-May-2003.
- Fraguela B, Doallo R and Zapata E (2003). Probabilistic Miss Equations, IEEE Transactions on Computers, 52:3, (321-336), Online publication date: 1-Mar-2003.
- Chatterjee S, Parker E, Hanlon P and Lebeck A Exact analysis of the cache behavior of nested loops Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation, (286-297)
- Chatterjee S, Parker E, Hanlon P and Lebeck A (2001). Exact analysis of the cache behavior of nested loops, ACM SIGPLAN Notices, 36:5, (286-297), Online publication date: 1-May-2001.
Recommendations
Predicting Cache Contention for Multithread Applications at Compile Time
IPDPSW '14: Proceedings of the 2014 IEEE International Parallel & Distributed Processing Symposium WorkshopsShared cache in multicore processors is an important hardware resource that should be utilized effectively to achieve high performance for parallel applications. It is critical to coordinate accesses by multiple threads to data that reside in shared ...
High performance cache replacement using re-reference interval prediction (RRIP)
ISCA '10Practical cache replacement policies attempt to emulate optimal replacement by predicting the re-reference interval of a cache block. The commonly used LRU replacement policy always predicts a near-immediate re-reference interval on cache hits and ...