Fast address translation techniques for distributed shared memory compilers

F Cantonnet, TA El-Ghazawi, P Lorenz… - 19th IEEE International …, 2005 - ieeexplore.ieee.org
19th IEEE International Parallel and Distributed Processing Symposium, 2005ieeexplore.ieee.org
The distributed shared memory (DSM) model is designed to leverage the ease of
programming of the shared memory paradigm, while enabling the high-performance by
expressing locality as in the message-passing model. Experience, however, has shown that
DSM programming languages, such as UPC, may be unable to deliver the expected high
level of performance. Initial investigations have shown that among the major reasons is the
overhead of translating from the UPC memory model to the target architecture virtual …
The distributed shared memory (DSM) model is designed to leverage the ease of programming of the shared memory paradigm, while enabling the high-performance by expressing locality as in the message-passing model. Experience, however, has shown that DSM programming languages, such as UPC, may be unable to deliver the expected high level of performance. Initial investigations have shown that among the major reasons is the overhead of translating from the UPC memory model to the target architecture virtual addresses space, which can be very costly. Experimental measurements have shown this overhead increasing execution time by up to three orders of magnitude. Previous work has also shown that some of this overhead can be avoided by hand-tuning, which on the other hand can significantly decrease the UPC ease of use. In addition, such tuning can only improve the performance of local shared accesses but not remote shared accesses. Therefore, a new technique that resembles the translation look aside buffers (TLBs) is proposed here. This technique, which is called the memory model translation buffer (MMTB) has been implemented in the GCC-UPC compiler using two alternative strategies, full-table (FT) and reduced-table (RT). It would be shown that the MMTB strategies can lead to a performance boost of up to 700%, enabling ease-of-programming while performing at a similar performance to hand-tuned UPC and MPI codes.
ieeexplore.ieee.org