Address translation optimization for Unified Parallel C multi-dimensional arrays

O Serres, A Anbar, SG Merchant, A Kayi… - … on Parallel and …, 2011 - ieeexplore.ieee.org
2011 IEEE International Symposium on Parallel and Distributed …, 2011ieeexplore.ieee.org
Partitioned Global Address Space (PGAS) languages offer significant programmability
advantages with its global memory view abstraction, one-sided communication constructs
and data locality awareness. These attributes place PGAS languages at the forefront of
possible solutions to the exploding programming complexity in the many-core architectures.
To enable the shared address space abstraction, PGAS languages use an address
translation mechanism while accessing shared memory to convert shared addresses to …
Partitioned Global Address Space (PGAS) languages offer significant programmability advantages with its global memory view abstraction, one-sided communication constructs and data locality awareness. These attributes place PGAS languages at the forefront of possible solutions to the exploding programming complexity in the many-core architectures. To enable the shared address space abstraction, PGAS languages use an address translation mechanism while accessing shared memory to convert shared addresses to physical addresses. This mechanism is already expensive in terms of performance in distributed memory environments, but it becomes a major bottleneck in machines with shared memory support where the access latencies are significantly lower. Multi- and many-core processors exhibit even lower latencies for shared data due to on-chip cache space utilization. Thus, efficient handling of address translation becomes even more crucial as this overhead may easily become the dominant factor in the overall data access time for such architectures. To alleviate address translation overhead, this paper introduces a new mechanism targeting multi-dimensional arrays used in most scientific and image processing applications. Relative costs and the implementation details for UPC are evaluated with different workloads (matrix multiplication, Random Access benchmark and Sobel edge detection) on two different platforms: a many-core system, the TILE64 (a 64 core processor) and a dual-socket, quad-core Intel Nehalem system (up to 16 threads). Our optimization provides substantial performance improvements, up to 40x. In addition, the proposed mechanism can easily be integrated into compilers abstracting it from the programmers. Accordingly, this improves UPC productivity as it will reduce manual optimization efforts required to minimize the address translation overhead.
ieeexplore.ieee.org