Optimizing indirect memory references with milk
Proceedings of the 2016 International Conference on Parallel Architectures …, 2016•dl.acm.org
Modern applications such as graph and data analytics, when operating on real world data,
have working sets much larger than cache capacity and are bottlenecked by DRAM. To
make matters worse, DRAM bandwidth is increasing much slower than per CPU core count,
while DRAM latency has been virtually stagnant. Parallel applications that are bound by
memory bandwidth fail to scale, while applications bound by memory latency draw a small
fraction of much-needed bandwidth. While expert programmers may be able to tune …
have working sets much larger than cache capacity and are bottlenecked by DRAM. To
make matters worse, DRAM bandwidth is increasing much slower than per CPU core count,
while DRAM latency has been virtually stagnant. Parallel applications that are bound by
memory bandwidth fail to scale, while applications bound by memory latency draw a small
fraction of much-needed bandwidth. While expert programmers may be able to tune …
Modern applications such as graph and data analytics, when operating on real world data, have working sets much larger than cache capacity and are bottlenecked by DRAM. To make matters worse, DRAM bandwidth is increasing much slower than per CPU core count, while DRAM latency has been virtually stagnant. Parallel applications that are bound by memory bandwidth fail to scale, while applications bound by memory latency draw a small fraction of much-needed bandwidth. While expert programmers may be able to tune important applications by hand through heroic effort, traditional compiler cache optimizations have not been sufficiently aggressive to overcome the growing DRAM gap.
In this paper, we introduce milk - a C/C++ language extension that allows programmers to annotate memory-bound loops concisely. Using optimized intermediate data structures, random indirect memory references are transformed into batches of efficient sequential DRAM accesses. A simple semantic model enhances programmer productivity for efficient parallelization with OpenMP.
We evaluate the MILK compiler on parallel implementations of traditional graph applications, demonstrating performance gains of up to 3x.
ACM Digital Library