Hybrid memory cube performance characterization on data-centric workloads

M Gokhale, S Lloyd, C Macaraeg - Proceedings of the 5th Workshop on …, 2015 - dl.acm.org
M Gokhale, S Lloyd, C Macaraeg
Proceedings of the 5th Workshop on Irregular Applications: Architectures and …, 2015dl.acm.org
The Hybrid Memory Cube is an early commercial product embodying attributes of future
stacked DRAM architectures, namely large capacity, high bandwidth, on-package memory
controller, and high speed serial interface. We study the performance and energy of a Gen2
HMC on data-centric workloads through a combination of emulation and execution on an
HMC FPGA board. An in-house FPGA emulator has been used to obtain memory traces for a
small collection of data-centric benchmarks. Our FPGA emulator is based on a 32-bit ARM …
The Hybrid Memory Cube is an early commercial product embodying attributes of future stacked DRAM architectures, namely large capacity, high bandwidth, on-package memory controller, and high speed serial interface. We study the performance and energy of a Gen2 HMC on data-centric workloads through a combination of emulation and execution on an HMC FPGA board. An in-house FPGA emulator has been used to obtain memory traces for a small collection of data-centric benchmarks. Our FPGA emulator is based on a 32-bit ARM processor and non-intrusively captures complete memory access traces at only 20X slowdown from real time. We have developed tools to run combined trace fragments from multiple benchmarks on the HMC board, giving a unique capability to characterize HMC performance and power usage under a data parallel workload. We find that the HMC's separate read and write channels are not well exploited by read-dominated data-centric workloads. Our benchmarks achieve between 66% -- 80% of peak bandwidth (80 GB/s for 32-byte packets with 50--50 read/write mix) on the HMC, suggesting that combined read/write channels might show higher utilization on these access patterns. Bandwidth scales linearly up to saturation with increased demand on highly concurrent application workloads with many independent memory requests. There is a corresponding increase in latency, ranging from 80 ns on an extremely light load to 130 ns at high bandwidth.
ACM Digital Library