Efficient execution of memory access phases using dataflow specialization

CH Ho, SJ Kim, K Sankaralingam - Proceedings of the 42nd annual …, 2015 - dl.acm.org
Proceedings of the 42nd annual international symposium on computer architecture, 2015dl.acm.org
This paper identifies a new opportunity for improving the efficiency of a processor core:
memory access phases of programs. These are dynamic regions of programs where most of
the instructions are devoted to memory access or address computation. These occur
naturally in programs because of workload properties, or when employing an in-core
accelerator, we get induced phases where the code execution on the core is access code.
We observe such code requires an OOO core's dataflow and dynamism to run fast and does …
This paper identifies a new opportunity for improving the efficiency of a processor core: memory access phases of programs. These are dynamic regions of programs where most of the instructions are devoted to memory access or address computation. These occur naturally in programs because of workload properties, or when employing an in-core accelerator, we get induced phases where the code execution on the core is access code. We observe such code requires an OOO core's dataflow and dynamism to run fast and does not execute well on an in-order processor. However, an OOO core consumes much power, effectively increasing energy consumption and reducing the energy efficiency of in-core accelerators.
We develop an execution model called memory access dataflow (MAD) that encodes dataflow computation, event-condition-action rules, and explicit actions. Using it we build a specialized engine that provides an OOO core's performance but at a fraction of the power. Such an engine can serve as a general way for any accelerator to execute its respective induced phase, thus providing a common interface and implementation for current and future accelerators. We have designed and implemented MAD in RTL, and we demonstrate its generality and flexibility by integration with four diverse accelerators (SSE, DySER, NPU, and C-Cores). Our quantitative results show, relative to in-order, 2-wide OOO, and 4-wide OOO, MAD provides 2.4×, 1.4× and equivalent performance respectively. It provides 0.8×, 0.6× and 0.4× lower energy.
ACM Digital Library