poster

Acceleration of bulk memory operations in a heterogeneous multicore architecture

Authors:

Kyeong-An KwonAuthors Info & Claims

PACT '12: Proceedings of the 21st international conference on Parallel architectures and compilation techniques

Pages 423 - 424

https://doi.org/10.1145/2370816.2370877

Published: 19 September 2012 Publication History

Get Access

Abstract

In this paper, we present a novel approach of using the integrated GPU to accelerate conventional operations that are normally performed by the CPUs, the bulk memory operations, such as memcpy or memset. Offloading the bulk memory operations to the GPU has many advantages, i) the throughput driven GPU outperforms the CPU on the bulk memory operations; ii) for on-die GPU with unified cache between the GPU and the CPU, the GPU private caches can be leveraged by the CPU for storing moved data and reducing the CPU cache bottleneck; iii) with additional lightweight hardware, asynchronous offload can be supported as well; and iv) different from the prior arts using dedicated hardware copy engines (e.g., DMA), our approach leverages the exiting GPU hardware resources as much as possible. The performance results based on our solution showed that offloaded bulk memory operations outperform CPU up to 4.3 times in micro benchmarks while still using less resources. Using eight real world applications and a cycle based full system simulation environment, the results showed 30% speedup for five, more than 20% speedup for two of the eight applications.

References

[1]

Fes2: A full-system execution-driven simulator for x86. http://fes2.cs.uiuc.edu/index.html, 2007.

Google Scholar

[2]

Magnusson, P., Christensson, M., Eskilson, J., Forsgren, D., Hallberg, G., Hogberg, J., Larsson, F., Moestedt, A., and Werner, B. Simics: A full system simulation platform. Computer 35, 2 (Feb 2002), 50--58.

Digital Library

Google Scholar

[3]

Meng, J., and Skadron, K. Avoiding cache thrashing due to private data placement in last-level cache for manycore scaling. In Proceedings of the 2009 IEEE international conference on Computer design (Piscataway, NJ, USA, 2009), ICCD'09, IEEE Press, pp. 282--288.

Digital Library

Google Scholar

Cited By

View all

Zhuang SZhao JLi JYu PZhang YGuan H(2021)HAVS: Hardware-accelerated Shared-memory-based VPP Network StackIEEE INFOCOM 2021 - IEEE Conference on Computer Communications10.1109/INFOCOM42981.2021.9488808(1-10)Online publication date: 10-May-2021
https://doi.org/10.1109/INFOCOM42981.2021.9488808
Lee JShi WGil J(2018)Accelerated bulk memory operations on heterogeneous multi-core systemsThe Journal of Supercomputing10.1007/s11227-018-2589-x74:12(6898-6922)Online publication date: 1-Dec-2018
https://dl.acm.org/doi/10.1007/s11227-018-2589-x

Index Terms

Acceleration of bulk memory operations in a heterogeneous multicore architecture
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Heterogeneous (hybrid) systems

Recommendations

Accelerated bulk memory operations on heterogeneous multi-core systems

A traditional fixed-function graphics accelerator has evolved into a programmable general-purpose graphics processing unit over the past few years, the general-purpose computing on GPU (GPGPU). Recently, revolutionary measures have been taken along this ...
Improving performance of GPU code using novel features of the NVIDIA kepler architecture

Graphics processing unit GPU computing is a popular approach to simulating complex models and performing massive calculations. GPUs have attracted a great deal of interest because they offer both high performance and energy efficiency. Efficient General-...
Heterogeneous acceleration of volumetric JPEG 2000 using OpenCL

This paper discusses an OpenCL version of a volumetric JPEG 2000 codec that runs on GPUs, multi-core processors or a combination of both. Since the performance critical part consists of a fine-grained discrete wavelet transform and coarse-grained ...

Comments

Information & Contributors

Information

Published In

PACT '12: Proceedings of the 21st international conference on Parallel architectures and compilation techniques

September 2012

512 pages

ISBN:9781450311823

DOI:10.1145/2370816

General Chairs:
Pen-Chung Yew
University of Minnesota
,
Sangyeun Cho
University of Pittsburgh
,
Program Chairs:
Luiz DeRose
Cray, Inc.
,
David J. Lilja
University of Minnesota

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 September 2012

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Poster

Conference

PACT '12

Sponsor:

IFIP WG 10.3
SIGARCH
IEEE CS TCPP
IEEE CS TCAA

PACT '12: International Conference on Parallel Architectures and Compilation Techniques

September 19 - 23, 2012

Minnesota, Minneapolis, USA

Acceptance Rates

Overall Acceptance Rate 121 of 471 submissions, 26%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
194
Total Downloads

Downloads (Last 12 months)3
Downloads (Last 6 weeks)0

Reflects downloads up to 15 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Zhuang SZhao JLi JYu PZhang YGuan H(2021)HAVS: Hardware-accelerated Shared-memory-based VPP Network StackIEEE INFOCOM 2021 - IEEE Conference on Computer Communications10.1109/INFOCOM42981.2021.9488808(1-10)Online publication date: 10-May-2021
https://doi.org/10.1109/INFOCOM42981.2021.9488808
Lee JShi WGil J(2018)Accelerated bulk memory operations on heterogeneous multi-core systemsThe Journal of Supercomputing10.1007/s11227-018-2589-x74:12(6898-6922)Online publication date: 1-Dec-2018
https://dl.acm.org/doi/10.1007/s11227-018-2589-x

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations

Accelerated bulk memory operations on heterogeneous multi-core systems

Improving performance of GPU code using novel features of the NVIDIA kepler architecture

Heterogeneous acceleration of volumetric JPEG 2000 using OpenCL