research-article

Fine-grain task aggregation and coordination on GPUs

Authors:

Bradford M. Beckmann,

Steven K. Reinhardt,

David A. WoodAuthors Info & Claims

ISCA '14: Proceeding of the 41st annual international symposium on Computer architecuture

Pages 181 - 192

Published: 14 June 2014 Publication History

Abstract

In general-purpose graphics processing unit (GPGPU) computing, data is processed by concurrent threads execut-ing the same function. This model, dubbed single-instruction/multiple-thread (SIMT), requires programmers to coordinate the synchronous execution of similar opera-tions across thousands of data elements. To alleviate this programmer burden, Gaster and Howes outlined the chan-nel abstraction, which facilitates dynamically aggregating asynchronously produced fine-grain work into coarser-grain tasks. However, no practical implementation has been proposed

To this end, we propose and evaluate the first channel im-plementation. To demonstrate the utility of channels, we present a case study that maps the fine-grain, recursive task spawning in the Cilk programming language to channels by representing it as a flow graph. To support data-parallel recursion in bounded memory, we propose a hardware mechanism that allows wavefronts to yield their execution resources. Through channels and wavefront yield, we im-plement four Cilk benchmarks. We show that Cilk can scale with the GPU architecture, achieving speedups of as much as 4.3x on eight compute units

References

[1]

G. E. Moore, "Cramming More Components onto Integrated Circuits," Proc. IEEE, vol. 86, no. 1, pp. 82--85, Jan. 1998.

[2]

S. R. Gutta, D. Foley, A. Naini, R. Wasmuth, and D. Cherepacha, "A Low-Power Integrated x86--64 and Graphics Processor for Mobile Computing Devices," in Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2011, pp. 270--272.

[3]

M. Yuffe, E. Knoll, M. Mehalel, J. Shor, and T. Kurts, "A Fully Inte-grated Multi-CPU, GPU and Memory Controller 32nm Processor," in Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2011, pp. 264--266.

[4]

"Bringing High-End Graphics to Handheld Devices," NVIDIA, 2011.

[5]

G. Kyriazis, "Heterogeneous System Architecture: A Technical Review," AMD, Aug. 2012.

[6]

"CUDA C Programming Guide." {Online}. Available: http://docs.nvidia.com/cuda/cuda-c-programming-guide/.

[7]

"OpenCL 2.0 Reference Pages." {Online}. Available: http://www.khronos.org/registry/cl/sdk/2.0/docs/man/xhtml/.

[8]

B. R. Gaster and L. Howes, "Can GPGPU Programming Be Liberated from the Data-Parallel Bottleneck?" Computer, vol. 45, no. 8, pp. 42--52, Aug. 2012.

Digital Library

[9]

"Intel Threading Building Blocks." {Online}. Available: http://www.threadingbuildingblocks.org/.

[10]

S. Min, C. Iancu, and K. Yelick, "Hierarchical work stealing on manycore clusters," in Fifth Conference on Partitioned Global Address Space Programming Models, 2011.

[11]

J. Valois, "Implementing Lock-Free Queues," in Proceedings of the Seventh International Conference on Parallel and Distributed Computing Systems, 1994, pp. 64--69.

[12]

C. Gong and J. M. Wing, "A Library of Concurrent Objects and Their Proofs of Correctness," Carnegie Mellon University, Technical Report, 1990.

[13]

A. Gottlieb, B. D. Lubachevsky, and L. Rudolph, "Basic Techniques for the Efficient Coordination of Very Large Numbers of Cooperating Sequential Processors," ACM Trans Program Lang Syst, vol. 5, no. 2, pp. 164--189, Apr. 1983.

Digital Library

[14]

M. M. Michael and M. L. Scott, "Nonblocking Algorithms and Preemption-safe Locking on Multiprogrammed Shared Memory Multiprocessors," J Parallel Distrib Comput, vol. 51, no. 1, pp. 1--26, May 1998.

Digital Library

[15]

E. Ladan-mozes and N. Shavit, "An Optimistic Approach to Lock-Free FIFO queues," in Proceedings of the 18th International Symposium on Distributed Computing, 2004, pp. 117--131.

[16]

"HSA Programmer's Reference Manual: HSAIL Virtual ISA and Programming Model, Compiler Writer's Guide, and Object Format (BRIG)," HSA Foundation, Spring 2013.

[17]

M. Harris, "Optimizing Parallel Reduction in CUDA," NVIDIA. {Online}. Available: http://developer.download.nvidia.com/assets/ cuda/files/reduction.pdf.

[18]

J. Dean and S. Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters," Commun ACM, vol. 51, no. 1, pp. 107--113, Jan. 2008.

Digital Library

[19]

W. Thies, M. Karczmarek, and S. P. Amarasinghe, "StreamIt: A Language for Streaming Applications," in Proceedings of the 11th International Conference on Compiler Construction, London, UK, 2002, pp. 179--196.

Digital Library

[20]

J. Sugerman, K. Fatahalian, S. Boulos, K. Akeley, and P. Hanrahan, "GRAMPS: A Programming Model for Graphics Pipelines," ACM Trans Graph, vol. 28, no. 1, pp. 4:1--4:11, Feb. 2009.

Digital Library

[21]

M. Frigo, C. E. Leiserson, and K. H. Randall, "The Implementation of the Cilk-5 Multithreaded Language," in Proceedings of the ACM SIGPLAN 1998 Conference on Programming Language Design and Implementation, New York, N.Y., USA, 1998, pp. 212--223.

Digital Library

[22]

G. Diamos and S. Yalamanchili, "Speculative Execution on Multi-GPU Systems," in 2010 IEEE International Symposium on Parallel Distributed Processing (IPDPS), 2010, pp. 1--12.

[23]

Y. Guo, R. Barik, R. Raman, and V. Sarkar, "Work-First and Help-First Scheduling Policies for Async-Finish Task Parallelism," in IEEE International Symposium on Parallel Distributed Processing, 2009. IPDPS 2009, 2009, pp. 1--12.

Digital Library

[24]

N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood, "The gem5 Simulator," SIGARCH Comput Arch. News, vol. 39, no. 2, pp. 1--7, Aug. 2011.

Digital Library

[25]

P. Conway and B. Hughes, "The AMD Opteron Northbridge Architecture," IEEE Micro, vol. 27, no. 2, pp. 10--21, Mar. 2007.

Digital Library

[26]

B. A. Hechtman and D. J. Sorin, "Exploring Memory Consistency for Massively-Threaded Throughput-Oriented Processors," in Proceedings of the 40th Annual International Symposium on Computer Architecture, New York, N.Y., USA, 2013, pp. 201--212.

Digital Library

[27]

B. A. Hechtman, S. Che, D. R. Hower, Y. Tian, B. M. Beckmann, M. D. Hill, S. K. Reinhardt, and D. A. Wood, "QuickRelease: A Throughput-oriented Approach to Release Consistency on GPUs," presented at the 20th IEEE International Symposium On High Performance Computer Architecture (HPCA-2014).

[28]

V. Strassen, "The Asymptotic Spectrum of Tensors and the Exponent of Matrix Multiplication," in Proceedings of the 27th Annual Symposium on Foundations of Computer Science, Washington, D.C., USA, 1986, pp. 49--54.

Digital Library

[29]

AMD Corporation, "AMD Accelerated Parallel Processing SDK." {Online}. Available: http://developer.amd.com/tools-and-sdks/.

[30]

A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt, "Analyzing CUDA Workloads Using a Detailed GPU Simulator," in IEEE International Symposium on Performance Analysis of Systems and Software, 2009. ISPASS 2009, 2009, pp. 163--174.

[31]

T. G. Rogers, M. O'Connor, and T. M. Aamodt, "Cache-Conscious Thread Scheduling for Massively Multithreaded Processors," IEEE Micro, vol. 33, no. 3, pp. 78--85, May 2013.

Digital Library

[32]

M. Steffen and J. Zambreno, "Improving SIMT Efficiency of Global Rendering Algorithms with Architectural Support for Dynamic Micro-Kernels," in Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture, Washington, D.C., USA, 2010, pp. 237--248.

Digital Library

[33]

J. Hoberock, V. Lu, Y. Jia, and J. C. Hart, "Stream Compaction for Deferred Shading," in Proceedings of the Conference on High Performance Graphics 2009, New York, N.Y., USA, 2009, pp. 173--180.

Digital Library

[34]

M. Steinberger, B. Kainz, B. Kerbl, S. Hauswiesner, M. Kenzel, and D. Schmalstieg, "Softshell: Dynamic Scheduling on GPUs," ACM Trans Graph, vol. 31, no. 6, pp. 161:1--161:11, Nov. 2012.

Digital Library

[35]

T. Aila and S. Laine, "Understanding the Efficiency of Ray Traversal on GPUs," in Proceedings of the Conference on High Performance Graphics 2009, New York, N.Y., USA, 2009, pp. 145--149.

Digital Library

[36]

S. Tzeng, A. Patney, and J. D. Owens, "Task Management for Irregular-Parallel Workloads on the GPU," in Proceedings of the Conference on High Performance Graphics, Airela-Ville, Switzerland, 2010, pp. 29--37.

Digital Library

[37]

W. W. L. Fung, I. Singh, A. Brownsword, and T. M. Aamodt, "Hardware Transactional Memory for GPU Architectures," in Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, New York, N.Y., USA, 2011, pp. 296--307.

Digital Library

[38]

A. Parashar, M. Pellauer, M. Adler, B. Ahsan, N. Crago, D. Lustig, V. Pavlov, A. Zhai, M. Gambhir, A. Jaleel, R. Allmon, R. Rayess, S. Maresh, and J. Emer, "Triggered Instructions: A Control Paradigm for Spatially-Programmed Architectures," in Proceedings of the 40th Annual International Symposium on Computer Architecture, New York, N.Y., USA, 2013, pp. 142--153.

Digital Library

Cited By

LeBeane MHamidouche KBenton BBreternitz MReinhardt SJohn LEvripidou SStenström PO'Boyle M(2018)ComP-netProceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques10.1145/3243176.3243179(1-13)Online publication date: 1-Nov-2018
https://dl.acm.org/doi/10.1145/3243176.3243179
Li ALiu WWang LBarker KSong S(2018)Warp-ConsolidationProceedings of the 2018 International Conference on Supercomputing10.1145/3205289.3205294(53-64)Online publication date: 12-Jun-2018
https://dl.acm.org/doi/10.1145/3205289.3205294
Zhang JAji AChu MWang HFeng WKaeli DPericàs M(2018)Taming irregular applications via advanced dynamic parallelism on GPUsProceedings of the 15th ACM International Conference on Computing Frontiers10.1145/3203217.3203243(146-154)Online publication date: 8-May-2018
https://dl.acm.org/doi/10.1145/3203217.3203243
Show More Cited By

Recommendations

Fine-grain task aggregation and coordination on GPUs
ISCA '14

In general-purpose graphics processing unit (GPGPU) computing, data is processed by concurrent threads execut-ing the same function. This model, dubbed single-instruction/multiple-thread (SIMT), requires programmers to coordinate the synchronous ...
Fine-grain parallelism using multi-core, Cell/BE, and GPU Systems

Currently, we are facing a situation where applications exhibit increasing computational demands and where a large variety of parallel processor systems are available. In this paper we focus on exploiting fine-grain parallelism for three applications ...
Generating Fine-Grain Multithreaded Applications Using a Multigrain Approach

The recent evolution in hardware landscape, aimed at producing high-performance computing systems capable of reaching extreme-scale performance, has reignited the interest in fine-grain multithreading, particularly at the intranode level. Indeed, ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ISCA '14: Proceeding of the 41st annual international symposium on Computer architecuture

June 2014

566 pages

ISBN:9781479943944

General Chairs:
Pen-Chung Yew
University of Minnesota
,
Antonia Zhai
University of Minnesota
,
Program Chair:
Steve Keckler
NVIDIA/University of Texas at Austin

ACM SIGARCH Computer Architecture News Volume 42, Issue 3
ISCA '14
June 2014
552 pages
ISSN:0163-5964
DOI:10.1145/2678373
Editor:
Doug DeGroot
acm dot org
Issue’s Table of Contents

Sponsors

IEEE TCCA: IEEE Computer Society Technical Committee on Computer Architecture
SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

IEEE Press

Publication History

Published: 14 June 2014

Check for updates

Qualifiers

Research-article

Conference

ISCA'14

Sponsor:

IEEE TCCA
SIGARCH

ISCA'14: The 41st Annual International Symposium on Computer Architecture

June 14 - 18, 2014

Minnesota, Minneapolis, USA

Acceptance Rates

Overall Acceptance Rate 543 of 3,203 submissions, 17%

Upcoming Conference

ISCA '25

Sponsor:
sigarch

The 52nd Annual International Symposium on Computer Architecture

June 21 - 25, 2025

Tokyo , Japan

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

26
Total Citations
View Citations
415
Total Downloads

Downloads (Last 12 months)20
Downloads (Last 6 weeks)3

Reflects downloads up to 23 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

LeBeane MHamidouche KBenton BBreternitz MReinhardt SJohn LEvripidou SStenström PO'Boyle M(2018)ComP-netProceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques10.1145/3243176.3243179(1-13)Online publication date: 1-Nov-2018
https://dl.acm.org/doi/10.1145/3243176.3243179
Li ALiu WWang LBarker KSong S(2018)Warp-ConsolidationProceedings of the 2018 International Conference on Supercomputing10.1145/3205289.3205294(53-64)Online publication date: 12-Jun-2018
https://dl.acm.org/doi/10.1145/3205289.3205294
Zhang JAji AChu MWang HFeng WKaeli DPericàs M(2018)Taming irregular applications via advanced dynamic parallelism on GPUsProceedings of the 15th ACM International Conference on Computing Frontiers10.1145/3203217.3203243(146-154)Online publication date: 8-May-2018
https://dl.acm.org/doi/10.1145/3203217.3203243
Belviranli MLee SVetter JBhuyan L(2018)JugglerACM SIGPLAN Notices10.1145/3200691.317849253:1(54-67)Online publication date: 10-Feb-2018
https://dl.acm.org/doi/10.1145/3200691.3178492
Belviranli MLee SVetter JBhuyan LKrall AGross T(2018)JugglerProceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3178487.3178492(54-67)Online publication date: 10-Feb-2018
https://dl.acm.org/doi/10.1145/3178487.3178492
Margerm SSharifian AGuha AShriraman APokam GOskin MInoue K(2018)TAPASProceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO.2018.00028(245-257)Online publication date: 20-Oct-2018
https://dl.acm.org/doi/10.1109/MICRO.2018.00028
Huang LLü YShen LWang Z(2017)Improving the Efficiency of GPGPU Work-Queue Through Data AwarenessACM Transactions on Architecture and Code Optimization10.1145/315103514:4(1-22)Online publication date: 5-Dec-2017
https://dl.acm.org/doi/10.1145/3151035
Orr MChe SBeckmann BOskin MReinhardt SWood DMohr BRaghavan P(2017)GravelProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3126908.3126914(1-12)Online publication date: 12-Nov-2017
https://dl.acm.org/doi/10.1145/3126908.3126914
Wang JRubin NSidelnik AYalamanchili S(2016)LaPermACM SIGARCH Computer Architecture News10.1145/3007787.300119944:3(583-595)Online publication date: 18-Jun-2016
https://dl.acm.org/doi/10.1145/3007787.3001199
Momeni ATabkhi HUkidave YSchirner GKaeli D(2016)Exploring the Efficiency of the OpenCL Pipe Semantic on an FPGAACM SIGARCH Computer Architecture News10.1145/2927964.292797443:4(52-57)Online publication date: 22-Apr-2016
https://dl.acm.org/doi/10.1145/2927964.2927974
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents