Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/2665671.2665701acmconferencesArticle/Chapter ViewAbstractPublication PagesiscaConference Proceedingsconference-collections

Fine-grain task aggregation and coordination on GPUs

Published: 14 June 2014 Publication History


In general-purpose graphics processing unit (GPGPU) computing, data is processed by concurrent threads execut-ing the same function. This model, dubbed single-instruction/multiple-thread (SIMT), requires programmers to coordinate the synchronous execution of similar opera-tions across thousands of data elements. To alleviate this programmer burden, Gaster and Howes outlined the chan-nel abstraction, which facilitates dynamically aggregating asynchronously produced fine-grain work into coarser-grain tasks. However, no practical implementation has been proposed
To this end, we propose and evaluate the first channel im-plementation. To demonstrate the utility of channels, we present a case study that maps the fine-grain, recursive task spawning in the Cilk programming language to channels by representing it as a flow graph. To support data-parallel recursion in bounded memory, we propose a hardware mechanism that allows wavefronts to yield their execution resources. Through channels and wavefront yield, we im-plement four Cilk benchmarks. We show that Cilk can scale with the GPU architecture, achieving speedups of as much as 4.3x on eight compute units


G. E. Moore, "Cramming More Components onto Integrated Circuits," Proc. IEEE, vol. 86, no. 1, pp. 82--85, Jan. 1998.
S. R. Gutta, D. Foley, A. Naini, R. Wasmuth, and D. Cherepacha, "A Low-Power Integrated x86--64 and Graphics Processor for Mobile Computing Devices," in Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2011, pp. 270--272.
M. Yuffe, E. Knoll, M. Mehalel, J. Shor, and T. Kurts, "A Fully Inte-grated Multi-CPU, GPU and Memory Controller 32nm Processor," in Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2011, pp. 264--266.
"Bringing High-End Graphics to Handheld Devices," NVIDIA, 2011.
G. Kyriazis, "Heterogeneous System Architecture: A Technical Review," AMD, Aug. 2012.
"CUDA C Programming Guide." {Online}. Available: http://docs.nvidia.com/cuda/cuda-c-programming-guide/.
"OpenCL 2.0 Reference Pages." {Online}. Available: http://www.khronos.org/registry/cl/sdk/2.0/docs/man/xhtml/.
B. R. Gaster and L. Howes, "Can GPGPU Programming Be Liberated from the Data-Parallel Bottleneck?" Computer, vol. 45, no. 8, pp. 42--52, Aug. 2012.
"Intel Threading Building Blocks." {Online}. Available: http://www.threadingbuildingblocks.org/.
S. Min, C. Iancu, and K. Yelick, "Hierarchical work stealing on manycore clusters," in Fifth Conference on Partitioned Global Address Space Programming Models, 2011.
J. Valois, "Implementing Lock-Free Queues," in Proceedings of the Seventh International Conference on Parallel and Distributed Computing Systems, 1994, pp. 64--69.
C. Gong and J. M. Wing, "A Library of Concurrent Objects and Their Proofs of Correctness," Carnegie Mellon University, Technical Report, 1990.
A. Gottlieb, B. D. Lubachevsky, and L. Rudolph, "Basic Techniques for the Efficient Coordination of Very Large Numbers of Cooperating Sequential Processors," ACM Trans Program Lang Syst, vol. 5, no. 2, pp. 164--189, Apr. 1983.
M. M. Michael and M. L. Scott, "Nonblocking Algorithms and Preemption-safe Locking on Multiprogrammed Shared Memory Multiprocessors," J Parallel Distrib Comput, vol. 51, no. 1, pp. 1--26, May 1998.
E. Ladan-mozes and N. Shavit, "An Optimistic Approach to Lock-Free FIFO queues," in Proceedings of the 18th International Symposium on Distributed Computing, 2004, pp. 117--131.
"HSA Programmer's Reference Manual: HSAIL Virtual ISA and Programming Model, Compiler Writer's Guide, and Object Format (BRIG)," HSA Foundation, Spring 2013.
M. Harris, "Optimizing Parallel Reduction in CUDA," NVIDIA. {Online}. Available: http://developer.download.nvidia.com/assets/ cuda/files/reduction.pdf.
J. Dean and S. Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters," Commun ACM, vol. 51, no. 1, pp. 107--113, Jan. 2008.
W. Thies, M. Karczmarek, and S. P. Amarasinghe, "StreamIt: A Language for Streaming Applications," in Proceedings of the 11th International Conference on Compiler Construction, London, UK, 2002, pp. 179--196.
J. Sugerman, K. Fatahalian, S. Boulos, K. Akeley, and P. Hanrahan, "GRAMPS: A Programming Model for Graphics Pipelines," ACM Trans Graph, vol. 28, no. 1, pp. 4:1--4:11, Feb. 2009.
M. Frigo, C. E. Leiserson, and K. H. Randall, "The Implementation of the Cilk-5 Multithreaded Language," in Proceedings of the ACM SIGPLAN 1998 Conference on Programming Language Design and Implementation, New York, N.Y., USA, 1998, pp. 212--223.
G. Diamos and S. Yalamanchili, "Speculative Execution on Multi-GPU Systems," in 2010 IEEE International Symposium on Parallel Distributed Processing (IPDPS), 2010, pp. 1--12.
Y. Guo, R. Barik, R. Raman, and V. Sarkar, "Work-First and Help-First Scheduling Policies for Async-Finish Task Parallelism," in IEEE International Symposium on Parallel Distributed Processing, 2009. IPDPS 2009, 2009, pp. 1--12.
N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood, "The gem5 Simulator," SIGARCH Comput Arch. News, vol. 39, no. 2, pp. 1--7, Aug. 2011.
P. Conway and B. Hughes, "The AMD Opteron Northbridge Architecture," IEEE Micro, vol. 27, no. 2, pp. 10--21, Mar. 2007.
B. A. Hechtman and D. J. Sorin, "Exploring Memory Consistency for Massively-Threaded Throughput-Oriented Processors," in Proceedings of the 40th Annual International Symposium on Computer Architecture, New York, N.Y., USA, 2013, pp. 201--212.
B. A. Hechtman, S. Che, D. R. Hower, Y. Tian, B. M. Beckmann, M. D. Hill, S. K. Reinhardt, and D. A. Wood, "QuickRelease: A Throughput-oriented Approach to Release Consistency on GPUs," presented at the 20th IEEE International Symposium On High Performance Computer Architecture (HPCA-2014).
V. Strassen, "The Asymptotic Spectrum of Tensors and the Exponent of Matrix Multiplication," in Proceedings of the 27th Annual Symposium on Foundations of Computer Science, Washington, D.C., USA, 1986, pp. 49--54.
AMD Corporation, "AMD Accelerated Parallel Processing SDK." {Online}. Available: http://developer.amd.com/tools-and-sdks/.
A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt, "Analyzing CUDA Workloads Using a Detailed GPU Simulator," in IEEE International Symposium on Performance Analysis of Systems and Software, 2009. ISPASS 2009, 2009, pp. 163--174.
T. G. Rogers, M. O'Connor, and T. M. Aamodt, "Cache-Conscious Thread Scheduling for Massively Multithreaded Processors," IEEE Micro, vol. 33, no. 3, pp. 78--85, May 2013.
M. Steffen and J. Zambreno, "Improving SIMT Efficiency of Global Rendering Algorithms with Architectural Support for Dynamic Micro-Kernels," in Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture, Washington, D.C., USA, 2010, pp. 237--248.
J. Hoberock, V. Lu, Y. Jia, and J. C. Hart, "Stream Compaction for Deferred Shading," in Proceedings of the Conference on High Performance Graphics 2009, New York, N.Y., USA, 2009, pp. 173--180.
M. Steinberger, B. Kainz, B. Kerbl, S. Hauswiesner, M. Kenzel, and D. Schmalstieg, "Softshell: Dynamic Scheduling on GPUs," ACM Trans Graph, vol. 31, no. 6, pp. 161:1--161:11, Nov. 2012.
T. Aila and S. Laine, "Understanding the Efficiency of Ray Traversal on GPUs," in Proceedings of the Conference on High Performance Graphics 2009, New York, N.Y., USA, 2009, pp. 145--149.
S. Tzeng, A. Patney, and J. D. Owens, "Task Management for Irregular-Parallel Workloads on the GPU," in Proceedings of the Conference on High Performance Graphics, Airela-Ville, Switzerland, 2010, pp. 29--37.
W. W. L. Fung, I. Singh, A. Brownsword, and T. M. Aamodt, "Hardware Transactional Memory for GPU Architectures," in Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, New York, N.Y., USA, 2011, pp. 296--307.
A. Parashar, M. Pellauer, M. Adler, B. Ahsan, N. Crago, D. Lustig, V. Pavlov, A. Zhai, M. Gambhir, A. Jaleel, R. Allmon, R. Rayess, S. Maresh, and J. Emer, "Triggered Instructions: A Control Paradigm for Spatially-Programmed Architectures," in Proceedings of the 40th Annual International Symposium on Computer Architecture, New York, N.Y., USA, 2013, pp. 142--153.

Cited By

View all
  • (2018)ComP-netProceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques10.1145/3243176.3243179(1-13)Online publication date: 1-Nov-2018
  • (2018)Warp-ConsolidationProceedings of the 2018 International Conference on Supercomputing10.1145/3205289.3205294(53-64)Online publication date: 12-Jun-2018
  • (2018)Taming irregular applications via advanced dynamic parallelism on GPUsProceedings of the 15th ACM International Conference on Computing Frontiers10.1145/3203217.3203243(146-154)Online publication date: 8-May-2018
  • Show More Cited By



Information & Contributors


Published In

cover image ACM Conferences
ISCA '14: Proceeding of the 41st annual international symposium on Computer architecuture
June 2014
566 pages



IEEE Press

Publication History

Published: 14 June 2014

Check for updates


  • Research-article



Acceptance Rates

Overall Acceptance Rate 543 of 3,203 submissions, 17%

Upcoming Conference

ISCA '25


Other Metrics

Bibliometrics & Citations


Article Metrics

  • Downloads (Last 12 months)20
  • Downloads (Last 6 weeks)3
Reflects downloads up to 23 Dec 2024

Other Metrics


Cited By

View all
  • (2018)ComP-netProceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques10.1145/3243176.3243179(1-13)Online publication date: 1-Nov-2018
  • (2018)Warp-ConsolidationProceedings of the 2018 International Conference on Supercomputing10.1145/3205289.3205294(53-64)Online publication date: 12-Jun-2018
  • (2018)Taming irregular applications via advanced dynamic parallelism on GPUsProceedings of the 15th ACM International Conference on Computing Frontiers10.1145/3203217.3203243(146-154)Online publication date: 8-May-2018
  • (2018)JugglerACM SIGPLAN Notices10.1145/3200691.317849253:1(54-67)Online publication date: 10-Feb-2018
  • (2018)JugglerProceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3178487.3178492(54-67)Online publication date: 10-Feb-2018
  • (2018)TAPASProceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO.2018.00028(245-257)Online publication date: 20-Oct-2018
  • (2017)Improving the Efficiency of GPGPU Work-Queue Through Data AwarenessACM Transactions on Architecture and Code Optimization10.1145/315103514:4(1-22)Online publication date: 5-Dec-2017
  • (2017)GravelProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3126908.3126914(1-12)Online publication date: 12-Nov-2017
  • (2016)LaPermACM SIGARCH Computer Architecture News10.1145/3007787.300119944:3(583-595)Online publication date: 18-Jun-2016
  • (2016)Exploring the Efficiency of the OpenCL Pipe Semantic on an FPGAACM SIGARCH Computer Architecture News10.1145/2927964.292797443:4(52-57)Online publication date: 22-Apr-2016
  • Show More Cited By

View Options

Login options

View options


View or Download as a PDF file.



View online with eReader.








Share this Publication link

Share on social media