Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/2665671.2665701acmconferencesArticle/Chapter ViewAbstractPublication PagesiscaConference Proceedingsconference-collections
research-article

Fine-grain task aggregation and coordination on GPUs

Published: 14 June 2014 Publication History

Abstract

In general-purpose graphics processing unit (GPGPU) computing, data is processed by concurrent threads execut-ing the same function. This model, dubbed single-instruction/multiple-thread (SIMT), requires programmers to coordinate the synchronous execution of similar opera-tions across thousands of data elements. To alleviate this programmer burden, Gaster and Howes outlined the chan-nel abstraction, which facilitates dynamically aggregating asynchronously produced fine-grain work into coarser-grain tasks. However, no practical implementation has been proposed
To this end, we propose and evaluate the first channel im-plementation. To demonstrate the utility of channels, we present a case study that maps the fine-grain, recursive task spawning in the Cilk programming language to channels by representing it as a flow graph. To support data-parallel recursion in bounded memory, we propose a hardware mechanism that allows wavefronts to yield their execution resources. Through channels and wavefront yield, we im-plement four Cilk benchmarks. We show that Cilk can scale with the GPU architecture, achieving speedups of as much as 4.3x on eight compute units

References

[1]
G. E. Moore, "Cramming More Components onto Integrated Circuits," Proc. IEEE, vol. 86, no. 1, pp. 82--85, Jan. 1998.
[2]
S. R. Gutta, D. Foley, A. Naini, R. Wasmuth, and D. Cherepacha, "A Low-Power Integrated x86--64 and Graphics Processor for Mobile Computing Devices," in Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2011, pp. 270--272.
[3]
M. Yuffe, E. Knoll, M. Mehalel, J. Shor, and T. Kurts, "A Fully Inte-grated Multi-CPU, GPU and Memory Controller 32nm Processor," in Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2011, pp. 264--266.
[4]
"Bringing High-End Graphics to Handheld Devices," NVIDIA, 2011.
[5]
G. Kyriazis, "Heterogeneous System Architecture: A Technical Review," AMD, Aug. 2012.
[6]
"CUDA C Programming Guide." {Online}. Available: http://docs.nvidia.com/cuda/cuda-c-programming-guide/.
[7]
"OpenCL 2.0 Reference Pages." {Online}. Available: http://www.khronos.org/registry/cl/sdk/2.0/docs/man/xhtml/.
[8]
B. R. Gaster and L. Howes, "Can GPGPU Programming Be Liberated from the Data-Parallel Bottleneck?" Computer, vol. 45, no. 8, pp. 42--52, Aug. 2012.
[9]
"Intel Threading Building Blocks." {Online}. Available: http://www.threadingbuildingblocks.org/.
[10]
S. Min, C. Iancu, and K. Yelick, "Hierarchical work stealing on manycore clusters," in Fifth Conference on Partitioned Global Address Space Programming Models, 2011.
[11]
J. Valois, "Implementing Lock-Free Queues," in Proceedings of the Seventh International Conference on Parallel and Distributed Computing Systems, 1994, pp. 64--69.
[12]
C. Gong and J. M. Wing, "A Library of Concurrent Objects and Their Proofs of Correctness," Carnegie Mellon University, Technical Report, 1990.
[13]
A. Gottlieb, B. D. Lubachevsky, and L. Rudolph, "Basic Techniques for the Efficient Coordination of Very Large Numbers of Cooperating Sequential Processors," ACM Trans Program Lang Syst, vol. 5, no. 2, pp. 164--189, Apr. 1983.
[14]
M. M. Michael and M. L. Scott, "Nonblocking Algorithms and Preemption-safe Locking on Multiprogrammed Shared Memory Multiprocessors," J Parallel Distrib Comput, vol. 51, no. 1, pp. 1--26, May 1998.
[15]
E. Ladan-mozes and N. Shavit, "An Optimistic Approach to Lock-Free FIFO queues," in Proceedings of the 18th International Symposium on Distributed Computing, 2004, pp. 117--131.
[16]
"HSA Programmer's Reference Manual: HSAIL Virtual ISA and Programming Model, Compiler Writer's Guide, and Object Format (BRIG)," HSA Foundation, Spring 2013.
[17]
M. Harris, "Optimizing Parallel Reduction in CUDA," NVIDIA. {Online}. Available: http://developer.download.nvidia.com/assets/ cuda/files/reduction.pdf.
[18]
J. Dean and S. Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters," Commun ACM, vol. 51, no. 1, pp. 107--113, Jan. 2008.
[19]
W. Thies, M. Karczmarek, and S. P. Amarasinghe, "StreamIt: A Language for Streaming Applications," in Proceedings of the 11th International Conference on Compiler Construction, London, UK, 2002, pp. 179--196.
[20]
J. Sugerman, K. Fatahalian, S. Boulos, K. Akeley, and P. Hanrahan, "GRAMPS: A Programming Model for Graphics Pipelines," ACM Trans Graph, vol. 28, no. 1, pp. 4:1--4:11, Feb. 2009.
[21]
M. Frigo, C. E. Leiserson, and K. H. Randall, "The Implementation of the Cilk-5 Multithreaded Language," in Proceedings of the ACM SIGPLAN 1998 Conference on Programming Language Design and Implementation, New York, N.Y., USA, 1998, pp. 212--223.
[22]
G. Diamos and S. Yalamanchili, "Speculative Execution on Multi-GPU Systems," in 2010 IEEE International Symposium on Parallel Distributed Processing (IPDPS), 2010, pp. 1--12.
[23]
Y. Guo, R. Barik, R. Raman, and V. Sarkar, "Work-First and Help-First Scheduling Policies for Async-Finish Task Parallelism," in IEEE International Symposium on Parallel Distributed Processing, 2009. IPDPS 2009, 2009, pp. 1--12.
[24]
N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood, "The gem5 Simulator," SIGARCH Comput Arch. News, vol. 39, no. 2, pp. 1--7, Aug. 2011.
[25]
P. Conway and B. Hughes, "The AMD Opteron Northbridge Architecture," IEEE Micro, vol. 27, no. 2, pp. 10--21, Mar. 2007.
[26]
B. A. Hechtman and D. J. Sorin, "Exploring Memory Consistency for Massively-Threaded Throughput-Oriented Processors," in Proceedings of the 40th Annual International Symposium on Computer Architecture, New York, N.Y., USA, 2013, pp. 201--212.
[27]
B. A. Hechtman, S. Che, D. R. Hower, Y. Tian, B. M. Beckmann, M. D. Hill, S. K. Reinhardt, and D. A. Wood, "QuickRelease: A Throughput-oriented Approach to Release Consistency on GPUs," presented at the 20th IEEE International Symposium On High Performance Computer Architecture (HPCA-2014).
[28]
V. Strassen, "The Asymptotic Spectrum of Tensors and the Exponent of Matrix Multiplication," in Proceedings of the 27th Annual Symposium on Foundations of Computer Science, Washington, D.C., USA, 1986, pp. 49--54.
[29]
AMD Corporation, "AMD Accelerated Parallel Processing SDK." {Online}. Available: http://developer.amd.com/tools-and-sdks/.
[30]
A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt, "Analyzing CUDA Workloads Using a Detailed GPU Simulator," in IEEE International Symposium on Performance Analysis of Systems and Software, 2009. ISPASS 2009, 2009, pp. 163--174.
[31]
T. G. Rogers, M. O'Connor, and T. M. Aamodt, "Cache-Conscious Thread Scheduling for Massively Multithreaded Processors," IEEE Micro, vol. 33, no. 3, pp. 78--85, May 2013.
[32]
M. Steffen and J. Zambreno, "Improving SIMT Efficiency of Global Rendering Algorithms with Architectural Support for Dynamic Micro-Kernels," in Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture, Washington, D.C., USA, 2010, pp. 237--248.
[33]
J. Hoberock, V. Lu, Y. Jia, and J. C. Hart, "Stream Compaction for Deferred Shading," in Proceedings of the Conference on High Performance Graphics 2009, New York, N.Y., USA, 2009, pp. 173--180.
[34]
M. Steinberger, B. Kainz, B. Kerbl, S. Hauswiesner, M. Kenzel, and D. Schmalstieg, "Softshell: Dynamic Scheduling on GPUs," ACM Trans Graph, vol. 31, no. 6, pp. 161:1--161:11, Nov. 2012.
[35]
T. Aila and S. Laine, "Understanding the Efficiency of Ray Traversal on GPUs," in Proceedings of the Conference on High Performance Graphics 2009, New York, N.Y., USA, 2009, pp. 145--149.
[36]
S. Tzeng, A. Patney, and J. D. Owens, "Task Management for Irregular-Parallel Workloads on the GPU," in Proceedings of the Conference on High Performance Graphics, Airela-Ville, Switzerland, 2010, pp. 29--37.
[37]
W. W. L. Fung, I. Singh, A. Brownsword, and T. M. Aamodt, "Hardware Transactional Memory for GPU Architectures," in Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, New York, N.Y., USA, 2011, pp. 296--307.
[38]
A. Parashar, M. Pellauer, M. Adler, B. Ahsan, N. Crago, D. Lustig, V. Pavlov, A. Zhai, M. Gambhir, A. Jaleel, R. Allmon, R. Rayess, S. Maresh, and J. Emer, "Triggered Instructions: A Control Paradigm for Spatially-Programmed Architectures," in Proceedings of the 40th Annual International Symposium on Computer Architecture, New York, N.Y., USA, 2013, pp. 142--153.

Cited By

View all
  • (2018)ComP-netProceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques10.1145/3243176.3243179(1-13)Online publication date: 1-Nov-2018
  • (2018)Warp-ConsolidationProceedings of the 2018 International Conference on Supercomputing10.1145/3205289.3205294(53-64)Online publication date: 12-Jun-2018
  • (2018)Taming irregular applications via advanced dynamic parallelism on GPUsProceedings of the 15th ACM International Conference on Computing Frontiers10.1145/3203217.3203243(146-154)Online publication date: 8-May-2018
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ISCA '14: Proceeding of the 41st annual international symposium on Computer architecuture
June 2014
566 pages
ISBN:9781479943944

Sponsors

Publisher

IEEE Press

Publication History

Published: 14 June 2014

Check for updates

Qualifiers

  • Research-article

Conference

ISCA'14
Sponsor:

Acceptance Rates

Overall Acceptance Rate 543 of 3,203 submissions, 17%

Upcoming Conference

ISCA '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)20
  • Downloads (Last 6 weeks)3
Reflects downloads up to 23 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2018)ComP-netProceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques10.1145/3243176.3243179(1-13)Online publication date: 1-Nov-2018
  • (2018)Warp-ConsolidationProceedings of the 2018 International Conference on Supercomputing10.1145/3205289.3205294(53-64)Online publication date: 12-Jun-2018
  • (2018)Taming irregular applications via advanced dynamic parallelism on GPUsProceedings of the 15th ACM International Conference on Computing Frontiers10.1145/3203217.3203243(146-154)Online publication date: 8-May-2018
  • (2018)JugglerACM SIGPLAN Notices10.1145/3200691.317849253:1(54-67)Online publication date: 10-Feb-2018
  • (2018)JugglerProceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3178487.3178492(54-67)Online publication date: 10-Feb-2018
  • (2018)TAPASProceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO.2018.00028(245-257)Online publication date: 20-Oct-2018
  • (2017)Improving the Efficiency of GPGPU Work-Queue Through Data AwarenessACM Transactions on Architecture and Code Optimization10.1145/315103514:4(1-22)Online publication date: 5-Dec-2017
  • (2017)GravelProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3126908.3126914(1-12)Online publication date: 12-Nov-2017
  • (2016)LaPermACM SIGARCH Computer Architecture News10.1145/3007787.300119944:3(583-595)Online publication date: 18-Jun-2016
  • (2016)Exploring the Efficiency of the OpenCL Pipe Semantic on an FPGAACM SIGARCH Computer Architecture News10.1145/2927964.292797443:4(52-57)Online publication date: 22-Apr-2016
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media