Export Citations
Save this search
Please login to be able to save your searches and receive alerts for new content matching your search criteria.
- research-articleJanuary 2021
Software-Defined Design Space Exploration for an Efficient DNN Accelerator Architecture
IEEE Transactions on Computers (ITCO), Volume 70, Issue 1Pages 45–56https://doi.org/10.1109/TC.2020.2983694Deep neural networks (DNNs) have been shown to outperform conventional machine learning algorithms across a wide range of applications, e.g., image recognition, object detection, robotics, and natural language processing. However, the high computational ...
- research-articleDecember 2020
Pushing the limits of narrow precision inferencing at cloud scale with microsoft floating point
- Bita Rouhani,
- Daniel Lo,
- Ritchie Zhao,
- Ming Liu,
- Jeremy Fowers,
- Kalin Ovtcharov,
- Anna Vinogradsky,
- Sarah Massengill,
- Lita Yang,
- Ray Bittner,
- Alessandro Forin,
- Haishan Zhu,
- Taesik Na,
- Prerak Patel,
- Shuai Che,
- Lok Chand Koppaka,
- Xia Song,
- Subhojit Som,
- Kaustav Das,
- Saurabh Tiwary,
- Steve Reinhardt,
- Sitaram Lanka,
- Eric Chung,
- Doug Burger
NIPS '20: Proceedings of the 34th International Conference on Neural Information Processing SystemsArticle No.: 861, Pages 10271–10281In this paper, we explore the limits of Microsoft Floating Point (MSFP), a new class of datatypes developed for production cloud-scale inferencing on custom hardware. Through the co-evolution of hardware design and algorithms, MSFP16 incurs 3 × lower ...
- research-articleNovember 2017
Gravel: fine-grain GPU-initiated network messages
SC '17: Proceedings of the International Conference for High Performance Computing, Networking, Storage and AnalysisArticle No.: 23, Pages 1–12https://doi.org/10.1145/3126908.3126914Distributed systems incorporate GPUs because they provide massive parallelism in an energy-efficient manner. Unfortunately, existing programming models make it difficult to route a GPU-initiated network message. The traditional coprocessor model forces ...
- articleJune 2017
Programming GPGPU Graph Applications with Linear Algebra Building Blocks
International Journal of Parallel Programming (IJPP), Volume 45, Issue 3Pages 657–679https://doi.org/10.1007/s10766-016-0448-zGraph applications are common in scientific and enterprise computing. Recent research used graphics processing units (GPUs) to accelerate graph workloads. These applications tend to present characteristics that are challenging for SIMD execution. To ...
- research-articleMay 2017
Work Stealing in a Shared Virtual-Memory Heterogeneous Environment: A Case Study with Betweenness Centrality
CF'17: Proceedings of the Computing Frontiers ConferencePages 164–173https://doi.org/10.1145/3075564.3075567This paper uses betweenness centrality as a case study to research efficient work stealing in a heterogeneous system environment. Betweenness centrality is an important algorithm in graph processing. It presents multiple-level parallelism and is an ...
-
- research-articleOctober 2016
Challenges of Programming a System with Heterogeneous Memories and Heterogeneous Processors: A Programmer's View
MEMSYS '16: Proceedings of the Second International Symposium on Memory SystemsPages 99–103https://doi.org/10.1145/2989081.2989097Recently there has been significant development and innovation in both frontiers of Heterogeneous Memory and Heterogeneous Compute domains. This paper summarizes the challenges, surveys related work, and proposes possible research directions to exploit ...
- research-articleOctober 2016
Software Assisted Hardware Cache Coherence for Heterogeneous Processors
MEMSYS '16: Proceedings of the Second International Symposium on Memory SystemsPages 279–288https://doi.org/10.1145/2989081.2989092Current trends suggest that future computing platforms will be increasingly heterogeneous. While these heterogeneous processors physically integrate disparate computing elements like CPUs and GPUs on a single chip, their programmability critically ...
- short-paperMay 2016
Betweenness Centrality in an HSA-enabled System
HPGP '16: Proceedings of the ACM Workshop on High Performance Graph ProcessingPages 35–38https://doi.org/10.1145/2915516.2915526This paper studies different approaches to implementing betweenness centrality in a heterogeneous system. Betweenness centrality is an important algorithm in graph processing. It presents multiple levels of parallelism when processing a graph, and is an ...
- research-articleMarch 2016
Implementing directed acyclic graphs with the heterogeneous system architecture
GPGPU '16: Proceedings of the 9th Annual Workshop on General Purpose Processing using Graphics Processing UnitPages 53–62https://doi.org/10.1145/2884045.2884052Achieving optimal performance on heterogeneous computing systems requires a programming model that supports the execution of asynchronous, multi-stream, and out-of-order tasks in a shared memory environment. Asynchronous dependency-driven tasking is one ...
- ArticleMay 2015
Graph Coloring on the GPU and Some Techniques to Improve Load Imbalance
IPDPSW '15: Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium WorkshopPages 610–617https://doi.org/10.1109/IPDPSW.2015.74Graphics processing units (GPUs) have been increasingly used to accelerate irregular applications such as graph and sparse-matrix computation. Graph coloring is a key building block for many graph applications. The first step of many graph applications ...
- research-articleMarch 2015
Synchronization Using Remote-Scope Promotion
ASPLOS '15: Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating SystemsPages 73–86https://doi.org/10.1145/2694344.2694350Heterogeneous system architecture (HSA) and OpenCL define scoped synchronization to facilitate low overhead communication across a subset of threads. Scoped synchronization works well for static sharing patterns, where consumer threads are known a ...
Also Published in:
ACM SIGPLAN Notices: Volume 50 Issue 4ACM SIGARCH Computer Architecture News: Volume 43 Issue 1 - ArticleApril 2015
SPEC ACCEL: A Standard Application Suite for Measuring Hardware Accelerator Performance
- Guido Juckeland,
- William Brantley,
- Sunita Chandrasekaran,
- Barbara Chapman,
- Shuai Che,
- Mathew Colgrove,
- Huiyu Feng,
- Alexander Grund,
- Robert Henschel,
- Wen-Mei W. Hwu,
- Huian Li,
- Matthias S. Müller,
- Wolfgang E. Nagel,
- Maxim Perminov,
- Pavel Shelepugin,
- Kevin Skadron,
- John Stratton,
- Alexey Titov,
- Ke Wang,
- Matthijs van Waveren,
- Brian Whitney,
- Sandra Wienke,
- Rengan Xu,
- Kalyan Kumaran
High Performance Computing Systems. Performance Modeling, Benchmarking, and SimulationPages 46–67https://doi.org/10.1007/978-3-319-17248-4_3AbstractHybrid nodes with hardware accelerators are becoming very common in systems today. Users often find it difficult to characterize and understand the performance advantage of such accelerators for their applications. The SPEC High Performance Group (...
- ArticleMay 2014
Dymaxion++: A Directive-Based API to Optimize Data Layout and Memory Mapping for Heterogeneous Systems
IPDPSW '14: Proceedings of the 2014 IEEE International Parallel & Distributed Processing Symposium WorkshopsPages 916–924https://doi.org/10.1109/IPDPSW.2014.104There has been a growing trend in using heterogeneous systems with CPUs and GPUs to solve diverse compute problems. However, high application performance on these platforms relies on efficient memory accesses. For many applications, CPUs and GPUs prefer ...
- research-articleMay 2014
BenchFriend: Correlating the performance of GPU benchmarks
International Journal of High Performance Computing Applications (SAGE-HPCA), Volume 28, Issue 2Pages 238–250https://doi.org/10.1177/1094342013507960Graphics processing units GPUs have become an important platform for general-purpose computing, thanks to their high parallel throughput and high memory bandwidth. GPUs present significantly different architectures from CPUs and require specific ...
- research-articleMay 2013
Load balancing in a changing world: dealing with heterogeneity and performance variability
CF '13: Proceedings of the ACM International Conference on Computing FrontiersArticle No.: 21, Pages 1–10https://doi.org/10.1145/2482767.2482794Fully utilizing the power of modern heterogeneous systems requires judiciously dividing work across all of the available computational devices. Existing approaches for partitioning work require offline training and generate fixed partitions that fail to ...
- research-articleNovember 2011
Dymaxion: optimizing memory access patterns for heterogeneous systems
SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and AnalysisArticle No.: 13, Pages 1–11https://doi.org/10.1145/2063384.2063401Graphics processors (GPUs) have emerged as an important platform for general purpose computing. GPUs offer a large number of parallel cores and have access to high memory bandwidth; however, data structure layouts in GPU memory often lead to suboptimal ...
- ArticleNovember 2011
Using cycle stacks to understand scaling bottlenecks in multi-threaded workloads
IISWC '11: Proceedings of the 2011 IEEE International Symposium on Workload CharacterizationPages 38–49https://doi.org/10.1109/IISWC.2011.6114195This paper proposes a methodology for analyzing parallel performance by building cycle stacks. A cycle stack quantifies where the cycles have gone, and provides hints towards optimization opportunities. We make the case that this is particularly ...
- ArticleDecember 2010
A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads
IISWC '10: Proceedings of the IEEE International Symposium on Workload Characterization (IISWC'10)Pages 1–11https://doi.org/10.1109/IISWC.2010.5650274The recently released Rodinia benchmark suite enables users to evaluate heterogeneous systems including both accelerators, such as GPUs, and multicore CPUs. As Rodinia sees higher levels of acceptance, it becomes important that researchers understand ...
- ArticleOctober 2009
Rodinia: A benchmark suite for heterogeneous computing
IISWC '09: Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC)Pages 44–54https://doi.org/10.1109/IISWC.2009.5306797This paper presents and characterizes Rodinia, a benchmark suite for heterogeneous computing. To help architects study emerging platforms such as GPUs (Graphics Processing Units), Rodinia includes applications and kernels which target multi-core CPU and ...
- articleOctober 2008
A performance study of general-purpose applications on graphics processors using CUDA
Journal of Parallel and Distributed Computing (JPDC), Volume 68, Issue 10Pages 1370–1380https://doi.org/10.1016/j.jpdc.2008.05.014Graphics processors (GPUs) provide a vast number of simple, data-parallel, deeply multithreaded cores and high memory bandwidths. GPU architectures are becoming increasingly programmable, offering the potential for dramatic speedups for a variety of ...