Author: Che, Shuai : Search

Applied Filters

People

Publications

Publication Date

Past 5 years

21 Results for: Author: Che, ShuaiEdit SearchSave SearchRSS

Searched The ACM Guide to Computing Literature (3,765,699 records)|Limit your search to The ACM Full-Text Collection (758,513 records)

Showing 1 - 20of21 Results

Filters

Select All

Export Citations Save to Binder

per page:

Recency

research-article
January 2021
Software-Defined Design Space Exploration for an Efficient DNN Accelerator Architecture
IEEE Transactions on Computers (ITCO), Volume 70, Issue 1Pages 45–56https://doi.org/10.1109/TC.2020.2983694
Deep neural networks (DNNs) have been shown to outperform conventional machine learning algorithms across a wide range of applications, e.g., image recognition, object detection, robotics, and natural language processing. However, the high computational ...
6
Metrics
Total Citations6
research-article
Free
December 2020
Pushing the limits of narrow precision inferencing at cloud scale with microsoft floating point
NIPS '20: Proceedings of the 34th International Conference on Neural Information Processing SystemsArticle No.: 861, Pages 10271–10281

In this paper, we explore the limits of Microsoft Floating Point (MSFP), a new class of datatypes developed for production cloud-scale inferencing on custom hardware. Through the co-evolution of hardware design and algorithms, MSFP16 incurs 3 × lower ...
1
174
Metrics
Total Citations1
Total Downloads174
Last 12 Months104
Last 6 weeks14
View online with eReader
PDF
research-article
November 2017
Gravel: fine-grain GPU-initiated network messages
SC '17: Proceedings of the International Conference for High Performance Computing, Networking, Storage and AnalysisArticle No.: 23, Pages 1–12https://doi.org/10.1145/3126908.3126914

Distributed systems incorporate GPUs because they provide massive parallelism in an energy-efficient manner. Unfortunately, existing programming models make it difficult to route a GPU-initiated network message. The traditional coprocessor model forces ...
5
237
Metrics
Total Citations5
Total Downloads237
Last 12 Months4
Last 6 weeks0
Get Access
article
June 2017
Programming GPGPU Graph Applications with Linear Algebra Building Blocks
International Journal of Parallel Programming (IJPP), Volume 45, Issue 3Pages 657–679https://doi.org/10.1007/s10766-016-0448-z

Graph applications are common in scientific and enterprise computing. Recent research used graphics processing units (GPUs) to accelerate graph workloads. These applications tend to present characteristics that are challenging for SIMD execution. To ...
4
Metrics
Total Citations4
research-article
May 2017
Work Stealing in a Shared Virtual-Memory Heterogeneous Environment: A Case Study with Betweenness Centrality
CF'17: Proceedings of the Computing Frontiers ConferencePages 164–173https://doi.org/10.1145/3075564.3075567

This paper uses betweenness centrality as a case study to research efficient work stealing in a heterogeneous system environment. Betweenness centrality is an important algorithm in graph processing. It presents multiple-level parallelism and is an ...
2
172
Metrics
Total Citations2
Total Downloads172
Last 12 Months13
Last 6 weeks1
Get Access
Upcoming Conferences

ASPLOS '25

March 30 - April 3, 2025

World Trade Center, Rotterdam, Netherlands

ASPLOS '25 Website
research-article
October 2016
Challenges of Programming a System with Heterogeneous Memories and Heterogeneous Processors: A Programmer's View
MEMSYS '16: Proceedings of the Second International Symposium on Memory SystemsPages 99–103https://doi.org/10.1145/2989081.2989097

Recently there has been significant development and innovation in both frontiers of Heterogeneous Memory and Heterogeneous Compute domains. This paper summarizes the challenges, surveys related work, and proposes possible research directions to exploit ...
1
244
Metrics
Total Citations1
Total Downloads244
Last 12 Months10
Last 6 weeks1
Get Access
research-article
October 2016
Software Assisted Hardware Cache Coherence for Heterogeneous Processors
MEMSYS '16: Proceedings of the Second International Symposium on Memory SystemsPages 279–288https://doi.org/10.1145/2989081.2989092

Current trends suggest that future computing platforms will be increasingly heterogeneous. While these heterogeneous processors physically integrate disparate computing elements like CPUs and GPUs on a single chip, their programmability critically ...
3
241
Metrics
Total Citations3
Total Downloads241
Last 12 Months41
Last 6 weeks5
Get Access
short-paper
May 2016
Betweenness Centrality in an HSA-enabled System
HPGP '16: Proceedings of the ACM Workshop on High Performance Graph ProcessingPages 35–38https://doi.org/10.1145/2915516.2915526

This paper studies different approaches to implementing betweenness centrality in a heterogeneous system. Betweenness centrality is an important algorithm in graph processing. It presents multiple levels of parallelism when processing a graph, and is an ...
1
152
Metrics
Total Citations1
Total Downloads152
Last 12 Months1
Last 6 weeks0
Get Access
research-article
March 2016
Implementing directed acyclic graphs with the heterogeneous system architecture
GPGPU '16: Proceedings of the 9th Annual Workshop on General Purpose Processing using Graphics Processing UnitPages 53–62https://doi.org/10.1145/2884045.2884052

Achieving optimal performance on heterogeneous computing systems requires a programming model that supports the execution of asynchronous, multi-stream, and out-of-order tasks in a shared memory environment. Asynchronous dependency-driven tasking is one ...
14
448
Metrics
Total Citations14
Total Downloads448
Last 12 Months33
Last 6 weeks3
Get Access
Article
May 2015
Graph Coloring on the GPU and Some Techniques to Improve Load Imbalance
IPDPSW '15: Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium WorkshopPages 610–617https://doi.org/10.1109/IPDPSW.2015.74

Graphics processing units (GPUs) have been increasingly used to accelerate irregular applications such as graph and sparse-matrix computation. Graph coloring is a key building block for many graph applications. The first step of many graph applications ...
2
Metrics
Total Citations2
research-article
March 2015
Synchronization Using Remote-Scope Promotion
ASPLOS '15: Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating SystemsPages 73–86https://doi.org/10.1145/2694344.2694350

Heterogeneous system architecture (HSA) and OpenCL define scoped synchronization to facilitate low overhead communication across a subset of threads. Scoped synchronization works well for static sharing patterns, where consumer threads are known a ...
Also Published in:
ACM SIGPLAN Notices: Volume 50 Issue 4ACM SIGARCH Computer Architecture News: Volume 43 Issue 1
27
472
Metrics
Total Citations27
Total Downloads472
Last 12 Months28
Last 6 weeks2
Get Access
Article
April 2015
SPEC ACCEL: A Standard Application Suite for Measuring Hardware Accelerator Performance
High Performance Computing Systems. Performance Modeling, Benchmarking, and SimulationPages 46–67https://doi.org/10.1007/978-3-319-17248-4_3
Abstract
Hybrid nodes with hardware accelerators are becoming very common in systems today. Users often find it difficult to characterize and understand the performance advantage of such accelerators for their applications. The SPEC High Performance Group (...
2
Metrics
Total Citations2
Article
May 2014
Dymaxion++: A Directive-Based API to Optimize Data Layout and Memory Mapping for Heterogeneous Systems
IPDPSW '14: Proceedings of the 2014 IEEE International Parallel & Distributed Processing Symposium WorkshopsPages 916–924https://doi.org/10.1109/IPDPSW.2014.104

There has been a growing trend in using heterogeneous systems with CPUs and GPUs to solve diverse compute problems. However, high application performance on these platforms relies on efficient memory accesses. For many applications, CPUs and GPUs prefer ...
1
Metrics
Total Citations1
research-article
May 2014
BenchFriend: Correlating the performance of GPU benchmarks
- Shuai Che,
- Kevin Skadron
International Journal of High Performance Computing Applications (SAGE-HPCA), Volume 28, Issue 2Pages 238–250https://doi.org/10.1177/1094342013507960

Graphics processing units GPUs have become an important platform for general-purpose computing, thanks to their high parallel throughput and high memory bandwidth. GPUs present significantly different architectures from CPUs and require specific ...
11
Metrics
Total Citations11
research-article
May 2013
Load balancing in a changing world: dealing with heterogeneity and performance variability
CF '13: Proceedings of the ACM International Conference on Computing FrontiersArticle No.: 21, Pages 1–10https://doi.org/10.1145/2482767.2482794

Fully utilizing the power of modern heterogeneous systems requires judiciously dividing work across all of the available computational devices. Existing approaches for partitioning work require offline training and generate fixed partitions that fail to ...
45
391
Metrics
Total Citations45
Total Downloads391
Last 12 Months13
Last 6 weeks3
Get Access
research-article
November 2011
Dymaxion: optimizing memory access patterns for heterogeneous systems
SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and AnalysisArticle No.: 13, Pages 1–11https://doi.org/10.1145/2063384.2063401

Graphics processors (GPUs) have emerged as an important platform for general purpose computing. GPUs offer a large number of parallel cores and have access to high memory bandwidth; however, data structure layouts in GPU memory often lead to suboptimal ...
83
855
Metrics
Total Citations83
Total Downloads855
Last 12 Months46
Last 6 weeks3
Get Access
Article
November 2011
Using cycle stacks to understand scaling bottlenecks in multi-threaded workloads
IISWC '11: Proceedings of the 2011 IEEE International Symposium on Workload CharacterizationPages 38–49https://doi.org/10.1109/IISWC.2011.6114195

This paper proposes a methodology for analyzing parallel performance by building cycle stacks. A cycle stack quantifies where the cycles have gone, and provides hints towards optimization opportunities. We make the case that this is particularly ...
15
Metrics
Total Citations15
Article
December 2010
A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads
IISWC '10: Proceedings of the IEEE International Symposium on Workload Characterization (IISWC'10)Pages 1–11https://doi.org/10.1109/IISWC.2010.5650274

The recently released Rodinia benchmark suite enables users to evaluate heterogeneous systems including both accelerators, such as GPUs, and multicore CPUs. As Rodinia sees higher levels of acceptance, it becomes important that researchers understand ...
84
Metrics
Total Citations84
Article
October 2009
Rodinia: A benchmark suite for heterogeneous computing
IISWC '09: Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC)Pages 44–54https://doi.org/10.1109/IISWC.2009.5306797

This paper presents and characterizes Rodinia, a benchmark suite for heterogeneous computing. To help architects study emerging platforms such as GPUs (Graphics Processing Units), Rodinia includes applications and kernels which target multi-core CPU and ...
936
Metrics
Total Citations936
article
October 2008
A performance study of general-purpose applications on graphics processors using CUDA
Journal of Parallel and Distributed Computing (JPDC), Volume 68, Issue 10Pages 1370–1380https://doi.org/10.1016/j.jpdc.2008.05.014

Graphics processors (GPUs) provide a vast number of simple, data-parallel, deeply multithreaded cores and high memory bandwidths. GPU architectures are becoming increasingly programmable, offering the potential for dramatic speedups for a variety of ...
116
Metrics
Total Citations116

Applied Filters

People

Names

Institutions

Authors

Publications

Journal/Magazine Names

Proceedings/Book Names

All Publications

Content Type

Media Formats

Publisher

Conferences

Sponsors

Conference Event

Proceedings Series

Publication Date

Software-Defined Design Space Exploration for an Efficient DNN Accelerator Architecture

Pushing the limits of narrow precision inferencing at cloud scale with microsoft floating point

Gravel: fine-grain GPU-initiated network messages

Programming GPGPU Graph Applications with Linear Algebra Building Blocks

Work Stealing in a Shared Virtual-Memory Heterogeneous Environment: A Case Study with Betweenness Centrality

Upcoming Conferences

Challenges of Programming a System with Heterogeneous Memories and Heterogeneous Processors: A Programmer's View

Software Assisted Hardware Cache Coherence for Heterogeneous Processors

Betweenness Centrality in an HSA-enabled System

Implementing directed acyclic graphs with the heterogeneous system architecture

Graph Coloring on the GPU and Some Techniques to Improve Load Imbalance

Synchronization Using Remote-Scope Promotion

Also Published in:

SPEC ACCEL: A Standard Application Suite for Measuring Hardware Accelerator Performance

Dymaxion++: A Directive-Based API to Optimize Data Layout and Memory Mapping for Heterogeneous Systems

BenchFriend: Correlating the performance of GPU benchmarks

Load balancing in a changing world: dealing with heterogeneity and performance variability

Dymaxion: optimizing memory access patterns for heterogeneous systems

Using cycle stacks to understand scaling bottlenecks in multi-threaded workloads

A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads

Rodinia: A benchmark suite for heterogeneous computing

A performance study of general-purpose applications on graphics processors using CUDA