Author: Gao, Guang R : Search

research-article

Extending an asynchronous runtime system for high throughput applications: A case study

Journal of Parallel and Distributed Computing (JPDC), Volume 163, Issue CPages 214–231https://doi.org/10.1016/j.jpdc.2022.01.027

Highlights

Asynchronous Many Task Runtimes effectively maps to Big Data Domain Frameworks.

Abstract

Current supercomputers are mostly composed of vast numbers of nodes enhanced with accelerators (usually in the form of GPUs). However, having these heterogeneous designs in the forefront have exposed the software toolchains and ...

research-article

A Profile-Based AI-Assisted Dynamic Scheduling Approach for Heterogeneous Architectures

International Journal of Parallel Programming (IJPP), Volume 50, Issue 1Pages 115–151https://doi.org/10.1007/s10766-021-00721-2

Abstract

While heterogeneous architectures are increasing popular with High Performance Computing systems, their effectiveness depends on how efficient the scheduler is at allocating workloads onto appropriate computing devices and how communication and ...

research-article

Public Access

E.T.: re-thinking self-attention for transformer models on GPUs

SC '21: Proceedings of the International Conference for High Performance Computing, Networking, Storage and AnalysisArticle No.: 25, Pages 1–18https://doi.org/10.1145/3458817.3476138

Transformer-based deep learning models have become a ubiquitous vehicle to drive a variety of Natural Language Processing (NLP) related tasks beyond their accuracy ceiling. However, these models also suffer from two pronounced challenges, that is, ...

editorial

Guest Editorial: Special issue on Network and Parallel Computing for Emerging Architectures and Applications

International Journal of Parallel Programming (IJPP), Volume 49, Issue 5Pages 625–627https://doi.org/10.1007/s10766-021-00720-3

research-article

Open Access

Generating Fine-Grain Multithreaded Applications Using a Multigrain Approach

ACM Transactions on Architecture and Code Optimization (TACO), Volume 14, Issue 4Article No.: 47, Pages 1–26https://doi.org/10.1145/3155288

The recent evolution in hardware landscape, aimed at producing high-performance computing systems capable of reaching extreme-scale performance, has reignited the interest in fine-grain multithreading, particularly at the intranode level. Indeed, ...

research-article

HAMR

International Journal of High Performance Computing Applications (SAGE-HPCA), Volume 31, Issue 5Pages 361–374https://doi.org/10.1177/1094342016672080

As the attention given to big data grows, cluster computing systems for distributed processing of large data sets become the mainstream and critical requirement in high performance distributed system research. One of the most successful systems is ...

research-article

Free

Leveraging access port positions to accelerate page table walk in DWM-based main memory

DATE '17: Proceedings of the Conference on Design, Automation & Test in EuropePages 1454–1459

Domain Wall Memory (DWM) with ultra-high density and comparable read/write latency to DRAM is an attractive replacement for CMOS-based devices. Unlike DRAM, DWM has non-uniform data access latency that is proportional to the number of shift operations. ...

Article

Toward a Parallel Turing Machine Model

Network and Parallel ComputingPages 191–204https://doi.org/10.1007/978-3-319-47099-3_16

Abstract

In the field of parallel computing, the late leader Ken Kennedy, has raised a concern in early 1990s: “Is Parallel Computing Dead?” Now, we have witnessed the tremendous momentum of the “second spring” of parallel computing in recent years. But, ...

proceeding

Network and Parallel Computing: 13th IFIP WG 10.3 International Conference, NPC 2016, Xi'an, China, October 28-29, 2016, Proceedings

Article

Energy efficient multi-level tiling for dense matrix multiplication on many-core architecture

IGSC '15: Proceedings of the 2015 Sixth International Green and Sustainable Computing Conference (IGSC)Pages 1–6https://doi.org/10.1109/IGCC.2015.7393735

With computing systems marching to exascale and big data era, power consumption has become more and more important for the system design. Energy efficiency is becoming one of the critical dimensions in the computer system design space and has been ...

article

Author Rebuttal to Rocha et al. "Comments on Minimizing Buffer Requirements under Rate-Optimal Schedule in Regular Dataflow Networks"

Journal of Signal Processing Systems (JSPS), Volume 81, Issue 1Pages 135–136https://doi.org/10.1007/s11265-015-0980-x

research-article

FreshBreeze

Procedia Computer Science (PROCS), Volume 51, Issue CPages 2573–2582https://doi.org/10.1016/j.procs.2015.05.365

The DDDAS paradigm, unifying applications, mathematical modeling, and sensors, is now more relevant than ever with the advent of Large-Scale/Big-Data and Big-Computing. Large-Scale-Dynamic-Data (advertised as the next wave of Big Data) includes the ...

research-article

TERAFLUX

Microprocessors & Microsystems (MSYS), Volume 38, Issue 8Pages 976–990https://doi.org/10.1016/j.micpro.2014.04.001

Display Omitted Scalable architecture for manycore, tera-device computing.Task-parallel programming models combining dataflow and stateful computations.Parallel simulation of large-scale multi-node architectures.Fault detection and recovery for task-...

book

Design Methods and Applications for Distributed Embedded Systems: IFIP 18th World Computer Congress, TC10 Working Conference on Distributed and ... 2004), 22-27 August, 2004 Toulouse, France

book

Design Methods and Applications for Distributed Embedded Systems: IFIP 18th World Computer Congress, TC10 Working Conference on Distributed and ... in Information and Communication Technology)

The ever decreasing price/performance ratio of microcontrollers makes it economically attractive to replace more and more conventional mechanical or electronic control systems within many products by embedded real-time computer systems. An embedded real-...

Article

An implementation of the codelet model

Euro-Par'13: Proceedings of the 19th international conference on Parallel ProcessingPages 633–644https://doi.org/10.1007/978-3-642-40047-6_63

Chip architectures are shifting from few, faster, functionally heavy cores to abundant, slower, simpler cores to address pressing physical limitations such as energy consumption and heat expenditure. As architectural trends continue to fluctuate, we ...

article

StreamTMC: Stream compilation for tiled multi-core architectures

Journal of Parallel and Distributed Computing (JPDC), Volume 73, Issue 4Pages 484–494https://doi.org/10.1016/j.jpdc.2012.12.001

Tiled multi-core architectures have become an important kind of multi-core design for its good scalability and low power consumption. Stream programming has been productively applied to a number of important application domains. It provides an ...

research-article

Software Pipelining for Stream Programs on Resource Constrained Multicore Architectures

IEEE Transactions on Parallel and Distributed Systems (TPDS), Volume 23, Issue 12Pages 2338–2350https://doi.org/10.1109/TPDS.2012.41

Stream programming model has been productively applied to a number of important application domains. Software pipelining is an important code scheduling technique for stream programs. However, the multicore evolution has presented a new dimension of ...

Article

Determinacy and Repeatability of Parallel Program Schemata

DFM '12: Proceedings of the 2012 Data-Flow Execution Models for Extreme Scale ComputingPages 1–9https://doi.org/10.1109/DFM.2012.10

The concept of "determinism" of parallel programs and parallel systems has received a lot of attention since the dawn of computing, with multiple proposals for formal and informal definitions of deterministic execution. In this paper, we present precise ...

Article

Source Code Partitioning in Program Optimization

ICPADS '11: Proceedings of the 2011 IEEE 17th International Conference on Parallel and Distributed SystemsPages 56–63https://doi.org/10.1109/ICPADS.2011.125

Program analysis and program optimization seek to improve program performance. There are optimization techniques which are applied to various scopes such as a source file, function or basic block. Inter-procedural program optimization techniques have ...

Applied Filters

People

Names

Institutions

Authors

Editors

Publications

Journal/Magazine Names

Proceedings/Book Names

All Publications

Content Type

Supplemental Material Type

Media Formats

Publisher

Conferences

Sponsors

Conference Event

Proceedings Series

Publication Date

Caption

Extending an asynchronous runtime system for high throughput applications: A case study

A Profile-Based AI-Assisted Dynamic Scheduling Approach for Heterogeneous Architectures

E.T.: re-thinking self-attention for transformer models on GPUs

Guest Editorial: Special issue on Network and Parallel Computing for Emerging Architectures and Applications

Generating Fine-Grain Multithreaded Applications Using a Multigrain Approach

HAMR

Leveraging access port positions to accelerate page table walk in DWM-based main memory

Toward a Parallel Turing Machine Model

Network and Parallel Computing: 13th IFIP WG 10.3 International Conference, NPC 2016, Xi'an, China, October 28-29, 2016, Proceedings

Energy efficient multi-level tiling for dense matrix multiplication on many-core architecture

Author Rebuttal to Rocha et al. "Comments on Minimizing Buffer Requirements under Rate-Optimal Schedule in Regular Dataflow Networks"

FreshBreeze

TERAFLUX

Design Methods and Applications for Distributed Embedded Systems: IFIP 18th World Computer Congress, TC10 Working Conference on Distributed and ... 2004), 22-27 August, 2004 Toulouse, France

Design Methods and Applications for Distributed Embedded Systems: IFIP 18th World Computer Congress, TC10 Working Conference on Distributed and ... in Information and Communication Technology)

An implementation of the codelet model

StreamTMC: Stream compilation for tiled multi-core architectures

Software Pipelining for Stream Programs on Resource Constrained Multicore Architectures

Determinacy and Repeatability of Parallel Program Schemata

Source Code Partitioning in Program Optimization

Applied Filters

People

Names

Institutions

Authors

Editors

Publications

Journal/Magazine Names

Proceedings/Book Names

All Publications

Content Type

Supplemental Material Type

Media Formats

Publisher

Conferences

Sponsors

Conference Event

Proceedings Series

Publication Date

Save to Binder