Author: Luan, Zhongzhi : Search

research-article

PRoof: A Comprehensive Hierarchical Profiling Framework for Deep Neural Networks with Roofline Analysis

ICPP '24: Proceedings of the 53rd International Conference on Parallel ProcessingAugust 2024, Pages 822–832https://doi.org/10.1145/3673038.3673116

The increasing diversity of deep neural network (DNN) models and hardware platforms necessitates effective model profiling for high-performance inference deployment. Current DNN profiling tools suffer from either limited optimization insights due to the ...

research-article

Jigsaw: Accelerating SpMM with Vector Sparsity on Sparse Tensor Core

ICPP '24: Proceedings of the 53rd International Conference on Parallel ProcessingAugust 2024, Pages 1124–1134https://doi.org/10.1145/3673038.3673108

As deep learning models continue to grow larger, model pruning is employed to reduce memory footprint and computation complexity, which generates a large number of sparse matrix-matrix multiplication (SpMM) with unstructured sparsity (e.g., vector ...

research-article

Tetris: Accelerating Sparse Convolution by Exploiting Memory Reuse on GPU

PPoPP '24: Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel ProgrammingMarch 2024, Pages 229–242https://doi.org/10.1145/3627535.3638471

Convolutional neural networks (CNNs) have achieved remarkable success in various application fields. Although model compression techniques mitigate the ever-increasing resource demands of large CNN models, the compressed models usually exhibit irregular ...

research-article

EasyScale: Elastic Training with Consistent Accuracy and Improved Utilization on GPUs

SC '23: Proceedings of the International Conference for High Performance Computing, Networking, Storage and AnalysisNovember 2023, Article No.: 55, Pages 1–14https://doi.org/10.1145/3581784.3607054

Distributed synchronized GPU training is commonly used for deep learning. The resource constraint of using a fixed number of GPUs makes large-scale training jobs suffer from long queuing time for resource allocation, and lowers the cluster utilization. ...

research-article

TrivialSpy: Identifying Software Triviality via Fine-grained and Dataflow-based Value Profiling

SC '23: Proceedings of the International Conference for High Performance Computing, Networking, Storage and AnalysisNovember 2023, Article No.: 90, Pages 1–13https://doi.org/10.1145/3581784.3607052

Trivial operations cause software inefficiencies that waste functional units and memory bandwidth for executing useless instructions. Although previous works have identified a significant amount of trivial operations in widely used programs, the proposed ...

Article

gGMED: Towards GPU Accelerated Geometric Modeling Evaluation and Derivative Processes

Algorithms and Architectures for Parallel ProcessingOct 2023, Pages 378–397https://doi.org/10.1007/978-981-97-0798-0_22

Abstract

Geometric modeling algorithms serve as the fundamental computation of CAD/CAM software in the field of computer graphics. The evaluation and derivative processes, being an essential component of geometric modeling algorithms, significantly impact ...

research-article

Adaptive Auto-Tuning Framework for Global Exploration of Stencil Optimization on GPUs

IEEE Transactions on Parallel and Distributed Systems (TPDS), Volume 35, Issue 1Jan. 2024, Pages 20–33https://doi.org/10.1109/TPDS.2023.3325630

Stencil computations are widely used in high performance computing (HPC) applications. Many HPC platforms utilize the high computation capability of GPUs to accelerate stencil computations. In recent years, stencils have become more diverse in terms of ...

research-article

Towards optimized tensor code generation for deep learning on sunway many-core processor

Frontiers of Computer Science: Selected Publications from Chinese Universities (FCS), Volume 18, Issue 2Apr 2024https://doi.org/10.1007/s11704-022-2440-7

Abstract

The flourish of deep learning frameworks and hardware platforms has been demanding an efficient compiler that can shield the diversity in both software and hardware in order to provide application portability. Among the existing deep learning ...

research-article

Improving Log-Based Anomaly Detection by Pre-Training Hierarchical Transformers

IEEE Transactions on Computers (ITCO), Volume 72, Issue 9Sept. 2023, Pages 2656–2667https://doi.org/10.1109/TC.2023.3257518

Pre-trained models, such as BERT, have resulted in significant pre-trained models, such as BERT, have resulted in significant improvements in many natural language processing (NLP) applications. However, due to differences in word distribution and domain ...

research-article

Exploiting Subgraph Similarities for Efficient Auto-tuning of Tensor Programs

ICPP '23: Proceedings of the 52nd International Conference on Parallel ProcessingAugust 2023, Pages 786–796https://doi.org/10.1145/3605573.3605596

The requirement for deploying deep learning (DL) models efficiently has boosted the research of DL compilers. Especially, the difficulty of generating optimized tensor programs has driven DL compilers to commonly adopt the auto-tuning approaches. ...

discussion

Input-Aware Sparse Tensor Storage Format Selection for Optimizing MTTKRP

Computer (COMP), Volume 56, Issue 8Aug. 2023, Pages 4–7https://doi.org/10.1109/MC.2023.3279447

This installment of Computer’s series highlighting the work published in IEEE Computer Society journals comes from IEEE Transactions on Computers.

research-article

HAOTuner: A Hardware Adaptive Operator Auto-Tuner for Dynamic Shape Tensor Compilers

IEEE Transactions on Computers (ITCO), Volume 72, Issue 11Nov. 2023, Pages 3178–3190https://doi.org/10.1109/TC.2023.3288758

Deep learning compilers with auto-tuners have the ability to generate high-performance programs, particularly tensor programs on accelerators. However, the performance of these tensor programs is shape-sensitive and hardware resource-sensitive. When the ...

research-article

BiRFIA: Selective Binary Rewriting for Function Interception on ARM

ICS '23: Proceedings of the 37th ACM International Conference on SupercomputingJune 2023, Pages 87–98https://doi.org/10.1145/3577193.3593701

Function interception of fully-optimized binaries is widely used for optimization with its ability to accurately collect runtime information and detect inefficiencies at the function level. However, the implementation of function interception with ...

research-article

LogEncoder: Log-Based Contrastive Representation Learning for Anomaly Detection

IEEE Transactions on Network and Service Management (ITNSM), Volume 20, Issue 2June 2023, Pages 1378–1391https://doi.org/10.1109/TNSM.2023.3239522

In recent years, cloud computing centers have grown rapidly in size. Analyzing system logs is an important way for the quality of service monitoring. However, systems produce massive amounts of logs, and it is impractical to analyze them manually. ...

research-article

VClinic: A Portable and Efficient Framework for Fine-Grained Value Profilers

ASPLOS 2023: Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2January 2023, Pages 892–904https://doi.org/10.1145/3575693.3576934

Fine-grained value profilers reveal a promising way to accurately detect value-related software inefficiencies with binary instrumentation. Due to the architecture-dependent implementation details of binary instrumentation, existing value profilers ...

research-article

CoGNN: efficient scheduling for concurrent GNN training on GPUs

SC '22: Proceedings of the International Conference on High Performance Computing, Networking, Storage and AnalysisNovember 2022, Article No.: 39, Pages 1–15

Graph neural networks (GNNs) suffer from low GPU utilization due to frequent memory accesses. Existing concurrent training mechanisms cannot be directly adapted to GNNs because they fail to consider the impact of input irregularity. This requires pre-...

research-article

swSpAMM: optimizing large-scale sparse approximate matrix multiplication on Sunway Taihulight

Frontiers of Computer Science: Selected Publications from Chinese Universities (FCS), Volume 17, Issue 4Aug 2023https://doi.org/10.1007/s11704-022-1749-6

Abstract

Although matrix multiplication plays an essential role in a wide range of applications, previous works only focus on optimizing dense or sparse matrix multiplications. The Sparse Approximate Matrix Multiply (SpAMM) is an algorithm to accelerate ...

research-article

Black-Box Attacks to Log-Based Anomaly Detection

CNSM '22: Proceedings of the 18th International Conference on Network and Service ManagementOctober 2022, Article No.: 61, Pages 1–7

Anomaly detection is the key to Quality of Service (QoS) in many modern systems. Logs, which record the run-time information of system, are widely used for anomaly detection. The security of the log-based anomaly detection has not been well ...

research-article

QoS-aware dynamic resource allocation with improved utilization and energy efficiency on GPU

Parallel Computing (PACO), Volume 113, Issue COct 2022https://doi.org/10.1016/j.parco.2022.102958

Abstract

Although GPUs have been indispensable in data centers, meeting the Quality of Service (QoS) under task consolidation on GPU is extremely challenging. Previous works mostly rely on the static task or resource scheduling and cannot ...

research-article

Vectorizing SpMV by Exploiting Dynamic Regular Patterns

ICPP '22: Proceedings of the 51st International Conference on Parallel ProcessingAugust 2022, Article No.: 53, Pages 1–12https://doi.org/10.1145/3545008.3545042

Modern optimizing compilers can exploit memory access and computation patterns to generate vectorized codes. However, such patterns in irregular programs such as SpMV are unknown until runtime due to the input dependence. Thus, either compiler’s static ...

Applied Filters

People

Names

Institutions

Authors

Editors

Publications

Journal/Magazine Names

Proceedings/Book Names

All Publications

Content Type

Supplemental Material Type

Media Formats

Publisher

Conferences

Sponsors

Conference Event

Proceedings Series

Reproducibility Badges

Publication Date

Save to Binder

Upcoming Conferences