Export Citations
Save this search
Please login to be able to save your searches and receive alerts for new content matching your search criteria.
- research-articleAugust 2024
PRoof: A Comprehensive Hierarchical Profiling Framework for Deep Neural Networks with Roofline Analysis
ICPP '24: Proceedings of the 53rd International Conference on Parallel ProcessingAugust 2024, Pages 822–832https://doi.org/10.1145/3673038.3673116The increasing diversity of deep neural network (DNN) models and hardware platforms necessitates effective model profiling for high-performance inference deployment. Current DNN profiling tools suffer from either limited optimization insights due to the ...
Jigsaw: Accelerating SpMM with Vector Sparsity on Sparse Tensor Core
ICPP '24: Proceedings of the 53rd International Conference on Parallel ProcessingAugust 2024, Pages 1124–1134https://doi.org/10.1145/3673038.3673108As deep learning models continue to grow larger, model pruning is employed to reduce memory footprint and computation complexity, which generates a large number of sparse matrix-matrix multiplication (SpMM) with unstructured sparsity (e.g., vector ...
- research-articleFebruary 2024
Tetris: Accelerating Sparse Convolution by Exploiting Memory Reuse on GPU
PPoPP '24: Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel ProgrammingMarch 2024, Pages 229–242https://doi.org/10.1145/3627535.3638471Convolutional neural networks (CNNs) have achieved remarkable success in various application fields. Although model compression techniques mitigate the ever-increasing resource demands of large CNN models, the compressed models usually exhibit irregular ...
EasyScale: Elastic Training with Consistent Accuracy and Improved Utilization on GPUs
- Mingzhen Li,
- Wencong Xiao,
- Hailong Yang,
- Biao Sun,
- Hanyu Zhao,
- Shiru Ren,
- Zhongzhi Luan,
- Xianyan Jia,
- Yi Liu,
- Yong Li,
- Wei Lin,
- Depei Qian
SC '23: Proceedings of the International Conference for High Performance Computing, Networking, Storage and AnalysisNovember 2023, Article No.: 55, Pages 1–14https://doi.org/10.1145/3581784.3607054Distributed synchronized GPU training is commonly used for deep learning. The resource constraint of using a fixed number of GPUs makes large-scale training jobs suffer from long queuing time for resource allocation, and lowers the cluster utilization. ...
- research-articleNovember 2023
TrivialSpy: Identifying Software Triviality via Fine-grained and Dataflow-based Value Profiling
SC '23: Proceedings of the International Conference for High Performance Computing, Networking, Storage and AnalysisNovember 2023, Article No.: 90, Pages 1–13https://doi.org/10.1145/3581784.3607052Trivial operations cause software inefficiencies that waste functional units and memory bandwidth for executing useless instructions. Although previous works have identified a significant amount of trivial operations in widely used programs, the proposed ...
-
- ArticleMarch 2024
gGMED: Towards GPU Accelerated Geometric Modeling Evaluation and Derivative Processes
- research-articleOctober 2023
Adaptive Auto-Tuning Framework for Global Exploration of Stencil Optimization on GPUs
IEEE Transactions on Parallel and Distributed Systems (TPDS), Volume 35, Issue 1Jan. 2024, Pages 20–33https://doi.org/10.1109/TPDS.2023.3325630Stencil computations are widely used in high performance computing (HPC) applications. Many HPC platforms utilize the high computation capability of GPUs to accelerate stencil computations. In recent years, stencils have become more diverse in terms of ...
- research-articleSeptember 2023
Towards optimized tensor code generation for deep learning on sunway many-core processor
- Jianjin Liao,
- Xuegui Zheng,
- Hailong Yang,
- Rujun Sun,
- Jun Xu,
- Lin Gan,
- Guangwen Yang,
- Zhongzhi Luan,
- Depei Qian,
- Mingzhen Li,
- Changxi Liu
Frontiers of Computer Science: Selected Publications from Chinese Universities (FCS), Volume 18, Issue 2Apr 2024https://doi.org/10.1007/s11704-022-2440-7AbstractThe flourish of deep learning frameworks and hardware platforms has been demanding an efficient compiler that can shield the diversity in both software and hardware in order to provide application portability. Among the existing deep learning ...
- research-articleSeptember 2023
Improving Log-Based Anomaly Detection by Pre-Training Hierarchical Transformers
IEEE Transactions on Computers (ITCO), Volume 72, Issue 9Sept. 2023, Pages 2656–2667https://doi.org/10.1109/TC.2023.3257518Pre-trained models, such as BERT, have resulted in significant pre-trained models, such as BERT, have resulted in significant improvements in many natural language processing (NLP) applications. However, due to differences in word distribution and domain ...
- research-articleSeptember 2023
Exploiting Subgraph Similarities for Efficient Auto-tuning of Tensor Programs
ICPP '23: Proceedings of the 52nd International Conference on Parallel ProcessingAugust 2023, Pages 786–796https://doi.org/10.1145/3605573.3605596The requirement for deploying deep learning (DL) models efficiently has boosted the research of DL compilers. Especially, the difficulty of generating optimized tensor programs has driven DL compilers to commonly adopt the auto-tuning approaches. ...
- discussionAugust 2023
Input-Aware Sparse Tensor Storage Format Selection for Optimizing MTTKRP
This installment of Computer’s series highlighting the work published in IEEE Computer Society journals comes from IEEE Transactions on Computers.
- research-articleJune 2023
HAOTuner: A Hardware Adaptive Operator Auto-Tuner for Dynamic Shape Tensor Compilers
IEEE Transactions on Computers (ITCO), Volume 72, Issue 11Nov. 2023, Pages 3178–3190https://doi.org/10.1109/TC.2023.3288758Deep learning compilers with auto-tuners have the ability to generate high-performance programs, particularly tensor programs on accelerators. However, the performance of these tensor programs is shape-sensitive and hardware resource-sensitive. When the ...
- research-articleJune 2023
BiRFIA: Selective Binary Rewriting for Function Interception on ARM
ICS '23: Proceedings of the 37th ACM International Conference on SupercomputingJune 2023, Pages 87–98https://doi.org/10.1145/3577193.3593701Function interception of fully-optimized binaries is widely used for optimization with its ability to accurately collect runtime information and detect inefficiencies at the function level. However, the implementation of function interception with ...
- research-articleJune 2023
LogEncoder: Log-Based Contrastive Representation Learning for Anomaly Detection
IEEE Transactions on Network and Service Management (ITNSM), Volume 20, Issue 2June 2023, Pages 1378–1391https://doi.org/10.1109/TNSM.2023.3239522In recent years, cloud computing centers have grown rapidly in size. Analyzing system logs is an important way for the quality of service monitoring. However, systems produce massive amounts of logs, and it is impractical to analyze them manually. ...
- research-articleJanuary 2023
VClinic: A Portable and Efficient Framework for Fine-Grained Value Profilers
ASPLOS 2023: Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2January 2023, Pages 892–904https://doi.org/10.1145/3575693.3576934Fine-grained value profilers reveal a promising way to accurately detect value-related software inefficiencies with binary instrumentation. Due to the architecture-dependent implementation details of binary instrumentation, existing value profilers ...
- research-articleNovember 2022
CoGNN: efficient scheduling for concurrent GNN training on GPUs
- Qingxiao Sun,
- Yi Liu,
- Hailong Yang,
- Ruizhe Zhang,
- Ming Dun,
- Mingzhen Li,
- Xiaoyan Liu,
- Wencong Xiao,
- Yong Li,
- Zhongzhi Luan,
- Depei Qian
SC '22: Proceedings of the International Conference on High Performance Computing, Networking, Storage and AnalysisNovember 2022, Article No.: 39, Pages 1–15Graph neural networks (GNNs) suffer from low GPU utilization due to frequent memory accesses. Existing concurrent training mechanisms cannot be directly adapted to GNNs because they fail to consider the impact of input irregularity. This requires pre-...
- research-articleNovember 2022
swSpAMM: optimizing large-scale sparse approximate matrix multiplication on Sunway Taihulight
Frontiers of Computer Science: Selected Publications from Chinese Universities (FCS), Volume 17, Issue 4Aug 2023https://doi.org/10.1007/s11704-022-1749-6AbstractAlthough matrix multiplication plays an essential role in a wide range of applications, previous works only focus on optimizing dense or sparse matrix multiplications. The Sparse Approximate Matrix Multiply (SpAMM) is an algorithm to accelerate ...
- research-articleJanuary 2023
Black-Box Attacks to Log-Based Anomaly Detection
CNSM '22: Proceedings of the 18th International Conference on Network and Service ManagementOctober 2022, Article No.: 61, Pages 1–7Anomaly detection is the key to Quality of Service (QoS) in many modern systems. Logs, which record the run-time information of system, are widely used for anomaly detection. The security of the log-based anomaly detection has not been well ...
- research-articleOctober 2022
QoS-aware dynamic resource allocation with improved utilization and energy efficiency on GPU
AbstractAlthough GPUs have been indispensable in data centers, meeting the Quality of Service (QoS) under task consolidation on GPU is extremely challenging. Previous works mostly rely on the static task or resource scheduling and cannot ...
- research-articleJanuary 2023
Vectorizing SpMV by Exploiting Dynamic Regular Patterns
ICPP '22: Proceedings of the 51st International Conference on Parallel ProcessingAugust 2022, Article No.: 53, Pages 1–12https://doi.org/10.1145/3545008.3545042Modern optimizing compilers can exploit memory access and computation patterns to generate vectorized codes. However, such patterns in irregular programs such as SpMV are unknown until runtime due to the input dependence. Thus, either compiler’s static ...