Export Citations
Save this search
Please login to be able to save your searches and receive alerts for new content matching your search criteria.
- research-articleFebruary 2025JUST ACCEPTED
Comprehensive Evaluation and Opportunity Discovery for Deterministic Concurrency Control
ACM Transactions on Architecture and Code Optimization (TACO), Just Accepted https://doi.org/10.1145/3715126Deterministic concurrency control (DCC) guarantees that the same input transactions produce the same serializable result. It offers benefits in both distributed databases and blockchain systems. Dozens of DCC algorithms have emerged in the past decade. ...
- research-articleNovember 2024
An Optimized GPU Implementation for GIST Descriptor
ACM Transactions on Architecture and Code Optimization (TACO), Volume 21, Issue 4Article No.: 78, Pages 1–24https://doi.org/10.1145/3689339The GIST descriptor is a classic feature descriptor primarily used for scene categorization and recognition tasks. It drives a bank of Gabor filters, which respond to edges and textures at various scales and orientations to capture the spatial structures ...
- research-articleNovember 2024JUST ACCEPTED
RT-GNN: Accelerating Sparse Graph Neural Networks by Tensor-CUDA Kernel Fusion
ACM Transactions on Architecture and Code Optimization (TACO), Just Accepted https://doi.org/10.1145/3702001Graph Neural Networks (GNNs) have achieved remarkable successes in various graph-based learning tasks, thanks to their ability to leverage advanced GPUs. However, GNNs currently face challenges arising from the concurrent use of advanced Tensor Cores (TCs)...
- research-articleNovember 2024JUST ACCEPTED
ApSpGEMM: Accelerating Large-scale SpGEMM with Heterogeneous Collaboration and Adaptive Panel
ACM Transactions on Architecture and Code Optimization (TACO), Just Accepted https://doi.org/10.1145/3703352The Sparse General Matrix-Matrix multiplication (SpGEMM) is a fundamental component for many applications, such as algebraic multigrid methods (AMG), graphic processing, and deep learning. However, the unbearable latency of computing high-dimensional, ...
- research-articleNovember 2024JUST ACCEPTED
Characterizing and Understanding HGNN Training on GPUs
ACM Transactions on Architecture and Code Optimization (TACO), Just Accepted https://doi.org/10.1145/3703356Owing to their remarkable representation capabilities for heterogeneous graph data, Heterogeneous Graph Neural Networks (HGNNs) have been widely adopted in many critical real-world domains such as recommendation systems and medical analysis. Prior to ...
-
- research-articleOctober 2024JUST ACCEPTED
MemoriaNova: Optimizing Memory-Aware Model Inference for Edge Computing
ACM Transactions on Architecture and Code Optimization (TACO), Just Accepted https://doi.org/10.1145/3701997In recent years, deploying deep learning models on edge devices has become pervasive, driven by the increasing demand for intelligent edge computing solutions across various industries. From industrial automation to intelligent surveillance and healthcare,...
- research-articleSeptember 2024
GraphSER: Distance-Aware Stream-Based Edge Repartition for Many-Core Systems
ACM Transactions on Architecture and Code Optimization (TACO), Volume 21, Issue 3Article No.: 48, Pages 1–25https://doi.org/10.1145/3661998With the explosive growth of graph data, distributed graph processing has become popular, and many graph hardware accelerators use distributed frameworks. Graph partitioning is foundation in distributed graph processing. However, dynamic changes in graph ...
- research-articleSeptember 2024
An Example of Parallel Merkle Tree Traversal: Post-Quantum Leighton-Micali Signature on the GPU
ACM Transactions on Architecture and Code Optimization (TACO), Volume 21, Issue 3Article No.: 44, Pages 1–25https://doi.org/10.1145/3659209The hash-based signature (HBS) is the most conservative and time-consuming among many post-quantum cryptography (PQC) algorithms. Two HBSs, LMS and XMSS, are the only PQC algorithms standardised by the National Institute of Standards and Technology (NIST) ...
- research-articleMay 2024
TEA+: A Novel Temporal Graph Random Walk Engine with Hybrid Storage Architecture
- Chengying Huan,
- Yongchao Liu,
- Heng Zhang,
- Shuaiwen Song,
- Santosh Pandey,
- Shiyang Chen,
- Xiangfei Fang,
- Yue Jin,
- Baptiste Lepers,
- Yanjun Wu,
- Hang Liu
ACM Transactions on Architecture and Code Optimization (TACO), Volume 21, Issue 2Article No.: 37, Pages 1–26https://doi.org/10.1145/3652604Many real-world networks are characterized by being temporal and dynamic, wherein the temporal information signifies the changes in connections, such as the addition or removal of links between nodes. Employing random walks on these temporal networks is a ...
- research-articleMarch 2024
Cost-aware Service Placement and Scheduling in the Edge-Cloud Continuum
ACM Transactions on Architecture and Code Optimization (TACO), Volume 21, Issue 2Article No.: 29, Pages 1–24https://doi.org/10.1145/3640823The edge to data center computing continuum is the aggregation of computing resources located anywhere between the network edge (e.g., close to 5G antennas), and servers in traditional data centers. Kubernetes is the de facto standard for the ...
- research-articleFebruary 2024
ApHMM: Accelerating Profile Hidden Markov Models for Fast and Energy-efficient Genome Analysis
- Can Firtina,
- Kamlesh Pillai,
- Gurpreet S. Kalsi,
- Bharathwaj Suresh,
- Damla Senol Cali,
- Jeremie S. Kim,
- Taha Shahroodi,
- Meryem Banu Cavlak,
- Joël Lindegger,
- Mohammed Alser,
- Juan Gómez Luna,
- Sreenivas Subramoney,
- Onur Mutlu
ACM Transactions on Architecture and Code Optimization (TACO), Volume 21, Issue 1Article No.: 19, Pages 1–29https://doi.org/10.1145/3632950Profile hidden Markov models (pHMMs) are widely employed in various bioinformatics applications to identify similarities between biological sequences, such as DNA or protein sequences. In pHMMs, sequences are represented as graph structures, where states ...
- research-articleDecember 2023
Autovesk: Automatic Vectorized Code Generation from Unstructured Static Kernels Using Graph Transformations
ACM Transactions on Architecture and Code Optimization (TACO), Volume 21, Issue 1Article No.: 4, Pages 1–25https://doi.org/10.1145/3631709Leveraging the SIMD capability of modern CPU architectures is mandatory to take full advantage of their increased performance. To exploit this capability, binary executables must be vectorized, either manually by developers or automatically by a tool. For ...
- research-articleDecember 2023
PARALiA: A Performance Aware Runtime for Auto-tuning Linear Algebra on Heterogeneous Systems
ACM Transactions on Architecture and Code Optimization (TACO), Volume 20, Issue 4Article No.: 52, Pages 1–25https://doi.org/10.1145/3624569Dense linear algebra operations appear very frequently in high-performance computing (HPC) applications, rendering their performance crucial to achieve optimal scalability. As many modern HPC clusters contain multi-GPU nodes, BLAS operations are ...
- research-articleMarch 2023
Source Matching and Rewriting for MLIR Using String-Based Automata
ACM Transactions on Architecture and Code Optimization (TACO), Volume 20, Issue 2Article No.: 22, Pages 1–26https://doi.org/10.1145/3571283A typical compiler flow relies on a uni-directional sequence of translation/optimization steps that lower the program abstract representation, making it hard to preserve higher-level program information across each transformation step. On the other hand, ...
- research-articleDecember 2022
XEngine: Optimal Tensor Rematerialization for Neural Networks in Heterogeneous Environments
ACM Transactions on Architecture and Code Optimization (TACO), Volume 20, Issue 1Article No.: 17, Pages 1–25https://doi.org/10.1145/3568956Memory efficiency is crucial in training deep learning networks on resource-restricted devices. During backpropagation, forward tensors are used to calculate gradients. Despite the option of keeping those dependencies in memory until they are reused in ...
- research-articleMarch 2022
CARL: Compiler Assigned Reference Leasing
ACM Transactions on Architecture and Code Optimization (TACO), Volume 19, Issue 1Article No.: 15, Pages 1–28https://doi.org/10.1145/3498730Data movement is a common performance bottleneck, and its chief remedy is caching. Traditional cache management is transparent to the workload: data that should be kept in cache are determined by the recency information only, while the program information,...
- research-articleMarch 2022
Weaving Synchronous Reactions into the Fabric of SSA-form Compilers
ACM Transactions on Architecture and Code Optimization (TACO), Volume 19, Issue 2Article No.: 22, Pages 1–25https://doi.org/10.1145/3506706We investigate the programming of reactive systems combining closed-loop control with performance-intensive components such as Machine Learning (ML). Reactive control systems are often safety-critical and associated with real-time execution requirements, ...
- research-articleSeptember 2021
SortCache: Intelligent Cache Management for Accelerating Sparse Data Workloads
ACM Transactions on Architecture and Code Optimization (TACO), Volume 18, Issue 4Article No.: 56, Pages 1–24https://doi.org/10.1145/3473332Sparse data applications have irregular access patterns that stymie modern memory architectures. Although hyper-sparse workloads have received considerable attention in the past, moderately-sparse workloads prevalent in machine learning applications, ...
- research-articleJuly 2021
All-gather Algorithms Resilient to Imbalanced Process Arrival Patterns
ACM Transactions on Architecture and Code Optimization (TACO), Volume 18, Issue 4Article No.: 41, Pages 1–22https://doi.org/10.1145/3460122Two novel algorithms for the all-gather operation resilient to imbalanced process arrival patterns (PATs) are presented. The first one, Background Disseminated Ring (BDR), is based on the regular parallel ring algorithm often supplied in MPI ...
- research-articleJune 2021
KernelFaRer: Replacing Native-Code Idioms with High-Performance Library Calls
- João P. L. De Carvalho,
- Braedy Kuzma,
- Ivan Korostelev,
- José Nelson Amaral,
- Christopher Barton,
- José Moreira,
- Guido Araujo
ACM Transactions on Architecture and Code Optimization (TACO), Volume 18, Issue 3Article No.: 38, Pages 1–22https://doi.org/10.1145/3459010Well-crafted libraries deliver much higher performance than code generated by sophisticated application programmers using advanced optimizing compilers. When a code pattern for which a well-tuned library implementation exists is found in the source code ...