Export Citations
Save this search
Please login to be able to save your searches and receive alerts for new content matching your search criteria.
- research-articleAugust 2024
Massively Parallel Inverse Block-sorting Transforms for bzip2 Decompression on GPUs
ICPP '24: Proceedings of the 53rd International Conference on Parallel ProcessingAugust 2024, Pages 856–865https://doi.org/10.1145/3673038.3673067Lossless data compression has evolved into an indispensable tool for reducing data transfer times in heterogeneous systems. However, performing decompression on host systems can create performance bottlenecks. Accelerator libraries, such as nvCOMP, ...
- research-articleJuly 2024JUST ACCEPTED
- research-articleApril 2024JUST ACCEPTED
gem5-NVDLA: A Simulation Framework for Compiling, Scheduling and Architecture Evaluation on AI System-on-Chips
ACM Transactions on Design Automation of Electronic Systems (TODAES), Just Accepted https://doi.org/10.1145/3661997Recent years have seen an increasing trend in designing AI accelerators together with the rest of the system, including CPUs and memory hierarchy. This trend calls for high-quality simulators or analytical models that enable such kind of co-exploration. ...
- research-articleApril 2024
GSCore: Efficient Radiance Field Rendering via Architectural Support for 3D Gaussian Splatting
ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3April 2024, Pages 497–511https://doi.org/10.1145/3620666.3651385This paper presents GSCore, a hardware acceleration unit that efficiently executes the rendering pipeline of 3D Gaussian Splatting with algorithmic optimizations. GSCore builds on the observations from an in-depth analysis of Gaussian-based radiance ...
- research-articleApril 2024
IANUS: Integrated Accelerator based on NPU-PIM Unified Memory System
- Minseok Seo,
- Xuan Truong Nguyen,
- Seok Joong Hwang,
- Yongkee Kwon,
- Guhyun Kim,
- Chanwook Park,
- Ilkon Kim,
- Jaehan Park,
- Jeongbin Kim,
- Woojae Shin,
- Jongsoon Won,
- Haerang Choi,
- Kyuyoung Kim,
- Daehan Kwon,
- Chunseok Jeong,
- Sangheon Lee,
- Yongseok Choi,
- Wooseok Byun,
- Seungcheol Baek,
- Hyuk-Jae Lee,
- John Kim
ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3April 2024, Pages 545–560https://doi.org/10.1145/3620666.3651324Accelerating end-to-end inference of transformer-based large language models (LLMs) is a critical component of AI services in datacenters. However, the diverse compute characteristics of LLMs' end-to-end inference present challenges as previously ...
-
- keynoteApril 2024
My Fifteen Year Journey of Deploying FPGA Accelerated Solutions
FPGA '24: Proceedings of the 2024 ACM/SIGDA International Symposium on Field Programmable Gate ArraysApril 2024, Page 142https://doi.org/10.1145/3626202.3644813While FPGAs have been investigated for accelerating computing workloads in academia for many decades, industry started adopting FPGAs as an accelerator only in the last decade, but even those deployments have been fairly limited. This talk describes my ...
- research-articleFebruary 2024
Dedicated Hardware Accelerators for Processing of Sparse Matrices and Vectors: A Survey
ACM Transactions on Architecture and Code Optimization (TACO), Volume 21, Issue 2Article No.: 27, Pages 1–26https://doi.org/10.1145/3640542Performance in scientific and engineering applications such as computational physics, algebraic graph problems or Convolutional Neural Networks (CNN), is dominated by the manipulation of large sparse matrices—matrices with a large number of zero elements. ...
- research-articleJanuary 2024
SparGD: A Sparse GEMM Accelerator with Dynamic Dataflow
ACM Transactions on Design Automation of Electronic Systems (TODAES), Volume 29, Issue 2Article No.: 26, Pages 1–32https://doi.org/10.1145/3634703Deep learning has become a highly popular research field, and previously deep learning algorithms ran primarily on CPUs and GPUs. However, with the rapid development of deep learning, it was discovered that existing processors could not meet the specific ...
- research-articleDecember 2023
Symphony: Orchestrating Sparse and Dense Tensors with Hierarchical Heterogeneous Processing
- Michael Pellauer,
- Jason Clemons,
- Vignesh Balaji,
- Neal Crago,
- Aamer Jaleel,
- Donghyuk Lee,
- Mike O’Connor,
- Angshuman Parashar,
- Sean Treichler,
- Po-An Tsai,
- Stephen W. Keckler,
- Joel S. Emer
ACM Transactions on Computer Systems (TOCS), Volume 41, Issue 1-4Article No.: 4, Pages 1–30https://doi.org/10.1145/3630007Sparse tensor algorithms are becoming widespread, particularly in the domains of deep learning, graph and data analytics, and scientific computing. Current high-performance broad-domain architectures, such as GPUs, often suffer memory system ...
- research-articleAugust 2024
Bang for the Buck: Evaluating the cost-effectiveness of Heterogeneous Edge Platforms for Neural Network Workloads
SEC '23: Proceedings of the Eighth ACM/IEEE Symposium on Edge ComputingDecember 2023, Pages 94–107https://doi.org/10.1145/3583740.3628437Machine learning (ML) applications have experienced remarkable growth and integration into various domains. However, challenges with cloud-based deployments, such as latency, privacy, reliability, bandwidth and connectivity, have driven the popularity of ...
- research-articleNovember 2023
Optimizing High-Performance Linpack for Exascale Accelerated Architectures
SC '23: Proceedings of the International Conference for High Performance Computing, Networking, Storage and AnalysisNovember 2023, Article No.: 49, Pages 1–12https://doi.org/10.1145/3581784.3607066We detail the performance optimizations made in rocHPL, AMD's open-source implementation of the High-Performance Linpack (HPL) benchmark targeting accelerated node architectures designed for exascale systems such as the Frontier supercomputer. The ...
- research-articleDecember 2023
Improving Data Reuse in NPU On-chip Memory with Interleaved Gradient Order for DNN Training
MICRO '23: Proceedings of the 56th Annual IEEE/ACM International Symposium on MicroarchitectureOctober 2023, Pages 438–451https://doi.org/10.1145/3613424.3614299During training tasks for machine learning models with neural processing units (NPUs), the most time-consuming part is the backward pass, which incurs significant overheads due to off-chip memory accesses. For NPUs, to mitigate the long latency and ...
- research-articleOctober 2023
FPGA-based Deep Learning Inference Accelerators: Where Are We Standing?
ACM Transactions on Reconfigurable Technology and Systems (TRETS), Volume 16, Issue 4Article No.: 60, Pages 1–32https://doi.org/10.1145/3613963Recently, artificial intelligence applications have become part of almost all emerging technologies around us. Neural networks, in particular, have shown significant advantages and have been widely adopted over other approaches in machine learning. In ...
- research-articleJune 2023
CPU-free Computing: A Vision with a Blueprint
HOTOS '23: Proceedings of the 19th Workshop on Hot Topics in Operating SystemsJune 2023, Pages 1–14https://doi.org/10.1145/3593856.3595906Since the inception of computing, we have been reliant on CPU-powered architectures. However, today this reliance is challenged by manufacturing limitations (CMOS scaling), performance expectations (stalled clocks, Turing tax), and security concerns (...
- research-articleJune 2023
MTIA: First Generation Silicon Targeting Meta's Recommendation Systems
- Amin Firoozshahian,
- Joel Coburn,
- Roman Levenstein,
- Rakesh Nattoji,
- Ashwin Kamath,
- Olivia Wu,
- Gurdeepak Grewal,
- Harish Aepala,
- Bhasker Jakka,
- Bob Dreyer,
- Adam Hutchin,
- Utku Diril,
- Krishnakumar Nair,
- Ehsan K. Aredestani,
- Martin Schatz,
- Yuchen Hao,
- Rakesh Komuravelli,
- Kunming Ho,
- Sameer Abu Asal,
- Joe Shajrawi,
- Kevin Quinn,
- Nagesh Sreedhara,
- Pankaj Kansal,
- Willie Wei,
- Dheepak Jayaraman,
- Linda Cheng,
- Pritam Chopda,
- Eric Wang,
- Ajay Bikumandla,
- Arun Karthik Sengottuvel,
- Krishna Thottempudi,
- Ashwin Narasimha,
- Brian Dodds,
- Cao Gao,
- Jiyuan Zhang,
- Mohammed Al-Sanabani,
- Ana Zehtabioskuie,
- Jordan Fix,
- Hangchen Yu,
- Richard Li,
- Kaustubh Gondkar,
- Jack Montgomery,
- Mike Tsai,
- Saritha Dwarakapuram,
- Sanjay Desai,
- Nili Avidan,
- Poorvaja Ramani,
- Karthik Narayanan,
- Ajit Mathews,
- Sethu Gopal,
- Maxim Naumov,
- Vijay Rao,
- Krishna Noru,
- Harikrishna Reddy,
- Prahlad Venkatapuram,
- Alexis Bjorlin
ISCA '23: Proceedings of the 50th Annual International Symposium on Computer ArchitectureJune 2023, Article No.: 80, Pages 1–13https://doi.org/10.1145/3579371.3589348Meta has traditionally relied on using CPU-based servers for running inference workloads, specifically Deep Learning Recommendation Models (DLRM), but the increasing compute and memory requirements of these models have pushed the company towards using ...
- research-articleJune 2023
Profiling Hyperscale Big Data Processing
- Abraham Gonzalez,
- Aasheesh Kolli,
- Samira Khan,
- Sihang Liu,
- Vidushi Dadu,
- Sagar Karandikar,
- Jichuan Chang,
- Krste Asanovic,
- Parthasarathy Ranganathan
ISCA '23: Proceedings of the 50th Annual International Symposium on Computer ArchitectureJune 2023, Article No.: 47, Pages 1–16https://doi.org/10.1145/3579371.3589082Computing demand continues to grow exponentially, largely driven by "big data" processing on hyperscale data stores. At the same time, the slowdown in Moore's law is leading the industry to embrace custom computing in large-scale systems. Taken together, ...
- research-articleJune 2023
NeuRex: A Case for Neural Rendering Acceleration
ISCA '23: Proceedings of the 50th Annual International Symposium on Computer ArchitectureJune 2023, Article No.: 21, Pages 1–13https://doi.org/10.1145/3579371.3589056This paper presents NeuRex, an accelerator architecture that efficiently performs the modern neural rendering pipeline with an algorithmic enhancement and supporting hardware. NeuRex leverages the insights from an in-depth analysis of the state-of-the-...
- research-articleMarch 2023
Cohort: Software-Oriented Acceleration for Heterogeneous SoCs
ASPLOS 2023: Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3March 2023, Pages 105–117https://doi.org/10.1145/3582016.3582059Philosophically, our approaches to acceleration focus on the extreme. We must optimise accelerators to the maximum, leaving software to fix any hardware-software mismatches. Today’s software abstractions for programming accelerators leak hardware ...
- short-paperFebruary 2023
BOBBER A Prototyping Platform for Batteryless Intermittent Accelerators
FPGA '23: Proceedings of the 2023 ACM/SIGDA International Symposium on Field Programmable Gate ArraysFebruary 2023, Pages 221–228https://doi.org/10.1145/3543622.3573046Batteryless systems offer promising platforms to support pervasive, near-sensor intelligence in a sustainable manner. These systems solely rely on ambient energy sources that often provide limited power. One common approach to designing batteryless ...
- research-articleJanuary 2023
Towards a Machine Learning-Assisted Kernel with LAKE
- Henrique Fingler,
- Isha Tarte,
- Hangchen Yu,
- Ariel Szekely,
- Bodun Hu,
- Aditya Akella,
- Christopher J. Rossbach
ASPLOS 2023: Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2January 2023, Pages 846–861https://doi.org/10.1145/3575693.3575697The complexity of modern operating systems (OSes), rapid diversification of hardware, and steady evolution of machine learning (ML) motivate us to explore the potential of ML to improve decision-making in OS kernels. We conjecture that ML can better ...