Welcome to the third volume of ASPLOS'24: the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. This document is mostly dedicated to the 2024 fall cycle but also provides some statistics summarizing all three cycles.
We introduced several notable changes to ASPLOS this year, most of which were discussed in our previous messages from program chairs in Volume 1 and 2, including: (1) significantly increasing the program committee size to over 220 members (more than twice the size of last year); (2) foregoing synchronous program committee (PC) meetings and instead making all decisions online; (3) overhauling the review assignment process; (4) developing an automated submission format violation identifier script that uncovers, e.g., disallowed vertical space manipulations that "squeeze" space; (5) introducing the new ASPLOS role of Program Vice Chairs to cope with the increased number of submissions and the added load caused by foregoing synchronous program committee; and (6) characterizing a systematic problem that ASPLOS is facing in reviewing quantum computing submissions, describing how we addressed it and highlighting how we believe that it should be handled in the future.
Assuming readers have read our previous messages, here, we will only describe differences between the current cycle and the previous ones. These include: (1) Finally unifying submission and acceptance paper formatting instructions (forgoing the `jpaper' class) to rid authors of accepted papers from the need to reformat; (2) Describing the methodology we employed to select best papers, which we believe ensures quality and hope will persist; and (3) Reporting the ethical incidents we encountered and how we handled them. In the final, fourth volume, when the outcome of the ASPLOS'24 fall major revisions will become known, we plan to conduct a broader analysis of all the data we have gathered throughout the year.
Following are some key statistics of the fall cycle: 340 submissions were finalized (43% more than last year's fall count and 17% less than our summer cycle) of which 111 are related to accelerators/FPGAs/GPUs, 105 to machine learning, 54 to security, 50 to datacenter/cloud and 50 to storage/memory; 183 (54%) submissions were promoted to the second review round; 39 (11.5%) papers were accepted (of which 19 were awarded artifact evaluation badges); 33 (9.7%) submissions were allowed to submit major revisions and are currently under review (these will be addressed in the fourth volume of ASPLOS'24 and will be presented in ASPLOS'25 if accepted); 1,368 reviews were uploaded; and 4,949 comments were generated during online discussions, of which 4,070 were dedicated to the submissions that made it to the second review round.
This year, in the submission form, we asked authors to specify which of the three ASPLOS research areas are related to their submitted work. Analyzing this data revealed that 80%, 39%, and 29% of the submissions are categorized by their authors as related to architecture, operating systems, and programming languages, respectively, generating the highest difference we have observed across the cycles between architecture and the other two. About 46% of the fall submissions are "interdisciplinary," namely, were associated with two or more of the three areas.
Overall, throughout all the ASPLOS'24 cycles, we received 922 submissions, constituting a 1.54x increase compared to last year. Our reviewers submitted a total of 3,634 reviews containing more than 2.6 million words, and we also generated 12,655 online comments consisting of nearly 1.2 million words. As planned, PC members submitted an average of 15.7 reviews and a median of 15, and external review committee (ERC) members submitted an average of 4.7 and a median of 5.
We accepted 170 papers thus far, written by 1100 authors, leading to an 18.4% acceptance rate, with the aforementioned 33 major revisions still under review. Assuming that the revision acceptance rate will be similar to that of previous cycles, we estimate that ASPLOS'24 will accept nearly 200 (!) papers, namely, 21%–22% of the submissions.
The ASPLOS'24 program consists of 193 papers: the 170 papers we accepted thus far and, in addition, 23 major revisions from the fall cycle of ASPLOS'23, which were re-reviewed and accepted. The full details are available in the PDF of the front matter.
Proceeding Downloads
Societal infrastructure in the age of Artificial General Intelligence
Today, we are at an inflection point in computing where emerging Generative AI services are placing unprecedented demand for compute while the existing architectural patterns for improving efficiency have stalled. In this talk, we will discuss the likely ...
Challenges and Opportunities for Systems Using CXL Memory
We are at the start of the technology cycle for compute express link (CXL) memory, which is a significant opportunity and challenge for architecture, operating systems, and programming languages. The 3.0 CXL specification allows multiple, physically ...
Harnessing the Power of Specialization for Sustainable Computing
Computing is critical to address some of the most pressing needs of humanity today, including climate change mitigation and adaptation. However, it is also the source of a significant and steadily increasing carbon toll, attributed in part to the ...
AWS Trainium: The Journey for Designing and Optimization Full Stack ML Hardware
Machine learning accelerators present a unique set of design challenges across chip architecture, instruction set, server design, compiler, and both inter- and intra-chip connectivity. With AWS Trainium, we've utilized AWS's end-to-end ownership from ...
8-bit Transformer Inference and Fine-tuning for Edge Accelerators
Transformer models achieve state-of-the-art accuracy on natural language processing (NLP) and vision tasks, but demand significant computation and memory resources, which makes it difficult to perform inference and training (fine-tuning) on edge ...
A Midsummer Night’s Tree: Efficient and High Performance Secure SCM
Secure memory is a highly desirable property to prevent memory corruption-based attacks. The emergence of nonvolatile, storage class memory (SCM) devices presents new challenges for secure memory. Metadata for integrity verification, organized in a ...
A shared compilation stack for distributed-memory parallelism in stencil DSLs
- George Bisbas,
- Anton Lydike,
- Emilien Bauer,
- Nick Brown,
- Mathieu Fehr,
- Lawrence Mitchell,
- Gabriel Rodriguez-Canal,
- Maurice Jamieson,
- Paul H. J. Kelly,
- Michel Steuwer,
- Tobias Grosser
Domain Specific Languages (DSLs) increase programmer productivity and provide high performance. Their targeted abstractions allow scientists to express problems at a high level, providing rich details that optimizing compilers can exploit to target ...
Accelerating Multi-Scalar Multiplication for Efficient Zero Knowledge Proofs with Multi-GPU Systems
Zero-knowledge proof is a cryptographic primitive that allows for the validation of statements without disclosing any sensitive information, foundational in applications like verifiable outsourcing and digital currency. However, the extensive proof ...
ACES: Accelerating Sparse Matrix Multiplication with Adaptive Execution Flow and Concurrency-Aware Cache Optimizations
Sparse matrix-matrix multiplication (SpMM) is a critical computational kernel in numerous scientific and machine learning applications. SpMM involves massive irregular memory accesses and poses great challenges to conventional cache-based computer ...
AdaPipe: Optimizing Pipeline Parallelism with Adaptive Recomputation and Partitioning
Large language models (LLMs) have demonstrated powerful capabilities, requiring huge memory with their increasing sizes and sequence lengths, thus demanding larger parallel systems. The broadly adopted pipeline parallelism introduces even heavier and ...
AERO: Adaptive Erase Operation for Improving Lifetime and Performance of Modern NAND Flash-Based SSDs
This work investigates a new erase scheme in NAND flash memory to improve the lifetime and performance of modern solid-state drives (SSDs). In NAND flash memory, an erase operation applies a high voltage (e.g., > 20 V) to flash cells for a long time (...
AUDIBLE: A Convolution-Based Resource Allocator for Oversubscribing Burstable Virtual Machines
- Seyedali Jokar Jandaghi,
- Kaveh Mahdaviani,
- Amirhossein Mirhosseini,
- Sameh Elnikety,
- Cristiana Amza,
- Bianca Schroeder
In an effort to increase the utilization of data center resources cloud providers have introduced a new type of virtual machine (VM) offering, called a burstable VM (BVM). Our work is the first to study the characteristics of burstable VMs (based on ...
BeeZip: Towards An Organized and Scalable Architecture for Data Compression
Data compression plays a critical role in operating systems and large-scale computing workloads. Its primary objective is to reduce network bandwidth consumption and memory/storage capacity utilization. Given the need to manipulate hash tables, and ...
Boost Linear Algebra Computation Performance via Efficient VNNI Utilization
Intel's Vector Neural Network Instruction (VNNI) provides higher efficiency on calculating dense linear algebra (DLA) computations than conventional SIMD instructions. However, existing auto-vectorizers frequently deliver suboptimal utilization of VNNI ...
C4CAM: A Compiler for CAM-based In-memory Accelerators
- Hamid Farzaneh,
- Joao Paulo Cardoso De Lima,
- Mengyuan Li,
- Asif Ali Khan,
- Xiaobo Sharon Hu,
- Jeronimo Castrillon
Machine learning and data analytics applications increasingly suffer from the high latency and energy consumption of conventional von Neumann architectures. Recently, several in-memory and near-memory systems have been proposed to overcome this von ...
Centauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning
Efficiently training large language models (LLMs) necessitates the adoption of hybrid parallel methods, integrating multiple communications collectives within distributed partitioned graphs. Overcoming communication bottlenecks is crucial and is often ...
Characterizing a Memory Allocator at Warehouse Scale
- Zhuangzhuang Zhou,
- Vaibhav Gogte,
- Nilay Vaish,
- Chris Kennelly,
- Patrick Xia,
- Svilen Kanev,
- Tipp Moseley,
- Christina Delimitrou,
- Parthasarathy Ranganathan
Memory allocation constitutes a substantial component of warehouse-scale computation. Optimizing the memory allocator not only reduces the datacenter tax, but also improves application performance, leading to significant cost savings.
We present the ...
Characterizing Power Management Opportunities for LLMs in the Cloud
- Pratyush Patel,
- Esha Choukse,
- Chaojie Zhang,
- Íñigo Goiri,
- Brijesh Warrier,
- Nithish Mahalingam,
- Ricardo Bianchini
Recent innovation in large language models (LLMs), and their myriad use cases have rapidly driven up the compute demand for datacenter GPUs. Several cloud providers and other enterprises plan to substantially grow their datacenter capacity to support ...
CSSTs: A Dynamic Data Structure for Partial Orders in Concurrent Execution Analysis
Dynamic analyses are a standard approach to analyzing and testing concurrent programs. Such techniques observe program traces σ and analyze them to infer the presence or absence of bugs. At its core, each analysis maintains a partial order P that ...
Dr. DNA: Combating Silent Data Corruptions in Deep Learning using Distribution of Neuron Activations
Deep neural networks (DNNs) have been widely-adopted in various safety-critical applications such as computer vision and autonomous driving. However, as technology scales and applications diversify, coupled with the increasing heterogeneity of underlying ...
DTC-SpMM: Bridging the Gap in Accelerating General Sparse Matrix Multiplication with Tensor Cores
Sparse Matrix-Matrix Multiplication (SpMM) is a building-block operation in scientific computing and machine learning applications. Recent advancements in hardware, notably Tensor Cores (TCs), have created promising opportunities for accelerating SpMM. ...
Energy-Adaptive Buffering for Efficient, Responsive, and Persistent Batteryless Systems
Batteryless energy harvesting systems enable a wide array of new sensing, computation, and communication platforms untethered by power delivery or battery maintenance demands. Energy harvesters charge a buffer capacitor from an unreliable environmental ...
Enforcing C/C++ Type and Scope at Runtime for Control-Flow and Data-Flow Integrity
Control-flow hijacking and data-oriented attacks are becoming more sophisticated. These attacks, especially data-oriented attacks, can result in critical security threats, such as leaking an SSL key. Data-oriented attacks are hard to defend against with ...
EVT: Accelerating Deep Learning Training with Epilogue Visitor Tree
As deep learning models become increasingly complex, the deep learning compilers are critical for enhancing the system efficiency and unlocking hidden optimization opportunities. Although excellent speedups have been achieved in inference workloads, ...
Explainable Port Mapping Inference with Sparse Performance Counters for AMD's Zen Architectures
Performance models are instrumental for optimizing performance-sensitive code. When modeling the use of functional units of out-of-order x86-64 CPUs, data availability varies by the manufacturer: Instruction-to-port mappings for Intel's processors are ...
FaaSMem: Improving Memory Efficiency of Serverless Computing with Memory Pool Architecture
- Chuhao Xu,
- Yiyu Liu,
- Zijun Li,
- Quan Chen,
- Han Zhao,
- Deze Zeng,
- Qian Peng,
- Xueqi Wu,
- Haifeng Zhao,
- Senbo Fu,
- Minyi Guo
In serverless computing, an idle container is not recycled directly, in order to mitigate time-consuming cold container startup. These idle containers still occupy the memory, exasperating the memory shortage of today's data centers. By offloading their ...
FEASTA: A Flexible and Efficient Accelerator for Sparse Tensor Algebra in Machine Learning
- Kai Zhong,
- Zhenhua Zhu,
- Guohao Dai,
- Hongyi Wang,
- Xinhao Yang,
- Haoyu Zhang,
- Jin Si,
- Qiuli Mao,
- Shulin Zeng,
- Ke Hong,
- Genghan Zhang,
- Huazhong Yang,
- Yu Wang
Recently, sparse tensor algebra (SpTA) plays an increasingly important role in machine learning. However, due to the unstructured sparsity of SpTA, the general-purpose processors (e.g., GPU and CPU) are inefficient because of the underutilized hardware ...
Felix: Optimizing Tensor Programs with Gradient Descent
Obtaining high-performance implementations of tensor programs such as deep neural networks on a wide range of hardware remains a challenging task. Search-based tensor program optimizers can automatically find high-performance programs on a given hardware ...
Fermihedral: On the Optimal Compilation for Fermion-to-Qubit Encoding
This paper introduces Fermihedral, a compiler framework focusing on discovering the optimal Fermion-to-qubit encoding for targeted Fermionic Hamiltonians. Fermion-to-qubit encoding is a crucial step in harnessing quantum computing for efficient ...
Flexible Non-intrusive Dynamic Instrumentation for WebAssembly
A key strength of managed runtimes over hardware is the ability to gain detailed insight into the dynamic execution of programs with instrumentation. Analyses such as code coverage, execution frequency, tracing, and debugging, are all made easier in a ...
Recommendations
Acceptance Rates
Year | Submitted | Accepted | Rate |
---|---|---|---|
ASPLOS '19 | 351 | 74 | 21% |
ASPLOS '18 | 319 | 56 | 18% |
ASPLOS '17 | 320 | 53 | 17% |
ASPLOS '16 | 232 | 53 | 23% |
ASPLOS '15 | 287 | 48 | 17% |
ASPLOS '14 | 217 | 49 | 23% |
ASPLOS XV | 181 | 32 | 18% |
ASPLOS XIII | 127 | 31 | 24% |
ASPLOS XII | 158 | 38 | 24% |
ASPLOS X | 175 | 24 | 14% |
ASPLOS IX | 114 | 24 | 21% |
ASPLOS VIII | 123 | 28 | 23% |
ASPLOS VII | 109 | 25 | 23% |
Overall | 2,713 | 535 | 20% |