Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3332466acmconferencesBook PagePublication PagesppoppConference Proceedingsconference-collections
PPoPP '20: Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
ACM2020 Proceeding
Publisher:
  • Association for Computing Machinery
  • New York
  • NY
  • United States
Conference:
PPoPP '20: 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming San Diego California February 22 - 26, 2020
ISBN:
978-1-4503-6818-6
Published:
19 February 2020
Sponsors:

Reflects downloads up to 04 Oct 2024Bibliometrics
Skip Abstract Section
Abstract

It is our great pleasure to welcome you to PPoPP 2020, the 25th ACM Symposium on Principles and Practice of Parallel Programming, in San Diego, USA. PPOPP is the premier forum for leading work on all aspects of parallel programming, including theoretical foundations, techniques, languages, compilers, runtime systems, tools, and practical experience. Given the rise of parallel architectures in the consumer market (desktops, laptops, and mobile devices) and data centers, we made an effort to attract work that addresses new parallel workloads and issues that arise out of extreme-scale applications or cloud platforms, as well as techniques and tools that improve the productivity of parallel programming or work towards improved synergy with such emerging architectures.

research-article
Kite: efficient and available release consistency for the datacenter

Key-Value Stores (KVSs) came into prominence as highly-available, eventually consistent (EC), "NoSQL" Databases, but have quickly transformed into general-purpose, programmable storage systems. Thus, EC, while relevant, is no longer sufficient. ...

Oak: a scalable off-heap allocated key-value map

Efficient ordered in-memory key-value (KV-)maps are paramount for the scalability of modern data platforms. In managed languages like Java, KV-maps face unique challenges due to the high overhead of garbage collection (GC).

We present Oak, a scalable ...

Optimizing batched winograd convolution on GPUs

In this paper, we present an optimized implementation for single-precision Winograd convolution on NVIDIA Volta and Turing GPUs. Compared with the state-of-the-art Winograd convolution in cuDNN 7.6.1, our implementation achieves up to 2.13X speedup on ...

Taming unbalanced training workloads in deep learning with partial collective operations

Load imbalance pervasively exists in distributed deep learning training systems, either caused by the inherent imbalance in learned tasks or by the system itself. Traditional synchronous Stochastic Gradient Descent (SGD) achieves good accuracy for a ...

Scalable top-k retrieval with Sparta

Many big data processing applications rely on a top-k retrieval building block, which selects (or approximates) the k highest-scoring data items based on an aggregation of features. In web search, for instance, a document's score is the sum of its ...

waveSZ: a hardware-algorithm co-design of efficient lossy compression for scientific data

Error-bounded lossy compression is critical to the success of extreme-scale scientific research because of ever-increasing volumes of data produced by today's high-performance computing (HPC) applications. Not only can error-controlled lossy compressors ...

Scaling concurrent queues by using HTM to profit from failed atomic operations

Queues are fundamental concurrent data structures, but despite years of research, even the state-of-the-art queues scale poorly. This poor scalability occurs because of contended atomic read-modify-write (RMW) operations.

This paper makes a first step ...

A wait-free universal construction for large objects

Concurrency has been a subject of study for more than 50 years. Still, many developers struggle to adapt their sequential code to be accessed concurrently. This need has pushed for generic solutions and specific concurrent data structures.

Wait-free ...

Fast concurrent data sketches

Data sketches are approximate succinct summaries of long data streams. They are widely used for processing massive amounts of data and answering statistical queries about it. Existing libraries producing sketches are very fast, but do not allow ...

Universal wait-free memory reclamation

In this paper, we present a universal memory reclamation scheme, Wait-Free Eras (WFE), for deleted memory blocks in wait-free concurrent data structures. WFE's key innovation is that it is completely wait-free. Although some prior techniques provide ...

research-article
Public Access
Using sample-based time series data for automated diagnosis of scalability losses in parallel programs

The performance of many parallel applications has failed to scale as fast as successive generations of hardware on which these applications execute. To understand the cause of scalability losses, experts use performance tools to monitor and analyze ...

research-article
Scaling out speculative execution of finite-state machines with parallel merge

A finite-state machine (FSM) is a key component for many important applications, such as Huffman decoding, regular expression matching and HTML tokenization. Due to its inherent dependencies and unpredictable memory access pattern, FSM computations are ...

research-article
On the fly MHP analysis

May-Happen-in-Parallel (MHP) analysis forms the basis for many problems of program analysis and program understanding. MHP analysis can also be used by IDEs (integrated-development-environments) to help programmers to refactor parallel-programs, ...

research-article
Open Access
Detecting and reproducing error-code propagation bugs in MPI implementations

We present an approach to automatically detect and reproduce error code propagation bugs in MPI implementations. Specifically, we combine static analysis and program repair for bug detection, and apply fault injection to reproduce error propagation bugs ...

Parallel and distributed bounded model checking of multi-threaded programs

We introduce a structure-aware parallel technique for context-bounded analysis of concurrent programs. The key intuition consists in decomposing the set of concurrent traces into symbolic subsets that are separately explored by multiple instances of the ...

research-article
Open Access
Parallel determinacy race detection for futures

The use of futures can generate arbitrary dependences in the computation, making it difficult to detect races efficiently. Algorithms proposed by prior work to detect races on programs with futures all have to execute the program sequentially. We ...

Practical parallel hypergraph algorithms

While there has been significant work on parallel graph processing, there has been very surprisingly little work on high-performance hypergraph processing. This paper presents a collection of efficient parallel algorithms for hypergraph processing, ...

research-article
Public Access
A supernodal all-pairs shortest path algorithm

We show how to exploit graph sparsity in the Floyd-Warshall algorithm for the all-pairs shortest path (Apsp) problem. Floyd-Warshall is an attractive choice for Apsp on high-performing systems due to its structural similarity to solving dense linear ...

research-article
Public Access
Increasing the parallelism of graph coloring via shortcutting

Graph coloring is an assignment of colors to the vertices of a graph such that no two adjacent vertices get the same color. It is a key building block in many applications. Finding a coloring with a minimal number of colors is often only part of the ...

Non-blocking interpolation search trees with doubly-logarithmic running time

Balanced search trees typically use key comparisons to guide their operations, and achieve logarithmic running time. By relying on numerical properties of the keys, interpolation search achieves lower search complexity and better performance. Although ...

YewPar: skeletons for exact combinatorial search

Combinatorial search is central to many applications, yet the huge irregular search trees and the need to respect search heuristics make it hard to parallelise. We aim to improve the reuse of intricate parallel search implementations by providing the ...

XIndex: a scalable learned index for multicore data storage

We present XIndex, a concurrent ordered index designed for fast queries. Similar to a recent proposal of the learned index, XIndex uses learned models to optimize index efficiency. Comparing with the learned index, XIndex is able to effectively handle ...

Overlapping host-to-device copy and computation using hidden unified memory

In this paper, we propose a runtime, called HUM, which hides host-to-device memory copy time without any code modification. It overlaps the host-to-device memory copy with host computation or CUDA kernel computation by exploiting Unified Memory and ...

research-article
<u>G</u>PU <u>i</u>nitiated <u>O</u>penSHMEM: correct and efficient intra-kernel networking for dGPUs

Current state-of-the-art in GPU networking utilizes a host-centric, kernel-boundary communication model that reduces performance and increases code complexity. To address these concerns, recent works have explored performing network operations from ...

No barrier in the road: a comprehensive study and optimization of ARM barriers

In this paper, we present the first comprehensive performance characterization and optimization of ARM barriers on both mobile and server platforms. We draw a set of observations through several abstracted models and validate them in scenarios where ...

spECK: accelerating GPU sparse matrix-matrix multiplication through lightweight analysis

Sparse general matrix-matrix multiplication on GPUs is challenging due to the varying sparsity patterns of sparse matrices. Existing solutions achieve good performance for certain types of matrices, but fail to accelerate all kinds of matrices in the ...

research-article
A novel data transformation and execution strategy for accelerating sparse matrix multiplication on GPUs

SpMM (multiplication of a sparse matrix and a dense matrix) and SDDMM (sampled dense-dense matrix multiplication) are at the core of many scientific, machine learning, and data mining applications. Because of the irregular memory accesses, the two ...

MatRox: modular approach for improving data locality in hierarchical (Mat)rix App(Rox)imation

Hierarchical matrix approximations have gained significant traction in the machine learning and scientific community as they exploit available low-rank structures in kernel methods to compress the kernel matrix. The resulting compressed matrix, HMatrix, ...

poster
Public Access
A parallel sparse tensor benchmark suite on CPUs and GPUs

Tensor computations present significant performance challenges that impact a wide spectrum of applications. Efforts on improving the performance of tensor computations include exploring data layout, execution scheduling, and parallelism in common tensor ...

poster
Nesting and composition in transactional data structure libraries

Transactional data structure libraries (TDSL) combine the ease-of-programming of transactions with the high performance and scalability of custom-tailored concurrent data structures. They can be very efficient thanks to their ability to exploit data ...

Contributors
  • University of California, Riverside
Index terms have been assigned to the content through auto-classification.

Recommendations

Acceptance Rates

PPoPP '20 Paper Acceptance Rate 28 of 121 submissions, 23%;
Overall Acceptance Rate 230 of 1,014 submissions, 23%
YearSubmittedAcceptedRate
PPoPP '211503121%
PPoPP '201212823%
PPoPP '191522919%
PPoPP '171322922%
PPoPP '141842815%
PPoPP '07652234%
PPoPP '03452044%
PPoPP '99791722%
PPOPP '97862630%
Overall1,01423023%