Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

PPoPP '21: Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

February 2021

2021 Proceeding

General Chair:
Jaejin Lee
Seoul National University, South Korea
,
Program Chair:
Erez Petrank
Technion, Israel

Publisher:

Association for Computing Machinery
New York
NY
United States

Conference:

PPoPP '21: 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming Virtual Event Republic of Korea 27 February 2021

ISBN:

978-1-4503-8294-6

Published:

17 February 2021

Sponsors:

SIGPLAN, SIGHPC

Recommend ACM DL

ALREADY A SUBSCRIBER?SIGN IN

Bibliometrics

Abstract

PPoPP is the premier forum for leading work on all aspects of parallel programming, including theoretical foundations, techniques, languages, compilers, runtime systems, tools, and practical experience. Given the rise of parallel architectures in the consumer market (desktops, laptops, and mobile devices) and data centers, we made an effort to attract work that addresses new parallel workloads and issues that arise out of extreme-scale applications or cloud platforms. In addition, we tried to attract techniques and tools that improve parallel programming productivity or work towards improved synergy with such emerging architectures.

Proceeding Downloads

PDFFront matter (Cover, Message from the chairs, Organization, Contents, Keynote)

PDFBack matter (Author index)

Select All

Export Citations Save to Binder

research-article

Efficient algorithms for persistent transactional memory

Pages 1–15https://doi.org/10.1145/3437801.3441586

Durable techniques coupled with transactional semantics provide to application developers the guarantee that data is saved consistently in persistent memory (PM), even in the event of a non-corrupting failure. Persistence fences and flush instructions ...

research-article

Investigating the semantics of futures in transactional memory systems

Pages 16–30https://doi.org/10.1145/3437801.3441594

This paper investigates the problem of integrating two powerful abstractions for concurrent programming, namely futures and transactional memory. Our focus is on specifying the semantics of execution of "transactional futures", i.e., futures that ...

research-article

Open Access

Constant-time snapshots with applications to concurrent data structures

Pages 31–46https://doi.org/10.1145/3437801.3441602

Given a concurrent data structure, we present an approach for efficiently taking snapshots of its constituent CAS objects. More specifically, we support a constant-time operation that returns a snapshot handle. This snapshot handle can later be used to ...

research-article

Open Access

Reasoning about recursive tree traversals

Pages 47–61https://doi.org/10.1145/3437801.3441617

Traversals are commonly seen in tree data structures, and performance-enhancing transformations between tree traversals are critical for many applications. Existing approaches to reasoning about tree traversals and their transformations are ad hoc, with ...

research-article

Best Paper

Synthesizing optimal collective algorithms

Pages 62–75https://doi.org/10.1145/3437801.3441620

Collective communication algorithms are an important component of distributed computation. Indeed, in the case of deep-learning, collective communication is the Amdahl's bottleneck of data-parallel training.

This paper introduces SCCL (for Synthesized ...

research-article

Public Access

Parallel binary code analysis

Pages 76–89https://doi.org/10.1145/3437801.3441604

Binary code analysis is widely used to help assess a program's correctness, performance, and provenance. Binary analysis applications often construct control flow graphs, analyze data flow, and use debugging information to understand how machine code ...

research-article

Public Access

Compiler support for near data computing

Pages 90–104https://doi.org/10.1145/3437801.3441600

Recent works from both hardware and software domains offer various optimizations that try to take advantage of near data computing (NDC) opportunities. While the results from these works indicate performance improvements of various magnitudes, the ...

research-article

Scaling implicit parallelism via dynamic control replication

Pages 105–118https://doi.org/10.1145/3437801.3441587

We present dynamic control replication, a run-time program analysis that enables scalable execution of implicitly parallel programs on large machines through a distributed and efficient dynamic dependence analysis. Dynamic control replication ...

research-article

Open Access

Understanding and bridging the gaps in current GNN performance optimizations

Pages 119–132https://doi.org/10.1145/3437801.3441585

Graph Neural Network (GNN) has recently drawn a rapid increase of interest in many domains for its effectiveness in learning over graphs. Maximizing its performance is essential for many tasks, but remains preliminarily understood. In this work, we ...

research-article

Public Access

A fast work-efficient SSSP algorithm for GPUs

Pages 133–146https://doi.org/10.1145/3437801.3441605

This paper presents a new Single Source Shortest Path (SSSP) algorithm for GPUs. Our key advancement is an improved work scheduler, which is central to the performance of SSSP algorithms. Previous GPU solutions for SSSP use simple work schedulers that ...

research-article

ShadowVM: accelerating data plane for data analytics with bare metal CPUs and GPUs

Pages 147–160https://doi.org/10.1145/3437801.3441595

With the development of the big data ecosystem, large-scale data analytics has become more prevalent in the past few years. Apache Spark, etc., provide a flexible approach for scalable processing upon massive data. However, they are not designed for ...

research-article

Public Access

BiPart: a parallel and deterministic hypergraph partitioner

Pages 161–174https://doi.org/10.1145/3437801.3441611

Hypergraph partitioning is used in many problem domains including VLSI design, linear algebra, Boolean satisfiability, and data mining. Most versions of this problem are NP-complete or NP-hard, so practical hypergraph partitioners generate approximate ...

research-article

Best Artifact

NBR: neutralization based reclamation

Pages 175–190https://doi.org/10.1145/3437801.3441625

Safe memory reclamation (SMR) algorithms suffer from a trade-off between bounding unreclaimed memory and the speed of reclamation. Hazard pointer (HP) based algorithms bound unreclaimed memory at all times, but tend to be slower than other approaches. ...

research-article

Efficiently reclaiming memory in concurrent search data structures while bounding wasted memory

Pages 191–204https://doi.org/10.1145/3437801.3441582

Nonblocking data structures face a safe memory reclamation (SMR) problem. In these algorithms, a node removed from the data structure cannot be reclaimed (freed) immediately, as other threads may be about to access it. The goal of an SMR scheme is to ...

research-article

OrcGC: automatic lock-free memory reclamation

Pages 205–218https://doi.org/10.1145/3437801.3441596

Dynamic lock-free data structures require a memory reclamation scheme with a similar progress. Until today, lock-free schemes are applied to data structures on a case-by-case basis, often with algorithm modifications to the data structure.

In this paper ...

research-article

Are dynamic memory managers on GPUs slow?: a survey and benchmarks

Pages 219–233https://doi.org/10.1145/3437801.3441612

Dynamic memory management on GPUs is generally understood to be a challenging topic. On current GPUs, hundreds of thousands of threads might concurrently allocate new memory or free previously allocated memory. This leads to problems with thread ...

research-article

Open Access

GPTune: multitask learning for autotuning exascale applications

Pages 234–246https://doi.org/10.1145/3437801.3441621

Multitask learning has proven to be useful in the field of machine learning when additional knowledge is available to help a prediction task. We adapt this paradigm to develop autotuning frameworks, where the objective is to find the optimal performance ...

research-article

Open Access

I/O lower bounds for auto-tuning of convolutions in CNNs

Pages 247–261https://doi.org/10.1145/3437801.3441609

Convolution is the most time-consuming part in the computation of convolutional neural networks (CNNs), which have achieved great successes in numerous practical applications. Due to the complex data dependency and the increase in the amount of model ...

research-article

Public Access

ApproxTuner: a compiler and runtime system for adaptive approximations

Pages 262–277https://doi.org/10.1145/3437801.3446108

Manually optimizing the tradeoffs between accuracy, performance and energy for resource-intensive applications with flexible accuracy or precision requirements is extremely difficult. We present ApproxTuner, an automatic framework for accuracy-aware ...

research-article

Open Access

EGEMM-TC: accelerating scientific computing on tensor cores with extended precision

Pages 278–291https://doi.org/10.1145/3437801.3441599

Nvidia Tensor Cores achieve high performance with half-precision matrix inputs tailored towards deep learning workloads. However, this limits the application of Tensor Cores especially in the area of scientific computing with high precision ...

research-article

Efficiently running SpMV on long vector architectures

Pages 292–303https://doi.org/10.1145/3437801.3441592

Sparse Matrix-Vector multiplication (SpMV) is an essential kernel for parallel numerical applications. SpMV displays sparse and irregular data accesses, which complicate its vectorization. Such difficulties make SpMV to frequently experiment non-optimal ...

research-article

Improving communication by optimizing on-node data movement with data layout

Pages 304–317https://doi.org/10.1145/3437801.3441598

We present optimizations to improve communication performance by reducing on-node data movement for a class of distributed memory applications. The primary concept is to eliminate the data movement associated with packing and unpacking subsets of the ...

research-article

Sparta: high-performance, element-wise sparse tensor contraction on heterogeneous memory

Pages 318–333https://doi.org/10.1145/3437801.3441581

Sparse tensor contractions appear commonly in many applications. Efficiently computing a two sparse tensor product is challenging: It not only inherits the challenges from common sparse matrix-matrix multiplication (SpGEMM), i.e., indirect memory access ...

research-article

Advanced synchronization techniques for task-based runtime systems

Pages 334–347https://doi.org/10.1145/3437801.3441601

Task-based programming models like OmpSs-2 and OpenMP provide a flexible data-flow execution model to exploit dynamic, irregular and nested parallelism. Providing an efficient implementation that scales well with small granularity tasks remains a ...

research-article

Public Access

An ownership policy and deadlock detector for promises

Pages 348–361https://doi.org/10.1145/3437801.3441616

Task-parallel programs often enjoy deadlock freedom under certain restrictions, such as the use of structured join operations, as in Cilk and X10, or the use of asynchronous task futures together with deadlock-avoiding policies such as Known Joins or ...

research-article

Understanding a program's resiliency through error propagation

Pages 362–373https://doi.org/10.1145/3437801.3441589

Aggressive technology scaling trends have worsened the transient fault problem in high-performance computing (HPC) systems. Some faults are benign, but others can lead to silent data corruption (SDC), which represents a serious problem; a fault ...

research-article

Public Access

Lightweight preemptive user-level threads

Pages 374–388https://doi.org/10.1145/3437801.3441610

Many-to-many mapping models for user- to kernel-level threads (or "M:N threads") have been extensively studied for decades as a lightweight substitute for current Pthreads implementations that provide a simple one-to-one mapping ("1:1 threads"). M:N ...

research-article

Open Access

TurboTransformers: an efficient GPU serving system for transformer models

Pages 389–402https://doi.org/10.1145/3437801.3441578

The transformer is the most critical algorithm innovation of the Nature Language Processing (NLP) field in recent years. Unlike the Recurrent Neural Network (RNN) models, transformers are able to process on dimensions of sequence lengths in parallel, ...

research-article

Extracting clean performance models from tainted programs

Pages 403–417https://doi.org/10.1145/3437801.3441613

Performance models are well-known instruments to understand the scaling behavior of parallel applications. They express how performance changes as key execution parameters, such as the number of processes or the size of the input problem, vary. Besides ...

research-article

Modernizing parallel code with pattern analysis

Pages 418–430https://doi.org/10.1145/3437801.3441603

Fifty years of parallel programming has generated a substantial legacy parallel codebase, creating a new portability challenge: re-parallelizing already parallel code. Our solution exploits inherently portable parallel patterns, and addresses the ...

Contributors

Jaejin Lee
- Publication Years
- Publication counts0
- Citation count0
- Available for Download0
- Downloads (cumulative)0
- Downloads (12 months)0
- Downloads (6 weeks)0
- Average Downloads per Article0
- Average Citation per Article0
View Full Profile
Erez Petrank
Technion - Israel Institute of Technology
- Publication Years1990 - 2024
- Publication counts107
- Citation count3,897
- Available for Download99
- Downloads (cumulative)61,192
- Downloads (12 months)5,073
- Downloads (6 weeks)869
- Average Downloads per Article618
- Average Citation per Article36
View Full Profile

Index Terms

Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
1. Theory of computation
  1. Models of computation
    1. Concurrency

Index terms have been assigned to the content through auto-classification.

Comments

Recommendations

Acceptance Rates

PPoPP '21 Paper Acceptance Rate 31 of 150 submissions, 21%;

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Year	Submitted	Accepted	Rate
PPoPP '21	150	31	21%
PPoPP '20	121	28	23%
PPoPP '19	152	29	19%
PPoPP '17	132	29	22%
PPoPP '14	184	28	15%
PPoPP '07	65	22	34%
PPoPP '03	45	20	44%
PPoPP '99	79	17	22%
PPOPP '97	86	26	30%
Overall	1,014	230	23%

PPOPP

Sections

Proceeding Downloads

Efficient algorithms for persistent transactional memory

Investigating the semantics of futures in transactional memory systems

Constant-time snapshots with applications to concurrent data structures

Reasoning about recursive tree traversals

Synthesizing optimal collective algorithms

Parallel binary code analysis

Compiler support for near data computing

Scaling implicit parallelism via dynamic control replication

Understanding and bridging the gaps in current GNN performance optimizations

A fast work-efficient SSSP algorithm for GPUs

ShadowVM: accelerating data plane for data analytics with bare metal CPUs and GPUs

BiPart: a parallel and deterministic hypergraph partitioner

NBR: neutralization based reclamation

Efficiently reclaiming memory in concurrent search data structures while bounding wasted memory

OrcGC: automatic lock-free memory reclamation

Are dynamic memory managers on GPUs slow?: a survey and benchmarks

GPTune: multitask learning for autotuning exascale applications

I/O lower bounds for auto-tuning of convolutions in CNNs

ApproxTuner: a compiler and runtime system for adaptive approximations

EGEMM-TC: accelerating scientific computing on tensor cores with extended precision

Efficiently running SpMV on long vector architectures

Improving communication by optimizing on-node data movement with data layout

Sparta: high-performance, element-wise sparse tensor contraction on heterogeneous memory

Advanced synchronization techniques for task-based runtime systems

An ownership policy and deadlock detector for promises

Understanding a program's resiliency through error propagation

Lightweight preemptive user-level threads

TurboTransformers: an efficient GPU serving system for transformer models

Extracting clean performance models from tainted programs

Modernizing parallel code with pattern analysis

Index Terms

PPoPP '09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming

PPoPP '12: Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming

PPoPP '08: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming

Acceptance Rates