skip to main content
PPoPP '21: Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
ACM2021 Proceeding
Publisher:
  • Association for Computing Machinery
  • New York
  • NY
  • United States
Conference:
PPoPP '21: 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming Virtual Event Republic of Korea 27 February 2021
ISBN:
978-1-4503-8294-6
Published:
17 February 2021
Sponsors:

Reflects downloads up to 23 Oct 2024Bibliometrics
Skip Abstract Section
Abstract

PPoPP is the premier forum for leading work on all aspects of parallel programming, including theoretical foundations, techniques, languages, compilers, runtime systems, tools, and practical experience. Given the rise of parallel architectures in the consumer market (desktops, laptops, and mobile devices) and data centers, we made an effort to attract work that addresses new parallel workloads and issues that arise out of extreme-scale applications or cloud platforms. In addition, we tried to attract techniques and tools that improve parallel programming productivity or work towards improved synergy with such emerging architectures.

Efficient algorithms for persistent transactional memory

Durable techniques coupled with transactional semantics provide to application developers the guarantee that data is saved consistently in persistent memory (PM), even in the event of a non-corrupting failure. Persistence fences and flush instructions ...

Investigating the semantics of futures in transactional memory systems

This paper investigates the problem of integrating two powerful abstractions for concurrent programming, namely futures and transactional memory. Our focus is on specifying the semantics of execution of "transactional futures", i.e., futures that ...

Constant-time snapshots with applications to concurrent data structures

Given a concurrent data structure, we present an approach for efficiently taking snapshots of its constituent CAS objects. More specifically, we support a constant-time operation that returns a snapshot handle. This snapshot handle can later be used to ...

Reasoning about recursive tree traversals

Traversals are commonly seen in tree data structures, and performance-enhancing transformations between tree traversals are critical for many applications. Existing approaches to reasoning about tree traversals and their transformations are ad hoc, with ...

research-article
Synthesizing optimal collective algorithms

Collective communication algorithms are an important component of distributed computation. Indeed, in the case of deep-learning, collective communication is the Amdahl's bottleneck of data-parallel training.

This paper introduces SCCL (for Synthesized ...

Parallel binary code analysis

Binary code analysis is widely used to help assess a program's correctness, performance, and provenance. Binary analysis applications often construct control flow graphs, analyze data flow, and use debugging information to understand how machine code ...

research-article
Public Access
Compiler support for near data computing

Recent works from both hardware and software domains offer various optimizations that try to take advantage of near data computing (NDC) opportunities. While the results from these works indicate performance improvements of various magnitudes, the ...

research-article
Scaling implicit parallelism via dynamic control replication

We present dynamic control replication, a run-time program analysis that enables scalable execution of implicitly parallel programs on large machines through a distributed and efficient dynamic dependence analysis. Dynamic control replication ...

Understanding and bridging the gaps in current GNN performance optimizations

Graph Neural Network (GNN) has recently drawn a rapid increase of interest in many domains for its effectiveness in learning over graphs. Maximizing its performance is essential for many tasks, but remains preliminarily understood. In this work, we ...

A fast work-efficient SSSP algorithm for GPUs

This paper presents a new Single Source Shortest Path (SSSP) algorithm for GPUs. Our key advancement is an improved work scheduler, which is central to the performance of SSSP algorithms. Previous GPU solutions for SSSP use simple work schedulers that ...

research-article
ShadowVM: accelerating data plane for data analytics with bare metal CPUs and GPUs

With the development of the big data ecosystem, large-scale data analytics has become more prevalent in the past few years. Apache Spark, etc., provide a flexible approach for scalable processing upon massive data. However, they are not designed for ...

BiPart: a parallel and deterministic hypergraph partitioner

Hypergraph partitioning is used in many problem domains including VLSI design, linear algebra, Boolean satisfiability, and data mining. Most versions of this problem are NP-complete or NP-hard, so practical hypergraph partitioners generate approximate ...

NBR: neutralization based reclamation

Safe memory reclamation (SMR) algorithms suffer from a trade-off between bounding unreclaimed memory and the speed of reclamation. Hazard pointer (HP) based algorithms bound unreclaimed memory at all times, but tend to be slower than other approaches. ...

research-article
Efficiently reclaiming memory in concurrent search data structures while bounding wasted memory

Nonblocking data structures face a safe memory reclamation (SMR) problem. In these algorithms, a node removed from the data structure cannot be reclaimed (freed) immediately, as other threads may be about to access it. The goal of an SMR scheme is to ...

OrcGC: automatic lock-free memory reclamation

Dynamic lock-free data structures require a memory reclamation scheme with a similar progress. Until today, lock-free schemes are applied to data structures on a case-by-case basis, often with algorithm modifications to the data structure.

In this paper ...

Are dynamic memory managers on GPUs slow?: a survey and benchmarks

Dynamic memory management on GPUs is generally understood to be a challenging topic. On current GPUs, hundreds of thousands of threads might concurrently allocate new memory or free previously allocated memory. This leads to problems with thread ...

GPTune: multitask learning for autotuning exascale applications

Multitask learning has proven to be useful in the field of machine learning when additional knowledge is available to help a prediction task. We adapt this paradigm to develop autotuning frameworks, where the objective is to find the optimal performance ...

research-article
Open Access
I/O lower bounds for auto-tuning of convolutions in CNNs

Convolution is the most time-consuming part in the computation of convolutional neural networks (CNNs), which have achieved great successes in numerous practical applications. Due to the complex data dependency and the increase in the amount of model ...

research-article
Public Access
ApproxTuner: a compiler and runtime system for adaptive approximations

Manually optimizing the tradeoffs between accuracy, performance and energy for resource-intensive applications with flexible accuracy or precision requirements is extremely difficult. We present ApproxTuner, an automatic framework for accuracy-aware ...

EGEMM-TC: accelerating scientific computing on tensor cores with extended precision

Nvidia Tensor Cores achieve high performance with half-precision matrix inputs tailored towards deep learning workloads. However, this limits the application of Tensor Cores especially in the area of scientific computing with high precision ...

Efficiently running SpMV on long vector architectures

Sparse Matrix-Vector multiplication (SpMV) is an essential kernel for parallel numerical applications. SpMV displays sparse and irregular data accesses, which complicate its vectorization. Such difficulties make SpMV to frequently experiment non-optimal ...

Improving communication by optimizing on-node data movement with data layout

We present optimizations to improve communication performance by reducing on-node data movement for a class of distributed memory applications. The primary concept is to eliminate the data movement associated with packing and unpacking subsets of the ...

Sparta: high-performance, element-wise sparse tensor contraction on heterogeneous memory

Sparse tensor contractions appear commonly in many applications. Efficiently computing a two sparse tensor product is challenging: It not only inherits the challenges from common sparse matrix-matrix multiplication (SpGEMM), i.e., indirect memory access ...

Advanced synchronization techniques for task-based runtime systems

Task-based programming models like OmpSs-2 and OpenMP provide a flexible data-flow execution model to exploit dynamic, irregular and nested parallelism. Providing an efficient implementation that scales well with small granularity tasks remains a ...

An ownership policy and deadlock detector for promises

Task-parallel programs often enjoy deadlock freedom under certain restrictions, such as the use of structured join operations, as in Cilk and X10, or the use of asynchronous task futures together with deadlock-avoiding policies such as Known Joins or ...

research-article
Understanding a program's resiliency through error propagation

Aggressive technology scaling trends have worsened the transient fault problem in high-performance computing (HPC) systems. Some faults are benign, but others can lead to silent data corruption (SDC), which represents a serious problem; a fault ...

Lightweight preemptive user-level threads

Many-to-many mapping models for user- to kernel-level threads (or "M:N threads") have been extensively studied for decades as a lightweight substitute for current Pthreads implementations that provide a simple one-to-one mapping ("1:1 threads"). M:N ...

TurboTransformers: an efficient GPU serving system for transformer models

The transformer is the most critical algorithm innovation of the Nature Language Processing (NLP) field in recent years. Unlike the Recurrent Neural Network (RNN) models, transformers are able to process on dimensions of sequence lengths in parallel, ...

Extracting clean performance models from tainted programs

Performance models are well-known instruments to understand the scaling behavior of parallel applications. They express how performance changes as key execution parameters, such as the number of processes or the size of the input problem, vary. Besides ...

research-article
Modernizing parallel code with pattern analysis

Fifty years of parallel programming has generated a substantial legacy parallel codebase, creating a new portability challenge: re-parallelizing already parallel code. Our solution exploits inherently portable parallel patterns, and addresses the ...

Contributors
  • Technion - Israel Institute of Technology

Index Terms

  1. Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Acceptance Rates

    PPoPP '21 Paper Acceptance Rate 31 of 150 submissions, 21%;
    Overall Acceptance Rate 230 of 1,014 submissions, 23%
    YearSubmittedAcceptedRate
    PPoPP '211503121%
    PPoPP '201212823%
    PPoPP '191522919%
    PPoPP '171322922%
    PPoPP '141842815%
    PPoPP '07652234%
    PPoPP '03452044%
    PPoPP '99791722%
    PPOPP '97862630%
    Overall1,01423023%