Skip to main content

Nadathur Satish

Followers

51

Following

12

Co-authors

12

Public Views

Interests

Uploads

Papers by Nadathur Satish

Interactive modeling, simulation and control of large-scale crowds and traffic

Abstract. We survey some of our recent work on interactive modeling, simulation, and control of l... more Abstract. We survey some of our recent work on interactive modeling, simulation, and control of large-scale crowds and traffic for urban scenes. The driving applications of our work include real-time simulation for computer games, virtual environments, and avatar-based online 3D social networks. We also present some preliminary results and proof-ofconcept demonstrations. Keywords: Velocity Obstacle, Multi-Agent Simulation. 1

ClearPath: Highly Parallel Collision Avoidance for Multi-Agent Simulation

We present a new local collision avoidance algorithm between multiple agents for real-time simula... more We present a new local collision avoidance algorithm between multiple agents for real-time simulations. Our approach extends the notion of velocity obstacles from robotics and formulates the conditions for collision free navigation as a quadratic optimization problem. We use a discrete optimization method to efficiently compute the motion of each agent. This resulting algorithm can be parallelized by exploiting data-parallelism and thread-level parallelism. The overall approach, ClearPath, is general and can robustly handle dense scenarios with tens or hundreds of thousands of heterogeneous agents in a few milli-seconds. As compared to prior collision avoidance algorithms, we observe more than an order of magnitude performance improvement.

Designing Efficient Sorting Algorithms for Manycore GPUs

We describe the design of high-performance parallel radix sort and merge sort routines for manyco... more We describe the design of high-performance parallel radix sort and merge sort routines for manycore GPUs, tak-ing advantage of the full programmability offered by CUDA. Our radix sort is the fastest GPU sort reported in the liter-ature, and is up to 4 times faster than the graphics-based GPUSort. It is also highly competitive with CPU imple-mentations, being up to 3.5 times faster than comparable routines on an 8-core 2.33 GHz Intel Core2 Xeon system. Our merge sort is the fastest published comparison-based GPU sort and is also competitive with multi-core routines. To achieve this performance, we carefully design our al-gorithms to expose substantial fine-grained parallelism and decompose the computation into independent tasks that per-form minimal global communication. We exploit the high-speed on-chip shared memory provided by NVIDIA’s Tesla architecture and efficient data-parallel primitives, particu-larly parallel scan. While targeted at GPUs, these algo-rithms should also be we...

Debunking the 100X GPU vs

Recent advances in computing have led to an explosion in the amount of data being generated. Proc... more Recent advances in computing have led to an explosion in the amount of data being generated. Processing the ever-growing data in a timely manner has made throughput computing an important aspect for emerging applications. Our analysis of a set of important throughput computing kernels shows that there is an ample amount of parallelism in these kernels which makes them suitable for today’s multi-core CPUs and GPUs. In the past few years there have been many studies claiming GPUs deliver substantial speedups (between 10X and 1000X) over multi-core CPUs on these kernels. To understand where such large performance difference comes from, we perform a rigorous performance analysis and find that after applying optimizations appropriate for both CPUs and GPUs the performance gap between an Nvidia GTX280 processor and the Intel Core i7 960 processor narrows to only 2.5x on average. In this paper, we discuss optimization techniques for both CPU and GPU, analyze what architecture features cont...

High Performance Parallel Stochastic Gradient Descent in Shared Memory

2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2016

Stochastic Gradient Descent (SGD) is a popular optimization method used to train a variety of mac... more Stochastic Gradient Descent (SGD) is a popular optimization method used to train a variety of machine learning models. Most of SGD work to-date has concentrated on improving its statistical efficiency, in terms of rate of convergence to the optimal solution. At the same time, as parallelism of modern CPUs continues to increase through progressively higher core counts, it is imperative to understand the parallel hardware efficiency of SGD, which often comes at odds with its statistical efficiency. In this paper, we explore several modern parallelization methods of SGD on a shared memory system, in the context of sparse and convex optimization problems. Specifically, we develop optimized parallel implementations of several SGD algorithms, and show that their parallel efficiency is severely limited by inter-core communication. We propose a new, scalable, communication-avoiding, many-core friendly implementation of SGD, called HogBatch, which exposes parallelism on several levels, minim...

Deep Learning @15 Petaflops/second: Semi-supervised pattern detection for 15 Terabytes of climate data

Scalable Bayesian Optimization Using Deep Neural Networks

Bayesian optimization is an effective methodology for the global optimization of functions with e... more Bayesian optimization is an effective methodology for the global optimization of functions with expensive evaluations. It relies on querying a distribution over functions defined by a relatively cheap surrogate model. An accurate model for this distribution over functions is critical to the effectiveness of the approach, and is typically fit using Gaussian processes (GPs). However, since GPs scale cubically with the number of observations, it has been challenging to handle objectives whose optimization requires many evaluations, and as such, massively parallelizing the optimization. In this work, we explore the use of neural networks as an alternative to GPs to model distributions over functions. We show that performing adaptive basis function regression with a neural network as the parametric form performs competitively with state-of-the-art GP-based approaches, but scales linearly with the number of data rather than cubically. This allows us to achieve a previously intractable deg...

Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations and Hardware Implications

The application of deep learning techniques resulted in remarkable improvement of machine learnin... more The application of deep learning techniques resulted in remarkable improvement of machine learning models. In this paper provides detailed characterizations of deep learning models used in many Facebook social network services. We present computational characteristics of our models, describe high performance optimizations targeting existing systems, point out their limitations and make suggestions for the future general-purpose/accelerated inference hardware. Also, we highlight the need for better co-design of algorithms, numerics and computing platforms to address the challenges of workloads often run in data centers.

Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture

Deep learning at 15PF

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Proceedings of the 48th International Symposium on Microarchitecture

Graphicionado: A high-performance and energy-efficient accelerator for graph analytics

2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)

Proceedings of the VLDB Endowment

Concurrency control on B + trees is primarily achieved with latches, but serialization and conten... more Concurrency control on B + trees is primarily achieved with latches, but serialization and contention can hinder scalability. As core counts on current processors increase, it is imperative to develop scalable latch-free techniques for concurrency control. We present PALM, a novel technique for performing multiple concurrent queries on in-memory B + trees. PALM is based on the Bulk Synchronous Parallel model, which guarantees freedom from deadlocks and race conditions. Input queries are grouped and processed in atomic batches , and work proceeds in stages that preclude contention. Transitions between stages are accomplished with scalable point-to-point communication. PALM exploits data-and thread-level parallelism on modern many-core architectures, and performs 40M 1 updates/second on trees with 128M keys, and 128M updates/second on trees with 512K keys on the latest CPU architectures. Our throughput is 2.3X--19X that of state-of-the-art concurrent update algorithms on in-memory B +...

Parallelizing Word2Vec in Shared and Distributed Memory

IEEE Transactions on Parallel and Distributed Systems

Bridging the gap between HPC and big data frameworks

Proceedings of the VLDB Endowment

Apache Spark is a popular framework for data analytics with attractive features such as fault tol... more Apache Spark is a popular framework for data analytics with attractive features such as fault tolerance and interoperability with the Hadoop ecosystem. Unfortunately, many analytics operations in Spark are an order of magnitude or more slower compared to native implementations written with high performance computing tools such as MPI. There is a need to bridge the performance gap while retaining the benefits of the Spark ecosystem such as availability, productivity, and fault tolerance. In this paper, we propose a system for integrating MPI with Spark and analyze the costs and benefits of doing so for four distributed graph and machine learning applications. We show that offloading computation to an MPI environment from within Spark provides 3.1−17.7× speedups on the four sparse applications, including all of the overheads. This opens up an avenue to reuse existing MPI libraries in Spark with little effort.

Efficient Approximation Algorithms for Weighted $b$-Matching

SIAM Journal on Scientific Computing, 2016

PALM: Parallel Architecture-Friendly Latch-Free Modifications to B+ Trees on Many-Core Processors

Search Unit to Accelerate Variable Length Compression/Decompression

Cache And/Or Socket Sensitive Multi-Processor Cores Breadth-First Traversal

Improving graph partitioning for modern graphs and architectures

Proceedings of the 5th Workshop on Irregular Applications: Architectures and Algorithms, 2015

Interactive modeling, simulation and control of large-scale crowds and traffic

Abstract. We survey some of our recent work on interactive modeling, simulation, and control of l... more Abstract. We survey some of our recent work on interactive modeling, simulation, and control of large-scale crowds and traffic for urban scenes. The driving applications of our work include real-time simulation for computer games, virtual environments, and avatar-based online 3D social networks. We also present some preliminary results and proof-ofconcept demonstrations. Keywords: Velocity Obstacle, Multi-Agent Simulation. 1

ClearPath: Highly Parallel Collision Avoidance for Multi-Agent Simulation

We present a new local collision avoidance algorithm between multiple agents for real-time simula... more We present a new local collision avoidance algorithm between multiple agents for real-time simulations. Our approach extends the notion of velocity obstacles from robotics and formulates the conditions for collision free navigation as a quadratic optimization problem. We use a discrete optimization method to efficiently compute the motion of each agent. This resulting algorithm can be parallelized by exploiting data-parallelism and thread-level parallelism. The overall approach, ClearPath, is general and can robustly handle dense scenarios with tens or hundreds of thousands of heterogeneous agents in a few milli-seconds. As compared to prior collision avoidance algorithms, we observe more than an order of magnitude performance improvement.

Designing Efficient Sorting Algorithms for Manycore GPUs

We describe the design of high-performance parallel radix sort and merge sort routines for manyco... more We describe the design of high-performance parallel radix sort and merge sort routines for manycore GPUs, tak-ing advantage of the full programmability offered by CUDA. Our radix sort is the fastest GPU sort reported in the liter-ature, and is up to 4 times faster than the graphics-based GPUSort. It is also highly competitive with CPU imple-mentations, being up to 3.5 times faster than comparable routines on an 8-core 2.33 GHz Intel Core2 Xeon system. Our merge sort is the fastest published comparison-based GPU sort and is also competitive with multi-core routines. To achieve this performance, we carefully design our al-gorithms to expose substantial fine-grained parallelism and decompose the computation into independent tasks that per-form minimal global communication. We exploit the high-speed on-chip shared memory provided by NVIDIA’s Tesla architecture and efficient data-parallel primitives, particu-larly parallel scan. While targeted at GPUs, these algo-rithms should also be we...

Debunking the 100X GPU vs

Recent advances in computing have led to an explosion in the amount of data being generated. Proc... more Recent advances in computing have led to an explosion in the amount of data being generated. Processing the ever-growing data in a timely manner has made throughput computing an important aspect for emerging applications. Our analysis of a set of important throughput computing kernels shows that there is an ample amount of parallelism in these kernels which makes them suitable for today’s multi-core CPUs and GPUs. In the past few years there have been many studies claiming GPUs deliver substantial speedups (between 10X and 1000X) over multi-core CPUs on these kernels. To understand where such large performance difference comes from, we perform a rigorous performance analysis and find that after applying optimizations appropriate for both CPUs and GPUs the performance gap between an Nvidia GTX280 processor and the Intel Core i7 960 processor narrows to only 2.5x on average. In this paper, we discuss optimization techniques for both CPU and GPU, analyze what architecture features cont...

High Performance Parallel Stochastic Gradient Descent in Shared Memory

2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2016

Stochastic Gradient Descent (SGD) is a popular optimization method used to train a variety of mac... more Stochastic Gradient Descent (SGD) is a popular optimization method used to train a variety of machine learning models. Most of SGD work to-date has concentrated on improving its statistical efficiency, in terms of rate of convergence to the optimal solution. At the same time, as parallelism of modern CPUs continues to increase through progressively higher core counts, it is imperative to understand the parallel hardware efficiency of SGD, which often comes at odds with its statistical efficiency. In this paper, we explore several modern parallelization methods of SGD on a shared memory system, in the context of sparse and convex optimization problems. Specifically, we develop optimized parallel implementations of several SGD algorithms, and show that their parallel efficiency is severely limited by inter-core communication. We propose a new, scalable, communication-avoiding, many-core friendly implementation of SGD, called HogBatch, which exposes parallelism on several levels, minim...

Deep Learning @15 Petaflops/second: Semi-supervised pattern detection for 15 Terabytes of climate data

Scalable Bayesian Optimization Using Deep Neural Networks

Bayesian optimization is an effective methodology for the global optimization of functions with e... more Bayesian optimization is an effective methodology for the global optimization of functions with expensive evaluations. It relies on querying a distribution over functions defined by a relatively cheap surrogate model. An accurate model for this distribution over functions is critical to the effectiveness of the approach, and is typically fit using Gaussian processes (GPs). However, since GPs scale cubically with the number of observations, it has been challenging to handle objectives whose optimization requires many evaluations, and as such, massively parallelizing the optimization. In this work, we explore the use of neural networks as an alternative to GPs to model distributions over functions. We show that performing adaptive basis function regression with a neural network as the parametric form performs competitively with state-of-the-art GP-based approaches, but scales linearly with the number of data rather than cubically. This allows us to achieve a previously intractable deg...

Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations and Hardware Implications

The application of deep learning techniques resulted in remarkable improvement of machine learnin... more The application of deep learning techniques resulted in remarkable improvement of machine learning models. In this paper provides detailed characterizations of deep learning models used in many Facebook social network services. We present computational characteristics of our models, describe high performance optimizations targeting existing systems, point out their limitations and make suggestions for the future general-purpose/accelerated inference hardware. Also, we highlight the need for better co-design of algorithms, numerics and computing platforms to address the challenges of workloads often run in data centers.

Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture

Deep learning at 15PF

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Proceedings of the 48th International Symposium on Microarchitecture

Graphicionado: A high-performance and energy-efficient accelerator for graph analytics

2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)

Proceedings of the VLDB Endowment

Concurrency control on B + trees is primarily achieved with latches, but serialization and conten... more Concurrency control on B + trees is primarily achieved with latches, but serialization and contention can hinder scalability. As core counts on current processors increase, it is imperative to develop scalable latch-free techniques for concurrency control. We present PALM, a novel technique for performing multiple concurrent queries on in-memory B + trees. PALM is based on the Bulk Synchronous Parallel model, which guarantees freedom from deadlocks and race conditions. Input queries are grouped and processed in atomic batches , and work proceeds in stages that preclude contention. Transitions between stages are accomplished with scalable point-to-point communication. PALM exploits data-and thread-level parallelism on modern many-core architectures, and performs 40M 1 updates/second on trees with 128M keys, and 128M updates/second on trees with 512K keys on the latest CPU architectures. Our throughput is 2.3X--19X that of state-of-the-art concurrent update algorithms on in-memory B +...

Parallelizing Word2Vec in Shared and Distributed Memory

IEEE Transactions on Parallel and Distributed Systems

Bridging the gap between HPC and big data frameworks

Proceedings of the VLDB Endowment

Apache Spark is a popular framework for data analytics with attractive features such as fault tol... more Apache Spark is a popular framework for data analytics with attractive features such as fault tolerance and interoperability with the Hadoop ecosystem. Unfortunately, many analytics operations in Spark are an order of magnitude or more slower compared to native implementations written with high performance computing tools such as MPI. There is a need to bridge the performance gap while retaining the benefits of the Spark ecosystem such as availability, productivity, and fault tolerance. In this paper, we propose a system for integrating MPI with Spark and analyze the costs and benefits of doing so for four distributed graph and machine learning applications. We show that offloading computation to an MPI environment from within Spark provides 3.1−17.7× speedups on the four sparse applications, including all of the overheads. This opens up an avenue to reuse existing MPI libraries in Spark with little effort.

Efficient Approximation Algorithms for Weighted $b$-Matching

SIAM Journal on Scientific Computing, 2016

PALM: Parallel Architecture-Friendly Latch-Free Modifications to B+ Trees on Many-Core Processors

Search Unit to Accelerate Variable Length Compression/Decompression

Cache And/Or Socket Sensitive Multi-Processor Cores Breadth-First Traversal

Improving graph partitioning for modern graphs and architectures

Proceedings of the 5th Workshop on Irregular Applications: Architectures and Algorithms, 2015