: Search

research-article

Preparing for performance analysis at exascale

ICS '22: Proceedings of the 36th ACM International Conference on SupercomputingArticle No.: 34, Pages 1–13https://doi.org/10.1145/3524059.3532397

Performance tools for emerging heterogeneous exascale platforms must address two principal challenges when analyzing execution measurements. First, measurement of large-scale executions may record mountains of performance data. Second, performance ...

research-article

Public Access

Low overhead and context sensitive profiling of GPU-accelerated applications

ICS '22: Proceedings of the 36th ACM International Conference on SupercomputingArticle No.: 1, Pages 1–13https://doi.org/10.1145/3524059.3532388

As we near the end of Moore's law scaling, the next-generation computing platforms are increasingly exploring heterogeneous processors for acceleration. Graphics Processing Units (GPUs) are the most widely used accelerators. Meanwhile, applications are ...

research-article

Public Access

Tools for top-down performance analysis of GPU-accelerated applications

ICS '20: Proceedings of the 34th ACM International Conference on SupercomputingArticle No.: 26, Pages 1–12https://doi.org/10.1145/3392717.3392752

This paper describes extensions to Rice University's HPCToolkit performance tools to support measurement and analysis of GPU-accelerated applications. To help developers understand the performance of accelerated applications as a whole, HPCToolkit's ...

research-article

Public Access

Automated Analysis of Time Series Data to Understand Parallel Program Behaviors

ICS '18: Proceedings of the 2018 International Conference on SupercomputingPages 240–251https://doi.org/10.1145/3205289.3205308

Traditionally, performance analysis tools have focused on collecting measurements, attributing them to program source code, and presenting them; responsibility for analysis and interpretation of measurement data falls to application developers. While ...

proceeding

ICS '15: Proceedings of the 29th ACM on International Conference on Supercomputing

Welcome to the 29th ACM International Conference on Supercomputing (ICS), June 8-11, 2015 at Newport Beach, CA. ICS is well known as the premier technical forum where researchers present their latest results and share with colleagues their perspectives ...

research-article

Author retrospective: compilation techniques for block-cyclic distributions

ACM International Conference on Supercomputing 25th Anniversary VolumePages 29–31https://doi.org/10.1145/2591635.2591651

Compilers for data-parallel languages use data distribution specifications to guide code generation for distributed-memory machines. Our 1994 paper described how to generate efficient code for programs that employ block-cyclic data distributions. In ...

research-article

Author retrospective for PTRAN's analysis and optimization techniques

ACM International Conference on Supercomputing 25th Anniversary VolumePages 1–3https://doi.org/10.1145/2591635.2591638

The PTRAN (Parallel Translator) system at IBM had as its goal the analysis and optimization of sequential programs for parallel architectures. In this paper, we give our perspective on what has changed since PTRAN, and what is still relevant.

research-article

A new approach for performance analysis of openMP programs

ICS '13: Proceedings of the 27th international ACM conference on International conference on supercomputingPages 69–80https://doi.org/10.1145/2464996.2465433

The number of hardware threads is growing with each new generation of multicore chips; thus, one must effectively use threads to fully exploit emerging processors. OpenMP is a popular directive-based programming model that helps programmers exploit ...

research-article

Scalable fine-grained call path tracing

ICS '11: Proceedings of the international conference on SupercomputingPages 63–74https://doi.org/10.1145/1995896.1995908

Applications must scale well to make efficient use of even medium-scale parallel systems. Because scaling problems are often difficult to diagnose, there is a critical need for scalable tools that guide scientists to the root causes of performance ...

research-article

Chunking parallel loops in the presence of synchronization

ICS '09: Proceedings of the 23rd international conference on SupercomputingPages 181–192https://doi.org/10.1145/1542275.1542304

Modern languages for shared-memory parallelism are moving from a bulk-synchronous Single Program Multiple Data (SPMD) execution model to lightweight Task Parallel execution models for improved productivity. This shift is intended to encourage ...

research-article

Phasers: a unified deadlock-free construct for collective and point-to-point synchronization

ICS '08: Proceedings of the 22nd annual international conference on SupercomputingPages 277–288https://doi.org/10.1145/1375527.1375568

Coordination and synchronization of parallel tasks is a major source of complexity in parallel programming. These constructs take many forms in practice including mutual exclusion in accesses to shared resources, termination detection of child tasks, ...

Article

Scalability analysis of SPMD codes using expectations

ICS '07: Proceedings of the 21st annual international conference on SupercomputingPages 13–22https://doi.org/10.1145/1274971.1274976

We present a new technique for identifying scalability bottlenecks in executions of single-program, multiple-data (SPMD) parallel programs, quantifying their impact on performance, and associating this information with the program source code. Our ...

Article

Profitable loop fusion and tiling using model-driven empirical search

ICS '06: Proceedings of the 20th annual international conference on SupercomputingPages 249–258https://doi.org/10.1145/1183401.1183437

Loop fusion and tiling are both recognized as effective transformations for improving memory performance of scientific applications. However, because of their sensitivity to the underlying cache architecture and their interaction with each other it is ...

Article

Low-overhead call path profiling of unmodified, optimized code

ICS '05: Proceedings of the 19th annual international conference on SupercomputingPages 81–90https://doi.org/10.1145/1088149.1088161

Call path profiling associates resource consumption with the calling context in which resources were consumed. We describe the design and implementation of a low-overhead call path profiler based on stack sampling. The profiler uses a novel sample-...

Article

Experiences tuning SMG98: a semicoarsening multigrid benchmark based on the hypre library

ICS '02: Proceedings of the 16th international conference on SupercomputingPages 305–314https://doi.org/10.1145/514191.514233

LLNL's hypre library is an object-oriented library for the solution of sparse linear systems on parallel computers. While hypre facilitates rapid-prototyping of complex parallel applications, our experience is that without careful attention to temporal ...

Article

Tools for application-oriented performance tuning

ICS '01: Proceedings of the 15th international conference on SupercomputingPages 154–165https://doi.org/10.1145/377792.377826

Application performance tuning is a complex process that requires assembling various types of information and correlating it with source code to pinpoint the causes of performance bottlenecks. Existing performance tools don't adequately support this ...

Article

Optimizing strategies for telescoping languages: procedure strength reduction and procedure vectorization

ICS '01: Proceedings of the 15th international conference on SupercomputingPages 92–101https://doi.org/10.1145/377792.377812

At Rice University, we have undertaken a project to construct a framework for generating high-level problem solving languages that can achieve high performance on a variety of platforms.The underlying strategy, called telescoping languages, builds ...

Article

Free

Fast greedy weighted fusion

Ken Kennedy

ICS '00: Proceedings of the 14th international conference on SupercomputingPages 131–140https://doi.org/10.1145/335231.335244

Loop fusion is important to optimizing compilers because it is an important tool in managing the memory hierarchy. By fusing loops that use the same data elements, we can reduce the distance between accesses to the same datum and avoid costly cache ...

Article

Free

Improving memory hierarchy performance for irregular applications

ICS '99: Proceedings of the 13th international conference on SupercomputingPages 425–433https://doi.org/10.1145/305138.305228

Article

Free

Realizing the performance potential of the virtual interface architecture

ICS '99: Proceedings of the 13th international conference on SupercomputingPages 184–192https://doi.org/10.1145/305138.305192

Applied Filters

People

Names

Institutions

Authors

Publications

Proceedings/Book Names

All Publications

Content Type

Media Formats

Publisher

Conferences

Sponsors

Conference Event

Publication Date

Preparing for performance analysis at exascale

Low overhead and context sensitive profiling of GPU-accelerated applications

Tools for top-down performance analysis of GPU-accelerated applications

Automated Analysis of Time Series Data to Understand Parallel Program Behaviors

ICS '15: Proceedings of the 29th ACM on International Conference on Supercomputing

Author retrospective: compilation techniques for block-cyclic distributions

Author retrospective for PTRAN's analysis and optimization techniques

A new approach for performance analysis of openMP programs

Scalable fine-grained call path tracing

Chunking parallel loops in the presence of synchronization

Phasers: a unified deadlock-free construct for collective and point-to-point synchronization

Scalability analysis of SPMD codes using expectations

Profitable loop fusion and tiling using model-driven empirical search

Low-overhead call path profiling of unmodified, optimized code

Experiences tuning SMG98: a semicoarsening multigrid benchmark based on the hypre library

Tools for application-oriented performance tuning

Optimizing strategies for telescoping languages: procedure strength reduction and procedure vectorization

Fast greedy weighted fusion

Improving memory hierarchy performance for irregular applications

Realizing the performance potential of the virtual interface architecture

Applied Filters

People

Names

Institutions

Authors

Publications

Proceedings/Book Names

All Publications

Content Type

Media Formats

Publisher

Conferences

Sponsors

Conference Event

Publication Date

Save to Binder