Newsletter Downloads
Using simulation to design extremescale applications and architectures: programming model exploration
A key problem facing application developers is that they are expected to utilize extreme levels of parallelism soon after delivery of future leadership class machines, but developing applications capable of exposing sufficient concurrency is a time ...
Performance analysis of the OP2 framework on many-core architectures
We present a performance analysis and benchmarking study of the OP2 "active" library, which provides an abstraction framework for the solution of parallel unstructured mesh applications. OP2 aims to decouple the scientific specification of the ...
Benchmarking and modelling of POWER7, Westmere, BG/P, and GPUs: an industry case study
This paper introduces an industry strength, multi-purpose, benchmark: Shamrock. Developed at the Atomic Weapons Establishment (AWE), Shamrock is a two dimensional (2D) structured hydrocode; one of its aims is to assess the impacts of a change in ...
Performance analysis of a hybrid MPI/CUDA implementation of the NASLU benchmark
We present the performance analysis of a port of the LU benchmark from the NAS Parallel Benchmark (NPB) suite to NVIDIA's Compute Unified Device Architecture (CUDA), and report on the optimisation efforts employed to take advantage of this platform. ...
Memory Trace Compression and Replay for SPMD Systems using Extended PRSDs?
Concurrency levels in large-scale supercomputers are rising exponentially, and shared-memory nodes with hundreds of cores and non-uniform memory access latencies are expected within the next decade. However, even current petascale systems with tens of ...
The structural simulation toolkit
- A. F. Rodrigues,
- K. S. Hemmert,
- B. W. Barrett,
- C. Kersey,
- R. Oldfield,
- M. Weston,
- R. Risen,
- J. Cook,
- P. Rosenfeld,
- E. Cooper-Balis,
- B. Jacob
As supercomputers grow, understanding their behavior and performance has become increasingly challenging. New hurdles in scalability, programmability, power consumption, reliability, cost, and cooling are emerging, along with new technologies such as 3D ...
Parallel memory prediction for fused linear algebra kernels
The performance of many scientific programs is limited by data movement. Loop fusion is one optimization used to increase the speed of memory bound operations. To automate loop fusion for matrix computations, we developed the Build to Order (BTO) ...
A fast GEMM implementation on the cypress GPU
We present benchmark results of optimized dense matrix multiplication kernels for Cypress GPU. We write general matrix multiply (GEMM) kernels for single (SP), double (DP) and double-double (DDP) precision. Our SGEMM and DGEMM kernels show ~ 2 Top/s and ...
Performance characteristics of hybrid MPI/OpenMP implementations of NAS parallel benchmarks SP and BT on large-scale multicore supercomputers
The NAS Parallel Benchmarks (NPB) are well-known applications with the fixed algorithms for evaluating parallel systems and tools. Multicore supercomputers provide a natural programming paradigm for hybrid programs, whereby OpenMP can be used with the ...
A framework for architecture-level power, area, and thermal simulation and its application to network-on-chip design exploration
We describe the integrated power, area and thermal modeling framework in the Structural Simulation Toolkit (SST) for large-scale high performance computer simulation. It integrates various power and thermal modeling tools and computes run-time energy ...
Should we worry about memory loss?
In recent years the High Performance Computing (HPC) industry has benefited from the development of higher density multi-core processors. With recent chips capable of executing up to 32 tasks in parallel, this rate of growth also shows no sign of ...
A statistical performance model of the opteron processor
Cycle-accurate simulation is the dominant methodology for processor design space analysis and performance prediction. However, with the prevalence of multi-core, multi-threaded architectures, this method has become highly impractical as the sole means ...
Preliminary design examination of the ParalleX system from a software and hardware perspective
Exascale systems, expected to emerge by the end of the next decade, will require the exploitation of billion-way parallelism at multiple hierarchical levels in order to achieve the desired sustained performance. While traditional approaches to ...
Energy-aware metrics for benchmarking heterogeneous systems
With the advent of heterogeneous computing systems consisting of multi-core CPUs and many-core GPUs, robust methods are needed to facilitate fair benchmark comparisons between different systems. In this paper we present a benchmarking methodology for ...