Research Interests:
We investigate operating system noise, which we identify as one of the main reasons for a lack of synchronicity in parallel applications. Using a microbenchmark, we measure the noise on several contemporary platforms and find that, even... more
We investigate operating system noise, which we identify as one of the main reasons for a lack of synchronicity in parallel applications. Using a microbenchmark, we measure the noise on several contemporary platforms and find that, even with a general-purpose operating sys-tem, noise can be limited if certain precautions are taken. We then inject artificially generated noise into a massively parallel system and measure its influence on the performance of collec-tive operations. Our experiments indicate that on extreme-scale platforms, the performance is correlated with the largest interruption to the application, even if the probability of such an in-terruption on a single process is extremely small. We demonstrate that synchronizing the noise can significantly reduce its negative influence.
Research Interests:
Cloud resources promise to be an avenue to address new categories of scientific applications including data-intensive science applications, on-demand/surge computing, and ap-plications that require customized software environments.... more
Cloud resources promise to be an avenue to address new categories of scientific applications including data-intensive science applications, on-demand/surge computing, and ap-plications that require customized software environments. However, there is a limited understanding on how to oper-ate and use clouds for scientific applications. Magellan, a project funded through the Department of Energy’s (DOE) Advanced Scientific Computing Research (ASCR) program, is investigating the use of cloud computing for science at the Argonne Leadership Computing Facility (ALCF) and the National Energy Research Scientific Computing Facility (NERSC). In this paper, we detail the experiences to date at both sites and identify the gaps and open challenges from both a resource provider as well as application perspective.
owned by the United States Government and operated by The University of Chicago under the provisions of a contract with the Department of Energy. Disclaimer This report was prepared as an account of work sponsored by an agency of the... more
owned by the United States Government and operated by The University of Chicago under the provisions of a contract with the Department of Energy. Disclaimer This report was prepared as an account of work sponsored by an agency of the United States Government. Neither the United States Government nor any agency thereof, nor The University of Chicago, nor any of their employees or officers, makes any warranty, express or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise, does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or any agency thereof. The views and opinions of document authors expressed herein do not nec...
(ASCR), was to investigate the potential role of cloud computing in addressing
Research Interests:
Mira, Argonne's petascale IBM Blue Gene/Q system, ushers in a new era of scientific supercomputing at the Argonne Leadership Computing Facility. An engineering marvel, the 10-petaflops supercomputer is capable of carrying out 10... more
Mira, Argonne's petascale IBM Blue Gene/Q system, ushers in a new era of scientific supercomputing at the Argonne Leadership Computing Facility. An engineering marvel, the 10-petaflops supercomputer is capable of carrying out 10 quadrillion calculations per second. As a machine for open science, any researcher with a question that requires large-scale computing resources can submit a proposal for time on Mira, typically in allocations of millions of core-hours, to run programs for their experiments. This adds up to billions of hours of computing time per year.
Research Interests:
Research Interests:
Research Interests:
Research Interests:
ABSTRACT The Advanced Computing Lab at Los Alamos National Laboratory is engaged in cluster-based systems development to support large-scale applications and to learn how future systems and software may be designed. In this paper we... more
ABSTRACT The Advanced Computing Lab at Los Alamos National Laboratory is engaged in cluster-based systems development to support large-scale applications and to learn how future systems and software may be designed. In this paper we describe our large SGI cluster and our experimental Linux cluster, outline some important software available on both of them, and discuss some performance results.
Research Interests:
NeuroBuilder is a comprehensive package for building neuro-biologically-based networks. It allows the user a range of options in specifying both the single-unit dynamics (from neural network-like neurons to complex dynamics) and the... more
NeuroBuilder is a comprehensive package for building neuro-biologically-based networks. It allows the user a range of options in specifying both the single-unit dynamics (from neural network-like neurons to complex dynamics) and the three-dimensional structure and connectivity necessary to describe any biologically based system.
Research Interests:
Research Interests:
Research Interests:
Research Interests:
Petascale HPC systems are among the largest systems in the world. Intrepid, one such system, is a 40,000 node, 556 teraflop Blue Gene/P system that has been deployed at Argonne National Laboratory. In this paper, we provide some... more
Petascale HPC systems are among the largest systems in the world. Intrepid, one such system, is a 40,000 node, 556 teraflop Blue Gene/P system that has been deployed at Argonne National Laboratory. In this paper, we provide some background about the system and our administration experiences. In particular, due to the scale of the system, we have faced a variety of issues, some surprising to us, that are not common in the commodity world. We discuss our expectations, these issues, and approaches we have used to address them.
Research Interests:
The Advanced Computing Lab at Los Alamos National Laboratory is engaged in cluster-based systems development to support large-scale applications and to learn how future systems and software may be designed. In this paper we describe our... more
The Advanced Computing Lab at Los Alamos National Laboratory is engaged in cluster-based systems development to support large-scale applications and to learn how future systems and software may be designed. In this paper we describe our large SGI cluster and our experimental Linux cluster, outline some important software available on both of them, and discuss some performance results.
This paper presents a framework to support transparent, live migration of virtual GPU accelerators in a virtualized execution environment. Migration is a critical capability in such environments because it provides support for fault... more
This paper presents a framework to support transparent, live migration of virtual GPU accelerators in a virtualized execution environment. Migration is a critical capability in such environments because it provides support for fault tolerance, on-demand system maintenance, resource management, and load balancing in the mapping of virtual to physical GPUs. Techniques to increase responsiveness and reduce migration overhead are explored. The system is evaluated by using four application kernels and is demonstrated to provide low migration overheads. Through transparent load balancing, our system provides a speedup of 1.7 to 1.9 for three of the four application kernels.
Research Interests:
Research Interests:
... Proc. of DSN, 2006. [8] Y. Liang, Y. Zhang, H. Xiong, and R. Sahoo. An adaptive semantic filter for Blue Gene/L failure log analysis systems. Workshop on SMTPS, 2007. [9] Y. Liang, Y. Zhang, H. Xiong, and R. Sahoo. Failure predic-tion... more
... Proc. of DSN, 2006. [8] Y. Liang, Y. Zhang, H. Xiong, and R. Sahoo. An adaptive semantic filter for Blue Gene/L failure log analysis systems. Workshop on SMTPS, 2007. [9] Y. Liang, Y. Zhang, H. Xiong, and R. Sahoo. Failure predic-tion in IBM Blue Gene/L event logs. Proc. ...
Research Interests:
Research Interests:
Research Interests:
Research Interests:
Research Interests:
Petascale HPC systems are among the largest systems in the world. Intrepid, one such system, is a 40,000 node, 556 teraflop Blue Gene/P system that has been deployed at Argonne National Laboratory. In this paper, we provide some... more
Petascale HPC systems are among the largest systems in the world. Intrepid, one such system, is a 40,000 node, 556 teraflop Blue Gene/P system that has been deployed at Argonne National Laboratory. In this paper, we provide some background about the system and ...
Research Interests:
Research Interests:
Achieving high performance for distributed I/O on a wide-area network continues to be an elusive holy grail. Despite enhancements in network hardware as well as software stacks, achieving high-performance remains a challenge. In this... more
Achieving high performance for distributed I/O on a wide-area network continues to be an elusive holy grail. Despite enhancements in network hardware as well as software stacks, achieving high-performance remains a challenge. In this paper, our worldwide team took a completely new and non-traditional approach to distributed I/O, called ParaMEDIC: Parallel Metadata Environment for Distributed I/O and Computing, by utilizing application-specific transformation of data to orders-of-magnitude smaller meta-data before ...