Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

    Susan Coghlan

    We investigate operating system noise, which we identify as one of the main reasons for a lack of synchronicity in parallel applications. Using a microbenchmark, we measure the noise on several contemporary platforms and find that, even... more
    We investigate operating system noise, which we identify as one of the main reasons for a lack of synchronicity in parallel applications. Using a microbenchmark, we measure the noise on several contemporary platforms and find that, even with a general-purpose operating sys-tem, noise can be limited if certain precautions are taken. We then inject artificially generated noise into a massively parallel system and measure its influence on the performance of collec-tive operations. Our experiments indicate that on extreme-scale platforms, the performance is correlated with the largest interruption to the application, even if the probability of such an in-terruption on a single process is extremely small. We demonstrate that synchronizing the noise can significantly reduce its negative influence.
    Cloud resources promise to be an avenue to address new categories of scientific applications including data-intensive science applications, on-demand/surge computing, and ap-plications that require customized software environments.... more
    Cloud resources promise to be an avenue to address new categories of scientific applications including data-intensive science applications, on-demand/surge computing, and ap-plications that require customized software environments. However, there is a limited understanding on how to oper-ate and use clouds for scientific applications. Magellan, a project funded through the Department of Energy’s (DOE) Advanced Scientific Computing Research (ASCR) program, is investigating the use of cloud computing for science at the Argonne Leadership Computing Facility (ALCF) and the National Energy Research Scientific Computing Facility (NERSC). In this paper, we detail the experiences to date at both sites and identify the gaps and open challenges from both a resource provider as well as application perspective.
    owned by the United States Government and operated by The University of Chicago under the provisions of a contract with the Department of Energy. Disclaimer This report was prepared as an account of work sponsored by an agency of the... more
    owned by the United States Government and operated by The University of Chicago under the provisions of a contract with the Department of Energy. Disclaimer This report was prepared as an account of work sponsored by an agency of the United States Government. Neither the United States Government nor any agency thereof, nor The University of Chicago, nor any of their employees or officers, makes any warranty, express or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise, does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or any agency thereof. The views and opinions of document authors expressed herein do not nec...
    (ASCR), was to investigate the potential role of cloud computing in addressing
    Mira, Argonne's petascale IBM Blue Gene/Q system, ushers in a new era of scientific supercomputing at the Argonne Leadership Computing Facility. An engineering marvel, the 10-petaflops supercomputer is capable of carrying out 10... more
    Mira, Argonne's petascale IBM Blue Gene/Q system, ushers in a new era of scientific supercomputing at the Argonne Leadership Computing Facility. An engineering marvel, the 10-petaflops supercomputer is capable of carrying out 10 quadrillion calculations per second. As a machine for open science, any researcher with a question that requires large-scale computing resources can submit a proposal for time on Mira, typically in allocations of millions of core-hours, to run programs for their experiments. This adds up to billions of hours of computing time per year.
    ABSTRACT The Advanced Computing Lab at Los Alamos National Laboratory is engaged in cluster-based systems development to support large-scale applications and to learn how future systems and software may be designed. In this paper we... more
    ABSTRACT The Advanced Computing Lab at Los Alamos National Laboratory is engaged in cluster-based systems development to support large-scale applications and to learn how future systems and software may be designed. In this paper we describe our large SGI cluster and our experimental Linux cluster, outline some important software available on both of them, and discuss some performance results.
    NeuroBuilder is a comprehensive package for building neuro-biologically-based networks. It allows the user a range of options in specifying both the single-unit dynamics (from neural network-like neurons to complex dynamics) and the... more
    NeuroBuilder is a comprehensive package for building neuro-biologically-based networks. It allows the user a range of options in specifying both the single-unit dynamics (from neural network-like neurons to complex dynamics) and the three-dimensional structure and connectivity necessary to describe any biologically based system.
    Abstract We investigate operating system noise, which we identify as one of the main reasons for a lack of synchronicity in par- allel applications. Using a microbenchmark, we measure the noise on several contemporary platforms and find... more
    Abstract We investigate operating system noise, which we identify as one of the main reasons for a lack of synchronicity in par- allel applications. Using a microbenchmark, we measure the noise on several contemporary platforms and find that, even with a general-purpose operating system, noise can be limited if certain precautions are taken. We then inject ar- tificially generated noise
    Petascale HPC systems are among the largest systems in the world. Intrepid, one such system, is a 40,000 node, 556 teraflop Blue Gene/P system that has been deployed at Argonne National Laboratory. In this paper, we provide some... more
    Petascale HPC systems are among the largest systems in the world. Intrepid, one such system, is a 40,000 node, 556 teraflop Blue Gene/P system that has been deployed at Argonne National Laboratory. In this paper, we provide some background about the system and our administration experiences. In particular, due to the scale of the system, we have faced a variety of issues, some surprising to us, that are not common in the commodity world. We discuss our expectations, these issues, and approaches we have used to address them.
    Research Interests:
    The Advanced Computing Lab at Los Alamos National Laboratory is engaged in cluster-based systems development to support large-scale applications and to learn how future systems and software may be designed. In this paper we describe our... more
    The Advanced Computing Lab at Los Alamos National Laboratory is engaged in cluster-based systems development to support large-scale applications and to learn how future systems and software may be designed. In this paper we describe our large SGI cluster and our experimental Linux cluster, outline some important software available on both of them, and discuss some performance results.
    This paper presents a framework to support transparent, live migration of virtual GPU accelerators in a virtualized execution environment. Migration is a critical capability in such environments because it provides support for fault... more
    This paper presents a framework to support transparent, live migration of virtual GPU accelerators in a virtualized execution environment. Migration is a critical capability in such environments because it provides support for fault tolerance, on-demand system maintenance, resource management, and load balancing in the mapping of virtual to physical GPUs. Techniques to increase responsiveness and reduce migration overhead are explored. The system is evaluated by using four application kernels and is demonstrated to provide low migration overheads. Through transparent load balancing, our system provides a speedup of 1.7 to 1.9 for three of the four application kernels.
    Research Interests:
    ABSTRACT Power consumption is becoming a critical factor as we continue our quest toward exascale computing. Yet, actual power utilization of a complete system is an insufficiently studied research area. Estimating the power consumption... more
    ABSTRACT Power consumption is becoming a critical factor as we continue our quest toward exascale computing. Yet, actual power utilization of a complete system is an insufficiently studied research area. Estimating the power consumption of a large scale system is a nontrivial task because a large number of components are involved and because power requirements are affected by the (unpredictable) workloads. Clearly needed is a power-monitoring infrastructure that can provide timely and accurate feedback to system developers and application writers so that they can optimize the use of this precious resource. Many existing large-scale installations do feature power-monitoring sensors, however, those are part of environmental- and health monitoring sub systems and were not designed with application level power consumption measurements in mind. In this paper, we evaluate the existing power monitoring of IBM Blue Gene systems, with the goal of understanding what capabilities are available and how they fare with respect to spatial and temporal resolution, accuracy, latency, and other characteristics. We find that with a careful choice of dedicated micro benchmarks, we can obtain meaningful power consumption data even on Blue Gene/P, where the interval between available data points is measured in minutes. We next evaluate the monitoring subsystem on Blue Gene/Q, and are able to study the power characteristics of FPU and memory subsystems of Blue Gene/Q. We find the monitoring subsystem capable of providing second-scale resolution of power data conveniently separated between node components with seven seconds latency. This represents a significant improvement in power monitoring infrastructure, and hope future systems will enable real-time power measurement in order to better understand application behavior at a finer granularity.
    ... Proc. of DSN, 2006. [8] Y. Liang, Y. Zhang, H. Xiong, and R. Sahoo. An adaptive semantic filter for Blue Gene/L failure log analysis systems. Workshop on SMTPS, 2007. [9] Y. Liang, Y. Zhang, H. Xiong, and R. Sahoo. Failure predic-tion... more
    ... Proc. of DSN, 2006. [8] Y. Liang, Y. Zhang, H. Xiong, and R. Sahoo. An adaptive semantic filter for Blue Gene/L failure log analysis systems. Workshop on SMTPS, 2007. [9] Y. Liang, Y. Zhang, H. Xiong, and R. Sahoo. Failure predic-tion in IBM Blue Gene/L event logs. Proc. ...
    ABSTRACT A varied collection of scientific and engineering codes has been adapted and enhanced to take advantage of the IBM Blue Gene®/Q architecture and thus enable research that was previously out of reach. Computational research teams... more
    ABSTRACT A varied collection of scientific and engineering codes has been adapted and enhanced to take advantage of the IBM Blue Gene®/Q architecture and thus enable research that was previously out of reach. Computational research teams from a number of disciplines collaborated with the staff of the Argonne Leadership Computing Facility to assess which of Blue Gene/Q's many novel features could be exploited for each application to equip it to tackle existing problem classes with greater fidelity and in some cases to address new phenomena. The quad floating-point units and the five-dimensional torus interconnect are among the features that were demonstrated to be effective for a number of important applications. Furthermore, data obtained from the hardware counters provided insights that were valuable in guiding the code modifications. Hardware features and programming techniques that were effective across multiple codes are documented as well. First, we have confirmed that there is no significant code rewrite needed to run today's production codes with good performance on Mira, an IBM Blue Gene/Q supercomputer. Performance improvements are already demonstrated, even though our measurements are all on pre-production software and hardware. The application domains included biology, materials science, combustion, chemistry, nuclear physics, and industrial-scale design of nuclear reactors, jet engines, and the efficiency of transportation systems.
    Petascale HPC systems are among the largest systems in the world. Intrepid, one such system, is a 40,000 node, 556 teraflop Blue Gene/P system that has been deployed at Argonne National Laboratory. In this paper, we provide some... more
    Petascale HPC systems are among the largest systems in the world. Intrepid, one such system, is a 40,000 node, 556 teraflop Blue Gene/P system that has been deployed at Argonne National Laboratory. In this paper, we provide some background about the system and ...
    Achieving high performance for distributed I/O on a wide-area network continues to be an elusive holy grail. Despite enhancements in network hardware as well as software stacks, achieving high-performance remains a challenge. In this... more
    Achieving high performance for distributed I/O on a wide-area network continues to be an elusive holy grail. Despite enhancements in network hardware as well as software stacks, achieving high-performance remains a challenge. In this paper, our worldwide team took a completely new and non-traditional approach to distributed I/O, called ParaMEDIC: Parallel Metadata Environment for Distributed I/O and Computing, by utilizing application-specific transformation of data to orders-of-magnitude smaller meta-data before ...