Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content
Generally speaking, a “module” is used as an “encapsulation mechanism” to tie together a set of declarations of variables and operations upon them. Although there is no standard way to instantiate or use a module, the general idea is that... more
Generally speaking, a “module” is used as an “encapsulation mechanism” to tie together a set of declarations of variables and operations upon them. Although there is no standard way to instantiate or use a module, the general idea is that a module describes the implementation of all the values of a given type. We believe that this is too inflexible to provide enough control: one should be able to use different implementations (given by different modules) for variables (and values) of the same type. When incorporated properly into the notation, this finer grain of control allows one to program at a high level of abstraction and then to indicate how various pieces of the program should be implemented. It provides simple, effective access to earlier-written modules, so that they are useable in a more flexible manner than is possible in current notations. It generalizes to provide the ability to indicate structural transformations, in a disciplined fashion, in order to achieve efficiency with respect to time or space. However, the program will still be understood at the abstract level and the transformations or implementations will be looked at only to deal with efficiency concerns. Finally, some so-called “data types” (e.g. stack and subranges of the integers) can more properly be looked upon simply as restricted implementations of more general types (e.g. sequence and integer). Thus, the notion of subtype becomes less important.
An asynchronous work-stealing implementation of dy-namic load balance is implemented using Unified Parallel C (UPC) and evaluated using the Unbalanced Tree Search (UTS) benchmark [1]. The UTS benchmark presents a synthetic tree-structured... more
An asynchronous work-stealing implementation of dy-namic load balance is implemented using Unified Parallel C (UPC) and evaluated using the Unbalanced Tree Search (UTS) benchmark [1]. The UTS benchmark presents a synthetic tree-structured search space that is highly ...
We present a distributed memory parallel implementation of the unbalanced tree search (UTS) benchmark using MPI and investigate MPI's ability to efficiently support irregular and nested parallelism through... more
We present a distributed memory parallel implementation of the unbalanced tree search (UTS) benchmark using MPI and investigate MPI's ability to efficiently support irregular and nested parallelism through continuous dynamic load balancing. Two load balancing methods are ...
The UTS benchmark is used to evaluate the expression and performance of task parallelism in OpenMP 3.0 as implemented in a number of recently released compilers and run-time systems. UTS performs parallel search of an irregular and... more
The UTS benchmark is used to evaluate the expression and performance of task parallelism in OpenMP 3.0 as implemented in a number of recently released compilers and run-time systems. UTS performs parallel search of an irregular and unpredictable search space, as arises, e.g., in combinatorial optimization problems. As such UTS presents a highly unbalanced task graph that challenges scheduling, load balancing, termination detection, and task coarsening strategies. Expressiveness and scalability are compared for OpenMP 3.0, Cilk, Cilk++, Intel Thread Building Blocks, as well as an OpenMP implementation of the benchmark without tasks that performs all scheduling, load balancing, and termination detection explicitly. Current OpenMP 3.0 run time implementations generally exhibit poor behavior on the UTS benchmark. We identify inadequate load balancing strategies and overhead costs as primary factors contributing to poor performance and scalability.
Multiphase flow implementations of the lattice Botlzmann method (LBM) are widely applied to the study of porous medium systems. In this work, we construct a new variant of the popular “color” LBM for two-phase flow in which a... more
Multiphase flow implementations of the lattice Botlzmann method (LBM) are widely applied to the study of porous medium systems. In this work, we construct a new variant of the popular “color” LBM for two-phase flow in which a three-dimensional, 19-velocity (D3Q19) lattice is used to compute the momentum transport solution while a three-dimensional, seven velocity (D3Q7) lattice is used to compute the mass transport solution. Based on this formulation, we implement a novel heterogeneous GPU-accelerated algorithm in which the mass transport solution is computed by multiple shared memory CPU cores programmed using OpenMP while a concurrent solution of the momentum transport is performed using a GPU. The heterogeneous solution is demonstrated to provide speedup of 2.6×2.6× as compared to multi-core CPU solution and 1.8×1.8× compared to GPU solution due to concurrent utilization of both CPU and GPU bandwidth. Furthermore, we verify that the proposed formulation provides an accurate physi...
Understanding on-node application power and performance characteristics is critical to the push toward exascale computing. In this paper, we present an analysis of factors that impact both performance and energy usage of OpenMP... more
Understanding on-node application power and performance characteristics is critical to the push toward exascale computing. In this paper, we present an analysis of factors that impact both performance and energy usage of OpenMP applications. Using hardware performance counters in the Intel Sandy bridge X86-64 architecture, we measure energy usage and power draw for a variety of OpenMP programs: simple micro-benchmarks, a task parallel benchmark suite, and a hydrodynamics mini-app of a few thousand lines. The evaluation reveals substantial variations in energy usage depending on the algorithm, the compiler, the optimization level, the number of threads, and even the temperature of the chip. Variations of 20% were common and in the extreme were over 2X. In most cases, performance increases and energy usage decreases as more threads are used. However, for programs with sub-linear speedup, minimal energy usage often occurs at a lower thread count than peak performance. Our findings info...
Task parallelism raises the level of abstraction in shared memory parallel programming to simplify the development of complex applications. However, task parallel applications can exhibit poor performance due to thread idleness,... more
Task parallelism raises the level of abstraction in shared memory parallel programming to simplify the development of complex applications. However, task parallel applications can exhibit poor performance due to thread idleness, scheduling overheads, and work time inflation -- additional time spent by threads in a multithreaded computation beyond the time required to perform the same work in a sequential computation. We identify the contributions of each factor to lost efficiency in various task parallel OpenMP applications and diagnose the causes of work time inflation in those applications. Increased data access latency can cause significant work time inflation in NUMA systems. Our locality framework for task parallel OpenMP programs mitigates this cause of work time inflation. Our extensions to the Qthreads library demonstrate that locality-aware scheduling can improve performance up to 3X compared to the Intel OpenMP task scheduler.
Research Interests:
Protein function prediction is one of the central problems in computational biology. We present a novel automated protein structure-based function prediction method using libraries of local residue packing patterns that are common to most... more
Protein function prediction is one of the central problems in computational biology. We present a novel automated protein structure-based function prediction method using libraries of local residue packing patterns that are common to most proteins in a known functional family. Critical to this approach is the representation of a protein structure as a graph where residue vertices (residue name used as a vertex label) are connected by geometrical proximity edges. The approach employs two steps. First, it uses a fast subgraph mining algorithm to find all occurrences of family-specific labeled subgraphs for all well characterized protein structural and functional families. Second, it queries a new structure for occurrences of a set of motifs characteristic of a known family, using a graph index to speed up Ullman's subgraph isomorphism algorithm. The confidence of function inference from structure depends on the number of family-specific motifs found in the query structure compared...
The recent addition of task parallelism to the OpenMP shared mem-ory API allows programmers to express concurrency at a high level of abstraction and places the burden of scheduling parallel execu-tion on the OpenMP run time system. This... more
The recent addition of task parallelism to the OpenMP shared mem-ory API allows programmers to express concurrency at a high level of abstraction and places the burden of scheduling parallel execu-tion on the OpenMP run time system. This is a welcome develop-ment for scientific computing as supercomputer nodes grow "fatter" with multicore and manycore processors. But efficient scheduling of tasks on modern multi-socket multicore shared memory systems requires careful consideration of an increasingly complex memory hierarchy, including shared caches and NUMA characteristics. In this paper, we propose a hierarchical scheduling strategy that lever-ages different methods at different levels of the hierarchy. By allow-ing one thread to steal work on behalf of all of the threads within a single chip that share a cache, our scheduler limits the number of costly remote steals. For cores on the same chip, a shared LIFO queue allows exploitation of cache locality between sibling tas...
Research Interests:
The recent addition of task parallelism to the OpenMP shared memory API allows programmers to express concurrency at a high level of abstraction and places the burden of scheduling parallel execution on the OpenMP run-time system.... more
The recent addition of task parallelism to the OpenMP shared memory API allows programmers to express concurrency at a high level of abstraction and places the burden of scheduling parallel execution on the OpenMP run-time system. Efficient scheduling of tasks on modern multi-socket multicore shared memory systems requires careful consideration of an increasingly complex memory hierarchy, including shared caches and non-uniform memory access (NUMA) characteristics. In order to evaluate scheduling strategies, we extended the open source Qthreads threading library to implement different scheduler designs, accepting OpenMP programs through the ROSE compiler. Our comprehensive performance study of diverse OpenMP task-parallel benchmarks compares seven different task-parallel run-time scheduler implementations on an Intel Nehalem multi-socket multicore system: our proposed hierarchical work-stealing scheduler, a per-core work-stealing scheduler, a centralized scheduler, and LIFO and FIFO...
Existing methods for detecting RNA intermediates resulting from exonuclease degradation are low-throughput and laborious. In addition, mapping the 3' ends of RNA molecules to the genome after high-throughput sequencing is challenging,... more
Existing methods for detecting RNA intermediates resulting from exonuclease degradation are low-throughput and laborious. In addition, mapping the 3' ends of RNA molecules to the genome after high-throughput sequencing is challenging, particularly if the 3' ends contain post-transcriptional modifications. To address these problems, we developed EnD-Seq, a high-throughput sequencing protocol that preserves the 3' end of RNA molecules, and AppEnD, a computational method for analyzing high-throughput sequencing data. Together these allow determination of the 3' ends of RNA molecules, including nontemplated additions. Applying EnD-Seq and AppEnD to histone mRNAs revealed that a significant fraction of cytoplasmic histone mRNAs end in one or two uridines, which have replaced the 1-2 nt at the 3' end of mature histone mRNA maintaining the length of the histone transcripts. Histone mRNAs in fly embryos and ovaries show the same pattern, but with different tail nucleotid...
Frequent itemset mining is a popular and important first step in analyzing data sets across a broad range of applications. The traditional,... more
Frequent itemset mining is a popular and important first step in analyzing data sets across a broad range of applications. The traditional, "exact" approach for finding frequent itemsets requires that every item in the itemset occurs in each supporting transaction. However, real data is typically subject to noise, and in the presence of such noise, traditional itemset mining may fail
This article summarizes the Sigma program for molecular dynamics simulation and describes a generic web browser-based interface (“WASP”) applicable to programs with complex, hierarchical command structures. Use of the interface is... more
This article summarizes the Sigma program for molecular dynamics simulation and describes a generic web browser-based interface (“WASP”) applicable to programs with complex, hierarchical command structures. Use of the interface is illustrated with its application to the Sigma program (“Wigma”).
We present a novel parallel implementation of N-body grav- itational simulation. Our algorithm uses graphics hardware to accelerate local computation, and is optimized to account for low bandwidth between the CPU and the graphics card, as... more
We present a novel parallel implementation of N-body grav- itational simulation. Our algorithm uses graphics hardware to accelerate local computation, and is optimized to account for low bandwidth between the CPU and the graphics card, as well as low bandwidth across the network. The number of bodies that can be simulated with our implementation is limited only by the memory
... Mellor-Crummey Rice University, USA Michael O'Boyle University of Edinburgh, UK Paul Petersen Intel, USA Keshav Pingali University of Texas, USA Vivek Sarkar Rice ... 1 Nagarajan Kanna, Jaspal Subhlok, Edgar... more
... Mellor-Crummey Rice University, USA Michael O'Boyle University of Edinburgh, UK Paul Petersen Intel, USA Keshav Pingali University of Texas, USA Vivek Sarkar Rice ... 1 Nagarajan Kanna, Jaspal Subhlok, Edgar Gabriel, Eshwar Rohit, and David Anderson The STAPL pList ...
Abstract Effective optimization of FPS Array Processor assembly language (APAL) is difficult. Instructions must be rearranged and consolidated to minimize periods during which the functional units remain idle or perform unnecessary tasks.... more
Abstract Effective optimization of FPS Array Processor assembly language (APAL) is difficult. Instructions must be rearranged and consolidated to minimize periods during which the functional units remain idle or perform unnecessary tasks. Register conflicts and branches ...
Modern dialects of Fortran enjoy wide use and good support on high- performance computers as performance-oriented programming languages. By pro- viding the ability to express nested data parallelism, modern Fortran dialects en- able... more
Modern dialects of Fortran enjoy wide use and good support on high- performance computers as performance-oriented programming languages. By pro- viding the ability to express nested data parallelism, modern Fortran dialects en- able irregular computations to be incorporated into existing applications with minimal rewriting and without sacrificing performance within the regular por- tions of the application. Since performance of nested
ABSTRACT Protein local structure comparison aims to recognize structural similarities between parts of proteins. It is an active topic in bioinformatics research, integrating computer science concepts in computational geometry and graph... more
ABSTRACT Protein local structure comparison aims to recognize structural similarities between parts of proteins. It is an active topic in bioinformatics research, integrating computer science concepts in computational geometry and graph theory with empirical observations and physical principles from biochemistry. It has important biological applications, including protein function prediction. In this chapter, we provide an introduction to the protein local structure comparison problem including challenges and applications. Current approaches to the problem are reviewed. Particular consideration is given to the discovery of local structure common to a group of related proteins. We present a new algorithm for this problem that uses a graph-based representation of protein structure and finds recurring subgraphs among a group of protein graphs.
In recent years technological advances have made the construction of large-scale parallel computers economically at-tractive. These machines have the potential to provide fast solutions to computationally demanding problems that arise in... more
In recent years technological advances have made the construction of large-scale parallel computers economically at-tractive. These machines have the potential to provide fast solutions to computationally demanding problems that arise in computational science, real-time control, computer simulation, large database manipulation and other areas. How-ever, applications that exploit this performance potential have been slow to appear; such applications have proved ex-ceptionally
this article. Space-filling,ball-stick, or solvent-accessible-surface representations may seem more attractive; however,these presentations can slow down graphical response.When running computation and display on separate platforms, as we... more
this article. Space-filling,ball-stick, or solvent-accessible-surface representations may seem more attractive; however,these presentations can slow down graphical response.When running computation and display on separate platforms, as we used to do (andmay again, depending on available machines), network bandwidth is a potential concern.The upper bound of computing performance on our system is about 15,000 atomic updatesper second, which corresponds to about 180

And 16 more