No abstract available.
Evaluation of compiler optimizations for Fortran D on MIMD distributed memory machines
The Fortran D compiler uses data decomposition specifications to automatically translate Fortran programs for execution on MIMD distributed-memory machines. This paper introduces and classifies a number of advanced optimizations needed to achieve ...
Evaluation of compiler generated parallel programs on three multicomputers
Distributed memory parallel processors (DMPPs) have no hardware support for a global address space. However, conventional programs written in a sequential imperative language such as Fortran typically manipulate few, large arrays. The Oxygen compiler, ...
Automatic data mapping for distributed-memory parallel computers
The performance of a program on a distributed-memory parallel computer depends on the algorithms employed, the structure and speed of the machine's communication network, and the ways in which data are distributed to the processors. This paper addresses ...
Characterizing memory performance in vector multiprocessors
We propose a set of three memory performance measures directed at vector multiprocessors. One is the port reservation time which is closely related to the commonly-used memory bandwidth measure. The second is the vector fill time and is the latency ...
Performance analysis of the CM-2, a massively parallel SIMD computer
The performance evaluation process for a massively parallel distributed memory SIMD computer is described generally. The performance in basic computation, grid communication, and computation with grid communication is analyzed. A practical performance ...
Evaluation of the lock mechanism in a snooping cache
This paper discusses the design concepts of a lock mechanism for a Parallel Inference Machine (the PIM/c prototype) and investigates the performance of the mechanism in detail.
Lock operations are extremely frequent on the PIM; however, lock contention ...
Processor allocation and loop scheduling on multiprocessor computers
This paper is concerned with the automatic exploitation of the parallelism detected in a sequential program. The target machine is a shared memory multiprocessor.
The main goal is minimizing the completion time of the program. To achieve this, one has ...
Low level scheduling using the hierarchical task graph
This paper introduces a new efficient instruction scheduling algorithm that can schedule across basic blocks. Scheduling globally, across basic blocks, is done by using an extension of the control flow graph (CFG) that combines both data and control ...
Deriving good transformations for mapping nested loops on hierarchical parallel machines in polynomial time
We present a computationally efficient method for deriving the most appropriate transformation and mapping of a nested loop for a given hierarchical parallel machine. This method is in the context of our systematic and general theory of unimodular loop ...
ABCL/onEM-4: a new software/hardware architecture for object-oriented concurrent computing on an extended dataflow supercomputer
The trend towards object-oriented software construction is becoming more and more prevalent, and parallel programming cannot be an exception. In the context of parallel computation, it is often natural to model the computation as message passing between ...
Tolerating data access latency with register preloading
By exploiting fine grain parallelism, superscalar processors can potentially increase the performance of future supercomputers. However, supercomputers typically have a long access delay to their first level memory which can severely restrict the ...
Supercomputing and transputers
It will be studied which degree parallel supercomputers can be scaled to. Necessary measures to achieve a maximum scalability will be discussed, and a case-study be presented. To this purpose, a new class of “supermassively parallel architectures” is ...
Automatic software cache coherence through vectorization
Access latency in large-scale shared-memory multiprocessors is a concern since most (if not all) memory is one or more hops away through an interconnection network. Providing processors with one or more levels of cache is an accepted way to reduce the ...
Life span strategy—a compiler-based approach to cache coherence
In this paper, a cache coherence strategy with a combined software and hardware approach is proposed for large-scale multiprocessor systems. The new strategy has the scalability advantages of existing software strategies and does not rely on shared ...
Conflict-free access of vectors with power-of-two strides
An address mapping and an access order is presented for conflict-free access to vectors with any initial address and power-of-two strides. We show that for this conflict-free access it is necessary that the memory be unmatched and present an ...
Parallel program visualization using SIEVE.1
In this paper we introduce a new model for the design of performance analysis and visualization tools. The system integrates static code analysis, relational database designs and a spreadsheet model of interactive programming. This system provides a ...
The CODE 2.0 graphical parallel programming language
CODE 2.0 is a graphical parallel programming system that targets the three goals of ease of use, portability, and production of efficient parallel code. Ease of use is provided by an integrated graphical/textual interface, a powerful dynamic model of ...
Paralex: an environment for parallel programming in distributed systems
Modern distributed systems consisting of powerful workstations and high-speed interconnection networks are an economical alternative to special-purpose super computers. The technical issues that need to be addressed in exploiting the parallelism ...
Exploiting heterogeneous parallelism on a multithreaded multiprocessor
This paper describes an integrated architecture, compiler, runtime, and operating system solution to exploiting heterogeneous parallelism. The architecture is a pipelined multi-threaded multiprocessor, enabling the execution of very fine (multiple ...
An architectural framework for migration from CISC to higher performance platforms
We describe a novel architectural framework that allows software applications written for a given Complex Instruction Set Computer (CISC) to migrate to a different, higher performance architecture, without a significant investment on the part of the ...
Manchester data-flow: a progress report
The Manchester Data-Flow Machine, MDFM, has evolved continuously during the past decade. By the time the prototype uniprocessor hardware system was decommissioned, in 1989, the putative multi-processor architecture comprised separate Processing Elements ...
Array abstractions using semantic analysis of trapezoid congruences
With the growing use of vector supercomputers, efficient and accurate data structure analyses are needed. What we propose in this paper is to use the quite general framework of Cousot's abstract interpretation for the particular analysis of multi-...
A comprehensive approach to parallel data flow analysis
We present a comprehensive approach to performing data flow analysis in parallel. We first identify three types of parallelism inherent in the data flow solution process: independent-problem parallelism, separate-unit parallelism and algorithmic ...
Compile-time analysis of communicating processes
We present an algorithm for analyzing deadlock and for constructing sequentializations of a class of communicating sequential processes. The algorithm may be used for deadlock detection in parallel and distributed programs at compile time, or for ...
Register requirements of pipelined processors
To enable concurrent instruction execution, scientific computers generally rely on pipelining, which combines with faster system clocks to achieve greater throughput. Each concurrently executing instruction requires buffer space, usually implemented as ...
Benchmarking a vector-processor prototype based on multithreaded streaming/FIFO vector (MSFV) architecture
This paper presents the benchmark results on a vector-processor prototype based on the MSFV (multithreaded streaming/FIFO vector) architecture. The MSFV architecture is single-chip oriented, and thus its main object is to save the off-chip memory ...
On storage schemes for parallel array access
In parallel matrix manipulation operations, some data patterns need to be accessed in one memory cycle without conflict. Investigating the frequently used data patterns, we propose a powerful skewing scheme which allows most frequently used data ...
A general algorithm for data dependence analysis
With the development of ever more sophisticated data flow analysis algorithms, traditional data dependence tests based on elementary loop information will not be sufficient in the future. In this paper, quite general algorithms are presented for solving ...
On exact data dependence analysis
The GCD test and the Banerjee-Wolfe test are the two tests traditionally used to determine statement data dependence, subject to direction vectors, in automatic vectorization / parallelization of loops. In an earlier study [14] a sufficient condition ...
Array privatization for parallel execution of loops
In recent experiments, array privatization played a critical role in successful parallelization of several real programs. This paper presents compiler algorithms for the program analysis for this transformation. The paper also addresses issues in the ...
Index Terms
- Proceedings of the 6th international conference on Supercomputing