Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
Volume 17, Issue 2April 1989Special issue: Proceedings of ASPLOS-III: the third international conference on architecture support for programming languages and operating systems
Bibliometrics
Skip Table Of Content Section
article
Free
Architecture and compiler tradeoffs for a long instruction wordprocessor

A very long instruction word (VLIW) processor exploits parallelism by controlling multiple operations in a single instruction word. This paper describes the architecture and compiler tradeoffs in the design of iWarp, a VLIW single-chip microprocessor ...

article
Free
Tradeoffs in instruction format design for horizontal architectures

With recent improvements in software techniques and the enhanced level of fine grain parallelism made available by such techniques, there has been an increased interest in horizontal architectures and large instruction words that are capable of issuing ...

article
Free
Overlapped loop support in the Cydra 5

The CydraTM 5 architecture adds unique support for overlapping successive iterations of a loop to a very long instruction word (VLIW) base. This architecture allows highly parallel loop execution for a much larger class of loops than can be vectorized, ...

article
Free
Architectural support for synchronous task communication

This paper describes the motivation for a set of intertask communication primitives, the hardware support of these primitives, the architecture used in the Sylvan project which studies these issues, and the experience gained from various experiments ...

article
Free
The fuzzy barrier: a mechanism for high speed synchronization of processors

Parallel programs are commonly written using barriers to synchronize parallel processes. Upon reaching a barrier, a processor must stall until all participating processors reach the barrier. A software implementation of the barrier mechanism using ...

article
Free
Efficient synchronization primitives for large-scale cache-coherent multiprocessors

This paper proposes a set of efficient primitives for process synchronization in multiprocessors. The only assumptions made in developing the set of primitives are that hardware combining is not implemented in the inter-connect, and (in one case) that ...

article
Free
A software instruction counter

Although several recent papers have proposed architectural support for program debugging and profiling, most processors do not yet provide even basic facilities, such as an instruction counter. As a result, system developers have been forced to invent ...

article
Free
Efficient debugging primitives for multiprocessors

Existing kernel-level debugging primitives are inappropriate for instrumenting complex sequential or parallel programs. These functions incur a heavy overhead in their use of system calls and process switches. Context switches are used to alternately ...

article
Free
Sheaved memory: architectural support for state saving and restoration in pages systems

The concept of read-one/write-many paged memory is introduced and given the name sheaved memory. It is shown that sheaved memory is useful for efficiently maintaining checkpoints in main memory and for providing state saving and state restoration for ...

article
Free
Reference history, page size, and migration daemons in local/remote architectures

We address the problem of paged main memory management in the local/remote architecture subclass of shared memory multiprocessors. We consider the case where the operating system has primary responsibility and uses page migration as its main tool. We ...

article
Free
Translation lookaside buffer consistency: a software approach

We discuss the translation lookaside buffer (TLB) consistency problem for multiprocessors, and introduce the Mach shootdown algorithm for maintaining TLB consistency in software. This algorithm has been implemented on several multiprocessors, and is in ...

article
Free
Failure correction techniques for large disk arrays

The ever increasing need for I/O bandwidth will be met with ever larger arrays of disks. These arrays require redundancy to protect against data loss. This paper examines alternative choices for encodings, or codes, that reliably store information in ...

article
Free
A unified vector/scalar floating-point architecture

In this paper we present a unified approach to vector and scalar computation, using a single register file for both scalar operands and vector elements. The goal of this architecture is to yield improved scalar performance while broadening the range of ...

article
Free
Data buffering: run-time versus compile-time support

Data-dependency, branch, and memory-access penalties are main constraints on the performance of high-speed microprocessors. The memory-access penalties concern both penalties imposed by external memory (e.g. cache) or by under utilization of the local ...

article
Free
article
Free
A real-time support processor for ada tasking

Task synchronization in Ada causes excessive run-time overhead due to the complex semantics of the rendezvous. To demonstrate that the speed can be increased by two orders of magnitude by using special purpose hardware, a single chip VLSI support ...

article
Free
The runtime environment for Scheme, a Scheme implementation on the 88000

We are implementing a Scheme development system for the Motorola 88000. The core of the implementation is an optimizing native code compiler, together with a carefully designed runtime system. This paper describes our experiences with the 88000 as a ...

article
Free
Program optimization for instruction caches

This paper presents an optimization algorithm for reducing instruction cache misses. The algorithm uses profile information to reposition programs in memory so that a direct-mapped cache behaves much like an optimal cache with full associativity and ...

article
Free
Using registers to optimize cross-domain call performance

This paper describes a new technique to improve the performance of cross-domain calls and returns in a capability-based computer system. Using register optimization information obtained from the compiler, a trusted linker can minimize the number of ...

article
Free
The design of nectar: a network backplane for heterogeneous multicomputers

Nectar is a “network backplane” for use in heterogeneous multicomputers. The initial system consists of a star-shaped fiber-optic network with an aggregate bandwidth of 1.6 gigabits/second and a switching latency of 700 nanoseconds. The system can be ...

article
Free
A message driven OR-parallel machine

A message driven architecture for the execution of OR-parallel logic languages is proposed. The computational model is based on well known compilation techniques for Logic Languages. We present first the multiple binding mechanism for the OR-parallel ...

article
Free
Evaluating the performance of software cache coherence

In a shared-memory multiprocessor with private caches, cached copies of a data item must be kept consistent. This is called cache coherence. Both hardware and software coherence schemes have been proposed. Software techniques are attractive because they ...

article
Free
Analysis of cache invalidation patterns in multiprocessors

To make shared-memory multiprocessors scalable, researchers are now exploring cache coherence protocols that do not rely on broadcast, but instead send invalidation messages to individual caches that contain stale data. The feasibility of such directory-...

article
Free
The effect of sharing on the cache and bus performance of parallel programs

Bus bandwidth ultimately limits the performance, and therefore the scale, of bus-based, shared memory multiprocessors. Previous studies have extrapolated from uniprocessor measurements and simulations to estimate the performance of these machines. In ...

article
Free
Available instruction-level parallelism for superscalar and superpipelined machines

Superscalar machines can issue several instructions per cycle. Superpipelined machines can issue only one instruction per cycle, but they have cycle times shorter than the latency of any functional unit. In this paper these two techniques are shown to ...

article
Free
Micro-optimization of floating-point operations

This paper describes micro-optimization, a technique for reducing the operation count and time required to perform floating-point calculations. Micro-optimization involves breaking floating-point operations into their constituent micro-operations and ...

article
Free
Limits on multiple instruction issue

This paper investigates the limitations on designing a processor which can sustain an execution rate of greater than one instruction per cycle on highly-optimized, non-scientific applications. We have used trace-driven simulations to determine that ...

Subjects

Comments