- Sponsor:
- sigarch
No abstract available.
Evaluating the performance of four snooping cache coherency protocols
Write-invalidate and write-broadcast coherency protocols have been criticized for being unable to achieve good bus performance across all cache configurations. In particular, write-invalidate performance can suffer as block size increases; and large ...
Multi-level shared caching techniques for scalability in VMP-M/C
The problem of building a scalable shared memory multiprocessor can be reduced to that of building a scalable memory hierarchy, assuming interprocessor communication is handled by the memory system. In this paper, we describe the VMP-MC design, a ...
Design and performance of a coherent cache for parallel logic programming architectures
This paper describes the design and performance of a tightly-coupled shared-memory coherent cache optimized for the execution of parallel logic programming architectures. The cache utilizes a copy-back write-allocation protocol having five states and a ...
The Epsilon dataflow processor
The εpsilon dataflow architecture is designed for high speed uniprocessor execution as well as for parallel operation in a multiprocessor system. The εpsilon architecture directly matches ready operands, thus eliminating the need for associative ...
An architecture of a dataflow single chip processor
A highly parallel (more than a thousand) dataflow machine EM-4 is now under development. The EM-4 design principle is to construct a high performance computer using a compact architecture by overcoming several defects of dataflow machines. Constructing ...
Exploiting data parallelism in signal processing on a dataflow machine
This paper will show that the massive data parallelism inherent to most signal processing tasks may be easily mapped onto the parallel structure of a data flow machine. A special system called STRUCTFLOW has been designed to optimize the static data ...
Architectural mechanisms to support sparse vector processing
We discuss the algorithmic steps involved in common sparse matrix problems, with particular emphasis on linear programming by the revised simplex method. We then propose new architectural mechanisms which are being built into an experimental machine, ...
A dynamic storage scheme for conflict-free vector access
Previous investigations into data storage schemes have focused on finding a storage scheme that permits conflict-free access for a set of frequently encountered access patterns. This paper considers an alternative approach. Rather than forcing a single ...
SIMP (Single Instruction stream/Multiple instruction Pipelining): a novel high-speed single-processor architecture
SIMP is a novel multiple instruction-pipeline parallel architecture. It is targeted for enhancing the performance of SISD processors drastically by exploiting both temporal and spatial parallelisms, and for keeping program compatibility as well. Degree ...
2-D SIMD algorithms in the perfect shuffle networks
This paper studies a set of basic algorithms for SIMD Perfect Shuffle networks. These algorithms where studied in several papers, but for the 1-D case, where the size of the problem N is the same as the number of processors P. For the 2-D case of N = L *...
Systematic hardware adaptation of systolic algorithms
In this paper we propose a methodology to adapt Systolic Algorithms to the hardware selected for their implementation. Systolic Algorithms obtained can be efficiently implemented using Pipelined Functional Units. The methodology is based on two ...
Task migration in hypercube multiprocessors
Allocation and deallocation of subcubes usually result in a fragmented hypercube where even if a sufficient number of hypercube nodes are available, they do not form a subcube large enough to execute an incoming task. As the fragmentation in ...
Characteristics of performance-optimal multi-level cache hierarchies
The increasing speed of new generation processors will exacerbate the already large difference between CPU cycle times and main memory access times. As this difference grows, it will be increasingly difficult to build single-level caches that are both ...
Supporting reference and dirty bits in SPUR's virtual address cache
Virtual address caches can provide faster access times than physical address caches, because translation is only required on cache misses. However, because we don't check the translation information on each cache access, maintaining reference and dirty ...
Inexpensive implementations of set-associativity
The traditional approach to implementing wide set-associativity is expensive, requiring a wide tag memory (directory) and many comparators. Here we examine alternative implementations of associativity that use hardware similar to that used to implement ...
Organization and performance of a two-level virtual-real cache hierarchy
We propose and analyze a two-level cache organization that provides high memory bandwidth. The first-level cache is accessed directly by virtual addresses. It is small, fast, and, without the burden of address translation, can easily be optimized to ...
High performance communications in processor networks
In order to provide an arbitrary and fully dynamic connectivity in a network of processors, transport mechanisms must be implemented, which provide the propagation of data from processor to processor, based on addresses contained within a packet of ...
Introducing memory into the switch elements of multiprocessor interconnection networks
As VLSI technology continues to improve, circuit area is gradually being replaced by pin restrictions as the limiting factor in design. Thus, it is reasonable to anticipate that on-chip memory will become increasingly inexpensive since it is a simple, ...
Using feedback to control tree saturation in multistage interconnection networks
In this paper, we propose the use of feedback schemes in multiprocessors which use an interconnection network with distributed routing control. We show that by altering system behavior so as to minimize the occurrence of a performance-degrading ...
Constructing replicated systems using processors with point-to-point communication links
Replicated processing with majority voting is a well known method of achieving fault tolerance. We consider the problem of constructing a distributed system composed of an arbitrarily large number of N-modular redundant (NMR) nodes, where each node ...
KCM: a knowledge crunching machine
- H. Benker,
- J. M. Beacco,
- M. Dorochevsky,
- Th. Jeffré,
- A. Pöhlmann,
- J. Noyé,
- B. Poterie,
- J. C. Syre,
- O. Thibault,
- G. Watzlawik
KCM (Knowledge Crunching Machine) is a high-performance back-end processor which, coupled to a UNIX* desk-top workstation, provides a powerful and user-friendly Prolog environment catering for both development and execution of significant Prolog ...
A high performance Prolog processor with multiple function units
We describe the Parallel Unification Machine (PLUM), a Prolog processor that exploits fine grain parallelism using multiple function units executing in parallel. In most cases the execution of bookkeeping instructions is almost completely overlapped by ...
Evaluation of memory system for integrated Prolog processor IPP
This paper discusses an optimal memory system to realize a high performance integrated Prolog processor, the IPP. First, the memory access characteristics of Prolog are analyzed by a simulator, which simulates the execution of a Prolog program at a ...
A type driven hardware engine for Prolog clause retrieval over a large knowledge base
Whereas existing Prolog systems are very effective at handling small knowledge bases, they are not very efficient at and often incapable of handling large sets of clauses. Large knowledge bases which may comprise millions of clauses and are shared by a ...
Comparing software and hardware schemes for reducing the cost of branches
Pipelining has become a common technique to increase throughput of the instruction fetch, instruction decode, and instruction execution portions of modern computers. Branch instructions disrupt the flow of instructions through the pipeline, increasing ...
Improving performance of small on-chip instruction caches
Most current single-chip processors employ an on-chip instruction cache to improve performance. A miss in this instruction cache will cause an external memory reference which must compete with data references for access to the external memory, thus ...
Achieving high instruction cache performance with an optimizing compiler
Increasing the execution power requires a high instruction issue bandwidth, and decreasing instruction encoding and applying some code improving techniques cause code expansion. Therefore, the instruction memory hierarchy performance has become an ...
The impact of code density on instruction cache performance
The widespread use of reduced-instruction-set computers has generated a lot of interest in the tradeoff between the density of an instruction set and the size of the instruction cache. In this paper we present and justify a method that predicts the ...
Can dataflow subsume von Neumann computing?
We explore the question: “What can a von Neumann processor borrow from dataflow to make it more suitable for a multiprocessor?” Starting with a simple, “RISC-like” instruction set, we show how to change the underlying processor organization to make it ...
Exploring the benefits of multiple hardware contexts in a multiprocessor architecture: preliminary results
A fundamental problem that any scalable multiprocessor must address is the ability to tolerate high latency memory operations. This paper explores the extent to which multiple hardware contexts per processor can help to mitigate the negative effects of ...
Index Terms
- Proceedings of the 16th annual international symposium on Computer architecture