Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
Bibliometrics
Skip Table Of Content Section
article
Free
Separating data and control transfer in distributed operating systems

Advances in processor architecture and technology have resulted in workstations in the 100+ MIPS range. As well, newer local-area networks such as ATM promise a ten- to hundred-fold increase in throughput, much reduced latency, greater scalability, and ...

article
Free
Scheduling and page migration for multiprocessor compute servers

Several cache-coherent shared-memory multiprocessors have been developed that are scalable and offer a very tight coupling between the processing resources. They are therefore quite attractive for use as compute servers for multiprogramming and parallel ...

article
Free
Reactive synchronization algorithms for multiprocessors

Synchronization algorithms that are efficient across a wide range of applications and operating conditions are hard to design because their performance depends on unpredictable run-time factors. The designer of a synchronization algorithm has a choice ...

article
Free
Integration of message passing and shared memory in the Stanford FLASH multiprocessor

The advantages of using message passing over shared memory for certain types of communication and synchronization have provided an incentive to integrate both models within a single architecture. A key goal of the FLASH (FLexible Architecture for SHared ...

article
Free
Software overhead in messaging layers: where does the time go?

Despite improvements in network interfaces and software messaging layers, software communication overhead still dominates the hardware routing cost in most systems. In this study, we identify the sources of this overhead by analyzing software costs of ...

article
Free
Where is time spent in message-passing and shared-memory programs?

Message passing and shared memory are two techniques parallel programs use for coordination and communication. This paper studies the strengths and weaknesses of these two mechanisms by comparing equivalent, well-written message-passing and shared-...

article
Free
Performance of a hardware-assisted real-time garbage collector

Hardware-assisted real-time garbage collection offers high throughput and small worst-case bounds on the times required to allocate dynamic objects and to access the memory contained within previously allocated objects. Whether the proposed technology ...

article
Free
eNVy: a non-volatile, main memory storage system

This paper describes the architecture of eNVy, a large non-volatile main memory storage system built primarily with Flash memory. eNVy presents its storage space as a linear, memory mapped array rather than as an emulated disk in order to provide an ...

article
Free
Resource allocation in a high clock rate microprocessor

This paper discusses the design of a high clock rate (300MHz) processor. The architecture is described, and the goals for the design are explained. The performance of three processor models is evaluated using trace-driven simulation. A cost model is ...

article
Free
Hardware and software support for efficient exception handling

Program-synchronous exceptions, for example, breakpoints, watchpoints, illegal opcodes, and memory access violations, provide information about exceptional conditions, interrupting the program and vectoring to an operating system handler. Over the last ...

article
Free
A technique for monitoring run-time dynamics of an operating system and a microprocessor executing user applications

In this paper, we present a non-invasive and efficient technique for simulating applications complete with their operating system interaction. The technique involves booting and initiating an application on a hardware development system, capturing the ...

article
Free
Trap-driven simulation with Tapeworm II

Tapeworm II is a software-based simulation tool that evaluates the cache and TLB performance of multiple-task and operating system intensive workloads. Tapeworm resides in an OS kernel and causes a host machine's hardware to drive simulations with ...

article
Free
Contrasting characteristics and cache performance of technical and multi-user commercial workloads

Experience has shown that many widely used benchmarks are poor predictors of the performance of systems running commercial applications. Research into this anomaly has long been hampered by a lack of address traces from representative multi-user ...

article
Free
Avoiding conflict misses dynamically in large direct-mapped caches

This paper describes a method for improving the performance of a large direct-mapped cache by reducing the number of conflict misses. Our solution consists of two components: an inexpensive hardware device called a Cache Miss Lookaside (CML) buffer that ...

article
Free
Surpassing the TLB performance of superpages with less operating system support

Many commercial microprocessor architectures have added translation lookaside buffer (TLB) support for superpages. Superpages differ from segments because their size must be a power of two multiple of the base page size and they must be aligned in both ...

article
Free
Dynamic memory disambiguation using the memory conflict buffer

To exploit instruction level parallelism, compilers for VLIW and superscalar processors often employ static code scheduling. However, the available code reordering may be severely restricted due to ambiguous dependences between memory instructions. This ...

article
Free
AP1000+: architectural support of PUT/GET interface for parallelizing compiler

The scalability of distributed-memory parallel computers makes them attractive candidates for solving large-scale problems. New languages, such as HPF, FortranD, and VPP Fortran, have been developed to enable existing software to be easily ported to ...

article
Free
LCM: memory system support for parallel language implementation

Higher-level parallel programming languages can be difficult to implement efficiently on parallel machines. This paper shows how a flexible, compiler-controlled memory system can help achieve good performance for language constructs that previously ...

article
Free
The performance advantages of integrating block data transfer in cache-coherent multiprocessors

Integrating support for block data transfer has become an important emphasis in recent cache-coherent shared address space multiprocessors. This paper examines the potential performance benefits of adding this support. A set of ambitious hardware ...

article
Free
Improving the accuracy of static branch prediction using branch correlation

Recent work in history-based branch prediction uses novel hardware structures to capture branch correlation and increase branch prediction accuracy. We present a profile-based code transformation that exploits branch correlation to improve the accuracy ...

article
Free
Reducing branch costs via branch alignment

Several researchers have proposed algorithms for basic block reordering. We call these branch alignment algorithms. The primary emphasis of these algorithms has been on improving instruction cache locality, and the few studies concerned with branch ...

article
Free
Compiler optimizations for improving data locality

In the past decade, processor speed has become significantly faster than memory speed. Small, fast cache memories are designed to overcome this discrepancy, but they are only effective when programs exhibit data locality. In this paper, we present ...

article
Free
DCG: an efficient, retargetable dynamic code generation system

Dynamic code generation allows aggressive optimization through the use of runtime information. Previous systems typically relied on ad hoc code generators that were not designed for retargetability, and did not shield the client from machine-specific ...

article
Free
The performance impact of flexibility in the Stanford FLASH multiprocessor

A flexible communication mechanism is a desirable feature in multiprocessors because it allows support for multiple communication protocols, expands performance monitoring capabilities, and leads to a simpler design and debug process. In the Stanford ...

article
Free
Simple compiler algorithms to reduce ownership overhead in cache coherence protocols

We study in this paper the design and efficiency of compiler algorithms that remove ownership overhead in shared-memory multiprocessors with write-invalidate protocols. These algorithms detect loads followed by stores to the same address. Such loads are ...

article
Free
Fine-grain access control for distributed shared memory

This paper discusses implementations of fine-grain memory access control, which selectively restricts reads and writes to cache-block-sized memory regions. Fine-grain access control forms the basis of efficient cache-coherent shared memory. This paper ...

article
Free
Interleaving: a multithreading technique targeting multiprocessors and workstations

There is an increasing trend to use commodity microprocessors as the compute engines in large-scale multiprocessors. However, given that the majority of the microprocessors are sold in the workstation market, not in the multiprocessor market, it is only ...

article
Free
Hardware support for fast capability-based addressing

Traditional methods of providing protection in memory systems do so at the cost of increased context switch time and/or increased storage to record access permissions for processes. With the advent of computers that supported cycle-by-cycle ...

article
Free
The effectiveness of multiple hardware contexts

Multithreaded processors are used to tolerate long memory latencies. By executing threads loaded in multiple hardware contexts, an otherwise idle processor can keep busy, thus increasing its utilization. However, the larger size of a multi-thread ...

Subjects

Comments