Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Scott Mahlke

Abstract: An apparatus and method capable of reducing idle resources in a multicore device and improving the use of available resources in the multicore device are provided. The apparatus includes a static scheduling unit configured to... more
Abstract: An apparatus and method capable of reducing idle resources in a multicore device and improving the use of available resources in the multicore device are provided. The apparatus includes a static scheduling unit configured to generate one or more task groups, and to allocate the task groups to virtual cores by dividing or combining the tasks included in the task groups based on the execution time estimates of the task groups.
S. Abraham G. Adams A. Agarwal D. Agrawal G. Alvarez R. Alverson C. Amza C. Anderson M. Annavaram J. Archibald J. Arora K. Asanovic T. Austin D. Bacon S. Bagchi N. Bagerzadeh H. Bal S. Banerjia M. Banikazemi L. Barroso S. Basu J. Bennett... more
S. Abraham G. Adams A. Agarwal D. Agrawal G. Alvarez R. Alverson C. Amza C. Anderson M. Annavaram J. Archibald J. Arora K. Asanovic T. Austin D. Bacon S. Bagchi N. Bagerzadeh H. Bal S. Banerjia M. Banikazemi L. Barroso S. Basu J. Bennett D. Bhandarkar R. Bianchini A. Bilas B. Black D. Blough K. Bolding R. Boppana S. Breach M. Brorsson E. Bugnion D. Burger B. Calder C. Cascaval J. Cavallaro D. Chaiken PY. Chang P. Cao J. Chapin M. Charney B. Chen CH. Chen J. Chen TF. Chen W. Chen YK. Chen TC. Chiueh S. Cho F. Chong P. Chou
An integrated circuit is provided with latency detecting circuitry for detecting signal generation latency within one or more functional circuits and in response thereto to generate a wearout response. The wearout response can take a... more
An integrated circuit is provided with latency detecting circuitry for detecting signal generation latency within one or more functional circuits and in response thereto to generate a wearout response. The wearout response can take a variety of different forms such as reducing the operating frequency, increasing the operating voltage, operating task allocation within a multiprocessor system, manufacturing test binning and other wearout responses.
Page 1.
[57] ABSTRACT A compiler includes a branch statistics data analyzer to analyze branch statistics data of a branch instruction to construct a branch predictor function for the branch instruction. A branch prediction instruction generator... more
[57] ABSTRACT A compiler includes a branch statistics data analyzer to analyze branch statistics data of a branch instruction to construct a branch predictor function for the branch instruction. A branch prediction instruction generator is coupled to the branch statistics data analyzer to generate at least one prediction instruction to implement the branch predictor function. A main compiling engine is coupled to the branch prediction instruction generator to insert the prediction instruction before the branch instruction.
We set a goal of five reviews for each paper and largely met this goal. Most papers received five reviews and none received fewer than four, for an average of 4.97 reviews per paper. Each paper was reviewed by three committee members and... more
We set a goal of five reviews for each paper and largely met this goal. Most papers received five reviews and none received fewer than four, for an average of 4.97 reviews per paper. Each paper was reviewed by three committee members and two external reviewers. For each paper, one of the external reviews was assigned by us and the other was assigned by a committee member. We made every effort to assign papers to committee members and external reviewers with matching interests and research areas.
A data processing system is provided having a processor and analysing circuitry for identifying a SIMD instruction associated with a first SIMD instruction set and replacing it by a functionally-equivalent scalar representation and... more
A data processing system is provided having a processor and analysing circuitry for identifying a SIMD instruction associated with a first SIMD instruction set and replacing it by a functionally-equivalent scalar representation and marking that functionally-equivalent scalar representation.
Abstract Application-specific extensions to the computational capabilities of a processor provide an efficient mechanism to meet the growing performance and power demands of embedded applications. Hardware, in the form of new function... more
Abstract Application-specific extensions to the computational capabilities of a processor provide an efficient mechanism to meet the growing performance and power demands of embedded applications. Hardware, in the form of new function units (or coprocessors), and the corresponding instructions are added to a baseline processor to meet the critical computational demands of a target application. In this paper, the design of a system to automate the instruction set customization process is presented.
An instruction scheduling method and a processor using an instruction scheduling method are provided. The instruction scheduling method includes selecting a first instruction that has a highest priority from a plurality of instructions,... more
An instruction scheduling method and a processor using an instruction scheduling method are provided. The instruction scheduling method includes selecting a first instruction that has a highest priority from a plurality of instructions, and allocating the selected first instruction and a first time slot to one of the functional units, allocating a second instruction and a second time slot to one of the functional units, wherein the second instruction is dependent on the first instruction.
Abstract Transient faults are emerging as a critical reliability concern in modern microprocessors. Redundant hardware solutions are commonly deployed to detect transient faults, but they are less flexible and cost-effective than software... more
Abstract Transient faults are emerging as a critical reliability concern in modern microprocessors. Redundant hardware solutions are commonly deployed to detect transient faults, but they are less flexible and cost-effective than software solutions. However, software solutions are rendered impractical because of high performance overheads. To address this problem, this paper presents Runtime Asynchronous Fault Tolerance via Speculation (RAFT), the fastest transient fault detection technique known to date.
Provided are a computing apparatus and method based on SIMD architecture capable of supporting various SIMD widths without wasting resources. The computing apparatus includes a plurality of configurable execution cores (CECs) that have a... more
Provided are a computing apparatus and method based on SIMD architecture capable of supporting various SIMD widths without wasting resources. The computing apparatus includes a plurality of configurable execution cores (CECs) that have a plurality of execution modes, and a controller for detecting a loop region from a program, determining a Single Instruction Multiple Data (SIMD) width for the detected loop region, and determining an execution mode of the processor according to the determined SIMD width.
Page 1. ix Program Committee Nader Bagherzadeh, University of California, Irvine Luc Bouge, LIP, ENS Lyon, France Bruce Childers, University of Pittsburgh Jong Choi, IBM TJ Watson Research Center Michel Cosnard, LORIA-INRIA, France Jack W.
Abstract In high-end embedded systems, coarse-grained reconfigurable architectures (CGRA) continue to replace traditional ASIC designs. CGRAs offer high performance at a low power consumption, yet provide flexibility through... more
Abstract In high-end embedded systems, coarse-grained reconfigurable architectures (CGRA) continue to replace traditional ASIC designs. CGRAs offer high performance at a low power consumption, yet provide flexibility through programmability. In this paper we introduce a recurrence cycle-aware scheduling technique for CGRAs. Our modulo scheduler groups operations belonging to a recurrence cycle into a clustered node and then computes a scheduling order for those clustered nodes.
An automated design system for VLIW processors explores a parameterized design space to assist in identifying candidate processor designs that satisfy desired design constraints, such as processor cost and performance. A VLIW synthesis... more
An automated design system for VLIW processors explores a parameterized design space to assist in identifying candidate processor designs that satisfy desired design constraints, such as processor cost and performance. A VLIW synthesis process takes as input a specification of processor parameters and synthesizes a datapath specification, an instruction format design, and a control path specification. The synthesis process also extracts a machine description suitable to re-target a compiler.
An information processor for executing a program comprising a plurality of separate program instructions is provided. The processor comprises processing logic operable to individually execute said separate program instructions of said... more
An information processor for executing a program comprising a plurality of separate program instructions is provided. The processor comprises processing logic operable to individually execute said separate program instructions of said program, an operand store operable to store operand values and an accelerator having a plurality of functional units.
An accelerator 120 is tightly coupled to the normal execution unit 110. The operand store, which could be a register file 130, a stack based operand store or other operand store is shared by the execution unit and the accelerator unit.... more
An accelerator 120 is tightly coupled to the normal execution unit 110. The operand store, which could be a register file 130, a stack based operand store or other operand store is shared by the execution unit and the accelerator unit. Operands may also be accessed as immediate values within the instructions themselves.
Abstract Coarse-Grained Reconfigurable Array (CGRA) processors accelerate inner loops of applications by exploiting instructionlevel parallelism (ILP) and in some cases also data-level and task-level parallelism (DLP & TLP). The aim of... more
Abstract Coarse-Grained Reconfigurable Array (CGRA) processors accelerate inner loops of applications by exploiting instructionlevel parallelism (ILP) and in some cases also data-level and task-level parallelism (DLP & TLP). The aim of this tutorial is to give insight in CGRA architectures and their compilation techniques to exploit parallelism.
A system is provided which simplifies and speeds up the process of designing a computer system by evaluating the components of the memory hierarchy for any member of a broad family of processors in an application-specific manner. The... more
A system is provided which simplifies and speeds up the process of designing a computer system by evaluating the components of the memory hierarchy for any member of a broad family of processors in an application-specific manner. The system uses traces produced by a reference processor in the design space for a particular cache design and characterizes the differences in behavior between the reference processor and an arbitrarily chosen processor.
Abstract The rapid advancements in the computational capabilities of the graphics processing unit (GPU) and the deployment of general programming models for these devices have made the vision of a desktop supercomputer a reality. It is... more
Abstract The rapid advancements in the computational capabilities of the graphics processing unit (GPU) and the deployment of general programming models for these devices have made the vision of a desktop supercomputer a reality. It is now possible to assemble a system that provides TFLOPs of performance on scientific applications for the cost of a high-end laptop. While these devices have clearly changed the landscape of computing, there are two central problems that arise.
Abstract To efficiently schedule superscalar and superpipelined processors, it is necessary to move instructions across branches. This requires increasing the scheduling scope beyond the basic block. Superblock scheduling, a static... more
Abstract To efficiently schedule superscalar and superpipelined processors, it is necessary to move instructions across branches. This requires increasing the scheduling scope beyond the basic block. Superblock scheduling, a static scheduling method, is a variant of trace scheduling that removes the bookkeeping complexity associated with branches into a trace by removing these entrances using a method called tail duplication.
Alex Alet�� Gheorghe Almasi Erik Altman David August Eduard Ayguad�� Rosa M. Bad��a Ivan Baev Ron Barnes Rastislav Bodik Mike Boucher Ian Bratt Preston Briggs Brad Calder Steve Carr Calin Cascaval Deepak Chandra Ben Cheng Bruce Childers... more
Alex Alet�� Gheorghe Almasi Erik Altman David August Eduard Ayguad�� Rosa M. Bad��a Ivan Baev Ron Barnes Rastislav Bodik Mike Boucher Ian Bratt Preston Briggs Brad Calder Steve Carr Calin Cascaval Deepak Chandra Ben Cheng Bruce Childers Michael Chu Nathan Clark Josep M.

And 186 more