Export Citations
Save this search
Please login to be able to save your searches and receive alerts for new content matching your search criteria.
FDRA: A Framework for a Dynamically Reconfigurable Accelerator Supporting Multi-Level Parallelism
ACM Transactions on Reconfigurable Technology and Systems (TRETS), Volume 17, Issue 1Article No.: 4, Pages 1–26https://doi.org/10.1145/3614224Coarse-grained reconfigurable architectures (CGRAs) have emerged as promising accelerators due to their high flexibility and energy efficiency. However, existing open source works often lack integration of CGRAs with CPU systems and corresponding ...
- research-articleFebruary 2023
A Sound and Complete Algorithm for Code Generation in Distance-Based ISA
CC 2023: Proceedings of the 32nd ACM SIGPLAN International Conference on Compiler ConstructionFebruary 2023, Pages 73–84https://doi.org/10.1145/3578360.3580263The single-thread performance of a processor core is essential even in the multicore era. However, increasing the processing width of a core to improve the single-thread performance leads to a super-linear increase in power consumption. To overcome ...
- research-articleMarch 2017
A mechanism for energy-efficient reuse of decoding and scheduling of x86 instruction streams
DATE '17: Proceedings of the Conference on Design, Automation & Test in EuropeMarch 2017, Pages 1472–1477Current superscalar x86 processors decompose each CISC instruction (variable-length and with multiple addressing modes) into multiple RISC-like μops at runtime so they can be pipelined and scheduled for concurrent execution. This challenging and power-...
- research-articleDecember 2015
Integer Linear Programming-Based Scheduling for Transport Triggered Architectures
ACM Transactions on Architecture and Code Optimization (TACO), Volume 12, Issue 4Article No.: 59, Pages 1–22https://doi.org/10.1145/2845082Static multi-issue machines, such as traditional Very Long Instructional Word (VLIW) architectures, move complexity from the hardware to the compiler. This is motivated by the ability to support high degrees of instruction-level parallelism without ...
- research-articleAugust 2015
Revisiting Clustered Microarchitecture for Future Superscalar Cores: A Case for Wide Issue Clusters
ACM Transactions on Architecture and Code Optimization (TACO), Volume 12, Issue 3Article No.: 28, Pages 1–22https://doi.org/10.1145/2800787During the past 10 years, the clock frequency of high-end superscalar processors has not increased. Performance keeps growing mainly by integrating more cores on the same chip and by introducing new instruction set extensions. However, this benefits ...
-
- research-articleMay 2015
An instrumentation approach for hardware-agnostic software characterization
CF '15: Proceedings of the 12th ACM International Conference on Computing FrontiersMay 2015, Article No.: 3, Pages 1–8https://doi.org/10.1145/2742854.2742859Simulators and empirical profiling data are often used to understand how suitable a specific hardware architecture is for an application. However, simulators can be slow, and empirical profiling-based methods can only provide insights about the existing ...
- ArticleNovember 2014
Potential of Using a Reconfigurable System on a Superscalar Core for ILP Improvements
SBESC '14: Proceedings of the 2014 Brazilian Symposium on Computing Systems EngineeringNovember 2014, Pages 43–48https://doi.org/10.1109/SBESC.2014.19As technology scaling reduces pace and energy efficiency becomes a new important design constraint, superscalar processor designs seem to be reaching their performance limits under the area and power constraints. As a result, new architectural paradigms ...
- research-articleAugust 2014
Warp-aware trace scheduling for GPUs
PACT '14: Proceedings of the 23rd international conference on Parallel architectures and compilationAugust 2014, Pages 163–174https://doi.org/10.1145/2628071.2628101GPU performance depends not only on thread/warp level parallelism (TLP) but also on instruction-level parallelism (ILP). It is not enough to schedule instructions within basic blocks, it is also necessary to exploit opportunities for ILP optimization ...
- research-articleDecember 2013
MLP-aware dynamic instruction window resizing for adaptively exploiting both ILP and MLP
MICRO-46: Proceedings of the 46th Annual IEEE/ACM International Symposium on MicroarchitectureDecember 2013, Pages 37–48https://doi.org/10.1145/2540708.2540713It is difficult to improve the single-thread performance of a processor in memory-intensive programs because processors have hit the memory wall, i.e., the large speed discrepancy between the processors and the main memory. Exploiting memory-level ...
- research-articleOctober 2013
A unified view of non-monotonic core selection and application steering in heterogeneous chip multiprocessors
PACT '13: Proceedings of the 22nd international conference on Parallel architectures and compilation techniquesOctober 2013, Pages 133–144A single-ISA heterogeneous chip multiprocessor (HCMP) is an attractive substrate to improve single-thread performance and energy efficiency in the dark silicon era. We consider HCMPs comprised of non-monotonic core types where each core type is ...
- research-articleSeptember 2013
Software thread integration for instruction-level parallelism
ACM Transactions on Embedded Computing Systems (TECS), Volume 13, Issue 1Article No.: 8, Pages 1–23https://doi.org/10.1145/2512466Multimedia applications require a significantly higher level of performance than previous workloads of embedded systems. They have driven digital signal processor (DSP) makers to adopt high-performance architectures like VLIW (Very-Long Instruction Word)...
- ArticleDecember 2012
An Address-Based Compiling Optimization for FFT on Multi-cluster DSP
PAAP '12: Proceedings of the 2012 Fifth International Symposium on Parallel Architectures, Algorithms and ProgrammingDecember 2012, Pages 60–64https://doi.org/10.1109/PAAP.2012.17This paper presents a compiling optimization for FFT program on multi-cluster DSP based on analysis of memory address. We transform the loops in order to reduce the number of instructions in the innermost loop. The interrelationship between each two ...
- articleMay 2012
FabScalar: Automating Superscalar Core Design
- Niket Choudhary,
- Salil Wadhavkar,
- Tanmay Shah,
- Hiran Mayukh,
- Jayneel Gandhi,
- Brandon Dwiel,
- Sandeep Navada,
- Hashem Najaf-abadi,
- Eric Rotenberg
Providing multiple superscalar core types on a chip, each tailored to different classes of instruction-level behavior, is an exciting direction for increasing processor performance and energy efficiency. Unfortunately, processor design and verification ...
- ArticleFebruary 2012
On Optimizing the Longest Common Subsequence Problem by Loop Unrolling Along Wavefronts
PDP '12: Proceedings of the 2012 20th Euromicro International Conference on Parallel, Distributed and Network-based ProcessingFebruary 2012, Pages 603–611https://doi.org/10.1109/PDP.2012.49Loop unrolling is a loop transformation where a few loop iterations are grouped as a super iteration for exploring more independent instructions and to decrease the total loop overhead. This paper characterizes loop unrolling by the unrolling factor, ...
- research-articleNovember 2011
Efficient Spilling Reduction for Software Pipelined Loops in Presence of Multiple Register Types in Embedded VLIW Processors
ACM Transactions on Embedded Computing Systems (TECS), Volume 10, Issue 4Article No.: 47, Pages 1–25https://doi.org/10.1145/2043662.2043671Integrating register allocation and software pipelining of loops is an active research area. We focus on techniques that precondition the dependence graph before software pipelining in order to ensure that no register spill instructions are inserted by ...
- research-articleJune 2011
Parallelism and data movement characterization of contemporary application classes
SPAA '11: Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architecturesJune 2011, Pages 95–104https://doi.org/10.1145/1989493.1989506This paper presents a framework for characterizing the distribution of fine-grained parallelism, data movement, and communication-minimizing code partitions. Understanding the spectrum of parallelism available in applications, and how much data movement ...
- research-articleMay 2011
Quantitative analysis of parallelism and data movement properties across the Berkeley computational motifs
CF '11: Proceedings of the 8th ACM International Conference on Computing FrontiersMay 2011, Article No.: 17, Pages 1–2https://doi.org/10.1145/2016604.2016625This work presents the first thorough quantitative study of the available instruction-level parallelism, basic-block-granularity thread parallelism, and data movement, across the Berkeley dwarfs/computational motifs. Although this classification was ...
- research-articleFebruary 2011
Computing Floating-Point Square Roots via Bivariate Polynomial Evaluation
IEEE Transactions on Computers (ITCO), Volume 60, Issue 2February 2011, Pages 214–227https://doi.org/10.1109/TC.2010.152In this paper, we show how to reduce the computation of correctly rounded square roots of binary floating-point data to the fixed-point evaluation of some particular integer polynomials in two variables. By designing parallel and accurate evaluation ...
- articleJanuary 2011
A scheduling approach for distributed resource architectures with scarce communication resources
International Journal of High Performance Systems Architecture (IJHPSA), Volume 3, Issue 1January 2011, Pages 12–22https://doi.org/10.1504/IJHPSA.2011.038054Advances in semiconductor fabrication technology will continue to enable exponential increase in the number of transistors available. However, conventional architectures, such as superscalars or VLIWs, will not be able to use the abundant on-chip ...
- research-articleOctober 2010
Mighty-morphing power-SIMD
CASES '10: Proceedings of the 2010 international conference on Compilers, architectures and synthesis for embedded systemsOctober 2010, Pages 67–76https://doi.org/10.1145/1878921.1878934In modern wireless devices, two broad classes of compute-intensive applications are common: those with high amounts of data-level parallelism, such as signal processing used in wireless baseband applications, and those that have little data-level ...