While a canonical out-of-order engine can effectively exploit implicit parallelism in sequential ... more While a canonical out-of-order engine can effectively exploit implicit parallelism in sequential programs, its ef-fectiveness is often hindered by instruction and data supply imperfections manifested as branch mispredictions and cache misses. Accurate and deep look-ahead guided by a slice of the executed program is a simple yet effective approach to mitigate the performance impact of branch mispredictions and cache misses. Unfortunately, program slice-guided look-ahead is often limited by the speed of the look-ahead code slice, especially for irregular programs. In this paper, we attempt to speed up the look-ahead agent using specula-tive parallelization, which is especially suited for the task. First, slicing for look-ahead tends to reduce important data dependences that prohibit successful speculative paralleliza-tion. Second, the task for look-ahead is not correctness-critical and thus naturally tolerates dependence violations. This enables an implementation to forgo violation de...
Shared cache is generally optimized to maximize the overall throughput, fairness, or both. Increa... more Shared cache is generally optimized to maximize the overall throughput, fairness, or both. Increasingly in shared environments, especially compute clouds, users are unrelated to one another. In such circumstances, an overall gain in throughput does not justify an individual loss. This paper explores conservative sharing, which protects the cache occupancy for individual programs, but still enables full cache sharing whenever there is unused space. Specifically, we present a new hardware based mechanism called cache rationing. Each core/program is assigned a portion of the shared cache as its ration. The hardware support protects the ration so it cannot be taken away by peer programs while in use. However, a program can exceed its allocated ration only if another program has unused blocks in its ration. This paper shows that rationing provides both full protection and full utilization of the cache. In addition, the same hardware support can enable energy-efficient caching and hardwar...
1 ď€ Abstract—Thirst for high performance and increased functionality has forced designers to inte... more 1 ď€ Abstract—Thirst for high performance and increased functionality has forced designers to integrate large number of transistors on same die and sometime on same substrate. System-on-a-chip (SOC) – a trend in which analog, digital, and RF circuits share common substrate – is becoming popular primarily due to ease of large scale heterogeneous integration. However, one of the major concerns in such systems is the noise which gets coupled from noise sources to more sensitive blocks through common substrate. In past, noise issues were investigated in analog circuits and digital circuits were less vulnerable to such problems until recently. This paper presents an overview of substrate coupling noise (SCN) phenomenon in high-speed SOC and digital integrated circuits (ICs). Various design methodologies to efficiently model the SCN are reviewed and their key aspects are discussed. Few simple yet important noise mitigation techniques and state-of-the-art solutions to deal with SCN problems...
In this paper, several passive termination schemes for high speed on-chip serial interconnect and... more In this paper, several passive termination schemes for high speed on-chip serial interconnect and trade offs associated with them are presented. Signal declension due to reflections in high speed transmission lines (TL) can be minimized by introducing adequate termination circuit. Primary trade off involved in terminated TL is with the optimization of bandwidth – power consumption. Secondary trades off i.e. area – bandwidth, noise margin – power consumption are also discussed in brief. Few guidelines are proposed which are useful in achieving contentious objectives i.e. low latency, high bandwidth, and low power consumption in on-chip high speed communication. Basis of qualitative and quantitative analysis presented here is the state-of-theart literature of recent research. In most of the cases, it is assumed that reflections occur mainly due to mismatching at receiver end and proper termination is provided at transmitter end.
In spite of the multicore revolution, high single thread performance still plays an important rol... more In spite of the multicore revolution, high single thread performance still plays an important role in ensuring a decentoverall gain. Look-ahead is a proven strategy in uncoveringimplicit parallelism; however, a conventional out-of-ordercore quickly becomes resource-inefficient when looking beyond a short distance. An effective approach is to use an in-dependent look-ahead thread running on a separate contextguided by a program slice known as the skeleton. We observethat fixed heuristics to generate skeletons are often suboptimal. As a consequence, look-ahead agent is not able to targetsufficient bottlenecks to reap all the benefits it should.In this paper, we present DRUT, a holistic hardware-software solution, which achieves good single thread performance by tuning the look-ahead skeleton efficiently. First, we propose a number of dynamic transformations to branchbased code modules (we call them Do-It-Yourself or DIY)that enable a faster look-ahead thread without compromisingthe qu...
One well known approach to mitigate the impact of branch mispredictions and cache misses is to en... more One well known approach to mitigate the impact of branch mispredictions and cache misses is to enable de p lookahead so as to overlap instruction and data supply with instruction processing. A continuous look-ahead process which uses separate thread of control on another hardware contexts is one such approach which we call decoupled look-ahead [1], [2]. However, in such look-ahead schemes, look-ahead thread can often become the performance bottleneck. In this work, we explore speculative parallelizat ion in a decoupled look-ahead agent. Intuitively, speculative parallelization is aptly suited to the task of speeding up look-ahead agent for two reasons. First, the program slice for look-ahead does not contain all the data dependencies embedded in the original program, providing more opportunities for parallelization. Second, the execution of the sl ice is only for look-ahead purposes and thus the environment is inherently more tolerant of dependence violations.
Branch prediction is one of the ancient performance improving techniques which still finds releva... more Branch prediction is one of the ancient performance improving techniques which still finds relevance into modern architectures. While the simple prediction techniques provide fast lookup and power efficiency they suffer from high misprediction rate. On the other hand, complex branch predictions – either neural based or variants of two-level branch prediction – provide better prediction accuracy but consume more power and complexity increases exponentially. In addition to this, in complex prediction techniques the time taken to predict the branches is itself very high – ranging from 2 to 5 cycles – which is comparable to the execution time of actual branches. Branch prediction is essentially an optimization (minimization) problem where the emphasis is on to achieve lowest possible miss rate, low power consumption and low complexity with minimum resources. In this survey paper we review the traditional Two-level branch prediction techniques; their variants and the underlying principle...
Despite the proliferation of multi-core and multi-threaded architectures, exploiting implicit par... more Despite the proliferation of multi-core and multi-threaded architectures, exploiting implicit parallelism for a single semantic thread is still a crucial component in achieving high performance. While a canonical out-of-order engine can effectively uncover implicit parallelism in sequential programs, its effectiveness is often hindered by instruction and data supply imperfections (manifested as branch mispredictions and cache misses). Look-ahead is a tried-and-true strategy to exploit implicit parallelism, but can have resource-inefficient implementations such as in a conventional, monolithic out-of-order core. A more decoupled approach with an independent, dedicated look-ahead thread on a separate thread context can be a more flexible and effective implementation, especially in a multi-core environment. While capable of generating significant performance gains, the look-ahead agent often becomes the new speed limit; thus, we explore a range of software and hardware based techniques...
While a canonical out-of-order engine can effectively exploit implicit parallelism in sequential ... more While a canonical out-of-order engine can effectively exploit implicit parallelism in sequential programs, its ef-fectiveness is often hindered by instruction and data supply imperfections manifested as branch mispredictions and cache misses. Accurate and deep look-ahead guided by a slice of the executed program is a simple yet effective approach to mitigate the performance impact of branch mispredictions and cache misses. Unfortunately, program slice-guided look-ahead is often limited by the speed of the look-ahead code slice, especially for irregular programs. In this paper, we attempt to speed up the look-ahead agent using specula-tive parallelization, which is especially suited for the task. First, slicing for look-ahead tends to reduce important data dependences that prohibit successful speculative paralleliza-tion. Second, the task for look-ahead is not correctness-critical and thus naturally tolerates dependence violations. This enables an implementation to forgo violation de...
Shared cache is generally optimized to maximize the overall throughput, fairness, or both. Increa... more Shared cache is generally optimized to maximize the overall throughput, fairness, or both. Increasingly in shared environments, especially compute clouds, users are unrelated to one another. In such circumstances, an overall gain in throughput does not justify an individual loss. This paper explores conservative sharing, which protects the cache occupancy for individual programs, but still enables full cache sharing whenever there is unused space. Specifically, we present a new hardware based mechanism called cache rationing. Each core/program is assigned a portion of the shared cache as its ration. The hardware support protects the ration so it cannot be taken away by peer programs while in use. However, a program can exceed its allocated ration only if another program has unused blocks in its ration. This paper shows that rationing provides both full protection and full utilization of the cache. In addition, the same hardware support can enable energy-efficient caching and hardwar...
1 ď€ Abstract—Thirst for high performance and increased functionality has forced designers to inte... more 1 ď€ Abstract—Thirst for high performance and increased functionality has forced designers to integrate large number of transistors on same die and sometime on same substrate. System-on-a-chip (SOC) – a trend in which analog, digital, and RF circuits share common substrate – is becoming popular primarily due to ease of large scale heterogeneous integration. However, one of the major concerns in such systems is the noise which gets coupled from noise sources to more sensitive blocks through common substrate. In past, noise issues were investigated in analog circuits and digital circuits were less vulnerable to such problems until recently. This paper presents an overview of substrate coupling noise (SCN) phenomenon in high-speed SOC and digital integrated circuits (ICs). Various design methodologies to efficiently model the SCN are reviewed and their key aspects are discussed. Few simple yet important noise mitigation techniques and state-of-the-art solutions to deal with SCN problems...
In this paper, several passive termination schemes for high speed on-chip serial interconnect and... more In this paper, several passive termination schemes for high speed on-chip serial interconnect and trade offs associated with them are presented. Signal declension due to reflections in high speed transmission lines (TL) can be minimized by introducing adequate termination circuit. Primary trade off involved in terminated TL is with the optimization of bandwidth – power consumption. Secondary trades off i.e. area – bandwidth, noise margin – power consumption are also discussed in brief. Few guidelines are proposed which are useful in achieving contentious objectives i.e. low latency, high bandwidth, and low power consumption in on-chip high speed communication. Basis of qualitative and quantitative analysis presented here is the state-of-theart literature of recent research. In most of the cases, it is assumed that reflections occur mainly due to mismatching at receiver end and proper termination is provided at transmitter end.
In spite of the multicore revolution, high single thread performance still plays an important rol... more In spite of the multicore revolution, high single thread performance still plays an important role in ensuring a decentoverall gain. Look-ahead is a proven strategy in uncoveringimplicit parallelism; however, a conventional out-of-ordercore quickly becomes resource-inefficient when looking beyond a short distance. An effective approach is to use an in-dependent look-ahead thread running on a separate contextguided by a program slice known as the skeleton. We observethat fixed heuristics to generate skeletons are often suboptimal. As a consequence, look-ahead agent is not able to targetsufficient bottlenecks to reap all the benefits it should.In this paper, we present DRUT, a holistic hardware-software solution, which achieves good single thread performance by tuning the look-ahead skeleton efficiently. First, we propose a number of dynamic transformations to branchbased code modules (we call them Do-It-Yourself or DIY)that enable a faster look-ahead thread without compromisingthe qu...
One well known approach to mitigate the impact of branch mispredictions and cache misses is to en... more One well known approach to mitigate the impact of branch mispredictions and cache misses is to enable de p lookahead so as to overlap instruction and data supply with instruction processing. A continuous look-ahead process which uses separate thread of control on another hardware contexts is one such approach which we call decoupled look-ahead [1], [2]. However, in such look-ahead schemes, look-ahead thread can often become the performance bottleneck. In this work, we explore speculative parallelizat ion in a decoupled look-ahead agent. Intuitively, speculative parallelization is aptly suited to the task of speeding up look-ahead agent for two reasons. First, the program slice for look-ahead does not contain all the data dependencies embedded in the original program, providing more opportunities for parallelization. Second, the execution of the sl ice is only for look-ahead purposes and thus the environment is inherently more tolerant of dependence violations.
Branch prediction is one of the ancient performance improving techniques which still finds releva... more Branch prediction is one of the ancient performance improving techniques which still finds relevance into modern architectures. While the simple prediction techniques provide fast lookup and power efficiency they suffer from high misprediction rate. On the other hand, complex branch predictions – either neural based or variants of two-level branch prediction – provide better prediction accuracy but consume more power and complexity increases exponentially. In addition to this, in complex prediction techniques the time taken to predict the branches is itself very high – ranging from 2 to 5 cycles – which is comparable to the execution time of actual branches. Branch prediction is essentially an optimization (minimization) problem where the emphasis is on to achieve lowest possible miss rate, low power consumption and low complexity with minimum resources. In this survey paper we review the traditional Two-level branch prediction techniques; their variants and the underlying principle...
Despite the proliferation of multi-core and multi-threaded architectures, exploiting implicit par... more Despite the proliferation of multi-core and multi-threaded architectures, exploiting implicit parallelism for a single semantic thread is still a crucial component in achieving high performance. While a canonical out-of-order engine can effectively uncover implicit parallelism in sequential programs, its effectiveness is often hindered by instruction and data supply imperfections (manifested as branch mispredictions and cache misses). Look-ahead is a tried-and-true strategy to exploit implicit parallelism, but can have resource-inefficient implementations such as in a conventional, monolithic out-of-order core. A more decoupled approach with an independent, dedicated look-ahead thread on a separate thread context can be a more flexible and effective implementation, especially in a multi-core environment. While capable of generating significant performance gains, the look-ahead agent often becomes the new speed limit; thus, we explore a range of software and hardware based techniques...
Uploads
Papers by Raj Parihar