Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content
  • Magnus Själander is an Associate Professor at the Department of Computer Science and a Visiting Senior Lecturer at th... moreedit
While for sensor networks energy-efficiency has always been one of the major design goals, energy-efficiency has, for the past decade, also become increasingly important in other disciplines. An emerging class of applications do not... more
While for sensor networks energy-efficiency has always been one of the major design goals, energy-efficiency has, for the past decade, also become increasingly important in other disciplines. An emerging class of applications do not require a perfect result. Rather, an approximate result is sufficient. Approximate computing enables more energyefficient design of computer systems, reducing, for example, the energy dissipation in data centers. In this poster abstract we argue for embracing the concept of approximation computing in the sensor networking community.
MicroScope, and microarchitectural replay attacks in general, take advantage of the characteristics of speculative execution to trap the execution of the victim application in an infinite loop, enabling the attacker to amplify a... more
MicroScope, and microarchitectural replay attacks in general, take advantage of the characteristics of speculative execution to trap the execution of the victim application in an infinite loop, enabling the attacker to amplify a side-channel attack by executing it indefinitely. Due to the nature of the replay, it can be used to effectively attack security critical trusted execution environments (secure enclaves), even under conditions where a side-channel attack would not be possible. At the same time, unlike speculative side-channel attacks, MicroScope can be used to amplify the correct path of execution, rendering many existing speculative side-channel defences ineffective. In this work, we generalize microarchitectural replay attacks beyond MicroScope and present an efficient defence against them. We make the observation that such attacks rely on repeated squashes of so-called “replay handles” and that the instructions causing the side-channel must reside in the same reorder buff...
The complexity of the hardware-software co-design continues to grow despite the relentless efforts of the EDA community. This makes the task of producing an optimal, yet functionally correct design even more challenging. To make the... more
The complexity of the hardware-software co-design continues to grow despite the relentless efforts of the EDA community. This makes the task of producing an optimal, yet functionally correct design even more challenging. To make the situation worse, the applications that a particular design is optimized for plays a vital role in determining the competitiveness of the design. It is therefore imperative that a generic design is tailored prior to manufacturing. Encouragingly, as the design complexity increases, the innovative methods of alleviating these problems evolve. In this paper, we address the issues related to hardware adaptation for a specific suite of applications.
Energy efficiency is a first-order design goal for nearly all classes of processors, but it is particularly important in mobile and embedded systems. Data caches in such systems account for a large portion of the processor's energy... more
Energy efficiency is a first-order design goal for nearly all classes of processors, but it is particularly important in mobile and embedded systems. Data caches in such systems account for a large portion of the processor's energy usage, and thus techniques to improve the energy efficiency of the cache hierarchy are likely to have high impact. Our prior work reduced data cache energy via a tagless access buffer (TAB) that sits at the top of the cache hierarchy. Strided memory references are redirected from the level-one data cache (L1D) to the smaller, more energy-efficient TAB. These references need not access the data translation lookaside buffer (DTLB), and they can avoid unnecessary transfers from lower levels of the memory hierarchy. The original TAB implementation requires changing the immediate field of load and store instructions, necessitating substantial ISA modifications. Here we present a new TAB design that requires minimal instruction set changes, gives software m...
With the growing importance of electronic products in day-to-day life, the need for portable electronic products with low power consumption largely increases. In this paper, an area efficient high speed and low power Multiply Accumulator... more
With the growing importance of electronic products in day-to-day life, the need for portable electronic products with low power consumption largely increases. In this paper, an area efficient high speed and low power Multiply Accumulator unit (MAC) with carry look-ahead adder (CLA) as final adder is being designed. In the same MAC architecture design in final adder stage of partial product unit the carry save adder(CSA), carry select adder(CSLA) and carry skip adder(CSKPA) are also used instead of CLA to compare the power and performance. These MAC designs were simulated and synthesized using Xilinx 8. 1. The simulation result shows that the MAC design with CLA has area reducing by 16. 7%, speed increase by 1. 95% and the consumed power reducing by 0. 5%.
Applications for System-on-Chips have very differ- ent requirements regarding operand precision. We propose a twin-precision FFT engine that can efficiently perform either the common full-precision operation or two simultaneous and... more
Applications for System-on-Chips have very differ- ent requirements regarding operand precision. We propose a twin-precision FFT engine that can efficiently perform either the common full-precision operation or two simultaneous and independent half-precision operations in the same hardware block. We provide an evaluation on energy, delay and area for synthesized 0.13-"m butterfly circuits, and show how many lower- precision operations that are needed for the twin-precision FFT butterfly to perform better than a conventional, dedicated FFT butterfly.
Side-channel attacks based on speculative execution access sensitive data and use transmitters to leak such data during wrongpath execution. Speculative side-channel defenses have been proposed to prevent such information leakage. In one... more
Side-channel attacks based on speculative execution access sensitive data and use transmitters to leak such data during wrongpath execution. Speculative side-channel defenses have been proposed to prevent such information leakage. In one class of defenses, speculative instructions are considered unsafe and are delayed until they become non-speculative. However, not all speculative instructions are unsafe: Recent work demonstrates that speculative invariant instructions are independent of a speculative control-flow path and are guaranteed to eventually execute and commit, regardless of the outcome of the performed speculation. Compile time information coupled with run-time mechanisms can then selectively lift defenses for Speculative Invariant instructions, regaining some of the performance lost to “delay” defenses. Unfortunately, speculative invariance can be easily mishandled with Speculative Interference to leak information using a new side-channel that we introduce in this paper....
Recent architectural approaches that address speculative side-channel attacks aim to prevent software from exposing the microarchitectural state changes of transient execution. The Delay-on-Miss technique is one such approach, which... more
Recent architectural approaches that address speculative side-channel attacks aim to prevent software from exposing the microarchitectural state changes of transient execution. The Delay-on-Miss technique is one such approach, which simply delays loads that miss in the L1 cache until they become non-speculative, resulting in no transient changes in the memory hierarchy. However, this costs performance, prompting the use of value prediction (VP) to regain some of the delay. However, the problem cannot be solved by simply introducing a new kind of speculation (value prediction). Value-predicted loads have to be validated, which cannot be commenced until the load becomes non-speculative. Thus, value-predicted loads occupy the same amount of precious core resources (e.g., reorder buffer entries) as Delay-on-Miss. The end result is that VP only yields marginal benefits over Delay-on-Miss. In this paper, our insight is that we can achieve the same goal as VP (increasing performance by pro...
We introduce FlexCore, which is the first exemplar of a processor based on the FlexSoC processor paradigm. The FlexCore utilizes an exposed datapath for increased performance. Microbenchmarks yield a performance boost of a factor of two... more
We introduce FlexCore, which is the first exemplar of a processor based on the FlexSoC processor paradigm. The FlexCore utilizes an exposed datapath for increased performance. Microbenchmarks yield a performance boost of a factor of two over a traditional five-stage pipeline with the same functional units as the FlexCore. We describe our approach to compiling for the FlexCore. A flexible interconnect allows the FlexCore datapath to be dynamically reconfigured as a consequence of code generation. Additionally, specialized functional units may be introduced and utilized within the same architecture and compilation framework. The exposed datapath requires a wide control word. The conducted evaluation of two micro benchmarks confirms that this increases the instruction bandwidth and memory footprint. This calls for an efficient instruction decoding as proposed in the FlexSoC paradigm.
During the last decade of integrated electronic design ever more functionality has been integrated onto the same chip, paving the way for having a whole system on a single chip. The strive for ever more functionality increases the demands... more
During the last decade of integrated electronic design ever more functionality has been integrated onto the same chip, paving the way for having a whole system on a single chip. The strive for ever more functionality increases the demands on circuit designers that have to provide the foundation for all this functionality. The desire for increased functionality and an associated capability to adapt to changing requirements, has led to the design of reconfigurable architectures. With an increased interest and use of reconfigurable architectures there is a need for flexible and reconfigurable computational units that can meet the demands of high speed, high throughput, low power, and area efficiency. Multiplications are complex to implement and they continue to give designers headaches when trying to efficiently implement multipliers in hardware. Multipliers are therefore interesting to study, when investigating how to design flexible and reconfigurable computational units. In this the...
There exist extensive ongoing research efforts on emerging technologies that have the potential to become an alternative to today’s CMOS technologies. A common feature among the investigated techno ...
The comfort of our daily lives has come to rely on a vast number of embedded systems, such as mobile phones, anti-spin systems for cars, and high-definition video. To improve the end-user experience at often stringent require- ments, in... more
The comfort of our daily lives has come to rely on a vast number of embedded systems, such as mobile phones, anti-spin systems for cars, and high-definition video. To improve the end-user experience at often stringent require- ments, in terms of high performance, low power dissipation, and low cost, makes these systems complex and nontrivial to design. This thesis addresses design challenges in three different areas of embedded systems. The presented FlexCore processor intends to improve the programmability of heterogeneous embedded systems while maintaining the performance of application-specific accelerators. This is achieved by integrating accelerators into the datapath of a general-purpose processor in combination with a wide control word consisting of all control signals in a FlexCore’s datapath. Furthermore, a FlexCore processor utilizes a flexible interconnect, which together with the expressiveness of the wide control word improves its performance. When designing new embedde...
Achieving high result-accuracy in approximate computing (AC) based real-time applications without violating power constraints of the underlying hardware is a challenging problem. Execution of such AC real-time tasks can be divided into... more
Achieving high result-accuracy in approximate computing (AC) based real-time applications without violating power constraints of the underlying hardware is a challenging problem. Execution of such AC real-time tasks can be divided into the execution of the mandatory part to obtain a result of acceptable quality, followed by a partial/complete execution of the optional part to improve accuracy of the initially obtained result within the given time-limit. However, enhancing result-accuracy at the cost of increased execution length might lead to deadline violations with higher energy usage. We propose Prepare , a novel hybrid offline-online approximate real-time task-scheduling approach, that first schedules AC-based tasks and determines operational processing speeds for each individual task constrained by system-wide power limit, deadline, and task-dependency. At runtime, by employing fine-grained DVFS, the energy-adaptive processing speed governing mechanism of Prepare reduces proces...
The FlexSoC program, launched in 2003, aims to develop new architectural techniques and pro- gramming models for high performance System on Chip (SoC). The target platforms for FlexSoC are embedded systems such as PDAs and high... more
The FlexSoC program, launched in 2003, aims to develop new architectural techniques and pro- gramming models for high performance System on Chip (SoC). The target platforms for FlexSoC are embedded systems such as PDAs and high performance mobile phones. Important properties for embedded systems are long battery life, high performance and the flexibility to adapt to new protocols and standards. This abstract presents some of the work conducted by the participants of the FlexSoC program.
Research Interests:
Page 1. THESIS FOR THE DEGREE OF LICENTIATE OF ENGINEERING Efficient Reconfigurable Multipliers Based on the Twin-Precision Technique MAGNUS SJÄLANDER Division of Computer Engineering Department of ...
ABSTRACT The need for energy efficiency continues to grow for many classes of processors, including those for which performance remains vital. Data cache is crucial for good performance, but it also represents a significant portion of the... more
ABSTRACT The need for energy efficiency continues to grow for many classes of processors, including those for which performance remains vital. Data cache is crucial for good performance, but it also represents a significant portion of the processor's energy expenditure. We describe the implementation and use of a tagless access buffer (TAB) that greatly improves data access energy efficiency while slightly improving performance. The compiler recognizes memory reference patterns within loops and allocates these references to a TAB. This combined hardware/software approach reduces energy usage by (1) replacing many level-one data cache (L1D) accesses with accesses to the smaller, more power-efficient TAB; (2) removing the need to perform tag checks or data translation lookaside buffer (DTLB) lookups for TAB accesses; and (3) reducing DTLB lookups when transferring data between the L1D and the TAB. Accesses to the TAB occur earlier in the pipeline, and data lines are prefetched from lower memory levels, which result in a small performance improvement. In addition, we can avoid many unnecessary block transfers between other memory hierarchy levels by characterizing how data in the TAB are used. With a combined size equal to that of a conventional 32-entry register file, a four-entry TAB eliminates 40% of L1D accesses and 42% of DTLB accesses, on average. This configuration reduces data-access related energy by 35% while simultane-ously decreasing execution time by 3%.
ABSTRACT Due to performance reasons, all ways in set-associative level-one (L1) data caches are accessed in parallel for load operations even though the requested data can only reside in one of the ways. Thus, a significant amount of... more
ABSTRACT Due to performance reasons, all ways in set-associative level-one (L1) data caches are accessed in parallel for load operations even though the requested data can only reside in one of the ways. Thus, a significant amount of energy is wasted when loads are performed. We propose a speculation technique that performs the tag comparison in parallel with the address calculation, leading to the access of only one way during the following cycle on successful speculations. The technique incurs no execution time penalty, has an insignificant area overhead, and does not require any customized SRAM implementation. Assuming a 16kB 4-way set-associative L1 data cache implemented in a 65-nm process technology, our evaluation based on 20 different MiBench benchmarks shows that the proposed technique on average leads to a 24% data cache energy reduction.
ABSTRACT As CPU data requests to the level-one (L1) data cache (DC) can represent as much as 25% of an embedded processor's total power dissipation, techniques that decrease L1 DC accesses can significantly enhance processor... more
ABSTRACT As CPU data requests to the level-one (L1) data cache (DC) can represent as much as 25% of an embedded processor's total power dissipation, techniques that decrease L1 DC accesses can significantly enhance processor energy efficiency. Filter caches are known to efficiently decrease the number of accesses to instruction caches. However, due to the irregular access pattern of data accesses, a conventional data filter cache (DFC) has a high miss rate, which degrades processor performance. We propose to integrate a DFC with a fast address calculation technique to significantly reduce the impact of misses and to improve performance by enabling one-cycle loads. Furthermore, we show that DFC stalls can be eliminated even after unsuccessful fast address calculations, by simultaneously accessing the DFC and L1 DC on the following cycle. We quantitatively evaluate different DFC configurations, with and without the fast address calculation technique, using different write allocation policies, and qualitatively describe their impact on energy efficiency. The proposed design provides an efficient DFC that yields both energy and performance improvements.
Page 1. Design Space Exploration for an Embedded Processor with Flexible Datapath Interconnect Tung Thanh Hoang, Ulf Jälmbrant, Erik der Hagopian, Kasyab P. Subramaniyan, Magnus Själander, and Per Larsson-Edefors ...
Research Interests:
Fine-grained control through the use of a wide control word can lead to high instruction-level parallelism, but unless compressed the words require a large memory footprint. A reconfigurable fixed-length decoding scheme can be created by... more
Fine-grained control through the use of a wide control word can lead to high instruction-level parallelism, but unless compressed the words require a large memory footprint. A reconfigurable fixed-length decoding scheme can be created by taking advantage of the fact that an application only uses a subset of the datapath for its execution. We present the first complete implementation of

And 38 more