IoT devices, edge devices and embedded devices, in general, are ubiquitous. The energy consumption of such devices is important both due to the total number of devices deployed and because such devices are often battery-powered. Hence,... more
IoT devices, edge devices and embedded devices, in general, are ubiquitous. The energy consumption of such devices is important both due to the total number of devices deployed and because such devices are often battery-powered. Hence, improving the energy efficiency of such high-performance embedded systems is crucial. The first step to decreasing energy consumption is to accurately measure it, as we base our conclusions and decisions on the measurements. Given the importance of the measurements, it surprised us that most publications dedicate little space and effort to the description of their experimental setup. One variable of importance of the measurement system is the sampling frequency, e.g. how often the continuous signal's voltage and current are measured per second. In this paper, we systematically explore the impact of the sampling frequency on the accuracy of the measurement system. We measure the energy consumption of a Hardkernel Odroid-XU4 board executing nine Rodinia benchmarks with a wide range of runtimes and options at 4kHz, which is the standard sampling frequency of our measurement system. We show that one needs to measure at least at 350Hz to achieve equivalent results in comparison to the original power traces. Sampling at 1Hz (e.g. Hardkernel SmartPower2) results in a maximum error of 80%.
The present electronics world has a lot of dependency on processing devices in the current and future developments. Even non-electronic industries have much data to process and are indirectly dependent on processors. The larger the number... more
The present electronics world has a lot of dependency on processing devices in the current and future developments. Even non-electronic industries have much data to process and are indirectly dependent on processors. The larger the number of processors incorporated into the architecture, will lower the data handling and processing time; thus, efficiency improves. Hence multi-core processors have become a regular part of the design of processing elements in the electronic industry. The large number of processors incorporated into the system architecture results in difficulty in communicating among them without a deadlock or live lock. NoC is a promising solution for communicating among the on-chip processors, provided it is fast enough and consumes less energy. Further, the latency among the multi-core processors should be optimal to stand with the increasing data acquisition and processing in the new/developing operating systems and software. This paper addresses energy efficiency and latency reduction methods/techniques for Multi-core architectures.
In Multi-Core Technology epoch, Network on Chip (NoC) architectures has been acknowledged as a solution to solve the design challenges of System on Chips (SoCs). The communication issue is played major role in design of effective NoCs. In... more
In Multi-Core Technology epoch, Network on Chip (NoC) architectures has been acknowledged as a solution to solve the design challenges of System on Chips (SoCs). The communication issue is played major role in design of effective NoCs. In order to achieve better communication among the multiple cores in NoC, an efficient routing algorithm is required. To evaluate the performance of NoCs, the performance parameters like throughput, energy and path length are focused by applying different routing algorithms. In this paper, we have analyzed network partitioning based on routing path and implemented two different routing algorithms named as Path Based Shortest Path (PBSP) and All Pair Shortest Path (APSP) algorithms for multicast messaging. These algorithms are developed and evaluated using C/C++ and Network Simulator for 2D Mesh NoC respectively. And finally we compare and analyze these two algorithms in terms of throughput and energy.
We introduce and experimentally validate a new macro-level model of the CPU temperature/power relationship within nanometer-scale application processors or system-onchips. By adopting a holistic view, this model is able to take into... more
We introduce and experimentally validate a new macro-level model of the CPU temperature/power relationship within nanometer-scale application processors or system-onchips. By adopting a holistic view, this model is able to take into account many of the physical effects that occur within such systems. Together with two algorithms described in the paper, our results can be used, for instance by engineers designing power or thermal management units, to cancel the temperatureinduced bias on power measurements. This will help them gather temperature-neutral power data while running multiple instance of their benchmarks. Also power requirements and system failure rates can be decreased by controlling the CPU's thermal behavior. Even though it is usually assumed that the temperature/power relationship is exponentially related, there is however a lack of publicly available physical temperature/power measurements to back up this assumption, something our paper corrects. Via measurements on two pertinent platforms sporting nanometerscale application processors, we show that the power/temperature relationship is indeed very likely exponential over a 20 • C to 85 • C temperature range. Our data suggest that, for application processors operating between 20 • C and 50 • C, a quadratic model is still accurate and a linear approximation is acceptable.
In this paper, we address the problem of variationaware selection of voltage levels for system-on-chips (SoCs) that are organized into multiple frequency and voltage domains. Conventionally, the voltage levels for each domain, as well as... more
In this paper, we address the problem of variationaware selection of voltage levels for system-on-chips (SoCs) that are organized into multiple frequency and voltage domains. Conventionally, the voltage levels for each domain, as well as the mapping between frequencies and voltages, are determined without considering variations. In the presence of variations, these choices are often suboptimal since the frequency versus voltage characteristics vary from one SoC instance to another and across different voltage domains within an instance. We present a two-pronged approach to address this problem. First, we propose breaking the conventional fixed coupling between voltage levels and frequencies and demonstrate that performing this association based on the characteristics of individual chip instances can lead to significant improvements in power and performance. Second, we show that voltage levels that are computed while accounting for variations can lead to further improvements. We present a methodology to determine a set of discrete voltage levels in a variation-aware manner by generating and quantizing the ideal voltage distribution for a given SoC. Our experiments on an 802.11 MAC processor SoC indicate that the proposed techniques lead to significant improvements in power and performance characteristics in the presence of variations. We obtained an improvement of up to 68% in parametric yield (number of chips meeting power and performance targets) compared to conventional voltage scaling. Index Terms-Process variations, voltage scaling. I. INTRODUCTION A S manufacturing-induced variations continue to increase with technology scaling, traditional design techniques based on nominal or worst-case analysis are increasingly becoming ineffective. These techniques lead to yield loss, waste due to large margins, or excessive effort leading to missed time-to-market windows. It is becoming imperative to account for variations in all stages of the design cycle starting from system-level design. In this work, we study the impact of variations on systems that use multiple supply voltages to meet power and performance targets and present techniques for variation-aware voltage level selection and scaling.
The design of scalable and reliable interconnection net-works for System on Chips (SoCs) introduce new design constraints not present in current multicomputer systems. Although regular topologies are preferred for building NoCs,... more
The design of scalable and reliable interconnection net-works for System on Chips (SoCs) introduce new design constraints not present in current multicomputer systems. Although regular topologies are preferred for building NoCs, heterogeneous blocks, fabrication faults and reliabil-ity ...
This paper presents a novel approach to accelerate program execution by mapping repetitive traces of executed in-structions, called Megablocks, to a runtime reconfigurable array of functional units. An offline tool suite extracts... more
This paper presents a novel approach to accelerate program execution by mapping repetitive traces of executed in-structions, called Megablocks, to a runtime reconfigurable array of functional units. An offline tool suite extracts Megablocks from microprocessor instruction traces and generates a Reconfigurable Processing Unit (RPU) tailored for the execution of those Mega-blocks. The system is able to move transparently computations from the microprocessor to the RPU at runtime. A prototype im-plementation of the system ...
Current technology allows designers to implement complete embedded computing systems on a single FPGA. Using an FPGA as the implementation platform introduces greater flexibility into the design process and allows a new approach to... more
Current technology allows designers to implement complete embedded computing systems on a single FPGA. Using an FPGA as the implementation platform introduces greater flexibility into the design process and allows a new approach to embedded system design. Since there is no cost to reprogramming an FPGA, system performance can be measured on-chip in the runtime environment and the system's architecture can be altered based on an evaluation of the data to meet design requirements.
We introduce and experimentally validate a new macro-level model of the CPU temperature/power relationship within nanometer-scale application processors or system-onchips. By adopting a holistic view, this model is able to take into... more
We introduce and experimentally validate a new macro-level model of the CPU temperature/power relationship within nanometer-scale application processors or system-onchips. By adopting a holistic view, this model is able to take into account many of the physical effects that occur within such systems. Together with two algorithms described in the paper, our results can be used, for instance by engineers designing power or thermal management units, to cancel the temperatureinduced bias on power measurements. This will help them gather temperature-neutral power data while running multiple instance of their benchmarks. Also power requirements and system failure rates can be decreased by controlling the CPU's thermal behavior.
We introduce and experimentally validate a new macro-level model of the CPU temperature/power relationship within nanometer-scale application processors or system-onchips. By adopting a holistic view, this model is able to take into... more
We introduce and experimentally validate a new macro-level model of the CPU temperature/power relationship within nanometer-scale application processors or system-onchips. By adopting a holistic view, this model is able to take into account many of the physical effects that occur within such systems. Together with two algorithms described in the paper, our results can be used, for instance by engineers designing power or thermal management units, to cancel the temperatureinduced bias on power measurements. This will help them gather temperature-neutral power data while running multiple instance of their benchmarks. Also power requirements and system failure rates can be decreased by controlling the CPU's thermal behavior.
Current technology allows designers to implement complete embedded computing systems on a single FPGA. Using an FPGA as the implementation platform introduces greater flexibility into the design process and allows a new approach to... more
Current technology allows designers to implement complete embedded computing systems on a single FPGA. Using an FPGA as the implementation platform introduces greater flexibility into the design process and allows a new approach to embedded system design. Since there is no cost to reprogramming an FPGA, system performance can be measured on-chip in the runtime environment and the system's architecture can be altered based on an evaluation of the data to meet design requirements.
Three-Dimensional (3D) integration is a solution to the interconnect bottleneck in Two-Dimensional (2D) Multi-Processor System on Chip (MPSoC). 3D IC design improves performance and decreases power consumption by replacing long horizontal... more
Three-Dimensional (3D) integration is a solution to the interconnect bottleneck in Two-Dimensional (2D) Multi-Processor System on Chip (MPSoC). 3D IC design improves performance and decreases power consumption by replacing long horizontal interconnects with shorter vertical ones. As the multicast communication is utilized commonly in various parallel applications, the performance can be significantly improved by supporting of multicast operations at the hardware level. In this paper, we propose a set of partitioning approaches each with a different level of efficiency. In addition, we present an advantageous method named Recursive Partitioning (RP) in which the network is recursively partitioned until all partitions contain comparable number of nodes. By this approach, the multicast traffic is distributed among several subsets and the network latency is considerably decreased. We also present Minimal Adaptive Routing (MAR) algorithm for the unicast and multicast traffic in 3D-mesh Networks-on-Chip (NoCs). The idea behind the MAR algorithm is utilizing the Hamiltonian path to provide a set of alternative paths.
Embedded processors are now widely used in system-on-chips. The computational power of such processors and their ease of access to/from other embedded cores can be utilized to test SoCs. This paper presents a software-based testing of... more
Embedded processors are now widely used in system-on-chips. The computational power of such processors and their ease of access to/from other embedded cores can be utilized to test SoCs. This paper presents a software-based testing of embedded cores in a system chip using the embedded processor We present a methodology to systematically generate test programs that test the processor and
In Multi-Core Technology epoch, Network on Chip (NoC) architectures has been acknowledged as a solution to solve the design challenges of System on Chips (SoCs). The communication issue is played major role in design of effective NoCs. In... more
In Multi-Core Technology epoch, Network on Chip (NoC) architectures has been acknowledged as a solution to solve the design challenges of System on Chips (SoCs). The communication issue is played major role in design of effective NoCs. In order to achieve better communication among the multiple cores in NoC, an efficient routing algorithm is required. To evaluate the performance of NoCs, the performance parameters like throughput, energy and path length are focused by applying different routing algorithms. In this paper, we have analyzed network partitioning based on routing path and implemented two different routing algorithms named as Path Based Shortest Path (PBSP) and All Pair Shortest Path (APSP) algorithms for multicast messaging. These algorithms are developed and evaluated using C/C++ and Network Simulator for 2D Mesh NoC respectively. And finally we compare and analyze these two algorithms in terms of throughput and energy.