Parallel Algorithms and Architectures for DSP Applications, 1991
We present methods to permute matrices in mesh-connected uniform square arrays with local control... more We present methods to permute matrices in mesh-connected uniform square arrays with local control only. The permutations that we consider form a class called affine permutations, which includes transpose and many other row/column reorderings. We first present a general scheme where destination tags are generated on the fly, and a standard sorting algorithm on the mesh is used to route the elements to their respective destination. We then specialize to four operations: in-place reflection, inplace rotation, and on-the-fly versions of these (that permute the matrix while it is loaded into the array), and show how they can be implemented very efficiently with local control only. We also develop a general theoretical model for prescheduled data transfers in distributed systems. This model can be applied to permutations, and we use it to verify one of the specialized operations.
IEEE International Symposium on Circuits and Systems
Square systolic arrays for performing a class of matrix permutations are presented. This class ca... more Square systolic arrays for performing a class of matrix permutations are presented. This class can be described by the composition of two basic operations, namely clockwise rotation by 90°, and horizontal reflection about the middle column. Four different arrays are presented: in-place reflection, in-place rotation, and on-the-fly versions of these that permute the matrix as it is being loaded into
Following the successful WCET Tool Challenges in 2006 and 2008, the third event in this series wa... more Following the successful WCET Tool Challenges in 2006 and 2008, the third event in this series was organized in 2011, again with support from the ARTIST DESIGN Network of Excellence. Following the practice established in the previous Challenges, the WCET Tool Challenge 2011 (WCC'11) dened two kinds of problems to be solved by the Challenge participants with their tools, WCET problems, which ask for bounds on the execution time, and ow-analysis problems, which ask for bounds on the number of times certain parts of ...
I take real pleasure in seeing the proceedings of the 12th International Workshop on Worst-Case E... more I take real pleasure in seeing the proceedings of the 12th International Workshop on Worst-Case Execution Time Analysis online already on the day of the workshop. This helps WCET'12 achieve its goal of facilitating discussion and interaction among participants as well as of returning value to the authors of the works that were accepted for presentation. I also feel personal satisfaction in having achieved the production of these proceedings as a tangible manifestation of the considerable effort that went in making WCET'12 happen, ...
Caches have become increasingly important with the widening gap between main memory and processor... more Caches have become increasingly important with the widening gap between main memory and processor speeds. However, they are a source of unpredictability due to their characteristics, resulting in programs behaving in a different way than expected.Cache locking mechanisms adapt caches to the needs of real-time systems. Locking the cache is a solution that trades performance for predictability: at a cost of generally lower performance, the time of accessing the memory becomes predictable.This paper combines compile-time cache analysis with data cache locking to estimate the worst-case memory performance (WCMP) in a safe, tight and fast way. In order to get predictable cache behavior, we first lock the cache for those parts of the code where the static analysis fails. To minimize the performance degradation, our method loads the cache, if necessary, with data likely to be accessed.Experimental results show that this scheme is fully predictable, without compromising the performance of t...
ABSTRACT Multi-core technology is recognized as a key component to develop new cost-efficient pro... more ABSTRACT Multi-core technology is recognized as a key component to develop new cost-efficient products. It can lead to reduction of the overall hardware cost through hardware consolidation. However, it also results in tremendous challenges related to the combination of predictability and performance. The AUTOSAR consortium has developed as the worldwide standard for automotive embedded software systems. One of the prominent aspects of this consortium is to support multi-core systems. In this paper, the ongoing work on addressing the challenge of achieving a resource efficient and predictable mapping of AUTOSAR runnables onto a multi-core system is discussed. The goal is to minimize the runnables' communication cost besides meeting timing and precedence constraints of the runnables. The basic notion utilized in this research is to consider runnable granularity, which leads to an increased flexibility in allocating runnables to various cores, compared of task granularity in which all of the runnables hosted on a task should be allocated on the same core. This increased flexibility can potentially enhance communication cost. In addition, a heuristic algorithm is introduced to create a task set according to the mapping of runnables on the cores. In our current work, we are formulating the problem as an Integer Linear Programming (ILP). Therefore, conventional ILP solvers can be easily applied to derive a solution.
Parallel Algorithms and Architectures for DSP Applications, 1991
We present methods to permute matrices in mesh-connected uniform square arrays with local control... more We present methods to permute matrices in mesh-connected uniform square arrays with local control only. The permutations that we consider form a class called affine permutations, which includes transpose and many other row/column reorderings. We first present a general scheme where destination tags are generated on the fly, and a standard sorting algorithm on the mesh is used to route the elements to their respective destination. We then specialize to four operations: in-place reflection, inplace rotation, and on-the-fly versions of these (that permute the matrix while it is loaded into the array), and show how they can be implemented very efficiently with local control only. We also develop a general theoretical model for prescheduled data transfers in distributed systems. This model can be applied to permutations, and we use it to verify one of the specialized operations.
IEEE International Symposium on Circuits and Systems
Square systolic arrays for performing a class of matrix permutations are presented. This class ca... more Square systolic arrays for performing a class of matrix permutations are presented. This class can be described by the composition of two basic operations, namely clockwise rotation by 90°, and horizontal reflection about the middle column. Four different arrays are presented: in-place reflection, in-place rotation, and on-the-fly versions of these that permute the matrix as it is being loaded into
Following the successful WCET Tool Challenges in 2006 and 2008, the third event in this series wa... more Following the successful WCET Tool Challenges in 2006 and 2008, the third event in this series was organized in 2011, again with support from the ARTIST DESIGN Network of Excellence. Following the practice established in the previous Challenges, the WCET Tool Challenge 2011 (WCC'11) dened two kinds of problems to be solved by the Challenge participants with their tools, WCET problems, which ask for bounds on the execution time, and ow-analysis problems, which ask for bounds on the number of times certain parts of ...
I take real pleasure in seeing the proceedings of the 12th International Workshop on Worst-Case E... more I take real pleasure in seeing the proceedings of the 12th International Workshop on Worst-Case Execution Time Analysis online already on the day of the workshop. This helps WCET'12 achieve its goal of facilitating discussion and interaction among participants as well as of returning value to the authors of the works that were accepted for presentation. I also feel personal satisfaction in having achieved the production of these proceedings as a tangible manifestation of the considerable effort that went in making WCET'12 happen, ...
Caches have become increasingly important with the widening gap between main memory and processor... more Caches have become increasingly important with the widening gap between main memory and processor speeds. However, they are a source of unpredictability due to their characteristics, resulting in programs behaving in a different way than expected.Cache locking mechanisms adapt caches to the needs of real-time systems. Locking the cache is a solution that trades performance for predictability: at a cost of generally lower performance, the time of accessing the memory becomes predictable.This paper combines compile-time cache analysis with data cache locking to estimate the worst-case memory performance (WCMP) in a safe, tight and fast way. In order to get predictable cache behavior, we first lock the cache for those parts of the code where the static analysis fails. To minimize the performance degradation, our method loads the cache, if necessary, with data likely to be accessed.Experimental results show that this scheme is fully predictable, without compromising the performance of t...
ABSTRACT Multi-core technology is recognized as a key component to develop new cost-efficient pro... more ABSTRACT Multi-core technology is recognized as a key component to develop new cost-efficient products. It can lead to reduction of the overall hardware cost through hardware consolidation. However, it also results in tremendous challenges related to the combination of predictability and performance. The AUTOSAR consortium has developed as the worldwide standard for automotive embedded software systems. One of the prominent aspects of this consortium is to support multi-core systems. In this paper, the ongoing work on addressing the challenge of achieving a resource efficient and predictable mapping of AUTOSAR runnables onto a multi-core system is discussed. The goal is to minimize the runnables' communication cost besides meeting timing and precedence constraints of the runnables. The basic notion utilized in this research is to consider runnable granularity, which leads to an increased flexibility in allocating runnables to various cores, compared of task granularity in which all of the runnables hosted on a task should be allocated on the same core. This increased flexibility can potentially enhance communication cost. In addition, a heuristic algorithm is introduced to create a task set according to the mapping of runnables on the cores. In our current work, we are formulating the problem as an Integer Linear Programming (ILP). Therefore, conventional ILP solvers can be easily applied to derive a solution.
Uploads