Skip to main content

Ben Juurlink

Followers

81

Following

17

Co-authors

17

Public Views

S. Suijkerbuijk

Jan Hoogerbrugge

Andrei Terechko

Ridzky Milova Siregar

Mladen Berekovic

Technische Universität Braunschweig

Interests

Uploads

Papers by Ben Juurlink

Performance Evaluation of Interleaved Multithreading in a VLIW Architecture

by Ben Juurlink, P. Stravers, and S. Suijkerbuijk

Interleaved multithreading is a technique in which the processor starts executing a different tas... more Interleaved multithreading is a technique in which the processor starts executing a different task when the current thread is stalled. However, whereas different forms of hardware multithreading have been extensively evaluated in superscalar processors, an evaluation of multithreading techniques in a VLIW architecture is frequently missing. The objective of this paper is to determine an efficient method of implementing interleaved hardware multithreading in the TriMedia and evaluate the performance. The TriMedia is a multimedia VLIW processor designed by Philips semiconductors. Currently, multithreading is not implemented in the TriMedia. First, the details of the used interleaved multithreading method are given. After that, the architectural changes that are made in to cycle-accurate simulator of the TriMedia are described. Then, the various test result are presented. Finally we discuss the conclusions that can be drawn from the simulation results.

Implementatio O a Streaming Executio Unit

ABSTRACT

Efficient State Management for Tile-Based 3D Graphics Architectures

by Ben Juurlink and Petri Liuha

Tile-based rendering is a promising technique for low-power, 3D graphics platforms. This techniqu... more Tile-based rendering is a promising technique for low-power, 3D graphics platforms. This technique decomposes a scene into smaller regions called tiles and renders the tiles one-by-one. The advantage of this scheme is that a small memory integrated on the graphics accelerator can be used to store the color components and z (depth) values of one tile, so that accesses to these values are local, onchip accesses which consume significantly less power than off-chip accesses. Tile-based rendering, however, requires that the primitives (commonly triangles) and state changes are sorted into bins corresponding to the tiles. In this paper we determine the optimal state change operations (e.g., enable/disable z testing, create/delete a texture) that should be included for each tile. Experimental results obtained using several suitable 3D graphics workloads show that various trade-offs can be made and that, usually, better performance can be obtained by trading it for memory.

3D Graphics Benchmarks for Low-Power Architectures

Currently, there is much interest in wireless 3D graphics applications, in particular games. Sinc... more Currently, there is much interest in wireless 3D graphics applications, in particular games. Since current 3D graphics accelerators consume too much power to be employed in mobile computing devices, several companies and universities have started to develop low-power 3D graphics accelerators. However, to the best of our knowledge, there is no publicly available benchmark suite appropriate for evaluating such devices. In this paper we present a set of 3D graphics benchmarks which can be considered typical 3D workloads of contemporary and emerging mobile devices. First, reasons why most 3D benchmarks employed for desktop computers are not suitable for mobile environments are given. After that, simulation results such as the number of triangles or fragments processed by a typical rasterization pipeline are presented. Finally, we discuss some architectural implications of the obtained results for low-power implementations.

Hardware-Based Task Dependency Resolution for the StarSs Programming Model

2012 41st International Conference on Parallel Processing Workshops, 2012

ABSTRACT Recently, several programming models have been proposed that try to relieve parallel pro... more ABSTRACT Recently, several programming models have been proposed that try to relieve parallel programming. One of these programming models is StarSs. In StarSs, the programmer has to identify pieces of code that can be executed as tasks, as well as their inputs and outputs. Thereafter, the runtime system (RTS) determines the dependencies between tasks and schedules ready tasks onto worker cores. Previous work has shown, however, that the StarSs RTS may constitute a bottleneck that limits the scalability of the system and proposed a hardware task management system called Nexus to eliminate this bottleneck. Nexus has several limitations, however. For example, the number of inputs and outputs of each task is limited to a fixed constant and Nexus does not support double buffering. In this paper we present Nexus++ that addresses these as well as other limitations. Experimental results show that double buffering achieves a speedup of 54×/143× with/without modeling memory contention respectively, and that Nexus++ significantly enhances the scalability of applications parallelized using StarSs.

GraalBench: A 3D Graphics Benchmark Suite

In this paper we consider implementations of embedded 3D graphics and provide evidence indicating... more In this paper we consider implementations of embedded 3D graphics and provide evidence indicating that 3D benchmarks employed for desktop computers are not suitable for mobile environments. Consequently, we present GraalBench, a set of 3D graphics workloads representative for contemporary and emerging mobile devices. In addition, we present detailed simulation results for a typical rasterization pipeline. The results show that the proposed benchmarks use only a part of the resources offered by current 3D graphics libraries. For instance, while each benchmark uses the texturing unit for more than 70% of the generated fragments, the alpha unit is employed for less than 13% of the fragments. The Fog unit was used for 84% of the fragments by one benchmark, but the other benchmarks did not use it at all. Our experiments on the proposed suite suggest that the texturing, depth and blending units should be implemented in hardware, while, for instance, the dithering unit may be omitted from ...

Implementation of MPEG4 on the Philips Co Vector Processor

Multimedia applications provide new highly valuable services to the consumer and form, consequent... more Multimedia applications provide new highly valuable services to the consumer and form, consequently, a new important workload for desktop systems. The in- creased computing power of the embedded processors re- quired in baseband processing for new high-bandwidth wireless communication protocols (e.g UMTS, CDMA- 2000) can make multimedia processing possible also for the mobile devices, such as cell phones. These devices

Limiting the number of dirty cache lines

by Ben Juurlink and Pepijn J de Langen

2009 Design, Automation & Test in Europe Conference & Exhibition, 2009

Caches often employ write-back instead of writethrough, since write-back avoids unnecessary trans... more Caches often employ write-back instead of writethrough, since write-back avoids unnecessary transfers for multiple writes to the same block. For several reasons, however, it is undesirable that a significant number of cache lines will be marked "dirty". Energy-efficient cache organizations, for example, often apply techniques that resize, reconfigure, or turn off (parts of) the cache. In such cache organizations, dirty lines have to be written back before the cache is reconfigured. The delay imposed by these write-backs or the required additional logic and buffers can significantly reduce the attained energy savings. A cache organization called the clean/dirty cache (CDcache) is proposed that combines the properties of write-back and write-through. It avoids unnecessary transfers for recurring writes, while restricting the number of dirty lines to a hard limit. Detailed experimental results show that the CD-cache reduces the number of dirty lines significantly, while achieving similar or better performance. We also use the CD-cache to implement cache decay. Experimental results show that the CD-cache attains similar or higher performance than a normal decay cache, while using a significantly less complex design.

This topic deals with architecture design and compilation for high performance systems. The areas... more This topic deals with architecture design and compilation for high performance systems. The areas of interest range from microprocessors to large-scale parallel machines; from general-purpose platforms to specialized hardware (e.g., graphic coprocessors, low-power embedded systems); and from hardware design to compiler technology. On the compilation side, topics of interest include programmer productivity issues, concurrent and/or sequential language aspects, program analysis, transformation, automatic discovery and/or management of parallelism at all levels, and the interaction between the compiler and the rest of the system. On the architecture side, the scope spans system architectures, processor micro-architecture, memory hierarchy, and multi-threading, and the impact of emerging trends.

Experiences with a model for parallel computation

Proceedings of the twelfth annual ACM symposium on Principles of distributed computing - PODC '93, 1993

In this paper we study the practical viability of the BSP model of parallel computation as propos... more In this paper we study the practical viability of the BSP model of parallel computation as proposed by Valiant. This model is intended for simulating the often considered PRAM model on more realistic parallel computers with a fixed interconnection hetwork. One of the main attributes of the BSP model is randomized routing. From experimentation on an existing parallel architecture, analytic models are derived which characterize the eiliciency of this routing scheme. This characterization leads to the identification of the bottlenecks involved in building a parallel architecture in which the BSP model can efficiently be embedded.

A quantitative comparison of parallel computation models

ACM Transactions on Computer Systems, 1998

CiteSeerX - Document Details (Isaac Councill, Lee Giles, Pradeep Teregowda): This paper experimen... more

Performance Scalability of Multimedia Instruction Set Extensions

Lecture Notes in Computer Science, 2002

Current media ISA extensions such as Sun's VIS consist of SIMD-like instructions that operate on ... more Current media ISA extensions such as Sun's VIS consist of SIMD-like instructions that operate on short vector registers. In order to exploit more parallelism in a superscalar processor provided with such instructions, the issue width has to be increased. In the Complex Streamed Instruction (CSI) set exploiting more parallelism does not involve issuing more instructions. In this paper we study how the performance of superscalar processors extended with CSI or VIS scales with the amount of parallel execution hardware. Results show that the performance of the CSI-enhanced processor scales very well. For example, increasing the datapath width of the CSI execution unit from 16 to 32 bytes improves the kernel-level performance by a factor of 1.56 on average. The VISenhanced machine is unable to utilize large amounts of parallel execution hardware efficiently. Due to the huge number of instructions that need to be executed, the decode-issue logic constitutes a bottleneck.

Implementation and evaluation of the Complex Streamed Instruction set

Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques, 2001

An architectural paradigm designed to accelerate streaming operations on mixed-width data is pres... more An architectural paradigm designed to accelerate streaming operations on mixed-width data is presented and evaluated. The described Complex Streamed Instruction (CSI) set contains instructions that process data streams of arbitrary length. The number of bits or elements that will be processed in parallel is, therefore, not visible to the programmer, so no recompilation is needed in order to benefit from a wider datapath. CSI also eliminates many overhead instructions (such as instructions needed for data alignment and reorganization) often needed in applications utilizing media ISA extensions such as MMX and VIS by replacing them by a hardware mechanism. Simulation results using several multimedia kernels demonstrate that CSI provides a factor of up to 9.9 (4.0 on average) performance improvement when compared to Sun's VIS extension. For complete applications, the performance gain is 9% to 36% with an average of 20%.

The Paderborn university BSP (PUB) library-design, implementation and performance

Proceedings 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing. IPPS/SPDP 1999, 1999

The Paderborn University BSP (PUB) library is a parallel C library based on the BSP model. The ba... more The Paderborn University BSP (PUB) library is a parallel C library based on the BSP model. The basic library supports buffered and unbuffered asynchronous communication between any pair of processors, and a mechanism for synchronizing the processors in a barrier style. In addition, it provides routines for collective communication on arbitrary subsets of processors, partition operations, and a zero-cost synchronization mechanism. Furthermore, some techniques used in its implementation deviate significantly from the techniques used in other BSP libraries.

Trade-Offs Between Voltage Scaling and Processor Shutdown for Low-Energy Embedded Multiprocessors

by Ben Juurlink and Pepijn J de Langen

Lecture Notes in Computer Science, 2007

When peak performance is unnecessary, Dynamic Voltage Scaling (DVS) can be used to reduce the dyn... more When peak performance is unnecessary, Dynamic Voltage Scaling (DVS) can be used to reduce the dynamic power consumption of embedded multiprocessors. In future technologies, however, static power consumption is expected to increase significantly. Then it will be more effective to limit the number of employed processors, and use a combination of DVS and processor shutdown. Scheduling heuristics are presented that determine the best trade-off between these three techniques: DVS, processor shutdown, and finding the optimal number of processors. Experimental results show that our approach reduces the total energy consumption by up to 25% for tight deadlines and by up to 57% for loose deadlines compared to DVS. We also compare the energy consumed by our scheduling algorithm to two lower bounds, and show that our best approach leaves little room for improvement.

The 3TU embedded systems master in the Netherlands

by Ben Juurlink, Gerrit Van Der Hoeven, H. Tonino, and Jan Friso Groote

Proceedings of the 2009 Workshop on Embedded Systems Education - WESS '09, 2009

The three technical universities in the Netherlands (Eindhoven University of Technology, Delft Un... more The three technical universities in the Netherlands (Eindhoven University of Technology, Delft University of Technology and University of Twente), abbreviated as 3TU, started a joint master on Embedded Systems in 2006. Embedded Systems is an interdisciplinary area of Electrical Engineering, Computer Science, Mechanical Engineering and Applied Mathematics. This paper discusses the background of the master and presents the curriculum of the masters at the three sites.

Communication-optimal parallel minimum spanning tree algorithms (extended abstract)

by Ben Juurlink and Wolfgang Dittrich

Proceedings of the tenth annual ACM symposium on Parallel algorithms and architectures - SPAA '98, 1998

Lower and upper bounds for nding a minimum spanning tree (MST) in a weighted undirected graph on ... more Lower and upper bounds for nding a minimum spanning tree (MST) in a weighted undirected graph on the BSP model are presented. We provide the rst non-trivial lower bounds on the communication volume required to solve the MST problem. Let p denote the number of processors, n the number of nodes of the input graph, and m the number of edges of the input graph. We show that in the worst case, a total of ( min(m; pn)) bits need to be communicated in order to solve the MST problem, where is the number of bits required to represent a single edge weight. This implies that if each message communicates at most bits, any BSP algorithm for nding an MST requires communication time (g min(m=p; n)), where g is the gap parameter of the BSP model. In addition, we present two algorithms with communication requirements that match our lower bound in di erent situations. Both algorithms perform linear work for appropriate values of n, m and p, and use a number of supersteps that is bounded for arbitrarily large input sizes. The rst algorithm is simple but can employ at most m=n processors e ciently. Hence, it should be applied in situations where the input graph is relatively dense. The second algorithm is a randomized algorithm that performs linear work with high probability, provided that m n log p.

Leakage-aware multiprocessor scheduling for low power

by Ben Juurlink and Pepijn J de Langen

Proceedings 20th IEEE International Parallel & Distributed Processing Symposium, 2006

It is expected that (single chip) multiprocessors will increasingly be deployed to realize high-p... more It is expected that (single chip) multiprocessors will increasingly be deployed to realize high-performance embedded systems. Because in current technologies the dynamic power consumption dominates the static power dissipation, an effective technique to reduce energy consumption is to employ as many processors as possible in order to finish the tasks as early as possible, and to use the remaining time before the deadline (the slack) to apply voltage scaling. We refer to this heuristic as Schedule and Stretch (S&S). However, since the static power consumption is expected to become more significant, this approach will no longer be efficient when leakage current is taken into account. In this paper, we first show for which combinations of leakage current, supply voltage, and clock frequency the static power consumption dominates the dynamic power dissipation. These results imply that, at a certain point, it is no longer advantageous from an energy perspective to employ as many processors as possible. Thereafter, a heuristic is presented to schedule the tasks on a number of processors that minimizes the total energy consumption. Experimental results obtained using a public task graph benchmark set show that our leakage-aware scheduling algorithm reduces the total energy consumption by up to 24% for tight deadlines (1.5x the critical path length) and by up to 67% for loose deadlines (8x the critical path length) compared to S&S.

Efficient tile-aware bounding-box overlap test for tile-based rendering

by Ben Juurlink and Petri Liuha

2004 International Symposium on System-on-Chip, 2004. Proceedings., 2004

Tile-based rendering appears to be a promising technique for low-cost, low-power 3D graphics plat... more Tile-based rendering appears to be a promising technique for low-cost, low-power 3D graphics platforms. This technique decomposes a scene into tiles and renders the tiles independently. It requires, however, that the primitives are sorted into bins that correspond to the tiles, which can be very time-consuming and may require a lot of memory bandwidth. The most often used test to determine if a primitive and a tile overlap is the bounding box test. This test checks if the 2D axis aligned bounding box of the primitive overlaps the tile and comprises four comparisons in the worst case. In this paper we show that the efficiency of the bounding box test can be improved significantly by adaptively varying the order in which the comparisons are performed depending on the position of the current tile. Experimental results obtained using several 3D graphics workloads show that the dynamic bounding box test reduces the average number of comparisons per primitive by 26% on average compared to the best performing static version in which the order of the comparisons is fixed.

Memory Bandwidth Requirements of Tile-Based Rendering

Lecture Notes in Computer Science, 2004

Because mobile phones are omnipresent and equipped with displays, they are attractive platforms f... more Because mobile phones are omnipresent and equipped with displays, they are attractive platforms for rendering 3D images. However, because they are powered by batteries, a graphics accelerator for mobile phones should dissipate as little energy as possible. Since external memory accesses consume a significant amount of power, techniques that reduce the amount of external data traffic also reduce the power consumption. A technique that looks promising is tile-based rendering. This technique decomposes a scene into tiles and renders the tiles one by one. This allows the color components and z values of one tile to be stored in small, on-chip buffers, so that only the pixels visible in the final scene need to be stored in the external frame buffer. However, in a tile-based renderer each triangle may need to be sent to the graphics accelerator more than once, since it might overlap more than one tile. In this paper we measure the total amount of external data traffic produced by conventional and tile-based renderers using several representative OpenGL benchmark scenes. The results show that employing a tile size of 32×32 pixels generally yields the best trade-off between the amount of on-chip memory and the amount of external data traffic. In addition, the results show that overall, a tile-based architecture reduces the total amount of external data traffic by a factor of 1.96 compared to a traditional architecture.

Performance Evaluation of Interleaved Multithreading in a VLIW Architecture

by Ben Juurlink, P. Stravers, and S. Suijkerbuijk

Interleaved multithreading is a technique in which the processor starts executing a different tas... more Interleaved multithreading is a technique in which the processor starts executing a different task when the current thread is stalled. However, whereas different forms of hardware multithreading have been extensively evaluated in superscalar processors, an evaluation of multithreading techniques in a VLIW architecture is frequently missing. The objective of this paper is to determine an efficient method of implementing interleaved hardware multithreading in the TriMedia and evaluate the performance. The TriMedia is a multimedia VLIW processor designed by Philips semiconductors. Currently, multithreading is not implemented in the TriMedia. First, the details of the used interleaved multithreading method are given. After that, the architectural changes that are made in to cycle-accurate simulator of the TriMedia are described. Then, the various test result are presented. Finally we discuss the conclusions that can be drawn from the simulation results.

Implementatio O a Streaming Executio Unit

ABSTRACT

Efficient State Management for Tile-Based 3D Graphics Architectures

by Ben Juurlink and Petri Liuha

Tile-based rendering is a promising technique for low-power, 3D graphics platforms. This techniqu... more Tile-based rendering is a promising technique for low-power, 3D graphics platforms. This technique decomposes a scene into smaller regions called tiles and renders the tiles one-by-one. The advantage of this scheme is that a small memory integrated on the graphics accelerator can be used to store the color components and z (depth) values of one tile, so that accesses to these values are local, onchip accesses which consume significantly less power than off-chip accesses. Tile-based rendering, however, requires that the primitives (commonly triangles) and state changes are sorted into bins corresponding to the tiles. In this paper we determine the optimal state change operations (e.g., enable/disable z testing, create/delete a texture) that should be included for each tile. Experimental results obtained using several suitable 3D graphics workloads show that various trade-offs can be made and that, usually, better performance can be obtained by trading it for memory.

3D Graphics Benchmarks for Low-Power Architectures

Currently, there is much interest in wireless 3D graphics applications, in particular games. Sinc... more Currently, there is much interest in wireless 3D graphics applications, in particular games. Since current 3D graphics accelerators consume too much power to be employed in mobile computing devices, several companies and universities have started to develop low-power 3D graphics accelerators. However, to the best of our knowledge, there is no publicly available benchmark suite appropriate for evaluating such devices. In this paper we present a set of 3D graphics benchmarks which can be considered typical 3D workloads of contemporary and emerging mobile devices. First, reasons why most 3D benchmarks employed for desktop computers are not suitable for mobile environments are given. After that, simulation results such as the number of triangles or fragments processed by a typical rasterization pipeline are presented. Finally, we discuss some architectural implications of the obtained results for low-power implementations.

Hardware-Based Task Dependency Resolution for the StarSs Programming Model

2012 41st International Conference on Parallel Processing Workshops, 2012

ABSTRACT Recently, several programming models have been proposed that try to relieve parallel pro... more ABSTRACT Recently, several programming models have been proposed that try to relieve parallel programming. One of these programming models is StarSs. In StarSs, the programmer has to identify pieces of code that can be executed as tasks, as well as their inputs and outputs. Thereafter, the runtime system (RTS) determines the dependencies between tasks and schedules ready tasks onto worker cores. Previous work has shown, however, that the StarSs RTS may constitute a bottleneck that limits the scalability of the system and proposed a hardware task management system called Nexus to eliminate this bottleneck. Nexus has several limitations, however. For example, the number of inputs and outputs of each task is limited to a fixed constant and Nexus does not support double buffering. In this paper we present Nexus++ that addresses these as well as other limitations. Experimental results show that double buffering achieves a speedup of 54×/143× with/without modeling memory contention respectively, and that Nexus++ significantly enhances the scalability of applications parallelized using StarSs.

GraalBench: A 3D Graphics Benchmark Suite

In this paper we consider implementations of embedded 3D graphics and provide evidence indicating... more In this paper we consider implementations of embedded 3D graphics and provide evidence indicating that 3D benchmarks employed for desktop computers are not suitable for mobile environments. Consequently, we present GraalBench, a set of 3D graphics workloads representative for contemporary and emerging mobile devices. In addition, we present detailed simulation results for a typical rasterization pipeline. The results show that the proposed benchmarks use only a part of the resources offered by current 3D graphics libraries. For instance, while each benchmark uses the texturing unit for more than 70% of the generated fragments, the alpha unit is employed for less than 13% of the fragments. The Fog unit was used for 84% of the fragments by one benchmark, but the other benchmarks did not use it at all. Our experiments on the proposed suite suggest that the texturing, depth and blending units should be implemented in hardware, while, for instance, the dithering unit may be omitted from ...

Implementation of MPEG4 on the Philips Co Vector Processor

Multimedia applications provide new highly valuable services to the consumer and form, consequent... more Multimedia applications provide new highly valuable services to the consumer and form, consequently, a new important workload for desktop systems. The in- creased computing power of the embedded processors re- quired in baseband processing for new high-bandwidth wireless communication protocols (e.g UMTS, CDMA- 2000) can make multimedia processing possible also for the mobile devices, such as cell phones. These devices

Limiting the number of dirty cache lines

by Ben Juurlink and Pepijn J de Langen

2009 Design, Automation & Test in Europe Conference & Exhibition, 2009

Caches often employ write-back instead of writethrough, since write-back avoids unnecessary trans... more Caches often employ write-back instead of writethrough, since write-back avoids unnecessary transfers for multiple writes to the same block. For several reasons, however, it is undesirable that a significant number of cache lines will be marked "dirty". Energy-efficient cache organizations, for example, often apply techniques that resize, reconfigure, or turn off (parts of) the cache. In such cache organizations, dirty lines have to be written back before the cache is reconfigured. The delay imposed by these write-backs or the required additional logic and buffers can significantly reduce the attained energy savings. A cache organization called the clean/dirty cache (CDcache) is proposed that combines the properties of write-back and write-through. It avoids unnecessary transfers for recurring writes, while restricting the number of dirty lines to a hard limit. Detailed experimental results show that the CD-cache reduces the number of dirty lines significantly, while achieving similar or better performance. We also use the CD-cache to implement cache decay. Experimental results show that the CD-cache attains similar or higher performance than a normal decay cache, while using a significantly less complex design.

This topic deals with architecture design and compilation for high performance systems. The areas... more This topic deals with architecture design and compilation for high performance systems. The areas of interest range from microprocessors to large-scale parallel machines; from general-purpose platforms to specialized hardware (e.g., graphic coprocessors, low-power embedded systems); and from hardware design to compiler technology. On the compilation side, topics of interest include programmer productivity issues, concurrent and/or sequential language aspects, program analysis, transformation, automatic discovery and/or management of parallelism at all levels, and the interaction between the compiler and the rest of the system. On the architecture side, the scope spans system architectures, processor micro-architecture, memory hierarchy, and multi-threading, and the impact of emerging trends.

Experiences with a model for parallel computation

Proceedings of the twelfth annual ACM symposium on Principles of distributed computing - PODC '93, 1993

In this paper we study the practical viability of the BSP model of parallel computation as propos... more In this paper we study the practical viability of the BSP model of parallel computation as proposed by Valiant. This model is intended for simulating the often considered PRAM model on more realistic parallel computers with a fixed interconnection hetwork. One of the main attributes of the BSP model is randomized routing. From experimentation on an existing parallel architecture, analytic models are derived which characterize the eiliciency of this routing scheme. This characterization leads to the identification of the bottlenecks involved in building a parallel architecture in which the BSP model can efficiently be embedded.

A quantitative comparison of parallel computation models

ACM Transactions on Computer Systems, 1998

CiteSeerX - Document Details (Isaac Councill, Lee Giles, Pradeep Teregowda): This paper experimen... more

Performance Scalability of Multimedia Instruction Set Extensions

Lecture Notes in Computer Science, 2002

Current media ISA extensions such as Sun's VIS consist of SIMD-like instructions that operate on ... more Current media ISA extensions such as Sun's VIS consist of SIMD-like instructions that operate on short vector registers. In order to exploit more parallelism in a superscalar processor provided with such instructions, the issue width has to be increased. In the Complex Streamed Instruction (CSI) set exploiting more parallelism does not involve issuing more instructions. In this paper we study how the performance of superscalar processors extended with CSI or VIS scales with the amount of parallel execution hardware. Results show that the performance of the CSI-enhanced processor scales very well. For example, increasing the datapath width of the CSI execution unit from 16 to 32 bytes improves the kernel-level performance by a factor of 1.56 on average. The VISenhanced machine is unable to utilize large amounts of parallel execution hardware efficiently. Due to the huge number of instructions that need to be executed, the decode-issue logic constitutes a bottleneck.

Implementation and evaluation of the Complex Streamed Instruction set

Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques, 2001

An architectural paradigm designed to accelerate streaming operations on mixed-width data is pres... more An architectural paradigm designed to accelerate streaming operations on mixed-width data is presented and evaluated. The described Complex Streamed Instruction (CSI) set contains instructions that process data streams of arbitrary length. The number of bits or elements that will be processed in parallel is, therefore, not visible to the programmer, so no recompilation is needed in order to benefit from a wider datapath. CSI also eliminates many overhead instructions (such as instructions needed for data alignment and reorganization) often needed in applications utilizing media ISA extensions such as MMX and VIS by replacing them by a hardware mechanism. Simulation results using several multimedia kernels demonstrate that CSI provides a factor of up to 9.9 (4.0 on average) performance improvement when compared to Sun's VIS extension. For complete applications, the performance gain is 9% to 36% with an average of 20%.

The Paderborn university BSP (PUB) library-design, implementation and performance

Proceedings 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing. IPPS/SPDP 1999, 1999

The Paderborn University BSP (PUB) library is a parallel C library based on the BSP model. The ba... more The Paderborn University BSP (PUB) library is a parallel C library based on the BSP model. The basic library supports buffered and unbuffered asynchronous communication between any pair of processors, and a mechanism for synchronizing the processors in a barrier style. In addition, it provides routines for collective communication on arbitrary subsets of processors, partition operations, and a zero-cost synchronization mechanism. Furthermore, some techniques used in its implementation deviate significantly from the techniques used in other BSP libraries.

Trade-Offs Between Voltage Scaling and Processor Shutdown for Low-Energy Embedded Multiprocessors

by Ben Juurlink and Pepijn J de Langen

Lecture Notes in Computer Science, 2007

When peak performance is unnecessary, Dynamic Voltage Scaling (DVS) can be used to reduce the dyn... more When peak performance is unnecessary, Dynamic Voltage Scaling (DVS) can be used to reduce the dynamic power consumption of embedded multiprocessors. In future technologies, however, static power consumption is expected to increase significantly. Then it will be more effective to limit the number of employed processors, and use a combination of DVS and processor shutdown. Scheduling heuristics are presented that determine the best trade-off between these three techniques: DVS, processor shutdown, and finding the optimal number of processors. Experimental results show that our approach reduces the total energy consumption by up to 25% for tight deadlines and by up to 57% for loose deadlines compared to DVS. We also compare the energy consumed by our scheduling algorithm to two lower bounds, and show that our best approach leaves little room for improvement.

The 3TU embedded systems master in the Netherlands

by Ben Juurlink, Gerrit Van Der Hoeven, H. Tonino, and Jan Friso Groote

Proceedings of the 2009 Workshop on Embedded Systems Education - WESS '09, 2009

The three technical universities in the Netherlands (Eindhoven University of Technology, Delft Un... more The three technical universities in the Netherlands (Eindhoven University of Technology, Delft University of Technology and University of Twente), abbreviated as 3TU, started a joint master on Embedded Systems in 2006. Embedded Systems is an interdisciplinary area of Electrical Engineering, Computer Science, Mechanical Engineering and Applied Mathematics. This paper discusses the background of the master and presents the curriculum of the masters at the three sites.

Communication-optimal parallel minimum spanning tree algorithms (extended abstract)

by Ben Juurlink and Wolfgang Dittrich

Proceedings of the tenth annual ACM symposium on Parallel algorithms and architectures - SPAA '98, 1998

Lower and upper bounds for nding a minimum spanning tree (MST) in a weighted undirected graph on ... more Lower and upper bounds for nding a minimum spanning tree (MST) in a weighted undirected graph on the BSP model are presented. We provide the rst non-trivial lower bounds on the communication volume required to solve the MST problem. Let p denote the number of processors, n the number of nodes of the input graph, and m the number of edges of the input graph. We show that in the worst case, a total of ( min(m; pn)) bits need to be communicated in order to solve the MST problem, where is the number of bits required to represent a single edge weight. This implies that if each message communicates at most bits, any BSP algorithm for nding an MST requires communication time (g min(m=p; n)), where g is the gap parameter of the BSP model. In addition, we present two algorithms with communication requirements that match our lower bound in di erent situations. Both algorithms perform linear work for appropriate values of n, m and p, and use a number of supersteps that is bounded for arbitrarily large input sizes. The rst algorithm is simple but can employ at most m=n processors e ciently. Hence, it should be applied in situations where the input graph is relatively dense. The second algorithm is a randomized algorithm that performs linear work with high probability, provided that m n log p.

Leakage-aware multiprocessor scheduling for low power

by Ben Juurlink and Pepijn J de Langen

Proceedings 20th IEEE International Parallel & Distributed Processing Symposium, 2006

It is expected that (single chip) multiprocessors will increasingly be deployed to realize high-p... more It is expected that (single chip) multiprocessors will increasingly be deployed to realize high-performance embedded systems. Because in current technologies the dynamic power consumption dominates the static power dissipation, an effective technique to reduce energy consumption is to employ as many processors as possible in order to finish the tasks as early as possible, and to use the remaining time before the deadline (the slack) to apply voltage scaling. We refer to this heuristic as Schedule and Stretch (S&S). However, since the static power consumption is expected to become more significant, this approach will no longer be efficient when leakage current is taken into account. In this paper, we first show for which combinations of leakage current, supply voltage, and clock frequency the static power consumption dominates the dynamic power dissipation. These results imply that, at a certain point, it is no longer advantageous from an energy perspective to employ as many processors as possible. Thereafter, a heuristic is presented to schedule the tasks on a number of processors that minimizes the total energy consumption. Experimental results obtained using a public task graph benchmark set show that our leakage-aware scheduling algorithm reduces the total energy consumption by up to 24% for tight deadlines (1.5x the critical path length) and by up to 67% for loose deadlines (8x the critical path length) compared to S&S.

Efficient tile-aware bounding-box overlap test for tile-based rendering

by Ben Juurlink and Petri Liuha

2004 International Symposium on System-on-Chip, 2004. Proceedings., 2004

Tile-based rendering appears to be a promising technique for low-cost, low-power 3D graphics plat... more Tile-based rendering appears to be a promising technique for low-cost, low-power 3D graphics platforms. This technique decomposes a scene into tiles and renders the tiles independently. It requires, however, that the primitives are sorted into bins that correspond to the tiles, which can be very time-consuming and may require a lot of memory bandwidth. The most often used test to determine if a primitive and a tile overlap is the bounding box test. This test checks if the 2D axis aligned bounding box of the primitive overlaps the tile and comprises four comparisons in the worst case. In this paper we show that the efficiency of the bounding box test can be improved significantly by adaptively varying the order in which the comparisons are performed depending on the position of the current tile. Experimental results obtained using several 3D graphics workloads show that the dynamic bounding box test reduces the average number of comparisons per primitive by 26% on average compared to the best performing static version in which the order of the comparisons is fixed.

Memory Bandwidth Requirements of Tile-Based Rendering

Lecture Notes in Computer Science, 2004

Because mobile phones are omnipresent and equipped with displays, they are attractive platforms f... more Because mobile phones are omnipresent and equipped with displays, they are attractive platforms for rendering 3D images. However, because they are powered by batteries, a graphics accelerator for mobile phones should dissipate as little energy as possible. Since external memory accesses consume a significant amount of power, techniques that reduce the amount of external data traffic also reduce the power consumption. A technique that looks promising is tile-based rendering. This technique decomposes a scene into tiles and renders the tiles one by one. This allows the color components and z values of one tile to be stored in small, on-chip buffers, so that only the pixels visible in the final scene need to be stored in the external frame buffer. However, in a tile-based renderer each triangle may need to be sent to the graphics accelerator more than once, since it might overlap more than one tile. In this paper we measure the total amount of external data traffic produced by conventional and tile-based renderers using several representative OpenGL benchmark scenes. The results show that employing a tile size of 32×32 pixels generally yields the best trade-off between the amount of on-chip memory and the amount of external data traffic. In addition, the results show that overall, a tile-based architecture reduces the total amount of external data traffic by a factor of 1.96 compared to a traditional architecture.