2012 41st International Conference on Parallel Processing Workshops, 2012
ABSTRACT Recently, several programming models have been proposed that try to relieve parallel pro... more ABSTRACT Recently, several programming models have been proposed that try to relieve parallel programming. One of these programming models is StarSs. In StarSs, the programmer has to identify pieces of code that can be executed as tasks, as well as their inputs and outputs. Thereafter, the runtime system (RTS) determines the dependencies between tasks and schedules ready tasks onto worker cores. Previous work has shown, however, that the StarSs RTS may constitute a bottleneck that limits the scalability of the system and proposed a hardware task management system called Nexus to eliminate this bottleneck. Nexus has several limitations, however. For example, the number of inputs and outputs of each task is limited to a fixed constant and Nexus does not support double buffering. In this paper we present Nexus++ that addresses these as well as other limitations. Experimental results show that double buffering achieves a speedup of 54×/143× with/without modeling memory contention respectively, and that Nexus++ significantly enhances the scalability of applications parallelized using StarSs.
In this paper we consider implementations of embedded 3D graphics and provide evidence indicating... more In this paper we consider implementations of embedded 3D graphics and provide evidence indicating that 3D benchmarks employed for desktop computers are not suitable for mobile environments. Consequently, we present GraalBench, a set of 3D graphics workloads representative for contemporary and emerging mobile devices. In addition, we present detailed simulation results for a typical rasterization pipeline. The results show that the proposed benchmarks use only a part of the resources offered by current 3D graphics libraries. For instance, while each benchmark uses the texturing unit for more than 70% of the generated fragments, the alpha unit is employed for less than 13% of the fragments. The Fog unit was used for 84% of the fragments by one benchmark, but the other benchmarks did not use it at all. Our experiments on the proposed suite suggest that the texturing, depth and blending units should be implemented in hardware, while, for instance, the dithering unit may be omitted from ...
Multimedia applications provide new highly valuable services to the consumer and form, consequent... more Multimedia applications provide new highly valuable services to the consumer and form, consequently, a new important workload for desktop systems. The in- creased computing power of the embedded processors re- quired in baseband processing for new high-bandwidth wireless communication protocols (e.g UMTS, CDMA- 2000) can make multimedia processing possible also for the mobile devices, such as cell phones. These devices
This topic deals with architecture design and compilation for high performance systems. The areas... more This topic deals with architecture design and compilation for high performance systems. The areas of interest range from microprocessors to large-scale parallel machines; from general-purpose platforms to specialized hardware (e.g., graphic coprocessors, low-power embedded systems); and from hardware design to compiler technology. On the compilation side, topics of interest include programmer productivity issues, concurrent and/or sequential language aspects, program analysis, transformation, automatic discovery and/or management of parallelism at all levels, and the interaction between the compiler and the rest of the system. On the architecture side, the scope spans system architectures, processor micro-architecture, memory hierarchy, and multi-threading, and the impact of emerging trends.
Proceedings of the twelfth annual ACM symposium on Principles of distributed computing - PODC '93, 1993
In this paper we study the practical viability of the BSP model of parallel computation as propos... more In this paper we study the practical viability of the BSP model of parallel computation as proposed by Valiant. This model is intended for simulating the often considered PRAM model on more realistic parallel computers with a fixed interconnection hetwork. One of the main attributes of the BSP model is randomized routing. From experimentation on an existing parallel architecture, analytic models are derived which characterize the eiliciency of this routing scheme. This characterization leads to the identification of the bottlenecks involved in building a parallel architecture in which the BSP model can efficiently be embedded.
CiteSeerX - Document Details (Isaac Councill, Lee Giles, Pradeep Teregowda): This paper experimen... more CiteSeerX - Document Details (Isaac Councill, Lee Giles, Pradeep Teregowda): This paper experimentally validates performance related issues for parallel computation models on several parallel platforms (a MasPar MP-1 with 1024 processors, a 64-node GCel and a CM-5 of 64 ...
2012 41st International Conference on Parallel Processing Workshops, 2012
ABSTRACT Recently, several programming models have been proposed that try to relieve parallel pro... more ABSTRACT Recently, several programming models have been proposed that try to relieve parallel programming. One of these programming models is StarSs. In StarSs, the programmer has to identify pieces of code that can be executed as tasks, as well as their inputs and outputs. Thereafter, the runtime system (RTS) determines the dependencies between tasks and schedules ready tasks onto worker cores. Previous work has shown, however, that the StarSs RTS may constitute a bottleneck that limits the scalability of the system and proposed a hardware task management system called Nexus to eliminate this bottleneck. Nexus has several limitations, however. For example, the number of inputs and outputs of each task is limited to a fixed constant and Nexus does not support double buffering. In this paper we present Nexus++ that addresses these as well as other limitations. Experimental results show that double buffering achieves a speedup of 54×/143× with/without modeling memory contention respectively, and that Nexus++ significantly enhances the scalability of applications parallelized using StarSs.
In this paper we consider implementations of embedded 3D graphics and provide evidence indicating... more In this paper we consider implementations of embedded 3D graphics and provide evidence indicating that 3D benchmarks employed for desktop computers are not suitable for mobile environments. Consequently, we present GraalBench, a set of 3D graphics workloads representative for contemporary and emerging mobile devices. In addition, we present detailed simulation results for a typical rasterization pipeline. The results show that the proposed benchmarks use only a part of the resources offered by current 3D graphics libraries. For instance, while each benchmark uses the texturing unit for more than 70% of the generated fragments, the alpha unit is employed for less than 13% of the fragments. The Fog unit was used for 84% of the fragments by one benchmark, but the other benchmarks did not use it at all. Our experiments on the proposed suite suggest that the texturing, depth and blending units should be implemented in hardware, while, for instance, the dithering unit may be omitted from ...
Multimedia applications provide new highly valuable services to the consumer and form, consequent... more Multimedia applications provide new highly valuable services to the consumer and form, consequently, a new important workload for desktop systems. The in- creased computing power of the embedded processors re- quired in baseband processing for new high-bandwidth wireless communication protocols (e.g UMTS, CDMA- 2000) can make multimedia processing possible also for the mobile devices, such as cell phones. These devices
This topic deals with architecture design and compilation for high performance systems. The areas... more This topic deals with architecture design and compilation for high performance systems. The areas of interest range from microprocessors to large-scale parallel machines; from general-purpose platforms to specialized hardware (e.g., graphic coprocessors, low-power embedded systems); and from hardware design to compiler technology. On the compilation side, topics of interest include programmer productivity issues, concurrent and/or sequential language aspects, program analysis, transformation, automatic discovery and/or management of parallelism at all levels, and the interaction between the compiler and the rest of the system. On the architecture side, the scope spans system architectures, processor micro-architecture, memory hierarchy, and multi-threading, and the impact of emerging trends.
Proceedings of the twelfth annual ACM symposium on Principles of distributed computing - PODC '93, 1993
In this paper we study the practical viability of the BSP model of parallel computation as propos... more In this paper we study the practical viability of the BSP model of parallel computation as proposed by Valiant. This model is intended for simulating the often considered PRAM model on more realistic parallel computers with a fixed interconnection hetwork. One of the main attributes of the BSP model is randomized routing. From experimentation on an existing parallel architecture, analytic models are derived which characterize the eiliciency of this routing scheme. This characterization leads to the identification of the bottlenecks involved in building a parallel architecture in which the BSP model can efficiently be embedded.
CiteSeerX - Document Details (Isaac Councill, Lee Giles, Pradeep Teregowda): This paper experimen... more CiteSeerX - Document Details (Isaac Councill, Lee Giles, Pradeep Teregowda): This paper experimentally validates performance related issues for parallel computation models on several parallel platforms (a MasPar MP-1 with 1024 processors, a 64-node GCel and a CM-5 of 64 ...
Uploads
Papers by Ben Juurlink