The pipeline pattern for parallel programs is utilized in a wide array of scientific applications... more The pipeline pattern for parallel programs is utilized in a wide array of scientific applications designed for execution on hybrid CPU-GPU architectures. However, there is a dearth of tools and libraries to support implementation of software pipelines for hybrid architectures. We present the Hybrid Pipeline Framework (HPF) that is intended to fill this gap. HyPi provides high level abstractions in C++ for implementation of pipelines on hybrid CPU-GPU architectures. It is a generic framework intended to support a wide range of applications. The complexities characteristic of such implementations, e.g., partitioning of input/output data structures, asynchronous memory transfer, communication between the CPU and the GPU etc., are handled by the framework and are therefore hidden from the developer. HyPi exposes certain degrees of freedom that can be tuned to optimize the performance of an application based on application specific requirements. We present a detailed account of the frame...
The advent of homogeneous many-core processors has been widely noticed as a major shift in the ar... more The advent of homogeneous many-core processors has been widely noticed as a major shift in the architecture of commodity computer systems. It has influenced the design of operating systems and programming models and gives a boost to high-level parallelization libraries. Future commodity systems will combine homogeneous many-core processors with graphical processing units and other special purpose accelerators to a new class of architecture called hybrid system. These systems provide improved hardware support for specialized computational tasks of different applications. In contrast to the programming for homogeneous architectures with existing tools, programming for such hybrid architectures is still in its infancy. Today, the software development for such systems explicitly has to address vendor-specific and version-specific characteristics of the particular devices. Due to this, refactoring of existing code becomes a tedious and error-prone task. We present our approach for automa...
InstantLab is our online experimentation platform that is used for hosting exercises and experime... more InstantLab is our online experimentation platform that is used for hosting exercises and experiments for operating systems and software engineering courses at HPI. In this paper, we discuss challenges and solutions for scaling InstantLab to provide experiment infrastructure for thousands of users in MOOC scenarios. We present InstantLabs XCloud architecture - a combination of a privat cloud resources at HPI combined with public cloud infrastructures via ``cloudbursting''. This way, we can provide specialized experiments using VM co-location and heterogeneous compute devices (such as GPGPU) that are not possible on public cloud infrastructures. Additionally, we discuss challenges and solutions dealing with embedding of special hardware, providing experiment feedback and managing access control. We propose trust-based access control as a way to handle resource management in MOOC settings.
Massive open online courses enjoy a surge of popularity: Numerous platforms such as Coursera, Uda... more Massive open online courses enjoy a surge of popularity: Numerous platforms such as Coursera, Udacity, edX and many more offer a variety of high-quality courses to learners worldwide. Hasso Plattner Institute is operating its own platform openHPI where selected courses are available. These platforms are well-suited for the presentation of video and reading material and offer opportunities for interaction in web forums. Assignments, however, are usually limited to simple quizzes or basic question & answer tasks, although, a very important component of teaching are practical assignments that allow students to gather hands-on experiences. With InstantLab, we have created a self-service web platform for interactive software experiments that is used in our curriculum: Experiments are provided as VM images which are executed on our in-house private cloud and accessed over a remote-desktop connection embedded in the web browser – requiring no additional software on the users computers. In ...
The advent of hybrid CPU-GPU architectures has significantly increased the number of raw FLOP/s. ... more The advent of hybrid CPU-GPU architectures has significantly increased the number of raw FLOP/s. However, it is not obvious how these can be put to use when processing Big Data. In this paper, we present an approach for designing Big Data simulations for hybrid architectures, which is based on a hierarchal application of design patterns in parallel programming. We provide a detailed account of the step by step approach that results in efficient utilization of processing and memory resources, while simultaneously improving developer productivity. Finally, we present our vision of automated tools that will further simplify the development of efficient parallel implementations for Big Data processing on hybrid architectures.
In recent years the multi-core era started to affect embedded systems, changing some of the rules... more In recent years the multi-core era started to affect embedded systems, changing some of the rules: While on a single processor, Earliest Deadline First has been proven to be the best algorithm to guarantee the correct execution of priorized tasks, Dhall et al. have shown that this approach is not feasible for multi-processor systems anymore. A variety of new scheduling algorithms has been introduced, competing to be the answer to the challenges multi-processor real-time scheduling is imposing. In this paper, we study the solution space of prioritization-based task scheduling algorithms using genetic programming and state-of-the-art accelerator technologies. We demonstrate that this approach is indeed feasible to generate a wide variety of capable scheduling algorithms with pre-selected characteristics, the best of which outperform many existing approaches. For a static predefined set of tasks, overfitting even allows us to produce optimal algorithms.
Within the bachelor project ”DNA“ the Distributed Control Lab (DCL), that facilitates the usage o... more Within the bachelor project ”DNA“ the Distributed Control Lab (DCL), that facilitates the usage of real-time control experiments, and the ASG-C5 component of the Adaptive Services Grid (ASG) for dynamic placement of services, were combined. Thereby, the ASG-C5 component was reimplemented. Besides, a scheduling mechanism and a component that augments the WSDL of a service with state management methods was created. The new system is able to interoperate with execution environments for miscellaneous programming languages. An execution environment for .NET web services was realized. Furthermore, the data management was optimized for heterogeneous usage. In addition to the portation of existing DCL experiments, tools for the creation of new services were added. Also, the existing management interface was reimplemented and extended by algorithms for the analysis of measurement data. The user interface was extended and revised in order to provide better usability for the experiments in the...
With the ongoing internationalization of virtuallaboratories, the integration aspect becomes more... more With the ongoing internationalization of virtuallaboratories, the integration aspect becomes moreimportant. The meanwhile commonly accepted ’glue’ forsuch legacy systems are service oriented architectures, basedon standardized and accepted Web service standards.We present our concept of the ’experiment as a service’,where the idea of service-based architectures is applied tovirtual remote laboratories. In our laboratory middleware,experiments are represented as stateful serviceimplementations and jobs as logical service instances ofthese implementations. We discuss performance, reliability,security and monitoring issues in this approach, and showhow the resulting infrastructure - the Distributed ControlLab - is applied in the European VetTrend project.
2014 15th International Conference on Parallel and Distributed Computing, Applications and Technologies, 2014
ABSTRACT Blind Signal Separation is an algorithmic problem class that deals with the restoration ... more ABSTRACT Blind Signal Separation is an algorithmic problem class that deals with the restoration of original signal data from a signal mixture. Implementations, such as FastICA, are optimized for parallelization on CPU or first-generation GPU hardware. With the advent of modern, compute centered GPU hardware with powerful features such as dynamic parallelism support, these solutions no longer leverage the available hardware performance in the best-possible way. We present an optimized implementation of the FastICA algorithm, which is specifically tailored for next-generation GPU architectures such as Nvidia Kepler. Our proposal achieves a two digit factor of speedup in the prototype implementation, compared to a multithreaded CPU implementation. Our custom matrix multiplication kernels, tailored specifically for the use case, contribute to the speedup by delivering better performance than the state-of-the-art CUBLAS library.
Scale-invariant feature transform (SIFT) is an algorithm to identify and track objects in a serie... more Scale-invariant feature transform (SIFT) is an algorithm to identify and track objects in a series of digital images. The algorithm can handle objects that change their location, scale, rotation or illumination in subsequent images. This makes SIFT a promising choice for feature detection and tracking in computer vision applications. The only problem is, that SIFT has a high computation overhead, which often forces system designers to choose a faster heuristic feature detection mechanism instead. We contribute a thorough performance analysis of various optimizations of SIFT tailored for the Scala high-level programming language. Our implementation is cache and non-uniform memory architecture (NUMA)-aware and therefore achieves a higher speedup factor than related work. Additionally, we demonstrate how scalability can be achieved for modern server and cloud computer systems using the actor programming model provided by Scala and Akka.
2013 International Conference on Parallel and Distributed Systems, 2013
ABSTRACT Modern server and desktop systems combine multiple computational cores and accelerator d... more ABSTRACT Modern server and desktop systems combine multiple computational cores and accelerator devices into a hybrid architecture. GPUs as one class of such devices provide dedicated processing power and memory capacities for data parallel computation of 2D and 3D graphics. Although these cards have demonstrated their applicability in a variety of areas, they are almost exclusively used by special purpose software. If such software is not running, the accelerator resources of the hybrid system remain unused. In this paper, we present an operating system extension that allows leveraging the GPU accelerator memory for operating system purposes. Our approach utilizes graphics card memory as cache for virtual memory pages, which can improve the overall system responsiveness, especially under heavy load. Our prototypical implementation for Windows proves the potential of such an approach, but identifies also significant preconditions for a widespread adoption in desktop systems.
ABSTRACT Desktop software developers' interest in graphics hardware is increasing as a re... more ABSTRACT Desktop software developers' interest in graphics hardware is increasing as a result of modern graphics cards' capabilities to act as compute devices that augment the main processor. This capability means parallel computing is no longer a dedicated task for the CPU. A trend toward heterogeneous computing combines the main processor and graphics processing unit (GPU). This overview of how to utilize GPU compute power in the best possible way includes explanations of the primary GPU hardware concepts and the corresponding programming principles. On this foundation, the authors discuss a collection of commonly agreed-upon critical performance optimization strategies that are the key factor for getting true scalability and performance improvements when moving parts of your application from a multithreaded to a GPU-enhanced version.
The pipeline pattern for parallel programs is utilized in a wide array of scientific applications... more The pipeline pattern for parallel programs is utilized in a wide array of scientific applications designed for execution on hybrid CPU-GPU architectures. However, there is a dearth of tools and libraries to support implementation of software pipelines for hybrid architectures. We present the Hybrid Pipeline Framework (HPF) that is intended to fill this gap. HyPi provides high level abstractions in C++ for implementation of pipelines on hybrid CPU-GPU architectures. It is a generic framework intended to support a wide range of applications. The complexities characteristic of such implementations, e.g., partitioning of input/output data structures, asynchronous memory transfer, communication between the CPU and the GPU etc., are handled by the framework and are therefore hidden from the developer. HyPi exposes certain degrees of freedom that can be tuned to optimize the performance of an application based on application specific requirements. We present a detailed account of the frame...
The advent of homogeneous many-core processors has been widely noticed as a major shift in the ar... more The advent of homogeneous many-core processors has been widely noticed as a major shift in the architecture of commodity computer systems. It has influenced the design of operating systems and programming models and gives a boost to high-level parallelization libraries. Future commodity systems will combine homogeneous many-core processors with graphical processing units and other special purpose accelerators to a new class of architecture called hybrid system. These systems provide improved hardware support for specialized computational tasks of different applications. In contrast to the programming for homogeneous architectures with existing tools, programming for such hybrid architectures is still in its infancy. Today, the software development for such systems explicitly has to address vendor-specific and version-specific characteristics of the particular devices. Due to this, refactoring of existing code becomes a tedious and error-prone task. We present our approach for automa...
InstantLab is our online experimentation platform that is used for hosting exercises and experime... more InstantLab is our online experimentation platform that is used for hosting exercises and experiments for operating systems and software engineering courses at HPI. In this paper, we discuss challenges and solutions for scaling InstantLab to provide experiment infrastructure for thousands of users in MOOC scenarios. We present InstantLabs XCloud architecture - a combination of a privat cloud resources at HPI combined with public cloud infrastructures via ``cloudbursting''. This way, we can provide specialized experiments using VM co-location and heterogeneous compute devices (such as GPGPU) that are not possible on public cloud infrastructures. Additionally, we discuss challenges and solutions dealing with embedding of special hardware, providing experiment feedback and managing access control. We propose trust-based access control as a way to handle resource management in MOOC settings.
Massive open online courses enjoy a surge of popularity: Numerous platforms such as Coursera, Uda... more Massive open online courses enjoy a surge of popularity: Numerous platforms such as Coursera, Udacity, edX and many more offer a variety of high-quality courses to learners worldwide. Hasso Plattner Institute is operating its own platform openHPI where selected courses are available. These platforms are well-suited for the presentation of video and reading material and offer opportunities for interaction in web forums. Assignments, however, are usually limited to simple quizzes or basic question & answer tasks, although, a very important component of teaching are practical assignments that allow students to gather hands-on experiences. With InstantLab, we have created a self-service web platform for interactive software experiments that is used in our curriculum: Experiments are provided as VM images which are executed on our in-house private cloud and accessed over a remote-desktop connection embedded in the web browser – requiring no additional software on the users computers. In ...
The advent of hybrid CPU-GPU architectures has significantly increased the number of raw FLOP/s. ... more The advent of hybrid CPU-GPU architectures has significantly increased the number of raw FLOP/s. However, it is not obvious how these can be put to use when processing Big Data. In this paper, we present an approach for designing Big Data simulations for hybrid architectures, which is based on a hierarchal application of design patterns in parallel programming. We provide a detailed account of the step by step approach that results in efficient utilization of processing and memory resources, while simultaneously improving developer productivity. Finally, we present our vision of automated tools that will further simplify the development of efficient parallel implementations for Big Data processing on hybrid architectures.
In recent years the multi-core era started to affect embedded systems, changing some of the rules... more In recent years the multi-core era started to affect embedded systems, changing some of the rules: While on a single processor, Earliest Deadline First has been proven to be the best algorithm to guarantee the correct execution of priorized tasks, Dhall et al. have shown that this approach is not feasible for multi-processor systems anymore. A variety of new scheduling algorithms has been introduced, competing to be the answer to the challenges multi-processor real-time scheduling is imposing. In this paper, we study the solution space of prioritization-based task scheduling algorithms using genetic programming and state-of-the-art accelerator technologies. We demonstrate that this approach is indeed feasible to generate a wide variety of capable scheduling algorithms with pre-selected characteristics, the best of which outperform many existing approaches. For a static predefined set of tasks, overfitting even allows us to produce optimal algorithms.
Within the bachelor project ”DNA“ the Distributed Control Lab (DCL), that facilitates the usage o... more Within the bachelor project ”DNA“ the Distributed Control Lab (DCL), that facilitates the usage of real-time control experiments, and the ASG-C5 component of the Adaptive Services Grid (ASG) for dynamic placement of services, were combined. Thereby, the ASG-C5 component was reimplemented. Besides, a scheduling mechanism and a component that augments the WSDL of a service with state management methods was created. The new system is able to interoperate with execution environments for miscellaneous programming languages. An execution environment for .NET web services was realized. Furthermore, the data management was optimized for heterogeneous usage. In addition to the portation of existing DCL experiments, tools for the creation of new services were added. Also, the existing management interface was reimplemented and extended by algorithms for the analysis of measurement data. The user interface was extended and revised in order to provide better usability for the experiments in the...
With the ongoing internationalization of virtuallaboratories, the integration aspect becomes more... more With the ongoing internationalization of virtuallaboratories, the integration aspect becomes moreimportant. The meanwhile commonly accepted ’glue’ forsuch legacy systems are service oriented architectures, basedon standardized and accepted Web service standards.We present our concept of the ’experiment as a service’,where the idea of service-based architectures is applied tovirtual remote laboratories. In our laboratory middleware,experiments are represented as stateful serviceimplementations and jobs as logical service instances ofthese implementations. We discuss performance, reliability,security and monitoring issues in this approach, and showhow the resulting infrastructure - the Distributed ControlLab - is applied in the European VetTrend project.
2014 15th International Conference on Parallel and Distributed Computing, Applications and Technologies, 2014
ABSTRACT Blind Signal Separation is an algorithmic problem class that deals with the restoration ... more ABSTRACT Blind Signal Separation is an algorithmic problem class that deals with the restoration of original signal data from a signal mixture. Implementations, such as FastICA, are optimized for parallelization on CPU or first-generation GPU hardware. With the advent of modern, compute centered GPU hardware with powerful features such as dynamic parallelism support, these solutions no longer leverage the available hardware performance in the best-possible way. We present an optimized implementation of the FastICA algorithm, which is specifically tailored for next-generation GPU architectures such as Nvidia Kepler. Our proposal achieves a two digit factor of speedup in the prototype implementation, compared to a multithreaded CPU implementation. Our custom matrix multiplication kernels, tailored specifically for the use case, contribute to the speedup by delivering better performance than the state-of-the-art CUBLAS library.
Scale-invariant feature transform (SIFT) is an algorithm to identify and track objects in a serie... more Scale-invariant feature transform (SIFT) is an algorithm to identify and track objects in a series of digital images. The algorithm can handle objects that change their location, scale, rotation or illumination in subsequent images. This makes SIFT a promising choice for feature detection and tracking in computer vision applications. The only problem is, that SIFT has a high computation overhead, which often forces system designers to choose a faster heuristic feature detection mechanism instead. We contribute a thorough performance analysis of various optimizations of SIFT tailored for the Scala high-level programming language. Our implementation is cache and non-uniform memory architecture (NUMA)-aware and therefore achieves a higher speedup factor than related work. Additionally, we demonstrate how scalability can be achieved for modern server and cloud computer systems using the actor programming model provided by Scala and Akka.
2013 International Conference on Parallel and Distributed Systems, 2013
ABSTRACT Modern server and desktop systems combine multiple computational cores and accelerator d... more ABSTRACT Modern server and desktop systems combine multiple computational cores and accelerator devices into a hybrid architecture. GPUs as one class of such devices provide dedicated processing power and memory capacities for data parallel computation of 2D and 3D graphics. Although these cards have demonstrated their applicability in a variety of areas, they are almost exclusively used by special purpose software. If such software is not running, the accelerator resources of the hybrid system remain unused. In this paper, we present an operating system extension that allows leveraging the GPU accelerator memory for operating system purposes. Our approach utilizes graphics card memory as cache for virtual memory pages, which can improve the overall system responsiveness, especially under heavy load. Our prototypical implementation for Windows proves the potential of such an approach, but identifies also significant preconditions for a widespread adoption in desktop systems.
ABSTRACT Desktop software developers' interest in graphics hardware is increasing as a re... more ABSTRACT Desktop software developers' interest in graphics hardware is increasing as a result of modern graphics cards' capabilities to act as compute devices that augment the main processor. This capability means parallel computing is no longer a dedicated task for the CPU. A trend toward heterogeneous computing combines the main processor and graphics processing unit (GPU). This overview of how to utilize GPU compute power in the best possible way includes explanations of the primary GPU hardware concepts and the corresponding programming principles. On this foundation, the authors discuss a collection of commonly agreed-upon critical performance optimization strategies that are the key factor for getting true scalability and performance improvements when moving parts of your application from a multithreaded to a GPU-enhanced version.
Uploads