Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content
... Page: 12. A Hybrid Randomized Initialization Protcol for TDMA in Single-Hop Wireless Networks. Aleksandar Micicl, Ivan Stojmenovic. Page: 13. A Limited-Global Fault Information Model for Dynamic Routing in 2-D Meshes. Zhen Jiang, Jie... more
... Page: 12. A Hybrid Randomized Initialization Protcol for TDMA in Single-Hop Wireless Networks. Aleksandar Micicl, Ivan Stojmenovic. Page: 13. A Limited-Global Fault Information Model for Dynamic Routing in 2-D Meshes. Zhen Jiang, Jie Wu. Page: 14. ...
ABSTRACT Energy is a scarce resource in Wireless Sensor Networks (WSN). Some studies show that more than 70% of energy is consumed in data transmission in WSN. Since most of the time, the sensed information is redundant due to... more
ABSTRACT Energy is a scarce resource in Wireless Sensor Networks (WSN). Some studies show that more than 70% of energy is consumed in data transmission in WSN. Since most of the time, the sensed information is redundant due to geographically collocated sensors, ...
Performance and energy are the two most important objectives for optimization on heterogeneous high performance computing platforms. This work studies a mathematical problem motivated by the bi‐objective optimization of data‐parallel... more
Performance and energy are the two most important objectives for optimization on heterogeneous high performance computing platforms. This work studies a mathematical problem motivated by the bi‐objective optimization of data‐parallel applications on such platforms for performance and energy. First, we formulate the problem and present an exact algorithm of polynomial complexity solving the problem where all the application profiles of objective type one are continuous and strictly increasing, and all the application profiles of objective type two are linear increasing. We then apply the algorithm to develop solutions for two related optimization problems of parallel applications on heterogeneous hybrid platforms, one for performance and dynamic energy and the other for performance and total energy. Our proposed solution methods are then employed to solve the two bi‐objective optimization problems for two data‐parallel applications, matrix multiplication and gene sequencing, on a hybrid platform employing five heterogeneous processors, namely, two different Intel multicore CPUs, an Nvidia K40c GPU, an Nvidia P100 PCIe GPU, and an Intel Xeon Phi.
The communication layer of modern HPC platforms is getting increasingly heterogeneous and hierarchical. As a result, even on platforms with homogeneous processors, the communication cost of many parallel applications will significantly... more
The communication layer of modern HPC platforms is getting increasingly heterogeneous and hierarchical. As a result, even on platforms with homogeneous processors, the communication cost of many parallel applications will significantly vary depending on the mapping of their processes to the processors of the platform. The optimal mapping, minimizing the communication cost of the application, will strongly depend on the network structure and performance as well as the logical communication flow of the application. In our previous work, we proposed a general approach and two approximate heuristic algorithms aimed at minimization of the communication cost of data parallel applications which have two-dimensional symmetric communication pattern on heterogeneous hierarchical networks, and tested these algorithms in the context of the parallel matrix multiplication application. In this paper, we develop a new algorithm that is built on top of one of these heuristic approaches in the context of a real-life application, MPDATA, which is one of the major parts of the EULAG geophysical model. We carefully study the communication flow of MPDATA and discover that even under the assumption of a perfectly homogeneous communication network, the logical communication links of this application will have different bandwidths, which makes the optimization of its communication cost particularly challenging. We propose a new algorithm that is based on cost functions of one of our general heuristic algorithms and apply it to optimization of the communication cost of MPDATA, which has asymmetric heterogeneous communication pattern. We also present experimental results demonstrating performance gains due to this optimization.
Load balancing is a widely accepted technique for performance optimization of scientific applications on parallel architectures. Indeed, balanced applications do not waste processor cycles on waiting at points of synchronization and data... more
Load balancing is a widely accepted technique for performance optimization of scientific applications on parallel architectures. Indeed, balanced applications do not waste processor cycles on waiting at points of synchronization and data exchange, maximizing this way the utilization of processors. In this paper, we challenge the universality of the load-balancing approach to optimization of the performance of parallel applications. First, we formulate conditions that should be satisfied by the performance profile of an application in order for the application to achieve its best performance via load balancing. Then we use a real-life scientific application, EULAG MPDATA kernel, to demonstrate that its performance profile on a modern parallel architecture, Intel Xeon Phi, significantly deviates from these conditions. Based on this observation, we propose a method of performance optimization of scientific applications through load imbalancing. In the case of data parallel application, the method uses functional performance models of the application to find partitioning that minimizes its computation time but not necessarily balances the load of processors. We apply this method to optimization of MPDATA on Intel Xeon Phi. Experimental results demonstrate that the performance of this carefully optimized load-balanced application can be further improved by 15percent using the proposed load-imbalancing technique.
Energy is one of the most important objectives for optimization on modern heterogeneous high performance computing (HPC) platforms. The tight integration of multicore CPUs with accelerators in these platforms present several challenges to... more
Energy is one of the most important objectives for optimization on modern heterogeneous high performance computing (HPC) platforms. The tight integration of multicore CPUs with accelerators in these platforms present several challenges to optimization of multithreaded data-parallel applications for dynamic energy.
In this paper, we study the problem of partitioning a matrix over a small number of interconnected heterogeneous processors. This problem is crucial for data parallel dense linear algebra and other applications with similar communication... more
In this paper, we study the problem of partitioning a matrix over a small number of interconnected heterogeneous processors. This problem is crucial for data parallel dense linear algebra and other applications with similar communication patterns on modern hybrid servers, integrating several heterogeneous compute devices such as CPUs, GPUs and other accelerators. The objective is to balance the load of the heterogeneous devices while minimising the communication cost. While the problem has been solved for the case of two processors, it is still open for three and more processors. The state-of-the-art solution for the case of three processors uses a communication cost function, which does not accurately account for the total amount of data moved between processors and therefore leaves the question of its global optimality open. In this work, we propose a cost function, which accurately represents the total amount of data moved between processors. Then, we formulate and solve the problem of optimal partitioning of a square computational domain, using this accurate communication cost function. Finally, we propose and implement an original experimental methodology for accurate measurement of the communication time of parallel applications on hybrid heterogeneous servers, integrating multi-core CPUs and various accelerators. We apply this methodology to experimental validation of our mathematical result.
eLearning systems provide online quiz tools, which can be used for assessments, providing instantaneous feedback for both students and lecturers. Moodle, a popular open-source eLearning platform, supports quizzes with various types of... more
eLearning systems provide online quiz tools, which can be used for assessments, providing instantaneous feedback for both students and lecturers. Moodle, a popular open-source eLearning platform, supports quizzes with various types of questions, such as multiple choice, calculated answer, essay, etc. For courses in Computer Science, which often involve programming, the functionality of traditional quizzes is insufficient, as programming cannot be mapped to these existing quizzes. Automated execution and verification of code is already available in the context of computer programming contests. Integration of these existing techniques into eLearning systems provides the required functionality for Computer Science courses. In this work, we present CodeRunner, an extension to Moodle that allows students to submit code as a solution to an assigned question. This code is compiled, executed and the output is compared to the solution provided by the teacher. Feedback is provided to the student and they are given the opportunity to modify their solution and resubmit. This immediate feedback and the student's desire to get a green checkmark encourage students to learn through an iterative process until they achieve the correct solution. The automated system lifts the burden of correcting student submissions from course demonstrators and provides uniformity and equality in the final grade. CodeRunner was originally developed and introduced in the University of Canterbury, New Zealand. The main limitation of the original plugin was security. The execution of untrusted code, such as student submissions, on a production server which is also hosting websites, such as the eLearning platform itself, is a major security vulnerability. Fortunately, this vulnerability can be closed by isolating the code execution with the help of virtualization and cloud computing. We solved this problem by separating the Moodle quiz plugin from the code execution. Student source code is sent from the production Moodle server to a remote virtual server for execution and the output is then returned for grading. We introduced automatic code assessment at the UCD School of Computer Science and Informatics in 2014, for approx. 500 students in 4 courses. CodeRunner supports various programming languages: C, Python, Java and we extended this list with Bash Shell and Scheme. We conducted a survey of students, teachers and demonstrators on the automated code assessment. We found that students like to interact with the system, and were motivated to work at the problem until they got it right. Demonstrators and teachers liked the system and found it gave them more time teach coding techniques. More time is required to set the questions when compared to a conventional lab or assignment; however for medium to large sized classes the time saved in correcting outweighs this. In general, this system is unsuitable for teaching software engineering incorporating multiple files, complex structures and tools, and it is unable to automate the correcting of individual project-based assignments. However, it is ideal for introductory programming courses and simple problem solving exercises.
Parallel matrix-matrix multiplication (PMM) of dense matrices is a foundational kernel of parallel linear algebra libraries in high performance computing (HPC) domain. The problem of finding the optimal shape of matrices for efficient... more
Parallel matrix-matrix multiplication (PMM) of dense matrices is a foundational kernel of parallel linear algebra libraries in high performance computing (HPC) domain. The problem of finding the optimal shape of matrices for efficient execution of PMM on heterogeneous platforms has an engrossing history comprising of two distinct threads. The first thread focused purely on rectangular partitions whereas the second thread relaxed the rectangular partition constraint to allow non-rectangular partitions. The research works in the second thread, however, are entirely theoretical. There is no software implementation that would facilitate experimental studies of the practical performance and optimality of the proposed partition shapes. We address this gap in this work. We propose an implementation of PMM based on non-rectangular partitions called SummaGen. To study its efficacy, we compare the performances of PMM for four partition shapes proven optimal for three processor case where speeds of the processors are represented by positive real numbers. We conduct the experiments on a hybrid heterogeneous multi-accelerator NUMA node comprising of three heterogeneous devices, a dual-socket Intel Haswell multicore CPU, an Nvidia K40 GPU, and an Intel Xeon Phi 3120P. We show that the four shapes exhibit equal performances (with an average percentage difference of 8%) for a range of problem sizes where the speeds are constant confirming the optimality of these shapes in practice. We demonstrate further that the four shapes exhibit equal dynamic energy consumptions for this case. We also present a study of performances of PMM for the same partition shapes for a matrix decomposition using load imbalancing data partitioning algorithm employing functional performance models (FPMs). The peak and average performances of the implementation are 80% and 70% of the theoretical peak floating-point performance of the machine.
Several millions of execution flows will be executed in ultrascale computing systems (UCS), and the task for the programmer to understand their coherency and for the runtime to coordinate them is unfathomable. Moreover, related to UCS... more
Several millions of execution flows will be executed in ultrascale computing systems (UCS), and the task for the programmer to understand their coherency and for the runtime to coordinate them is unfathomable. Moreover, related to UCS large scale and their impact on reliability, the current static point of view is not more sufficient. A runtime cannot consider to restart an application because of the failure of a single node as statically several nodes will fail every day. Classical management of these failures by the programmers using checkpoint restart is also too limited due to the overhead at such a scale. The article explores programming models and runtimes required to facilitate the task of scaling and extracting performance on continuously evolving platforms, while providing resilience and fault-tolerant mechanisms to tackle the increasing probability of failures throughout the whole software stack.
Ensuring applications to achieve an efficient usage of resources and fast execution time in the complex current heterogeneous high-performance computing platforms is a paramount problem. Essential efforts to reach the goal are the optimal... more
Ensuring applications to achieve an efficient usage of resources and fast execution time in the complex current heterogeneous high-performance computing platforms is a paramount problem. Essential efforts to reach the goal are the optimal partitioning of the data space between the processes composing a typical task/data-parallel application, and their right mapping and deployment on the platform. The computational and communication performance modeling describing the platform and the application behaviors is an increasingly recognized approach. This paper discusses the utility of the $$\uptau$$ τ –Lop analytic communication performance model in facing these issues and contributes with a practical symbolic computation tool that represents, manipulates and accurately evaluates the formal communication cost expression derived from a hybrid kernel. We identify a set of scenarios where the tool could be applied, provide with both basic and advanced use examples and evaluate the tool on real-life kernels.

And 260 more