Estimating the execution time of nested loops or the volume of data transferred between processors is necessary to make appropriate processor or data allocation. To achieve this goal one need to estimate the execution time of the body and... more
Estimating the execution time of nested loops or the volume of data transferred between processors is necessary to make appropriate processor or data allocation. To achieve this goal one need to estimate the execution time of the body and thus the number of nested loop iterations. This work could be a preprocessing step in an automatic parallelizing compilers to enhance the performance of the resulting parallel program. A bounded convex polyhedron can be associated with each loop nest. The number of its integer points corresponds to the iteration space size. In this paper, we present an algorithm that approximates this number. The algorithm is not restricted to a fixed dimension. The worst case complexity of the algorithm is infrequently reached in our context where the nesting level is rather small and the loop bound expressions are not very complex
Monitoring the performance of a network by which a real-time distributed system is connected is very important. If the system is adaptive or dynamic, the resource manager can use this information to create or use new processes. We may be... more
Monitoring the performance of a network by which a real-time distributed system is connected is very important. If the system is adaptive or dynamic, the resource manager can use this information to create or use new processes. We may be interested to determine how much load a ...
We discuss availability aspects of large software-based systems. We classify faults into Bohrbugs, Mandelbugs and aging-related bugs, and then examine mitigation methods for the last two bug types. We also consider quantitative approaches... more
We discuss availability aspects of large software-based systems. We classify faults into Bohrbugs, Mandelbugs and aging-related bugs, and then examine mitigation methods for the last two bug types. We also consider quantitative approaches to availability assurance.
A reconfigurable architecture using distributed logic block processing elements (PEs) is presented. This distributed processor uses a lowcost interconnection network and local indirect VLIW memories to provide efficient algorithm... more
A reconfigurable architecture using distributed logic block processing elements (PEs) is presented. This distributed processor uses a lowcost interconnection network and local indirect VLIW memories to provide efficient algorithm implementations for portable battery operated products. In order to provide optimal algorithm performance, the VLIWs loaded to each PE configure that PE for processing. By reloading the local VLIW memories, each PE is reconfigured for a new algorithm. Different levels of flexibility are feasible by varying the complexity of the distributed PEs in this architecture.
Monitoring the performance of a network by which a real-time distributed system is connected is very important. If the system is adaptive or dynamic, the resource manager can use this information to create or use new processes. We may be... more
Monitoring the performance of a network by which a real-time distributed system is connected is very important. If the system is adaptive or dynamic, the resource manager can use this information to create or use new processes. We may be interested to determine how much load a host is placing on the network, or what the network load index is. In this paper, a simple technique for evaluating the current load of network is proposed. If a computer is connected to several networks, then we can get the load index of that host for each network. We can also measure the load index of the network applied by all the hosts. The dynamic resource manager of DeSiDeRaTa should use this technique to achieve its requirements. We have verified the technique with two benchmarks -LoadSim and DynBench.
The growing reliance on services provided by software applications places a high premium on the reliable and efficient operation of these applications. A number of these applications follow the event-driven software architecture style... more
The growing reliance on services provided by software applications places a high premium on the reliable and efficient operation of these applications. A number of these applications follow the event-driven software architecture style since this style fosters evolvability by separating event handling from event demultiplexing and dispatching functionality. The event demultiplexing capability, which appears repeatedly across a class of event-driven applications, can be codified into a reusable pattern, such as the Reactor pattern. In order to enable performance analysis of event-driven applications at design time, a model is needed that represents the event demultiplexing and handling functionality that lies at the heart of these applications. In this paper, we present a model of the Reactor pattern based on the well-established Stochastic Reward Net (SRN) modeling paradigm. We discuss how the model can be used to obtain several performance measures such as the throughput, loss probability and upper and lower bounds on the response time. We illustrate how the model can be used to obtain the performance metrics of a Virtual Private Network (VPN) service provided by a Virtual Router (VR). We validate the estimates of the performance measures obtained from the SRN model using simulation.
Monitoring the performance of a network by which a real-time distributed system is connected is very important. If the system is adaptive or dynamic, the resource manager can use this information to create or use new processes. We may be... more
Monitoring the performance of a network by which a real-time distributed system is connected is very important. If the system is adaptive or dynamic, the resource manager can use this information to create or use new processes. We may be interested to determine how much load a host is placing on the network, or what the network load index is. In this paper, a simple technique for evaluating the current load of network is proposed. If a computer is connected to several networks, then we can get the load index of that host for each network. We can also measure the load index of the network applied by all the hosts. The dynamic resource manager of DeSiDeRaTa should use this technique to achieve its requirements. We have verified the technique with two benchmarks-LoadSim and DynBench.
Inifiniband is becoming an important interconnect technology in high performance computing. Recent efforts in large scale Infiniband deployments are raising scalability questions in the HPC community. Open MPI, a new production grade... more
Inifiniband is becoming an important interconnect technology in high performance computing. Recent efforts in large scale Infiniband deployments are raising scalability questions in the HPC community. Open MPI, a new production grade implementation of the MPI standard, provides several mechanisms to enhance Infiniband scalability. Initial comparisons with MVAPICH, the most widely used Infiniband MPI implementation, show similar performance but with much better scalability characterics. Specifically, small message latency is improved by up to 10% in medium/large jobs and memory usage per host is reduced by as much as 300%. In addition, Open MPI provides predicatable latency that is close to optimal without sacrificing bandwidth performance.
As the size of biological sequence databases continues to grow, the time to search these databases has grown proportionally. This has led to many parallel implementations of common sequence analysis suites. However, it has become clear... more
As the size of biological sequence databases continues to grow, the time to search these databases has grown proportionally. This has led to many parallel implementations of common sequence analysis suites. However, it has become clear that many of these parallel sequence analysis tools do not scale well for medium to large-sized clusters. In this paper we describe an enhanced version of MPI-HMMER. We improve on MPI-HMMER's scalability through the use of parallel I/O and a parallel file system. Our enhancements to the core HMMER search tools, hmmsearch and hmmpfam, allows for scalability through 256 nodes where MPI-HMMER was previously limited to 64 nodes.
Phylogentic analysis is becoming an increasingly important tool for customized drug treatments, epidemiological studies, and evolutionary analysis. The TCS method provides an important tool for dealing with genes at a population level.... more
Phylogentic analysis is becoming an increasingly important tool for customized drug treatments, epidemiological studies, and evolutionary analysis. The TCS method provides an important tool for dealing with genes at a population level. Existing software for TCS analysis takes an unreasonable amount of time for the analysis of significant numbers of Taxa. This paper presents the TCS algorithms and describes initial attempts at parallelization. Performance results are also presented for the algorithm on several data sets.
The pervasiveness of Internet-based communication technologies is fostering new forms of distributed computing, namely, large-scale, highly decentralized computing and mobile computing. In this context, new application domains such as... more
The pervasiveness of Internet-based communication technologies is fostering new forms of distributed computing, namely, large-scale, highly decentralized computing and mobile computing. In this context, new application domains such as M-commerce, mobile multimedia, and cooperative information systems, demand for adaptive and flexible middlewares and frameworks which fully exploit logical mobility and support programmable coordination infrastructures. In this paper, we present a distributed computational model which extends an active object model by introducing the mobile active object concept in order to support a multi paradigm design approach. Mobile active objects are autonomous network-aware entities or mobile agents which coordinate to one another through on-demand installable interaction spaces featured by events. The model is embedded in ActiWare -our customizable Javabased framework for the development of highly dynamic distributed applications.
EDGeS is an European funded Framework Program 7 project that aims to connect desktop and service grids together. While in a desktop grid, personal computers pull jobs when they are idle, in service grids there is a scheduler that pushes... more
EDGeS is an European funded Framework Program 7 project that aims to connect desktop and service grids together. While in a desktop grid, personal computers pull jobs when they are idle, in service grids there is a scheduler that pushes jobs to available resources. The work in EDGeS goes well beyond conceptual solutions to bridge these grids together: it reaches as far as actual implementation, standardization, deployment, application porting and training. One of the work packages of this project concerns monitoring the overall EDGeS infrastructure. Currently, this infrastructure includes two types of desktop grids, BOINC and XtremWeb, the EGEE service grid, and a couple of bridges to connect them. In this paper, we describe the monitoring effort in EDGeS: our technical approaches, the goals we achieved, and the plans for future work.
This paper introduces the Auto-Pipe design flow and the X design language, and presents sample applications. The applications include the Triple-DES encryption standard, a subset of the signal-processing pipeline for VER-ITAS, a... more
This paper introduces the Auto-Pipe design flow and the X design language, and presents sample applications. The applications include the Triple-DES encryption standard, a subset of the signal-processing pipeline for VER-ITAS, a high-energy gamma-ray astrophysics experiment. These applications are discussed and their description in X is presented. From X, simulations of alternative system designs and stage-to-device assignments are obtained and analyzed. The complete system will permit production of executable code and bit maps that may be downloaded onto real devices. Future work required to complete the Auto-Pipe design tool is discussed.
In this paper we discuss our initial experiences adapting OpenMP to enable it to serve as a programming model for high performance embedded systems. A high-level programming model such as OpenMP has the potential to increase programmer... more
In this paper we discuss our initial experiences adapting OpenMP to enable it to serve as a programming model for high performance embedded systems. A high-level programming model such as OpenMP has the potential to increase programmer productivity, reducing the design/development costs and time to market for such systems. However, OpenMP needs to be extended if it is to meet the needs of embedded application developers, who require the ability to express multiple levels of parallelism, real-time and resource constraints, and to provide additional information in support of optimization. It must also be capable of supporting the mapping of different software tasks, or components, to the devices configured in a given architecture.
High computational power of commodity PCs combined with the emergence of low latency and high bandwidth interconnects has escalated the trends of cluster computing. Clusters with InfiniBand are being deployed, as reflected in the TOP 500... more
High computational power of commodity PCs combined with the emergence of low latency and high bandwidth interconnects has escalated the trends of cluster computing. Clusters with InfiniBand are being deployed, as reflected in the TOP 500 Supercomputer rankings. However, increasing scale of these clusters has reduced the Mean Time Between Failures (MTBF) of components. Network component is one such component of clusters, where failure of Network Interface Cards (NICs), cables and/or switches breaks existing path(s) of communication. InfiniBand provides a hardware mechanism, Automatic Path Migration (APM), which allows user transparent detection and recovery from network fault(s), without application restart. In this paper, we design a set of modules; which work together for providing network fault tolerance for user level applications leveraging the APM feature. Our performance evaluation at the MPI Layer shows that APM incurs negligible overhead in the absence of faults in the system. In the presence of network faults, APM incurs negligible overhead for reasonably long running applications.
Traditional parallel programming models achieve synchronization with error-prone and complex-to-debug constructs such as locks and barriers. Transactional Memory (TM) is a promising new parallel programming abstraction that replaces... more
Traditional parallel programming models achieve synchronization with error-prone and complex-to-debug constructs such as locks and barriers. Transactional Memory (TM) is a promising new parallel programming abstraction that replaces conventional locks with critical sections expressed as transactions. Most TM research has focused on single address space parallel machines, leaving the area of distributed systems unexplored. In this paper we introduce a flexible Java Software TM (STM) to enable evaluation and prototyping of TM protocols on clusters. Our STM builds on top of the ProActive framework and has as an underlying transactional engine the state-of-the-art DSTM2. It does not rely on software or hardware distributed shared memory for the execution. This follows the transactional semantics at object granularity level and its feasibility is evaluated with non-trivial TM-specific benchmarks.
Most commercial routers designed for networks of workstations (NOWs) implement wormhole switching. However wormhole switching is not well suited for NOWs. The long wires required in this environment lead to large buffers to prevent buffer... more
Most commercial routers designed for networks of workstations (NOWs) implement wormhole switching. However wormhole switching is not well suited for NOWs. The long wires required in this environment lead to large buffers to prevent buffer overflow during flow control signaling. Moreover, wire length is limited by buffer size. Virtual cut-through (VCT) achieves a higher throughput than wormhole switching. Moreover, the traditional disadvantages of VCT switching, as buffer requirements and packetizing overhead, disappear in NOWs. In this paper, we show that VCT routers can be simpler than wormhole ones, while still achieving the advantages of using virtual channels and adaptive routing. We also propose a fully adaptive routing algorithm for VCT switching in NOWs. Moreover, we show that VCT routers outperform wormhole routers in a NOW environment at a lower cost
This paper presents a multiprocessor architecture prototype on a Field Programmable Gate Arrays (FPGA) with support for hardware and software multithreading. Thanks to partial dynamic reconfiguration, this system can, at run time, spawn... more
This paper presents a multiprocessor architecture prototype on a Field Programmable Gate Arrays (FPGA) with support for hardware and software multithreading. Thanks to partial dynamic reconfiguration, this system can, at run time, spawn both software and hardware threads, sharing not only the general purpose soft-cores present in the architecture but also area on the FPGA. While on a standard single processor architecture the partial dynamic reconfiguration requires the processor to stop working to instantiate the hardware threads, the proposed solution hides most of the reconfiguration latency through the parallel execution of software threads. We validate our framework on a JPEG 2000 encoder, showing how threads are spawned, executed and joined independently of their hardware or software nature. We also show results confirming that, by using the proposed approach, we are able to hide the reconfiguration time.
This paper aims at introducing a methodology that allows an easy implementation of IP-Cores focusing only on their functionalities rather than their interfaces and their integration in a given architecture. The proposed approach... more
This paper aims at introducing a methodology that allows an easy implementation of IP-Cores focusing only on their functionalities rather than their interfaces and their integration in a given architecture. The proposed approach implements all the communication infrastructure needed by a component, described in VHDL, to be finally inserted into a real architecture that can be implemented on FPGAs, reducing the time to market of the final implementation of the system. To validate the entire methodology, we have performed a comparison based on the CoreConnect communication infrastructure, between our results with the classical Xilinx design flow using EDK and ISE.
The size and complexity of current custom VLSI have forced the use of high-level programming languages to describe hardware, and compiler and synthesis technology to map abstract designs into silicon. Since streaming data processing in... more
The size and complexity of current custom VLSI have forced the use of high-level programming languages to describe hardware, and compiler and synthesis technology to map abstract designs into silicon. Since streaming data processing in DSP applications is typically described by loop constructs in a high-level language, loops are the most critical portions of the hardware description and special techniques are developed to optimally synthesize them. In this paper, we introduce a new method for mapping and pipelining nested loops efficiently into hardware. It achieves fine-grain parallelism even on strong intra-and inter-iteration data-dependent inner loops and, by sharing resources economically, improves performance at the expense of a small amount of additional area. We implemented the transformation within the Nimble Compiler environment and evaluated its performance on several signal-processing benchmarks. The method achieves up to 2x improvement in the area efficiency compared to the best known optimization techniques.
E-Payment is the corner stone of an e-commerce system. With respect to different payment requirements, different e-payment techniques and methods are developed with specific application purposes. E-payment technology involves digitized... more
E-Payment is the corner stone of an e-commerce system. With respect to different payment requirements, different e-payment techniques and methods are developed with specific application purposes. E-payment technology involves digitized cash, e-wallet, electronic credit/debit ...
This paper describes a study concerning the impact of MMX technology in the field of automatic vehicle guidance. Due to the high speed a vehicle can reach, this application field requires a very precise real-time response. After a brief... more
This paper describes a study concerning the impact of MMX technology in the field of automatic vehicle guidance. Due to the high speed a vehicle can reach, this application field requires a very precise real-time response. After a brief description of the ARGO autonomous vehicle, the paper focuses on the requirements of this kind of application: the use of only visual information, the use of low-cost hardware, and the need for real-time processing. The paper then presents the way these problems have been solved using MMX technology, discusses some optimization techniques that have been successfully employed, and compares the results with the ones of a traditional scalar code.
Adaptive scientific computations require that periodic repartitioning (load balancing) occur dynamically to maintain load balance. Hypergraph partitioning is a successful model for minimizing communication volume in scientific... more
Adaptive scientific computations require that periodic repartitioning (load balancing) occur dynamically to maintain load balance. Hypergraph partitioning is a successful model for minimizing communication volume in scientific computations, and partitioning software for the static case is widely available. In this paper, we present a new hypergraph model for the dynamic case, where we minimize the sum of communication in the application plus the migration cost to move data, thereby reducing total execution time. The new model can be solved using hypergraph partitioning with fixed vertices. We describe an implementation of a parallel multilevel repartitioning algorithm within the Zoltan load-balancing toolkit, which to our knowledge is the first code for dynamic load balancing based on hypergraph partitioning. Finally, we present experimental results that demonstrate the effectiveness of our approach on a Linux cluster with up to 64 processors. Our new algorithm compares favorably to the widely used ParMETIS partitioning software in terms of quality, and would have reduced total execution time in most of our test cases. * Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin CompanyMuch of the early work in load balancing focused on diffusive methods , where overloaded processors give work to neighboring processors that have lower than average loads. A quite different approach is to partition the new problem "from scratch" without accounting for existing partition assignments, and then try to remap partitions to minimize the migration cost . These two strategies have very different properties. Diffusive schemes are fast and have low migration cost, but may incur high communication volume. Scratch-remap schemes give low
In this paper we revise some of the most relevant aspects concerning the Quality of Service in wireless networks, providing, along the research issues we are currently pursuing, both the state-of-the-art and our recent achievements. More... more
In this paper we revise some of the most relevant aspects concerning the Quality of Service in wireless networks, providing, along the research issues we are currently pursuing, both the state-of-the-art and our recent achievements. More specifically, first of all we focus on network survivability, that is the ability of the network of maintaining functionality as a consequence of a component failure. Then, we turn our attention on data access and network services in a distributed environment. Finally, we analyze a basic network optimization task, that is routing design in wireless ATM networks.
The four High Energy Physics (HEP) detectors at the Large Hadron Collider (LHC) at the European Organization for Nuclear Research (CERN) are among the most important experiments where the National Institute of Nuclear Physics (INFN) is... more
The four High Energy Physics (HEP) detectors at the Large Hadron Collider (LHC) at the European Organization for Nuclear Research (CERN) are among the most important experiments where the National Institute of Nuclear Physics (INFN) is being actively involved. A Grid infrastructure of the World LHC Computing Grid (WLCG) has been developed by the HEP community leveraging on broader initiatives (e.g. EGEE in Europe, OSG in northen America) as a framework to exchange and maintain data storage and provide computing infrastructure for the entire LHC community. INFN-CNAF in Bologna hosts the Italian Tier-1 site, which represents the biggest italian center in the WLCG distributed computing. In the first part of this paper we will describe on the building of the Italian Tier-1 to cope with the WLCG computing requirements focusing on some peculiarities; in the second part we will analyze the INFN-CNAF contribution for the developement of the grid middleware, stressing in particular the characteristics of the Virtual Organization Membership Service (VOMS), the de facto standard for authorization on a grid, and StoRM, an implementation of the Storage Resource Manager (SRM) specifications for POSIX file systems. In particular StoRM is used at INFN-CNAF in conjunction with General Parallel File System (GPFS) and we are also testing an integration with Tivoli Storage Manager (TSM) to realize a complete Hierarchical Storage Management (HSM).
As modern supercomputing systems reach the peta-flop performance range, they grow in both size and complexity. This makes them increasingly vulnerable to failures from a variety of causes. Checkpointing is a popular technique for... more
As modern supercomputing systems reach the peta-flop performance range, they grow in both size and complexity. This makes them increasingly vulnerable to failures from a variety of causes. Checkpointing is a popular technique for tolerating such failures, enabling applications to periodically save their state and restart computation after a failure. Although a variety of automated system-level checkpointing solutions are currently available to HPC users, manual application-level checkpointing remains more popular due to its superior performance. This paper improves performance of automated checkpointing via a compiler analysis for incremental checkpointing. This analysis, which works with both sequential and OpenMP applications, significantly reduces checkpoint sizes and enables asynchronous checkpointing.
Many scientists perform extensive computations by executing large bags of similar tasks (BoTs) in mixtures of computational environments, such as grids and clouds. Although the reliability and cost may vary considerably across these... more
Many scientists perform extensive computations by executing large bags of similar tasks (BoTs) in mixtures of computational environments, such as grids and clouds. Although the reliability and cost may vary considerably across these environments, no tool exists to assist scientists in the selection of environments that can both fulfill deadlines and fit budgets. To address this situation, in this work we introduce the ExPERT BoT scheduling framework. Our framework systematically selects from a large search space the Paretoefficient scheduling strategies, that is, the strategies that deliver the best results for both makespan and cost. ExPERT chooses from them the best strategy according to a general, user-specified utility function. Through simulations and experiments in real production environments we demonstrate that ExPERT can substantially reduce both makespan and cost, in comparison to common scheduling strategies. For bioinformatics BoTs executed in a real mixed grid+cloud environment, we show how the scheduling strategy selected by ExPERT reduces both makespan and cost by 30%-70%, in comparison to commonly-used scheduling strategies.
With respect to different payment requirements, different e-payment techniques and methods are developed with specific application purposes. E-payment technology involves digitized cash, e-wallet, electronic credit/debit card, payment... more
With respect to different payment requirements, different e-payment techniques and methods are developed with specific application purposes. E-payment technology involves digitized cash, e-wallet, electronic credit/debit card, payment transactions, and payment settlement between various parties such as banks, ISPs, and customers involved in a transaction. However, in the B2C e-commerce, e-payment has not reached a massive market yet. There are many reasons behind this. One of the reasons is lack of multiple channels for payment. Another reason is the difficulty of enlarging micropayment market. Both strongly hinder a wider acceptance of e-payment. In this paper, we present an investigation on multiple e-payment and micro-payment from the point of view of technical and market.
This paper presents a new approach for the execution of coarse-grain (tiled) parallel SPMD code for applications derived from the explicit discretization of 2-dimensional PDE problems with finite-differencing schemes. Tiling... more
This paper presents a new approach for the execution of coarse-grain (tiled) parallel SPMD code for applications derived from the explicit discretization of 2-dimensional PDE problems with finite-differencing schemes. Tiling transformation is an efficient loop transformation to achieve coarse-grain parallelism in such algorithms, while rectangular tile shapes are the only feasible shapes that can be manually applied by program developers. However, rectangular tiling transformations are not always valid due to data dependencies, and thus requiring the application of an appropriate skewing transformation prior to tiling in order to enable rectangular tile shapes. We employ cyclic mapping of tiles to processes and propose a method to determine an efficient rectangular tiling transformation for a fixed number of processes for 2-dimensional, skewed PDE problems. Our experimental results confirm the merit of coarse-grain execution in this family of applications and indicate that the proposed method leads to the selection of highly efficient tiling transformations.
Granularity control is an effective means for trading power consumption with performance on dense shared memory multiprocessors, such as multi-SMT and multi-CMP systems. With granularity control, the number of threads used to execute an... more
Granularity control is an effective means for trading power consumption with performance on dense shared memory multiprocessors, such as multi-SMT and multi-CMP systems. With granularity control, the number of threads used to execute an application, or part of an application, is changed, thereby also changing the amount of work done by each active thread. In this paper, we analyze the energy/performance trade-off of varying thread granularity in parallel benchmarks written for shared memory systems. We use physical experimentation on a real multi-SMT system and a power estimation model based on the die areas of processor components and component activity factors obtained from a hardware event monitor. We also present HPPATCH, a runtime algorithm for live tuning of thread granularity, which attempts to simultaneously reduce both execution time and processor power consumption.
One of the most important features in image analysis and understanding is shape. Mathematical morphology is the image processing branch that deals with shape analysis. The definition of all morphological transformations is based on two... more
One of the most important features in image analysis and understanding is shape. Mathematical morphology is the image processing branch that deals with shape analysis. The definition of all morphological transformations is based on two primitive operations, i.e. dilation and erosion. Since many applications require the solution of morphological problems in real time, researching time efficient algorithms for these two operations is crucial. In this paper, efficient parallel algorithms for the binary dilation and erosion are presented and evaluated for an advanced associative processor. Simulation results indicate that the achieved speedup is linear.
HAL is a multidisciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or... more
HAL is a multidisciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L'archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d'enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
In this paper we present a simulator for the recon gurable mesh SIMD architecture. The purpose of the simulator is to assist in the analysis of algorithms and the visualisation of their behaviour. Furthermore, it can be used to... more
In this paper we present a simulator for the recon gurable mesh SIMD architecture. The purpose of the simulator is to assist in the analysis of algorithms and the visualisation of their behaviour. Furthermore, it can be used to demonstrate the potential of recon guration in an educational environment.
This paper investigates scalable implementations of out-of-core I/O-intensive Data Mining algorithms on affordable parallel architectures, such as clusters of workstations. In order to validate our approach, the K-means algorithm, a well... more
This paper investigates scalable implementations of out-of-core I/O-intensive Data Mining algorithms on affordable parallel architectures, such as clusters of workstations. In order to validate our approach, the K-means algorithm, a well known DM Clustering algorithm, was used as a test case.
and the functionality and data that reside in the storage subsystem. Recent technological trends, such as shared SAN or NAS storage and virtualization, have the potential to break this tight association between functionality and machines.... more
and the functionality and data that reside in the storage subsystem. Recent technological trends, such as shared SAN or NAS storage and virtualization, have the potential to break this tight association between functionality and machines. We describe the design and implementation of ENCOM-PASS -an image management system centered around a shared storage repository of "master system images", each representing different functionality. The functionality is provisioned by "cloning" master images, associating the resulting "clone images" with specified physical and/or virtual resources ("machines"), customizing the clone images for the specific environment and circumstances, and automatically performing the necessary operations to activate the clones. "Machines" -physical or virtual -are merely computational resources that do not have any permanent association with functionality. ENCOMPASS supports the complete lifecycle of a system image, including reallocation and re-targeting of resources, maintenance, updates, etc. It separates image creation from image management from resource allocation policies -an emerging trend that is manifested in particular by proliferation of turn-key "virtual appliances".
During the last few years, the concepts of cluster computing and heterogeneous networked systems have received increasing interest. The popularity of using Java for developing parallel and distributed applications that run on... more
During the last few years, the concepts of cluster computing and heterogeneous networked systems have received increasing interest. The popularity of using Java for developing parallel and distributed applications that run on heterogeneous distributed systems has also ...
Most commercial routers designed for networks of workstations (NOWs) implement wormhole switching. However, wormhole switching is not well suited for NOWs. The long wires required in this environment lead to large buffers to prevent... more
Most commercial routers designed for networks of workstations (NOWs) implement wormhole switching. However, wormhole switching is not well suited for NOWs. The long wires required in this environment lead to large buffers to prevent buffer overflow during flow control signaling. Moreover, wire length is limited by buffer size.
Computational grids provide computing power by sharing resources across administrative domains. This sharing, coupled with the need to execute untrusted code from arbitrary users, introduces security hazards. This paper addresses the... more
Computational grids provide computing power by sharing resources across administrative domains. This sharing, coupled with the need to execute untrusted code from arbitrary users, introduces security hazards. This paper addresses the security implications of making a computing resource available to untrusted applications via computational grids. It highlights the problems and limitations of current grid environments and proposes a technique that employs run-time monitoring and a restricted shell. The technique can be used for setting up an execution environment that supports the full legitimate use allowed by the security policy of a shared resource. Performance analysis shows up to 2.14 times execution overhead improvement for shell-based applications. The approach proves effective and provides a substrate for hybrid techniques that combine static and dynamic mechanisms to minimize monitoring overheads.
Abstract. Monitoring the performance of a network by which a real-time distributed system is connected is very important. If the system is adaptive or dynamic, the resource manager can use this information to create or use new processes.... more
Abstract. Monitoring the performance of a network by which a real-time distributed system is connected is very important. If the system is adaptive or dynamic, the resource manager can use this information to create or use new processes. We may be interested to determine how much load a host is placing on the network, or what the network load index is. In this paper, a simple technique for evaluating the current load of network is proposed. If a computer is connected to several networks, then we can get the load index of that host for each network. We can also measure the load index of the network applied by all the hosts. The dynamic resource manager of DeSiDeRaTa should use this technique to achieve its requirements. We have verified the technique with two benchmarks – LoadSim and DynBench. 1
We present Protagoras, a new plug-in architecture for the GNU compiler collection that allows one to modify GCC's internal representation of the program under compilation. We illustrate the utility of Protagoras by presenting plug-ins for... more
We present Protagoras, a new plug-in architecture for the GNU compiler collection that allows one to modify GCC's internal representation of the program under compilation. We illustrate the utility of Protagoras by presenting plug-ins for both compile-time and runtime software verification and monitoring. In the compiletime case, we have developed plug-ins that interpret the GIMPLE intermediate representation to verify properties statically. In the runtime case, we have developed plug-ins for GCC to perform memory leak detection, array bounds checking, and reference-count access monitoring.
As modern supercomputing systems reach the peta-flop performance range, they grow in both size and complexity. This makes them increasingly vulnerable to failures from a variety of causes. Checkpointing is a popular technique for... more
As modern supercomputing systems reach the peta-flop performance range, they grow in both size and complexity. This makes them increasingly vulnerable to failures from a variety of causes. Checkpointing is a popular technique for tolerating such failures, enabling applications to periodically save their state and restart computation after a failure. Although a variety of automated system-level checkpointing solutions are currently available to HPC users, manual application-level checkpointing remains more popular due to its superior performance. This paper improves performance of automated checkpointing via a compiler analysis for incremental checkpointing. This analysis, which works with both sequential and OpenMP applications, reduces checkpoint sizes by as much as 80% and enables asynchronous checkpointing.
LoOgGP, which allows an accurate characterization of MPI applications based on microbenchmark measurements. This new model is an extension of LogP for long messages in which both overhead and gap parameters perform a linear dependency... more
LoOgGP, which allows an accurate characterization of MPI applications based on microbenchmark measurements. This new model is an extension of LogP for long messages in which both overhead and gap parameters perform a linear dependency with message size. The LoOgGP model has been fully integrated into a modelling framework to obtain statistical models of parallel applications, providing the analyst with an easy and automatic tool for LoOgGP parameter set assessment to characterize communications. The use of LoOgGP model to obtain a statistical performance model of an image deconvolution application is illustrated as a case of study.
In a Grid computing environment, resources are shared among a large number of applications. Brokers and schedulers find matching resources and schedule the execution of the applications by monitoring dynamic resource availability and... more
In a Grid computing environment, resources are shared among a large number of applications. Brokers and schedulers find matching resources and schedule the execution of the applications by monitoring dynamic resource availability and employing policies such as firstcome-first-served and back-filling. To support applications with timeliness requirements in such an environment, brokering and scheduling algorithms must address an additional problem -they must be able to estimate the execution time of the application on the currently available resources. In this paper, we present a modeling approach to estimating the execution time of long-running scientific applications. The modeling approach we propose is generic; models can be constructed by merely observing the application execution "externally" without using intrusive techniques such as code inspection or instrumentation. The model is cross-platform; it enables prediction without the need for the application to be profiled first on the target hardware. To show the feasibility and effectiveness of this approach, we developed a resource usage model that estimates the execution time of a weather forecasting application in a multi-cluster Grid computing environment. We validated the model through extensive benchmarking and profiling experiments and observed prediction errors that were within 10% of the measured values. Based on our initial experience, we believe that our approach can be used to model the execution time of other time-sensitive scientific applications; thereby, enabling the development of more intelligent brokering and scheduling algorithms.
Today's computational science demands have resulted in ever larger parallel computers, and storage systems have grown to match these demands. Parallel file systems used in this environment are increasingly specialized to extract the... more
Today's computational science demands have resulted in ever larger parallel computers, and storage systems have grown to match these demands. Parallel file systems used in this environment are increasingly specialized to extract the highest possible performance for large I/O operations, at the expense of other potential workloads. While some applications have adapted to I/O best practices and can obtain good performance on these systems, the natural I/O patterns of many applications result in generation of many small files. These applications are not well served by current parallel file systems at very large scale.
The execution of a complex task in any environment requires planning. Planning is the process of constructing an activity graph given by the current state of the system, a goal state, and a set of activities. If we wish to execute a... more
The execution of a complex task in any environment requires planning. Planning is the process of constructing an activity graph given by the current state of the system, a goal state, and a set of activities. If we wish to execute a complex computing task in a heterogeneous computing environment with autonomous resource providers, we should be able to adapt to changes in the environment. A possible solution is to construct a family of activity graphs beforehand and investigate the means of switching from one member of the family to another when the execution of one activity graph fails. In this paper, we study the conditions when plan switching is feasible. Then we introduce an approach for plan switching and report the simulation results of this approach.
Embedded media applications have to satisfy real-time, low power consumption and silicon area constraints. These applications spend most of the execution time in the iteration of a few kernels; such kernels are typically made of... more
Embedded media applications have to satisfy real-time, low power consumption and silicon area constraints. These applications spend most of the execution time in the iteration of a few kernels; such kernels are typically made of independent operations, which can be executed in parallel. Clustered architectures are a solution designed to exploit the high Instruction Level Parallelism (ILP) of the media kernels, to keep a good level of scalability and to match the strict constraints of the embedded domains. Within this category, architectures with reconfigurable connections between clusters are of particular interest. The enhanced flexibility allows them to handle several different data-paths effectively, hence multiple applications; this is a key economic factor in the semiconductor world, in which the cost of the masks significantly increases at every technological advance. This papers describes Hierarchical Cluster Assignment (HCA), a compilation technique that deals with the problem of mapping the computation of multimedia kernels onto the clusters of the target machine. HCA exploits the hierarchical structure of the clusters of the target architectures; it works by decomposing the problem of cluster assignment into a sequence of simpler sub-problems, each of them involving a subset of the kernel instructions and a subset of the machine clusters. A prototype of this methodology has been implemented in a flexible framework and tested on machine models based on the DSPFabric architecture.
Distributed computing systems are a viable and less expensive alternative to parallel computers. However, concurrent programming methods in distributed systems have not been studied as extensively as for parallel computers. Some of the... more
Distributed computing systems are a viable and less expensive alternative to parallel computers. However, concurrent programming methods in distributed systems have not been studied as extensively as for parallel computers. Some of the main research issues are how to deal with scheduling and load balancing of such a system which may consist of heterogeneous computers. In the past, a variety of dynamic scheduling schemes suitable for parallel loops (with independent iterations) on heterogeneous computer clusters have been obtained and studied. However, no study of dynamic schemes for loops with iteration dependencies has been reported so far. In this work we study the problem of scheduling loops with iteration dependencies for heterogeneous (dedicated and non-dedicated) clusters. The presence of iteration dependencies incurs an extra degree of difficulty and makes the development of such schemes quite a challenge. We extend three well known dynamic schemes (CS, TSS and DTSS) by introducing synchronization points at certain intervals so that processors compute in pipelined fashion. Our scheme is called dynamic multi-phase scheduling (DM P S) and we apply it to loops with iteration dependencies. We implemented our new scheme on a network of heterogeneous computers and study its performance. Through extensive testing on two real-life applications (heat equation and Floyd-Steinberg algorithm), we show that the proposed method is efficient for parallelizing nested loops with dependencies on heterogeneous systems.