This paper gives an overview of the implementation of NESL, a portable nested data-parallel language. This language and its implementation are the first to fully support nested data structures as well as nested data-parallel function... more
This paper gives an overview of the implementation of NESL, a portable nested data-parallel language. This language and its implementation are the first to fully support nested data structures as well as nested data-parallel function calls. These features allow the concise description of parallel algorithms on irregular data. In addition, they maintain the advantages of data-parallel languages: a simple programming model and portability. The current NESL implementation is based on an intermediate language called VCODE and a library of vector routines called CVL. It runs on the Connection Machines CM-2, the Cray Y-MP C90, and serial machines. We compare initial benchmark results of NESL with those of machine-specific code on these machines for three algorithms: least-squares line-fitting, median finding, and a sparse- matrix vector product. These results show that NESL's performance is competitive with that of machine-specific code for regular dense data, and is often superior for irregular data.
In this paper we evaluate the MPI environments currently available for Windows NT on the Intel IA32 and Compaq/DEC Alpha architectures. We present benchmark results for low-level communication and for the NAS Parallel Benchmarks to allow... more
In this paper we evaluate the MPI environments currently available for Windows NT on the Intel IA32 and Compaq/DEC Alpha architectures. We present benchmark results for low-level communication and for the NAS Parallel Benchmarks to allow comparison to other systems, but our primary interest is determining real application performance and robustness in production cluster environments. For this we use PAFEC-FE, a large FORTRAN code for finite-element analysis. We present results from three MPI implementations, two architectures, and three networking technologies (10 Mbit/s and 100 Mbit/s Ethernet and 1 Gbit/s Myrinet).
This paper presents work in progress on a new method of implementing irregular divide-and-conquer algorithms in a nested data-parallel language model on distributed-memory multiprocessors. The main features discussed are the recursive... more
This paper presents work in progress on a new method of implementing irregular divide-and-conquer algorithms in a nested data-parallel language model on distributed-memory multiprocessors. The main features discussed are the recursive subdivision of asynchronous processor groups to match the change from data-parallel to control-parallel behavior over the lifetime of an algorithm, switching from parallel code to serial code when the group size is one (with the opportunity to use a more efficient serial algorithm) , and a simple manager-based run-time load-balancing system. Sample algorithms translated from the high-level nested data-parallel language NESL into C and MPI using this method are significantly faster than the current NESL system, and show the potential for further speedup.
A common problem that sales consultants face in the field is the selection of an appropriate hardware and software configuration for web farms. Over-provisioning means that the tender will be expensive while under-provisioning will lead... more
A common problem that sales consultants face in the field is the selection of an appropriate hardware and software configuration for web farms. Over-provisioning means that the tender will be expensive while under-provisioning will lead to a configuration that does not meet the customer criteria. Indy is a performance modeling environment which allows developers to create custom modeling applications. We have constructed an Indy-based application for defining web farm workloads and topologies. The paper presents an optimization framework that allows the consultant to easily find configurations that meet customers' criteria. The system searches the solution space creating possible configurations, using the web farm models to predict their performance. The optimization tool is then employed to select an optimal configuration. Rather than using a fixed algorithm, the framework provides an infrastructure for implementing multiple optimization algorithms. In this way, the appropriate algorithm can be selected to match the requirements of different types of problem. The framework incorporates a number of novel techniques, including caching results between problem runs, an XML based configuration language, and an effective method of comparing configurations. We have applied the system to a typical web farm configuration problem and results have been obtained for three standard optimization algorithms.
These are the lecture notes for CS 15-840B, a hands-on class in programming parallel algorithms. The class was taught in the fall of 1992 by Guy Blelloch, using the programming language NESL. It stressed the clean and concise expression... more
These are the lecture notes for CS 15-840B, a hands-on class in programming parallel algorithms. The class was taught in the fall of 1992 by Guy Blelloch, using the programming language NESL. It stressed the clean and concise expression of a variety of parallel algorithms. About 35 graduate students attended the class, of whom 28 took it for credit. These notes were written by students in the class, and were then reviewed and organized by Guy Blelloch and Jonathan Hardwick. The sample NESL code has been converted from the older LISP-style syntax into the new ML-style syntax. These notes are not in a polished form, and probably contain several errors and omissions, particularly with respect to references in the literature. Corrections are welcome.
CVL is a library of low-level vector routines callable from C. This library includes a wide variety of vector operations such as elementwise function applications, scans, reduces and permutations. Most CVL routines are defined for... more
CVL is a library of low-level vector routines callable from C. This library includes a wide variety of vector operations such as elementwise function applications, scans, reduces and permutations. Most CVL routines are defined for segmented and unsegmented vectors. This paper is intended for CVL users and implementors, and assumes familiarity with vector operations and the scan-vector model of parallel computation.
This paper describes the design and implementation of a practical parallel algorithm for Delaunay triangulation that works well on general distributions. Although there have been many theoretical parallel algorithms for the problem, and... more
This paper describes the design and implementation of a practical parallel algorithm for Delaunay triangulation that works well on general distributions. Although there have been many theoretical parallel algorithms for the problem, and some implementations based on bucketing that work well for uniform distributions, there has been little work on implementations for general distributions. We use the well known reduction of 2D Delaunay triangulation to find the 3D convex hull of points on a paraboloid. Based on this reduction we developed a variant of the Edelsbrunner and Shi 3D convex hull algorithm, specialized for the case when the point set lies on a paraboloid. This simplification reduces the work required by the algorithm (number of operations) from O(n log^2 n) to O(n log n). The depth (parallel time) is O(log^3 n) on a CREW PRAM. The algorithm is simpler than previous O(n log n) work parallel algorithms leading to smaller constants. Initial experiments using a variety of distributions showed that our parallel algorithm was within a factor of 2 in work from the best sequential algorithm. Based on these promising results, the algorithm was implemented using C and an MPI-based toolkit. Compared with previous work, the resulting implementation achieves significantly better speedups over good sequential code, does not assume a uniform distribution of points, and is widely portable due to its use of MPI as a communication mechanism. Results are presented for the IBM SP2, Cray T3D, SGI Power Challenge, and DEC AlphaCluster.
Several recent papers have proposed or analyzed optimal algorithms to route all-to-all personalized communication (AAPC) over communication networks such as meshes, hypercubes and omega switches. However, the constant factors of these... more
Several recent papers have proposed or analyzed optimal algorithms to route all-to-all personalized communication (AAPC) over communication networks such as meshes, hypercubes and omega switches. However, the constant factors of these algorithms are often an obscure function of system parameters such as link speed, processor clock rate, and memory access time. In this paper we investigate these architectural factors, showing the impact of the communication style, the network routing table, and most importantly, the local memory system, on AAPC performance and permutation routing on the Cray T3D.
This paper describes the derivation of an empirically efficient parallel two-dimensional Delaunay triangulation program from a theoretically efficient CREW PRAM algorithm. Compared to previous work, the resulting implementation is not... more
This paper describes the derivation of an empirically efficient parallel two-dimensional Delaunay triangulation program from a theoretically efficient CREW PRAM algorithm. Compared to previous work, the resulting implementation is not limited to datasets with a uniform distribution of points, achieves significantly better speedups over good serial code, and is widely portable due to its use of MPI as a communication mechanism. Results are presented for a loosely-coupled cluster of workstations, a distributed-memory multicomputer, and a shared-memory multiprocessor. The Machiavelli toolkit used to transform the nested data parallelism inherent in the divide-and-conquer algorithm into achievable task and data parallelism is also described and compared to previous techniques.
We motivate and describe the design and implementation of a system for compiling the high-level programming language NESL into Java. As well as increasing the portability of NESL, this system has enabled us to make existing simulations... more
We motivate and describe the design and implementation of a system for compiling the high-level programming language NESL into Java. As well as increasing the portability of NESL, this system has enabled us to make existing simulations and algorithm animations available in applet form on the Web. We present performance results showing that current Java virtual machines running the generated code achieve about half the performance of a native implementation of NESL. We conclude that the use of Java as an intermediate language is a viable way to improve the portability of existing high-level programming languages for scientific simulation and computation.
Indy is a new performance modeling framework for the creation of tools for many different classes of performance problems, including capacity planning, bottleneck analysis, etc. Users can plug in their own workload and hardware models... more
Indy is a new performance modeling framework for the creation of tools for many different classes of performance problems, including capacity planning, bottleneck analysis, etc. Users can plug in their own workload and hardware models while exploiting core shared services such as resource tracking and evaluation engines. We used Indy to create EMOD, a performance analysis tool for database-backed web sites. We validate EMOD using the predicted and observed performance of SVT, a sample e-commerce site.
This manual is a supplement to the language definition of NESL version 3.1. It describes how to use the NESL system interactively and covers features for accessing on-line help, debugging, profiling, executing programs on remote machines,... more
This manual is a supplement to the language definition of NESL version 3.1. It describes how to use the NESL system interactively and covers features for accessing on-line help, debugging, profiling, executing programs on remote machines, using NESL with GNU Emacs, and installing and customizing the NESL system.
This paper describes the design and implementation in MPI of the parallel vector library CVL, which is used as the basis for implementing nested data-parallel languages such as NESL and Proteus. We outline the features of CVL, and compare... more
This paper describes the design and implementation in MPI of the parallel vector library CVL, which is used as the basis for implementing nested data-parallel languages such as NESL and Proteus. We outline the features of CVL, and compare the ease of writing and debugging the portable MPI implementation with our experiences writing previous versions in CM-2 Paris, CM-5 CMMD, and PVM 3.0. We give initial performance results for MPI CVL running on the SP-1, Paragon, and CM-5, and compare them with previous versions of CVL running on the CM-2, CM-5, and Cray C90. We discuss the features of MPI that helped and hindered the effort, and make a plea for better support for certain primitives. Finally, we discuss the design limitations of CVL when implemented on current RISC-based MPP architectures, and outline our plans to overcome this by using MPI as a compiler target. CVL and associated languages are available via FTP.
We present our experiences in using Java as an intermediate language for the high-level programming language NESL. First, we describe the design and implementation of a system for translating VCODE--the current intermediate language used... more
We present our experiences in using Java as an intermediate language for the high-level programming language NESL. First, we describe the design and implementation of a system for translating VCODE--the current intermediate language used by NESL--into Java. Second, we evaluate this translation by comparing the performance of the original VCODE implementation with several variants of the Java implementation. The translator was easy to build, and the generated Java code achieves reasonable performance when using a just-in-time compiler. We conclude that Java is attractive both as a compilation target for rapid prototyping of new programming languages and as a means of improving the portability of existing programming languages.
Abstract: This paper gives an overview of the implementation of NESL, a portable nested data-parallel language.This language and its implementation are the first to fully support nested data structures as well as nesteddata-parallel... more
Abstract: This paper gives an overview of the implementation of NESL, a portable nested data-parallel language.This language and its implementation are the first to fully support nested data structures as well as nesteddata-parallel function calls. These features allow the concise description of parallel algorithms on irregulardata structures, such as sparse matrices and graphs. In addition, they maintain the advantages of data-parallellanguages: a simple programming model and portability. The...