A new backward error analysis of LU factorization is presented. It allows to obtain a sharper upper bound for the forward error and a new definition of the growth factor that we compare with the well known Wilkinson growth factor for some... more
A new backward error analysis of LU factorization is presented. It allows to obtain a sharper upper bound for the forward error and a new definition of the growth factor that we compare with the well known Wilkinson growth factor for some classes of matrices. Numerical experiments show that the new growth factor is often of order approximately log2n whereas Wilkinson's growth factor is of order n or $\sqrt n$ .
We present parallel and sequential dense QR factorization algorithms that are both optimal (up to polylogarithmic factors) in the amount of communication they perform, and just as stable as Householder QR. Our first algorithm, Tall Skinny... more
We present parallel and sequential dense QR factorization algorithms that are both optimal (up to polylogarithmic factors) in the amount of communication they perform, and just as stable as Householder QR. Our first algorithm, Tall Skinny QR (TSQR), factors m-by-n matrices in a one-dimensional (1-D) block cyclic row layout, and is optimized for m >> n. Our second algorithm, CAQR (Communication-Avoiding QR), factors general rectangular matrices distributed in a two-dimensional block cyclic layout. It invokes TSQR for each block column factorization.
Restructuring compilers use dependence analysis to prove that the meaning of a program is not changed by a transformation. A well-known limitation of dependence analysis is that it examines only the memory locations read and written by a... more
Restructuring compilers use dependence analysis to prove that the meaning of a program is not changed by a transformation. A well-known limitation of dependence analysis is that it examines only the memory locations read and written by a statement, and does not assume any particular interpretation for the operations in that statement. Exploiting the semantics of these operations enables a wider set of transformations to be used, and is critical for optimizing important codes such as LU factorization with pivoting. Symbolic execution of programs enables the exploitation of such semantic properties, but it is intractable for all but the simplest programs. In this paper, we propose a new form of symbolic analysis for use in restructuring compilers. Fractal symbolic analysis compares a program and its transformed version by repeatedly simplifying these programs until symbolic analysis becomes tractable, ensuring that equality of simplified programs is sufficient to guarantee equality of the original programs. We present a prototype implementation of fractal symbolic analysis, and show how it can be used to optimize the cache performance of LU factorization with pivoting.
The proliferation of high performance workstations and the emergence of high speed networks have attracted a lot of interest in workstation-based supercomputing. We project that workstation-based environments with supercomputing... more
The proliferation of high performance workstations and the emergence of high speed networks have attracted a lot of interest in workstation-based supercomputing. We project that workstation-based environments with supercomputing capabilities will be available in the not-so-distant future. However a number of hardware and software issues have to be resolved before the full potential of these workstation-based supercomputing environments can be exploited. The presented research has two main objectives: (1) to investigate the limitations of communications techniques used in current workstation-based systems and to identify a set of requirements that must be satisfied to achieve workstation-based supercomputing; (2) to use these requirements to develop software and hardware support that enables workstation-based supercomputing. The performance of two applications, the LU factorization of dense matrices and the calculation of FFT, on two platforms, the iPSC/860 supercomputer and on a clu...
The use of multibody formulations based on Cartesian or naturalcoordinates lead to sets of differential-algebraic equations that haveto be solved. The difficulty in providing compatible initial positionsand velocities for a general... more
The use of multibody formulations based on Cartesian or naturalcoordinates lead to sets of differential-algebraic equations that haveto be solved. The difficulty in providing compatible initial positionsand velocities for a general spatial multibody model and the finiteprecision of such data result in initial errors that must be correctedduring the forward dynamic solution of the system equations of motion.As the position and velocity constraint equations are not explicitlyinvolved in the solution procedure, any integration error leads to theviolation of these equations in the long run. Another problem that isvery often impossible to avoid is the presence of redundant constraints.Even with no initial redundancy it is possible for some systems toachieve singular configurations in which kinematic constraints becometemporarily redundant. In this work several procedures to stabilize thesolution of the equations of motion and to handle redundant constraintsare revisited. The Baumgarte stabilization, augmented Lagrangian andcoordinate partitioning methods are discussed in terms of theirefficiency and computational costs. The LU factorization with fullpivoting of the Jacobian matrix directs the choice of the set ofindependent coordinates, required by the coordinate partitioning method.Even when no particular stabilization method is used, a Newton–Raphsoniterative procedure is still required in the initial time step tocorrect the initial positions and velocities, thus requiring theselection of the independent coordinates. However, this initialselection does not guarantee that during the motion of the system otherconstraints do not become redundant. Two procedures based on the singlevalue decomposition and Gram–Schmidt orthogonalization are revisited forthe purpose. The advantages and drawbacks of the different procedures,used separately or in conjunction with each other and theircomputational costs are finally discussed.
We propose an energy-balanced allocation of a real-time application onto a single-hop cluster of homogeneous sensor nodes connected with multiple wireless channels. An epoch-based application consisting of a set of communicating tasks is... more
We propose an energy-balanced allocation of a real-time application onto a single-hop cluster of homogeneous sensor nodes connected with multiple wireless channels. An epoch-based application consisting of a set of communicating tasks is considered. Each sensor node is equipped with discrete dynamic voltage scaling (DVS). The time and energy costs of both computation and communication activities are considered. We propose both an Integer Linear Programming (ILP) formulation and a polynomial time 3-phase heuristic. Our simulation results show that for small scale problems (with ≤10 tasks), up to 5x lifetime improvement is achieved by the ILP-based approach, compared with the baseline where no DVS is used. Also, the 3-phase heuristic achieves up to 63% of the system lifetime obtained by the ILP-based approach. For large scale problems (with 60–100 tasks), up to 3.5x lifetime improvement can be achieved by the 3-phase heuristic. We also incorporate techniques for exploring the energy-latency tradeoffs of communication activities (such as modulation scaling), which leads to 10x lifetime improvement in our simulations. Simulations were further conducted for two real world problems – LU factorization and Fast Fourier Transformation (FFT). Compared with the baseline where neither DVS nor modulation scaling is used, we observed up to 8x lifetime improvement for the LU factorization algorithm and up to 9x improvement for FFT.
This paper extends the ideas behind Bareiss's fraction-free Gauss elimination algorithm in a number of directions. First, in the realm of linear algebra, algorithms are presented for fraction-free LU "factorization" of a... more
This paper extends the ideas behind Bareiss's fraction-free Gauss elimination algorithm in a number of directions. First, in the realm of linear algebra, algorithms are presented for fraction-free LU "factorization" of a matrix and for fraction-free algorithms for both forward and back substitution. These algorithms are valid not just for integer computation but also for any matrix system where the entries are taken from a unique factorization domain such as a polynomial ring. The second part of the paper introduces the application of the fraction-free formulation to resultant algorithms for solving systems of polynomial equations. In particular, the use of fraction-free polynomial arithmetic and triangularization algorithms in computing the Dixon resultant of a polynomial system is discussed.