Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SIAM J. SCI. COMPUT. Vol. 25, No. 6, pp. 1860–1879 c 2004 Society for Industrial and Applied Mathematics  PERMUTING SPARSE RECTANGULAR MATRICES INTO BLOCK-DIAGONAL FORM∗ CEVDET AYKANAT† , ALI PINAR‡ , AND ÜMIT V. ÇATALYÜREK§ Abstract. We investigate the problem of permuting a sparse rectangular matrix into blockdiagonal form. Block-diagonal form of a matrix grants an inherent parallelism for solving the deriving problem, as recently investigated in the context of mathematical programming, LU factorization, and QR factorization. To represent the nonzero structure of a matrix, we propose bipartite graph and hypergraph models that reduce the permutation problem to those of graph partitioning by vertex separator and hypergraph partitioning, respectively. Our experiments on a wide range of matrices, using the state-of-the-art graph and hypergraph partitioning tools MeTiS and PaToH, revealed that the proposed methods yield very effective solutions both in terms of solution quality and runtime. Key words. coarse-grain parallelism, sparse rectangular matrices, singly bordered blockdiagonal form, doubly bordered block-diagonal form, graph partitioning by vertex separator, hypergraph partitioning AMS subject classifications. 65F05, 65F50, 65F20, 65K05, 65Y05, 05C50, 05C65, 05C85, 05C90 DOI. 10.1137/S1064827502401953 1. Introduction. Block-diagonal structure of sparse matrices has been exploited for coarse-grain parallelization of various algorithms such as decomposition methods for linear programming, LU factorization, and QR factorization. In these methods, block diagonals give rise to subproblems that can be solved independently, whereas the border incurs a coordination task to combine the subproblem solutions into a solution of the original problem and is usually less amenable to parallelization. The objective of this work is to enhance these decomposition-based solution methods by transforming the underlying matrix into a block-diagonal form with small border size while maintaining a given balance condition on the sizes of the diagonal blocks. Our target problem is permuting rows and columns of an M × N sparse matrix A into a K-way singly bordered block-diagonal (SB) form: ⎤ ⎤ ⎡ ⎡ π B1 A11 . . . Aπ1K ⎥ ⎢ .. .. ⎥ ⎢ .. .. ⎥ ⎢ ⎢ . . . ⎥ (1.1) Aπ = P A Q = ⎢ . ⎥ = ASB , ⎥=⎢ π ⎦ ⎣ ⎣ Aπ . . . AKK BK ⎦ K1 AπS1 . . . AπSK R1 . . . RK where P and Q denote, respectively, the row and column permutation matrices to be determined. In (1.1), each row of the Mc × N border submatrix R = (R1 · · · RK ) ∗ Received by the editors February 4, 2002; accepted for publication (in revised form) August 25, 2003; published electronically May 25, 2004. http://www.siam.org/journals/sisc/25-6/40195.html † Computer Engineering Department, Bilkent University, Ankara, Turkey (aykanat@cs.bilkent. edu.tr). This author’s work was partially supported by the Scientific and Technical Research Council of Turkey (TÜBİTAK) under project EEEAG-199E013. ‡ Lawrence Berkeley Laboratory, Berkeley, CA 94720 (apinar@lbl.gov). This author’s work was supported by the Director, Office of Science, Division of Mathematical, Information, and Computational Sciences of the U.S. Department of Energy under contract DE-AC03-76SF00098. § Department of Biomedical Informatics, Ohio State University, Columbus, OH 43210 (catalyurek. 1@osu.edu). This author’s work was supported by the National Science Foundation under grant ACI0203846. 1860 PERMUTING A SPARSE MATRIX TO BLOCK-DIAGONAL FORM 1861 is called a column-coupling or simply coupling row. Each coupling row has nonzeros in the columns of at least two diagonal blocks. The objective is to permute matrix A into an SB form ASB such that the number (Mc ) of coupling rows is minimized while a given balance criterion is satisfied. The SB form in (1.1) is referred to here as the primal SB form, whereas in the dual SB form they are the columns that constitute the border. We also consider the problem of permuting rows and columns of a sparse matrix A into a K-way doubly bordered block-diagonal (DB) form: (1.2) Aπ11 ⎢ .. ⎢ Aπ = P A Q = ⎢ . ⎣ Aπ K1 AπS1 ⎡ . . . Aπ1K .. .. . . . . . AπKK . . . AπSK ⎤ ⎡ B1 Aπ1S .. ⎥ ⎢ ⎢ . ⎥ ⎥=⎢ π ⎦ ⎣ AKS π ASS R1 C1 .. . ... BK RK CK D ⎤ ⎥ ⎥ ⎥ = ADB . ⎦ In equation (1.2), each row and column of matrix R = (R1 · · · RK D) and C = T (C1T · · · CK DT )T is called a coupling row and a coupling column, respectively. The objective is to permute matrix A into a DB form ADB such that the sum of the number of coupling rows and columns is minimized while a given balance criterion is satisfied. The literature that addresses this problem is very rare and recent. Ferris and Horn [12] proposed a two-phase approach for A-to-ASB transformation. In the first phase, matrix A is transformed into a DB form ADB as an intermediate form. In the second phase, ADB is transformed into an SB form through column-splitting as discussed in section 3.3. Our initial results of this problem were presented in two conference papers [38, 39]. In [38], we proposed the basics of our hypergraph model and how to exploit this model to permute matrices to block-diagonal form. In our subsequent work [39] we proposed our graph models. Later Hu, Maguire, and Blake [24] independently investigated the same problem without spelling out the exact model to represent the sparsity structures of matrices or the details of their algorithm for permutation. In this paper, we present a complete work on the problem of permuting sparse matrices to block-diagonal form. We consider permutations to DB form as well as permutations to primal and dual SB forms. Our proposed graph and hypergraph models for sparse matrices reduce the problem of permuting a sparse matrix to block-diagonal form to the well-known problems of graph partitioning by vertex separator (GPVS) and hypergraph partitioning (HP). GPVS is widely used in nested-dissection-based low-fill orderings for factorization of symmetric, sparse matrices, whereas HP is widely used for solving the circuit partitioning and placement problems in VLSI layout design. Our models enable adoption of algorithms and tools for these well-studied problems to permute sparse matrices to block-diagonal form efficiently and effectively. In this work, we show that the A-to-ADB transformation problem can be described as a GPVS problem on the bipartite graph representation of A. The objective in the K-way GPVS problem is to find a subset of vertices (vertex separator) of minimum size that disconnects the K vertex parts while maintaining a given balance criterion on the vertex counts of K parts. In this model, minimizing the size of the vertex separator corresponds to minimizing the sum of the number of coupling rows and columns in ADB . We propose a one-phase approach for permuting A directly into an SB form. In this approach, a hypergraph model—proposed in an earlier version of this work [38]— is exploited to represent rectangular matrices. The proposed model reduces the 1862 CEVDET AYKANAT, ALI PINAR, AND ÜMIT V. ÇATALYÜREK A-to-ASB transformation problem into the HP problem. In this model, minimizing the size of the hyperedge separator directly corresponds to minimizing the number of coupling rows in ASB . The organization of the paper is as follows: In the next section we will discuss how block-diagonal structure can be exploited in parallelization of various applications. Some preliminary information on graph and hypergraph partitioning and ADB -to-ASB transformation are presented in section 3. Our proposed models for A-to-ADB and A-to-ASB transformations are explained in sections 4 and 5, respectively. Section 6 overviews recent graph and hypergraph partitioning algorithms and tools. Experimental evaluation of the proposed models is presented in section 7. And finally section 8 concludes the paper. 2. Applications. Block-diagonal structure of a matrix grants an inherent parallelism for the solution of the deriving problem. In this section, we will exemplify how to exploit this parallelism in three fundamental problems of linear algebra and optimization: linear programming, and LU and QR factorizations. 2.1. Linear programming. Exploiting the block-angular structure of linear programs (LPs) dates back to the work of Dantzig and Wolfe [11], when the motivation was solving large LPs with limited memory. Later studies investigated parallelization techniques [15, 23, 34]. The proposed techniques [11, 31, 35] led to iterative algorithms, where each iteration involves solving K independent LP subproblems corresponding to the block constraints followed by a coordination phase for coordinating the solutions of the subproblems according to the coupling constraints. These approaches have two nice properties. First, as the solution times of most LPs in practice increase as a quadratic or cubic function with the size of the problem, it is more efficient to solve a set of small problems than a single aggregate problem. Second, they give rise to a natural, coarse-grain parallelism that can be exploited by processing the subproblems concurrently. Coarse-grain parallelism inherent in these approaches has been exploited in several successful parallel implementations on distributed-memory multicomputers through the manager-worker scheme [12, 15, 23, 34]. At each iteration, the LP subproblems are solved concurrently by worker processors, whereas a serial master problem is solved by the manager processor in the coordination phase. As proposed in [12], these successful decomposition-based approaches can be exploited for coarse-grain parallel solution of general LP problems by transforming them into block-angular forms. In the matrix theoretical view, this transformation problem can be described as permuting the rectangular constraint matrix of the LP problem into an SB form, as shown in (1.1) with minimum border size, while maintaining a given balance criterion on the diagonal blocks. Note that row and column permutation correspond to reordering of the constraints and variables of the given LP problem. Here, minimizing the border size relates to minimizing the size of the master problem. The size of the master problem has been reported to be crucial for the parallel performance of these algorithms [12, 34]. First, it affects the convergence of the overall iterative algorithm. Second, in most algorithms the master problem is solved serially by the manager processor. Finally, it determines the communication requirement between phases. It is also important to have equal-sized blocks for load balancing in the parallel phase. It is worth noting that exploiting the block-angular structure of the constraint matrices is not restricted to LPs and can be applied in different optimization problems [36, 42]. PERMUTING A SPARSE MATRIX TO BLOCK-DIAGONAL FORM 1863 2.2. LU factorization. In most scientific computing applications, the core of the computation is solving a system of linear equations. Direct methods like LU factorization are commonly used for the solution of nonsymmetric systems for their numerical robustness. A coarse-grain parallel LU factorization scheme [24, 41] is to permute the square, nonsymmetric coefficient matrix to a DB form, as shown in (1.2). Notice that diagonal blocks of the permuted matrix constitute independent subproblems and can be factored concurrently. Pivots are chosen within the blocks for concurrency. Rows/columns that cannot be eliminated, including those that cannot be eliminated due to numerical reasons, are permuted to the end of the matrix to achieve a partially factored matrix in DB form as ⎡ ⎤ L1 U1 U1′ ⎢ .. ⎥ .. ⎢ . . ⎥ ⎢ ⎥. ′ ⎦ ⎣ LK UK UK F L′1 ... L′K In this matrix, Lk Uk constitutes the factored form of Aπk = Bk after the unfactored rows/columns are permuted to the end of the matrix. In a subsequent phase, the coupling rows and columns, along with unfactored columns and rows from the blocks, are factored. It is possible to parallelize this step with different (and usually less efficient) techniques. We stated two objectives during permutation to DB form. Our first objective is to minimize the number of coupling rows and columns, which relates to minimizing the work for the second phase, thus increasing concurrency. Our second objective of equal-sized blocks provides load balance during factorization of the blocks. 2.3. QR factorization. Least squares is one of the fundamental problems in numerical linear algebra and is defined as follows: min  Ax − b 2 , x where A is an M × N matrix with M ≥ N . QR factorization is a method commonly used to solve least-squares problems. In this method, matrix A is factored into an orthogonal M ×M matrix Q and an upper triangular N ×N matrix R with nonnegative diagonal elements so that   R A=Q . 0 Then we can solve for Rx = b′ to get a solution, where b′ is composed of the first N entries of vector b. Computationally, this problem is very similar to LU factorization; thus we can use the same scheme to parallelize QR factorization. Given a matrix in dual SB form, ⎤ ⎡ B1 C1 ⎢ B2 C2 ⎥ ⎥ ⎢ ⎢ .. ⎥ , . .. ⎣ . ⎦ BK CK the diagonal blocks of the matrix constitute the independent subblocks and can be factored independently. Thus, first phase is composed of factoring Bk and the associated 1864 CEVDET AYKANAT, ALI PINAR, AND ÜMIT V. ÇATALYÜREK coupling columns in Ck concurrently, so that [Bk Ck ] = Qk Rk 0 Sk Ck′ for k = 1, 2, . . . , K. T ′ In a subsequent phase, we factor C ′ = C1′ , . . . , CK [4]. So, in permuting a given matrix A into a dual SB form, minimizing the number of coupling columns minimizes the work on the second phase of the algorithm, and equal-sized blocks provide load balance for the first phase. 3. Preliminaries. In this section we will provide the basic definitions and techniques that will be adopted in the remainder of this paper. 3.1. Graph partitioning. An undirected graph G = (V, E) is defined as a set of vertices V and a set of edges E. Every edge eij ∈ E connects a pair of distinct vertices vi and vj . We use the notation Adj(vi ) to denote the set of vertices adjacent to vertex vi in graph G. We extend this operator to include the adjacency set of a vertex subset V ′ ⊂ V, i.e., Adj(V ′ ) = {vj ∈ V − V ′ : vj ∈ Adj(vi ) for some vi ∈ V ′ }. The degree di of a vertex vi is equal to the number of edges incident to vi , i.e., di = |Adj(vi )|. An edge subset ES is a K-way edge separator if its removal disconnects the graph into at least K connected components. A vertex subset VS is a K-way vertex separator if the subgraph induced by the vertices in V − VS has at least K connected components. The objective of graph partitioning is finding a separator, whose removal decomposes the graph into disconnected subgraphs with balanced sizes. The separator can be a set of edges or a set of vertices, and associated problems are called graph partitioning by edge separator (GPES) and graph partitioning by vertex separator (GPVS) problems, respectively. Both GPES and GPVS problems are known to be NP-hard [5]. Balance among subgraphs is usually defined by cumulative effect of weights assigned to vertices. Some alternatives have been studied recently [40]. We proceed with formal definitions. ΠES = {V1 , V2 , . . . , VK } is a K-way vertex partition of G by edge separator ES ⊂ E if the following conditions hold: Vk ⊂ V and Vk = ∅ for K 1 ≤ k ≤ K; Vk ∩ Vℓ = ∅ for 1 ≤ k < ℓ ≤ K; k=1 Vk = V. Edges between the vertices of different parts belong to ES and are called cut (external) edges, and all other edges are called uncut (internal) edges. Definition 3.1 (GPES problem). Given a graph G = (V, E), an integer K, and a balance criterion for subgraphs, the GPES problem is finding a K-way vertex partition ΠES = {V1 , V2 , . . . , VK } of G by edge separator ES that satisfies the balance criterion with minimum cost. The cost is defined as  cost(ΠES ) = (3.1) wij , eij ∈ES where wij is the weight of edge eij = (vi , vj ). The GPVS problem is similar, except that a subset of vertices, as opposed to edges, serve as the separator. ΠV S = {V1 , V2 , . . . , VK ; VS } is a K-way vertex partition of G by vertex separator VS ⊂ V if the following conditions hold: Vk ⊂ V and Vk = ∅ for 1 ≤ k ≤ K; Vk ∩ Vℓ = ∅ for 1 ≤ k < ℓ ≤ K and Vk ∩ VS = ∅ for 1 ≤ k < K; K k=1 Vk ∪ VS = V; removal of VS gives K disconnected parts V1 , V2 , . . . , VK (i.e., Adj(Vk ) ⊆ VS for 1 ≤ k ≤ K). A vertex vi ∈ Vk is said to be a boundary vertex of part Vk if it is adjacent to a vertex in VS . A vertex separator is said to be narrow if no subset of it forms a separator, and wide otherwise. PERMUTING A SPARSE MATRIX TO BLOCK-DIAGONAL FORM 1865 Definition 3.2 (GPVS problem). Given a graph G = (V, E), an integer K, and a balance criterion for subgraphs, the GPVS problem is finding a K-way vertex separator ΠV S = {V1 , V2 , . . . , VK ; VS } that satisfies the balance criterion, with minimum cost, where the cost is defined as (3.2) cost(ΠV S ) = |VS | . The techniques for solving GPES and GPVS problems are closely related, as will be further discussed in section 6. An indirect approach to solving the GPVS problem is to first find an edge separator through GPES, and then translate it to a vertex separator. After finding an edge separator, this approach takes vertices adjacent to separator edges as a wide separator to be refined to a narrow separator, with the assumption that a small edge separator yields a small vertex separator. The approach adopted by Ferris and Horn [12] falls into this class. The wide-to-narrow refinement problem is described as a minimum vertex cover problem on the bipartite graph induced by the cut edges. A minimum vertex cover can be taken as a narrow separator for the whole graph, because each cut edge will be adjacent to a vertex in the vertex cover. 3.2. Hypergraph partitioning. A hypergraph H = (U, N ) is defined as a set of nodes (vertices) U and a set of nets (hyperedges) N among those vertices. We refer to the vertices of H as nodes, to avoid the confusion between graphs and hypergraphs. Every net ni ∈ N is a subset of nodes, i.e., ni ⊆ U. The nodes in a net ni are called its pins and denoted as P ins(ni ). We extend this operator to include the pin list of a net subset N ′ ⊂ N , i.e., P ins(N ′ ) = ni ∈N ′ P ins(ni ). The size si of a net ni is equal to the number of its pins, i.e., si = |P ins(ni )|. The set of nets connected to a node uj is denoted as N ets(uj ). We also extend this operator to include the net list of a node subset U ′ ⊂ U, i.e., N ets(U ′ ) = uj ∈U ′ N ets(uj ). The degree dj of a node uj is equal to the number of nets it is connected to, |N ets(uj )|. The total  i.e., dj =  number p of pins denotes the size of H where p = ni ∈N si = uj ∈U dj . Graph is a special instance of hypergraph such that each net has exactly two pins. ΠHP = {U1 , U2 , . . . , UK } is a K-way node partition of H if the following conditions K hold: Uk ⊂ U and Uk = ∅ for 1 ≤ k ≤ K; Uk ∩Uℓ = ∅ for 1 ≤ k < ℓ ≤ K; k=1 Uk = U. In a partition ΠHP of H, a net that has at least one pin (node) in a part is said to connect that part. Connectivity set Λi of a net ni is defined as the set of parts connected by ni . Connectivity λi = |Λi | of a net ni denotes the number of parts connected by ni . A net ni is said to be cut (external) if it connects more than one part (i.e., λi > 1), and uncut (internal) otherwise (i.e., λj = 1). A net ni is said to be an internal net of a part Uk if it connects only part Uk , i.e., Λi = {Uk }, which also means P ins(ni ) ⊆ Uk . The set of internal nets of a part Uk is denoted as Nk for k = 1, . . . , K, and the set of external nets of a partition ΠHP is denoted as NS . So, although ΠHP is defined as a K-way partition on the node set of H, it can also be considered as inducing a (K + 1)-way partition {N1 , . . . , NK ; NS } on the net set. NS can be considered as a net separator whose removal gives K disconnected node parts U1 , . . . , UK as well as K disconnected net parts N1 , . . . , NK . Definition 3.3 (HP problem). Given a hypergraph H = (U, N ), an integer K, and a balance criterion for subhypergraphs, the HP problem is finding a K-way partitioning ΠHP = {U1 , U2 , . . . , UK } of H that satisfies the balance criterion, and minimizes the cost, which is defined as (3.3) cost(ΠHP ) = |NS | . 1866 CEVDET AYKANAT, ALI PINAR, AND ÜMIT V. ÇATALYÜREK 7 15 1 16 6 121 5 17 10 9 14 2 31 122 131 4 8 11 18 32 132 14 7 15 1 16 6 5 17 10 9 14 2 4 8 11 18 3 12 13 5 14 4 5 9 4 2 9 6 2 10 6 10 13 8 11 13 8 12 7 11 3 12 15 7 1 3 16 15 17 1 18 +1 -1 +1 -1 +1 -1 Fig. 3.1. Column-splitting process. The above metric of cost is often referred to as the cutsize metric in VLSI community. The connectivity metric is defined as  cost(ΠHP ) = (3.4) (λi − 1) ni ∈NS and is frequently used in VLSI [32] and scientific computing communities [8]. 3.3. Column-splitting method for ADB -to-ASB transformation. In the second phase of the Ferris–Horn (FH) algorithm [12], ADB is transformed into an SB form through the column-splitting technique used in stochastic programming to treat anticipativity [37]. In this technique, we consider the variables corresponding to the coupling columns. Consider a coupling column cj in submatrix C = T (C1T · · · CkT · · · CK DT )T of ADB , and let Λj denote the set of Ck ’s that have at least one nonzero in column cj . The nonzeros of a coupling column cj is split into |Λj | − 1 columns such that each new column includes nonzeros in rows of only one block. That is, we introduce one copy ckj of column cj for each block Ck ∈ Λj to decouple Ck from all other blocks in Λj on variable xj , so that ckj is permuted to be a column of Bk . Then we add |Λj | − 1 coupling constraints as coupling rows into ADB that force these variables {xkj : Ck ∈ Λj } all to be equal. Note that this splitting process for column cj increases both the row and column dimensions of matrix ASB by |Λj | − 1. Figure 3.1 depicts the column-splitting process on the ADB matrix obtained in Figure 4.2b. 4. Bipartite graph model for A-to-ADB transformation. In this section, we show that the A-to-ADB transformation problem can be described as a GPVS problem on the bipartite graph representation of A. In the bipartite graph model, M × N matrix A = (aij ) is represented as a bipartite graph BA = (V, E) on M + N vertices with the number of edges equal to the number of nonzeros in A. Each row and column of A is represented by a vertex in BA so that vertex sets R and C representing the rows and columns of A, respectively, form the vertex bipartition V = R ∪ C with |R| = M and |C| = N . There exists an edge between a row vertex ri ∈ R and a column vertex cj ∈ C if and only if the respective matrix entry aij is nonzero. So, Adj(ri ) and Adj(cj ) effectively represent the sets of columns and rows that have nonzeros in row i and column j, respectively. Figure 4.2a displays the bipartite graph representation of the sample matrix given in Figure 4.1. PERMUTING A SPARSE MATRIX TO BLOCK-DIAGONAL FORM 1 2 3 4 5 6 7 8 1867 9 10 11 12 13 14 15 16 17 18 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Fig. 4.1. A 15 × 18 sample matrix A. V1 c7 c15 r14 c16 c17 VS r4 r9 V2 7 15 1 16 6 r6 r1 c1 r5 r2 c12 c10 r15 c14 c13 r8 c9 c5 4 8 11 18 3 12 13 4 9 2 c2 c6 2 5 r10 r13 c3 5 17 10 9 14 14 6 10 13 8 11 12 r11 r3 r12 c18 7 r7 V3 c4 c8 (a) c11 3 15 1 (b) Fig. 4.2. (a) Bipartite graph representation BA of the sample A matrix given in Figure 4.1 and 3-way partitioning ΠV S of BA by vertex separator; (b) 3-way DB form ADB of A induced by ΠV S . Consider a K-way partition ΠV S = {V1 , . . . , VK ; VS } of BA , where Vk = Rk ∪ Ck for k = 1, . . . , K and VS = RS ∪ CS with Rk , RS ⊆ R and Ck , CS ⊆ C. ΠV S can be decoded as a partial permutation on the rows and columns of A to induce a permuted matrix Aπ . In this permutation, the rows and columns associated with the vertices in Rk+1 and Ck+1 are ordered after the rows and columns associated with the vertices in Rk and Ck for k = 1, . . . , K − 1, and the rows and columns associated with the vertices in RS and CS are ordered last as the coupling rows and columns, respectively. Theorem 4.1. Let BA = (V, E) be the bipartite graph representation of a given matrix A. A K-way vertex separator ΠV S = {V1 , V2 , . . . , VK ; VS } of BA gives a permutation of A to K-way DB form ADB , where row and column vertices in Vk constitute the rows and columns of the kth diagonal block of ADB , and row and column vertices in VS constitute the coupling rows and columns of ADB . Thus, • minimizing the size of the separator minimizes the border size; • balance among subgraphs infer balance among diagonal submatrices. Proof. Consider a row vertex ri ∈ Rk and a column vertex cj ∈ Ck of part Vk in a partition ΠV S of BA . Since Adj(ri ) ⊆ Ck ∪ CS , ri ∈ Rk corresponds to permuting 1868 CEVDET AYKANAT, ALI PINAR, AND ÜMIT V. ÇATALYÜREK all nonzeros of row i of A into either submatrix Aπkk or submatrices Aπkk and AπkS depending on ri being a nonboundary or a boundary vertex of Vk , respectively. So, all nonzeros in the kth row slice Aπk∗ of Aπ will be confined to the Aπkk and AπkS matrices. Since Adj(cj ) ⊆ Rk ∪ RS , cj ∈ Ck corresponds to permuting all nonzeros of column j of A into either submatrix Aπkk or submatrices Aπkk and AπSk of Aπ depending on cj being a nonboundary or a boundary vertex of Vk , respectively. So, all nonzeros in the kth column slice Aπ∗k of Aπ will be confined to the Aπkk and AπSk matrices. Hence, Aπ will be in a DB form, as shown in (1.2), with Aπkk = Bk , AπkS = Ck , and AπSk = Rk for k = 1, . . . , K, and AπSS = D. The number of coupling rows and columns in Aπ is equal to, respectively, the number of row and column vertices in the separator VS , i.e., Mc = |RS | and Nc = |CS |. So, in GPVS of BA , minimizing the separator size according to (3.2) corresponds to minimizing the sum of the number of coupling rows and columns in Aπ , since |VS | = |RS | + |CS | = Mc + Nc . The row and column dimensions of the kth diagonal block Bk of Aπ is equal to, respectively, the number of row and column vertices in part Vk , i.e., Mk = |Rk | and Nk = |Ck | for k = 1, . . . , K. So, the row-vertex and column-vertex counts of the parts {V1 , . . . , VK } can be used to maintain the required balance criterion on the dimensions of the diagonal blocks {B1 , . . . , BK } of Aπ . Figure 4.2a displays a 3-way GPVS of BA , and Figure 4.2b shows a corresponding partial permutation that transforms matrix A of Figure 4.1 into a 3-way DB form ADB . 5. Hypergraph model for A-to-ASB transformation. In this section, we show that A-to-ASB transformation can be described as an HP problem on a hypergraph representation of A. In our previous studies [7, 8, 38, 39], we proposed two hypergraph models, namely, row-net and column-net models, for representing rectangular as well as symmetric and nonsymmetric square matrices. These two models are duals: the row-net representation of a matrix is equal to the column-net representation of its transpose. Here, we describe and discuss only the row-net model for permuting a matrix A into a primal SB form, whereas the column-net model can be used for permuting A into a dual SB form. Because of the duality between the rownet and column-net models, permuting A into a dual SB form using the column-net model on A is the same as permuting AT into a primal SB form using the row-net model on AT . In the (row-net) hypergraph model, an M ×N matrix A = (aij ) is represented as a hypergraph HA = (U, N ) on N nodes and M nets with the number of pins equal to the number of nonzeros in matrix A. Node and net sets U and N correspond, respectively, to the columns and rows of A. There exist one net ni and one node uj for each row i and column j, respectively. Net ni ⊆ U contains the nodes corresponding to the columns that have a nonzero entry in row i, i.e., uj ∈ ni if and only if aij = 0. That is, P ins(ni ) represents the set of columns that have a nonzero in row i of A, and in a dual manner N ets(uj ) represents the set of rows that have a nonzero in column j of A. So, the size si of a net ni is equal to the number of nonzeros in row i of A, and the degree dj of a node uj is equal to the number of nonzeros in column j of A. Figure 5.1a displays the hypergraph representation of the 16 × 18 sample matrix in Figure 4.1. Recently, we exploited the proposed row-net (column-net) model for columnwise (rowwise) decomposition of sparse matrices for parallel matrix-vector multiplication [7, 8]. In that application, nodes represent units of computation and nets encode multiway data dependencies. In [7, 8], we showed that a one-dimensional matrix partitioning problem can be modeled as an HP problem in which the connectivity 1869 PERMUTING A SPARSE MATRIX TO BLOCK-DIAGONAL FORM U1 U2 17 7 2 15 14 7 15 1 16 12 10 1 16 5 2 5 18 13 4 3 8 11 14 5 14 15 6 6 17 10 9 14 9 6 1 10 2 4 9 4 9 5 13 12 2 6 8 10 13 3 13 18 11 3 12 12 11 3 7 4 8 11 U3 7 1 15 8 (a) (b) Fig. 5.1. (a) Row-net hypergraph representation HA of the sample A matrix shown in Figure 4.1 and 3-way partitioning ΠHP of HA ; (b) 3-way SB form ASB of A induced by ΠHP . metric in (3.4) is exactly equal to the parallel communication volume. The proposed HP model overcomes some flaws and limitations of the standard GPES models, which are also addressed by Hendrickson and Kolda [18, 19]. In this work, we show that the A-to-ASB transformation problem can be described as an HP problem in which the cutsize metric in (3.3) is exactly equal to the number of coupling rows in ASB . Theorem 5.1. Let HA = (U, N ) be the hypergraph representation of a given matrix A. A K-way partition ΠHP = {U1 , . . . , UK } = {N1 , . . . , NK ; NS } of HA gives a permutation of A to K-way SB form, where nodes in Uk and internals nets in Nk , respectively, constitute the columns and rows of the kth diagonal block of ASB , and external nets in NS constitute the coupling rows of ASB . Thus, • minimizing the cutsize minimizes the number of coupling rows; • balance among subhypergraphs infer balance among diagonal submatrices. Proof. Consider a K-way partition ΠHP = {U1 , . . . , UK } = {N1 , . . . , NK ; NS } of HA . ΠHP can be decoded as a partial permutation on the rows and columns of A to induce a permuted matrix Aπ . In this permutation, the columns associated with the nodes in Uk+1 are ordered after the columns associated with the nodes in Uk for k = 1, . . . , K − 1. The rows associated with the internal nets (Nk+1 ) of Uk+1 are ordered after the rows associated with the internal nets (Nk ) of Uk for k = 1, . . . , K −1, where the rows associated with the external nets (NS ) are ordered last as the coupling rows. That is, a node uj ∈ Uk corresponds to permuting column j of A to the kth T  column slice Aπ∗k = (Aπ1k )T · · · (AπKk )T (AπSk )T of Aπ . An internal net ni of Uk corresponds to permuting row i of A to the kth row slice Aπk∗ = (Aπk1 · · · AπkK ) of Aπ , and an external net ni corresponds to permuting row i of A to the border AπS = (AπS1 · · · AπSK ) of Aπ . Consider an internal net ni ∈ Nk of part Uk in a partition ΠHP of HA . Since P ins(ni ) ⊆ Uk , ni ∈ Nk corresponds to permuting all nonzeros of row i of A into submatrix Aπkk of Aπ . So, all nonzeros in the kth row slice Aπk∗ will be confined to the Aπkk submatrix. Consider a node uj of part Uk . Since N ets(uj ) ⊆ Nk ∪ NS , uj ∈ Uk corresponds to permuting all nonzeros of column j of A into either submatrix Aπkk or submatrices Aπkk and AπkS depending on whether uj is a nonboundary or a boundary 1870 CEVDET AYKANAT, ALI PINAR, AND ÜMIT V. ÇATALYÜREK node of Uk , respectively. So, all nonzeros in the kth column slice Aπ∗k will be confined to the Aπkk and AπSk matrices. Hence, Aπ will be in an SB form, as shown in (1.1), with Aπkk = Bk and AπSk = Rk for k = 1, . . . , K. The number of coupling rows in Aπ is equal to the number of external nets; thus minimizing the cutsize according to (3.3) corresponds to minimizing the number of coupling rows in Aπ . The row and column dimensions of the kth diagonal block Bk of Aπ is equal to, respectively, the number of internal nets and nodes in part Uk , i.e., Mk = |Nk | and Nk = |Uk | for k = 1, . . . , K. So, the node and internal-net counts of the parts {U1 , . . . , UK } can be used to maintain the required balance criterion on the dimensions of the diagonal blocks {B1 , . . . , BK } of Aπ . Figure 5.1a displays a 3-way partitioning ΠHP of HA and Figure 5.1b shows a corresponding partial permutation which transforms matrix A in Figure 4.1 directly into a 3-way SB form. 6. Graph and hypergraph partitioning algorithms and tools. Recently, multilevel GPES [6, 20] and HP [8, 17, 29] approaches have been proposed, leading to successful GPES tools such as Chaco [21], MeTiS [27], and WGPP [16] and HP tools hMeTiS [29] and PaToH [9]. These multilevel heuristics consist of 3 phases: coarsening, initial partitioning, and uncoarsening. In the first phase, a multilevel clustering is applied starting from the original graph/hypergraph by adopting various matching heuristics until the number of vertices in the coarsened graph/hypergraph decreases below a predetermined threshold value. Clustering corresponds to coalescing highly interacting vertices to supernodes. In the second phase, a partition is obtained on the coarsest graph/hypergraph using various heuristics including FM, which is an iterative refinement heuristic proposed for graph/hypergraph partitioning by Fiduccia and Mattheyses [13] as a faster implementation of the KL algorithm proposed by Kernighan and Lin [30]. In the third phase, the partition found in the second phase is successively projected back towards the original graph/hypergraph by refining the projected partitions on the intermediate level uncoarser graphs/hypergraphs using various heuristics including FM. In this work, we use the direct K-way GPES version of MeTiS [28] (kmetis option [27]) for indirect GPVS in the A-to-ADB transformation phase of the FH method and our multilevel HP tool PaToH [9] in our one-phase A-to-ASB transformation approach. One of the most important applications of GPVS is George’s nested-dissection algorithm [14], which has been widely used in fill-reducing orderings for sparse matrix factorizations. The basic idea in the nested-dissection algorithm is to reorder a symmetric matrix into a 2-way DB form so that no fill can occur in the off-diagonal blocks. The DB form of the given matrix is obtained through a symmetric row/column permutation induced by a 2-way GPVS. Then both diagonal blocks are reordered by applying the dissection strategy recursively. The performance of the nested-dissection reordering algorithm depends on finding small vertex separators at each dissection step. So, the nested-dissection implementations can easily be exploited for obtaining a K-way DB form of a matrix by terminating the dissection operation after lg2 K recursion levels and then gathering the vertex separators obtained at each dissection step to a single separator constituting a K-way vertex separator. So, we obtain a K-way DB form of matrix A in our two-phase approach by providing the bipartite graph model of A as input to a nested-dissection-based reordering tool. Note that we effectively perform a nonsymmetric nested dissection on the bipartite graph model of the rectangular A matrix. Direct 2-way GPVS approaches have been embedded into various multilevel nesteddissection implementations [16, 22, 27]. In these implementations, a 2-way GPVS PERMUTING A SPARSE MATRIX TO BLOCK-DIAGONAL FORM 1871 obtained on the coarsest graph is refined during the multilevel framework of the uncoarsening phase. Two distinct vertex-separator refinement schemes were proposed and used for the uncoarsening phase. The first one is the extension of the FM edgeseparator refinement approach to vertex-separator refinement as proposed by Ashcraft and Liu [1]. This scheme considers vertex moves from vertex separator VS to both V1 and V2 in ΠV S = {V1 , V2 ; VS }. This refinement scheme is adopted in the onmetis ordering code of MeTiS [27], the ordering code of WGPP [16], and the ordering code BEND [22]. The second scheme is based on Liu’s narrow separator refinement algorithm [33], which considers moving a set of vertices simultaneously from VS , in contrast to the FM-based refinement scheme [1], which moves only one vertex at a time. Liu’s refinement algorithm [33] can be considered as repeatedly running the maximum-matching-based vertex cover algorithm on the bipartite graphs induced by the edges between V1 and VS and between V2 and VS . That is, the wide vertex separator consisting of VS and the boundary vertices of V1 (V2 ) is refined as in the GPES-based wide-to-narrow separator refinement scheme. The network-flow-based minimum weighted vertex cover algorithms proposed by Ashcraft and Liu [2], and Hendrickson and Rothberg [22] enabled the use of Liu’s refinement approach [33] on the coarse graphs within the multilevel framework. In this work, we use the publicly available onmetis ordering code of MeTiS [27] for direct GPVS. 7. Experimental results. We tested the performance of the proposed models and associated solution approaches on a wide range of large LP constraint matrices obtained from [10] and [25]. Properties of these rectangular matrices are presented in Table 7.1, where the matrices are listed in the order of increasing number of rows. All experiments were performed on a workstation equipped with a 133 MHz PowerPC processor with 512 KB external cache and 64 MB of memory. We have tested K = 4-, 8-, and 16-way partitioning of every test matrix. For each K value, K-way partitioning of a test matrix constitutes a partitioning instance. Partitioning tools MeTiS [27] and PaToH [9] were run 50 times starting from different random seeds for each instance. We use averages of these runs for each instance in this section. Figure 7.1 displays K = 4-, 8-, and 16-way sample primal SB forms of the matrix GE obtained by PaToH. In this section, we first compare different solution techniques for a model. Tables 7.2–7.3 present only the averages over the 13 matrices. Breakdown of the results for each matrix can be found in [3]. Then we compare the effectiveness of the models for their best solution technique, both in terms of solution quality (Tables 7.4–7.5) and preprocessing times (Table 7.6). In these tables, %Mc denotes the percentage of the number of coupling rows in both DB and primal SB forms, i.e., %Mc = 100 × Mc /M . %Nc denotes the number of coupling columns in the DB forms as percents of the respective M values to enable the comparison of the Mc and Nc values under the same unit, i.e., %Nc = 100 × Nc /M . We measure the balance quality of the diagonal blocks in terms of percent row imbalance %RI = 100 × (Mmax /Mavg − 1) and percent column imbalance %CI = 100 × (Nmax /Navg − 1). Here, Mmax (Nmax ) denotes the row (column) count of the diagonal block with the maximum number of rows (columns) in both SB and DB forms. Mavg = (M −Mc )/K in both SB and DB forms, whereas Navg = (N − Nc )/K in DB forms and Navg = N/K in SB forms. It should be noted here that more complicated balancing criteria might need to be maintained in practical applications. For example, empirical relation T (M, N ) = cM 2.17 N 0.89 (where c is some constant) was reported in [34] for the solution time (with IMSL routine ZX0LP [26]) of an LP subproblem corresponding to an M × N diagonal block. 1872 CEVDET AYKANAT, ALI PINAR, AND ÜMIT V. ÇATALYÜREK Table 7.1 Properties of rectangular test matrices. Name NL CQ9 GE CO9 car4 fxm4-6 fome12 pltexpA4-6 kent world mod2 lpl1 fxm3-16 Number of rows cols M N 7039 9718 9278 13778 10099 11098 10789 14851 16384 33052 22400 30732 24284 48920 26894 70364 31300 16620 34506 32734 34774 31728 39951 125000 41340 64162 Number of nonzeros per row per col max avg max avg 149 5.89 15 4.26 390 9.58 24 6.45 47 3.92 36 3.56 440 9.41 28 6.84 111 3.89 109 1.93 57 11.12 24 8.10 228 5.87 14 2.91 30 5.32 8 2.03 960 5.90 18 11.11 341 4.77 16 5.02 310 4.75 16 5.20 177 9.54 16 3.05 57 8.97 36 5.78 Total 41428 88897 39554 101578 63724 248989 142528 143059 184710 164470 165129 381259 370839 0 0 1000 1000 2000 2000 3000 3000 4000 4000 5000 5000 6000 6000 7000 7000 8000 8000 9000 9000 10000 10000 0 2000 4000 6000 nz = 39554 8000 10000 0 2000 4000 (a) 6000 nz = 39554 8000 10000 8000 10000 (b) 0 0 1000 1000 2000 2000 3000 3000 4000 4000 5000 5000 6000 6000 7000 7000 8000 8000 9000 9000 10000 10000 0 2000 4000 6000 nz = 39554 (c) 8000 10000 0 2000 4000 6000 nz = 39554 (d) Fig. 7.1. Rectangular GE matrix with 10,099 rows and 11,098 columns: (a) original structure, (b) 4-way SB form, (c) 8-way SB form, (d) 16-way SB form. PERMUTING A SPARSE MATRIX TO BLOCK-DIAGONAL FORM 1873 Table 7.2 Performance of different techniques on the bipartite-graph (BG) model. K 4 8 16 avg Indirect GPVS BG-model (FH) ADB ASB %Mc %Nc %Mc 6.55 0.20 6.80 9.70 0.54 10.40 14.12 12.79 1.05 9.68 0.60 10.44 Direct GPVS BG-model (onmetis) ADB ASB %Mc %Nc %Mc 1.31 0.22 1.60 2.75 0.65 3.60 4.15 1.17 5.90 2.74 0.68 3.70 Table 7.3 Effect of different balancing criteria in the performance of PaToH. K 4 8 16 avg %Mc 1.62 3.15 4.79 3.19 R-PaToH %RI 9.1 15.6 23.5 16.1 %CI 15.0 26.3 37.1 26.2 (R+C)-PaToH %Mc %RI %CI 1.69 10.1 10.2 3.31 16.7 16.6 4.98 25.6 23.9 3.33 17.4 16.9 (R&C)-PaToH %Mc %RI %CI 1.72 8.2 10.1 3.43 14.5 17.2 5.17 21.3 24.6 3.44 14.7 17.3 Table 7.2 presents the results of our experiments on the bipartite graph (BG) model for both A-to-ADB transformation and two-phase A-to-ASB transformation. On the BG model, we experimented with the built-in GPES tool kmetis of MeTiS for indirect GPVS in the FH method and direct GPVS tool onmetis. Note that FH corresponds to our implementation of the algorithm proposed by Ferris and Horn [12], where we used kmetis to partition the bipartite graph. Since the GPES and GPVS solvers of MeTiS maintain balance on vertices, balance on the sum of the row and column counts of the diagonal blocks is explicitly maintained during partitioning. Both schemes produce DB forms with comparable row and column imbalance values. As seen in Table 7.2, the direct onmetis scheme produces substantially better DB forms than the indirect FH scheme. Table 7.2 also displays the effect of the columnsplitting process used in the second phase of two-phase approaches. In the table, (%McSB − %McDB )/%Nc = (McSB − McDB )/Nc shows the average number of coupling rows induced by a coupling column during the ADB -to-ASB transformation. It can easily be derived from the table that a coupling column induces 1.27 and 1.41 coupling rows in the FH and BG-onmetis schemes, respectively, on average. This means that vertex separators found by these two schemes contain column vertices with small degree, e.g., 2.27 and 2.41. It is interesting to note that both schemes produce DB forms with wide row borders and narrow column borders in general. For this work, we enhanced PaToH for maintaining different balance criteria that might be used in balancing diagonal blocks of the SB forms. Table 7.3 illustrates the effect of these different balancing criteria in the performance of PaToH. R-PaToH maintains balance on the number of internal nets of the parts during partitioning. (R+C)-PaToH maintains balance on the sum of internal net and vertex counts of the parts during partitioning. (R&C)-PaToH maintains balance on both the number of internal nets and vertices of the parts during partitioning. Note that, in the row-net hypergraph model, balancing the internal net and vertex counts of the parts correspond, respectively, to balancing the row and column counts of the diagonal blocks of the resulting SB form. As seen in Table 7.3, R-PaToH performs better than (R+C)-PaToH, which performs better than (R&C)-PaToH in terms of the number of coupling rows. This observation can be explained by the 1874 CEVDET AYKANAT, ALI PINAR, AND ÜMIT V. ÇATALYÜREK Table 7.4 Performance comparison of the hypergraph model (H-model) with the bipartite graph model (BG-model) in A-to-ASB transformation in terms of the border size ( %Mc ). Name NL CQ9 GE CO9 car4 fxm4-6 fome12 pltexpA4-6 kent world mod2 lpl1 fxm3-16 H-model BG-model K PaToH onmetis FH 4 5.02 5.22 27.71 8 6.02 6.59 32.57 16 7.19 8.31 36.72 4 2.87 2.92 23.06 8 4.10 4.03 27.76 16 5.40 5.28 30.50 4 3.01 2.53 4.71 8 4.37 4.39 8.06 16 5.63 5.97 10.81 4 2.72 2.78 21.27 8 3.78 3.85 26.12 16 5.10 5.03 30.26 4 0.00 0.00 0.00 8 0.00 0.52 1.29 16 0.00 1.83 1.29 4 0.64 0.41 0.49 8 1.17 0.80 1.70 16 2.13 1.42 2.28 4 0.00 0.00 0.00 8 9.43 12.27 17.04 16 15.39 21.23 29.02 4 1.62 0.79 1.08 8 3.02 2.15 1.98 16 5.32 4.42 4.77 4 0.34 0.15 0.66 8 0.70 0.56 2.11 1.33 3.47 16 1.26 4 1.08 0.80 1.53 8 2.25 2.25 3.79 16 5.25 5.94 9.29 4 0.86 0.78 0.88 8 2.12 2.05 3.42 16 5.10 5.64 8.75 4 3.27 4.08 6.37 8 5.40 6.58 9.03 16 6.17 8.76 15.96 4 0.52 0.33 0.56 8 0.66 0.73 0.34 16 0.86 1.51 0.39 Averages over K 4 1.69 1.60 6.80 8 3.31 3.60 10.40 16 4.98 5.90 14.12 all 3.33 3.70 10.44 reduced solution space with increasing complexity of the balancing criterion. Tables 7.4–7.6 present performance comparison of different schemes on A-to-ASB transformation. Tables 7.4 and 7.5 display the quality of SB forms in terms of border size (%Mc ) and diagonal-block imbalance (%RI and %CI), respectively, whereas Table 7.6 displays the runtime performance. The FH algorithm effectively maintains balance on the sum of the row and column counts of the diagonal blocks. The proposed two-phase BG-onmetis scheme also works according to the same balance criterion because of the limitation of the direct GPVS solver onmetis. Therefore, for the sake of PERMUTING A SPARSE MATRIX TO BLOCK-DIAGONAL FORM 1875 Table 7.5 Performance comparison of the hypergraph model (H-model) with the bipartite graph model (BG-model) in A-to-ASB transformation in terms of the diagonal-block imbalance size. Name NL CQ9 GE CO9 car4 fxm4-6 fome12 pltexpA4-6 kent world mod2 lpl1 fxm3-16 K 4 8 16 4 8 16 4 8 16 4 8 16 4 8 16 4 8 16 4 8 16 4 8 16 4 8 16 4 8 16 4 8 16 4 8 16 4 8 16 4 8 16 all H-model BG-model (R+C)-PaToH onmetis FH %RI %CI %RI %CI %RI %CI 8.6 6.6 11.8 12.2 15.5 14.0 13.0 11.3 17.6 18.5 23.4 19.2 23.7 24.5 28.9 22.2 18.3 15.3 17.0 22.6 17.8 17.6 19.5 17.8 26.6 31.0 24.4 25.3 22.9 24.0 37.3 38.6 36.7 29.5 29.5 24.8 14.8 11.8 15.5 15.4 13.5 12.1 21.5 19.8 19.3 20.0 19.0 19.4 29.9 27.6 27.0 27.7 28.2 22.9 10.9 19.3 14.7 12.1 18.8 17.1 14.4 27.5 21.2 16.9 20.7 24.4 27.9 33.0 30.1 22.2 26.7 25.4 0.6 0.9 3.3 5.9 22.6 25.0 0.6 2.0 12.8 18.7 0.9 2.9 0.7 4.3 23.7 36.1 0.9 6.1 10.0 9.5 2.6 2.4 8.0 7.8 14.7 13.8 10.9 11.0 14.8 14.3 23.2 22.5 19.1 20.2 15.1 15.1 0.0 0.0 0.0 0.0 0.0 0.0 9.6 8.3 12.8 10.7 16.6 13.3 19.4 13.8 24.9 22.3 25.9 16.8 5.9 4.8 2.9 3.2 11.4 11.7 12.9 10.4 10.2 10.9 11.9 12.5 19.7 17.0 16.2 18.2 15.5 16.7 12.2 16.1 12.9 23.8 18.3 21.8 19.3 24.6 21.3 35.0 22.9 18.8 26.7 41.7 31.8 48.5 28.8 32.9 9.8 11.5 10.3 10.0 10.5 10.0 17.8 20.6 17.8 19.8 15.4 17.5 30.9 28.3 31.0 30.4 22.6 20.0 9.6 10.3 10.6 10.6 11.8 10.7 17.2 18.3 18.4 20.4 15.0 17.1 30.2 26.7 28.7 29.6 22.0 20.4 18.4 6.8 11.7 5.5 13.4 12.3 31.3 12.0 15.8 11.7 24.1 17.3 40.5 16.0 26.0 18.9 35.8 20.3 13.6 12.9 0.6 0.5 7.8 7.6 17.6 16.6 1.3 1.2 2.4 1.8 27.7 26.4 2.9 2.6 4.6 3.7 Averages over K 10.1 10.2 8.8 9.2 13.2 12.9 16.7 16.6 15.7 16.9 16.2 15.6 25.6 23.9 24.8 25.4 21.9 19.0 17.4 16.9 16.4 17.2 17.1 15.8 a common experimental framework, the results of the (R+C)-PaToH, BG-onmetis, and FH schemes are displayed in Tables 7.4–7.6. As seen in Table 7.4, the proposed schemes perform significantly better than the FH algorithm. For example, the number of coupling rows of the SB forms produced by the FH algorithm are 3 times larger than those of PaToH, on overall average. The one-phase approach PaToH produces approximately 11% fewer coupling rows than the two-phase approach BG-onmetis, on average, which confirms the effectiveness of 1876 CEVDET AYKANAT, ALI PINAR, AND ÜMIT V. ÇATALYÜREK Table 7.6 Execution times of the partitioning algorithms given in Table 7.5 as percents of the solution times of the LP problems by LOQO. Values in parentheses are the LP solution times in seconds. Name NL CQ9 GE CO9 car4 fxm4-6 fome12 pltexpA4-6 kent world mod2 lpl1 fxm3-16 LOQO sol. time K 4 8 16 4 100 (554) 8 16 4 8 100 (403) 16 4 100 (708) 8 16 4 100 (56) 8 16 4 100 (191) 8 16 4 100 (62677) 8 16 4 100 (278) 8 16 4 100 (618) 8 16 4 100 (1163) 8 16 4 100 (1076) 8 16 4 100 (3800) 8 16 4 100 (449) 8 16 Averages 4 8 16 all 100 (804) H-model PaToH 0.211 0.244 0.271 0.459 0.571 0.672 0.220 0.294 0.392 0.390 0.484 0.545 3.562 5.168 6.704 1.978 2.941 3.884 0.015 0.024 0.028 1.576 2.328 3.029 0.756 1.117 1.385 0.427 0.612 0.786 0.453 0.632 0.843 0.833 1.086 1.221 1.365 2.087 2.737 over K 0.942 1.353 1.730 1.342 BG-model onmetis FH 0.140 0.090 0.179 0.090 0.199 0.106 0.339 0.213 0.447 0.229 0.538 0.263 0.273 0.136 0.387 0.154 0.449 0.169 0.305 0.189 0.393 0.205 0.472 0.233 45.958 46.603 50.529 54.329 52.429 58.326 1.976 0.944 2.931 0.975 3.817 0.986 0.007 0.004 0.014 0.005 0.018 0.007 1.470 0.782 2.277 0.785 2.810 0.811 0.898 0.451 1.333 0.487 1.662 0.534 0.317 0.169 0.478 0.178 0.667 0.214 0.334 0.178 0.509 0.186 0.710 0.221 0.341 0.169 0.482 0.178 0.662 0.198 1.387 0.719 2.026 0.690 2.652 0.659 4.134 4.768 5.160 4.688 3.896 4.499 4.825 4.407 the hypergraph model to permute rectangular matrices into SB forms. As seen in Table 7.4, the numbers of coupling rows of the SB forms produced by PaToH remain below 5% for 16-way partitionings, on average. As seen in Tables 7.4–7.5, our methods find balanced permutations, with very few coupling rows, which would lead to efficient parallel solutions. Table 7.6 displays execution times of the partitioning algorithms as percents of the solution times of the respective LP problems by LOQO [43]. As seen in this table, partitioning times are affordable when compared with the LP solution times. For PERMUTING A SPARSE MATRIX TO BLOCK-DIAGONAL FORM 1877 example, LOQO [43] solves the lpl 1 problem, which has the constraint matrix with the largest M × N product, in approximately 3800 seconds. As seen in Table 7.6, the 16-way partitioning times of all algorithms remain below 1.22% of the LOQO solution time of this LP problem. As also seen in the table, partitioning times of all algorithms remain well below 4% of the LOQO solution times of all LP problems except car 4. In two-phase approaches, hypergraph and bipartite representations of a rectangular matrix are of equal size: the number of nonzeros in the matrix. However, the clustering phase of an HP tool involves more costly operations than those of a GP tool. Hence, two-phase approaches using a GP tool are expected to run faster than the one-phase approach using an HP tool. As seen in Table 7.6, the two-phase approach BG-onmetis runs faster than PaToH in the partitioning of all test matrices except GE, car 4, and kent. 8. Conclusion. We investigated permuting a sparse rectangular matrix A into doubly bordered (DB) and singly bordered (SB) block-diagonal forms ADB and ASB with minimum border size while maintaining balance on the diagonal blocks. We showed that the A-to-ADB transformation problem can be described as a graph partitioning by vertex separator (GPVS) problem on the bipartite-graph representation of matrix A. We proposed a hypergraph model for representing the sparsity structure of A so that the A-to-ASB transformation problem can be formulated as a hypergraph partitioning (HP) problem. The performance of the proposed models and approaches depends on the performance of the tools used to solve the associated problems as well as the representation power of the models. We also overview solution techniques and tools for solving the stated problems. Experimental results on a wide range of sparse matrices were impressive and showed that our methods can effectively extract the underlying block-diagonal structure of a matrix. REFERENCES [1] C. Ashcraft and J. W. H. Liu, A Partition Improvement Algorithm for Generalized Nested Dissection, Tech. Report BCSTECH-94-020, Boeing Computer Services, Seattle, WA, 1994. [2] C. Ashcraft and J. W. H. Liu, Applications of the Dulmage–Mendelsohn decomposition and network flow to graph bisection improvement, SIAM J. Matrix Anal. Appl., 19 (1998), pp. 325–354. [3] C. Aykanat, A. Pinar, and U. V. Çatalyürek, Permuting Sparse Rectangular Matrices into Block-Diagonal Form, tech. report, Department of Computer Engineering, Bilkent University, Ankara, Turkey, 2002. [4] Å. Björck, Numerical Methods for Least Squares Problems, SIAM, Philadelphia, 1996. [5] T. N. Bui and C. Jones, Finding good approximate vertex and edge partitions is NP-hard, Inform. Process. Lett., 42 (1992), pp. 153–159. [6] T. N. Bui and C. Jones, A heuristic for reducing fill-in in sparse matrix factorization, in Proceedings of the 6th SIAM Conference on Parallel Processing for Scientific Computing, SIAM, Philadelphia, 1993, pp. 445–452. [7] U. V. Çatalyürek and C. Aykanat, Decomposing irregularly sparse matrices for parallel matrix-vector multiplications, in Proceedings of the 3rd International Symposium on Solving Irregularly Structured Problems in Parallel, Irregular’96, Lecture Notes in Comput. Sci. 1117, Springer-Verlag, Berlin, 1996, pp. 75–86. [8] U. V. Çatalyürek and C. Aykanat, Hypergraph-partitioning based decomposition for parallel sparse-matrix vector multiplication, IEEE Trans. Parallel Distrib. Systems, 10 (1999), pp. 673–693. [9] U. V. Çatalyürek and C. Aykanat, PaToH: A Multilevel Hypergraph Partitioning Tool, Version 3.0, Department of Computer Engineering, Bilkent University, Ankara, Turkey, 1999. [10] I. O. Center, Linear Programming Problems, ftp://col.biz.uiowa.edu:pub/testprob/lp/gondzio. 1878 CEVDET AYKANAT, ALI PINAR, AND ÜMIT V. ÇATALYÜREK [11] G. Dantzig and P. Wolfe, Decomposition principle for linear programs, Oper. Res., 8 (1960), pp. 101–111. [12] M. C. Ferris and J. D. Horn, Partitioning mathematical programs for parallel solution, Math. Programming, 80 (1998), pp. 35–61. [13] C. M. Fiduccia and R. M. Mattheyses, A linear-time heuristic for improving network partitions, in Proceedings of the 19th ACM/IEEE Design Automation Conference, IEEE Press, Piscataway, NJ, 1982, pp. 175–181. [14] A. George, Nested dissection of a regular finite element mesh, SIAM J. Numer. Anal., 10 (1973), pp. 345–363. [15] S. K. Gnanendran and J. K. Ho, Load balancing in the parallel optimization of block-angular linear programs, Math. Programming, 62 (1993), pp. 41–67. [16] A. Gupta, Watson Graph Partitioning Package, Tech. Report RC 20453, IBM T. J. Watson Research Center, Yorktown Heights, NY, 1996. [17] S. Hauck and G. Boriello, An evaluation of bipartitioning techniques, IEEE Trans. Computer-Aided Design of Integrated Circuits and Systems, 16 (1997), pp. 849–866. [18] B. Hendrickson, Graph partitioning and parallel solvers: Has the emperor no clothes?, in Proceedings of the 5th International Symposium on Solving Irregularly Structured Problems in Parallel, Irregular’98, Lecture Notes in Comput. Sci. 1457, Springer-Verlag, Berlin, 1998, pp. 218–225. [19] B. Hendrickson and T. G. Kolda, Graph partitioning models for parallel computing, Parallel Comput., 26 (2000), pp. 1519–1534. [20] B. Hendrickson and R. Leland, A Multilevel Algorithm for Partitioning Graphs, tech. report, Sandia National Laboratories, Albuquerque, NM, 1993. [21] B. Hendrickson and R. Leland, The Chaco User’s Guide, Version 2.0, Sandia National Laboratories, Albuquerque, NM, 1995. [22] B. Hendrickson and E. Rothberg, Improving the run time and quality of nested dissection ordering, SIAM J. Sci. Comput., 20 (1998), pp. 468–489. [23] J. K. Ho, T. C. Lee, and R. P. Sundarraj, Decomposition of linear programs using parallel computation, Math. Programming, 42 (1988), pp. 391–405. [24] Y. F. Hu, C. M. Maguire, and R. J. Blake, Ordering unsymmetric matrices into bordered block diagonal form for parallel processing, in Proceedings of Euro-Par ’99 Parallel Processing: 5th International Euro-Par Conference, Toulouse, France, 1999, P. Amestoy, P. Berger, M. J. Daydé, I. S. Duff, V. Frayssé, L. Giraud, and D. Ruiz, eds., Lecture Notes in Comput. Sci. 1685, Springer-Verlag, Berlin, 1999, pp. 295–302. [25] Hungarian Academy of Sciences: Computer and Automation Research Institute, LP Test Sets, ftp://ftp.sztaki.hu/pub/oplab/LPTESTSET/. [26] IMSL User’s Manual, Edition 9.2 (International Mathematical and Statistical Library), Houston, TX, 1984. [27] G. Karypis and V. Kumar, MeTiS. A Software Package for Partitioning Unstructured Graphs, Partitioning Meshes, and Computing Fill-Reducing Orderings of Sparse Matrices Version 3.0, Department of Computer Science and Engineering/Army HPC Research Center, University of Minnesota, Minneapolis, MN, 1998. [28] G. Karypis and V. Kumar, Multilevel Algorithms for Multi-constraint Graph Partitioning, Tech. Report 98-019, Department of Computer Science/Army HPC Research Center, University of Minnesota, Minneapolis, MN, 1998. [29] G. Karypis, V. Kumar, R. Aggarwal, and S. Shekhar, hMeTiS. A Hypergraph Partitioning Package Version 1.0.1, Department of Computer Science and Engineering/Army HPC Research Center, University of Minnesota, Minneapolis, MN, 1998. [30] B. W. Kernighan and S. Lin, An efficient heuristic procedure for partitioning graphs, Bell System Tech. J., 49 (1970), pp. 291–307. [31] C. Lemarechal, A. Nemirovski, and Y. Nesterov, New variants of bundle methods, Math. Programming, 69 (1995), pp. 111–147. [32] T. Lengauer, Combinatorial Algorithms for Integrated Circuit Layout, Wiley–Teubner, Chichester, UK, 199. [33] J. W. H. Liu, The minimum degree ordering with constraints, SIAM J. Sci. Statist. Comput., 10 (1989), pp. 1136–1145. [34] D. Medhi, Parallel bundle-based decomposition for large-scale structured mathematical programming problems, Ann. Oper. Res., 22 (1990), pp. 101–127. [35] D. Medhi, Bundle-based decomposition for structured large-scale convex optimization: Error estimate and application to block-angular linear programs, Math. Programming, 66 (1994), pp. 79–101. [36] R. R. Meyer and G. Zakeri, Multicoordination methods for solving convex block-angular PERMUTING A SPARSE MATRIX TO BLOCK-DIAGONAL FORM 1879 programs, SIAM J. Optim., 10 (1999), pp. 121–131. [37] J. M. Mulvey and A. Ruszcynski, A diagonal quadratic approximation method for large scale linear programs, Oper. Res. Lett., 12 (1992), pp. 205–215. [38] A. Pinar, U. V. Çatalyürek, C. Aykanat, and M. Pınar, Decomposing linear programs for parallel solution, in Proceedings of the Second International Workshop on Applied Parallel Computing, PARA ’95, Lyngby, Denmark, 1995, Lecture Notes in Comput. Sci. 1041, Springer-Verlag, Berlin, 1996, pp. 473–482. [39] A. Pinar and C. Aykanat, An effective model to decompose linear programs for parallel solution, in Proceedings of the Third International Workshop on Applied Parallel Computing, PARA’96, Lecture Notes in Comput. Sci. 1184, Springer-Verlag, Berlin, 1997, pp. 592–601. [40] A. Pinar and B. Hendrickson, Partitioning for complex objectives, in Proceedings of the 15th International Parallel and Distributed Processing Symposium (IPDPS), San Francisco, CA, IEEE Computer Society Press, Los Alamitos, CA, 2001. [41] T. Rashid and T. A. Davis, An approach for parallelizing any general unsymmetric sparse matrix algorithm, in Proceedings of the 7th SIAM Conference on Parallel Processing for Scientific Computing, SIAM, Philadelphia, 1995, pp. 413–417. [42] J. M. Stern and S. A. Vavasis, Active set algorithms for problems in block angular form, Comput. Appl. Math., 12 (1994), pp. 199–226. [43] R. J. Vanderbei, LOQO User’s Manual, Version 4.01, Tech. Report SOR 97-08, Princeton University, Princeton, NJ, 1997.