Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Journal of Parallel and Distributed Computing 140 (2020) 99–109 Contents lists available at ScienceDirect J. Parallel Distrib. Comput. journal homepage: www.elsevier.com/locate/jpdc Reordering sparse matrices into block-diagonal column-overlapped form✩ Seher Acer a , Cevdet Aykanat b , a b ∗ Center for Computing Research, Sandia National Laboratories, Albuquerque, NM, USA Computer Engineering Department, Bilkent University, Ankara, Turkey article info Article history: Received 30 May 2018 Received in revised form 24 March 2019 Accepted 9 March 2020 Available online 14 March 2020 Keywords: Underdetermined systems Sparse matrices Block-diagonal form Column overlap Hypergraph partitioning a b s t r a c t Many scientific and engineering applications necessitate computing the minimum norm solution of a sparse underdetermined linear system of equations. The minimum 2-norm solution of such systems can be obtained by a recent parallel algorithm, whose numerical effectiveness and parallel scalability are validated in both shared- and distributed-memory architectures. This parallel algorithm assumes the coefficient matrix in a block-diagonal column-overlapped (BDCO) form, which is a variant of the block-diagonal form where the successive diagonal blocks may overlap along their columns. The total overlap size of the BDCO form is an important metric in the performance of the subject parallel algorithm since it determines the size of the reduced system, solution of which is a bottleneck operation in the parallel algorithm. In this work, we propose a hypergraph partitioning model for reordering sparse matrices into BDCO form with the objective of minimizing the total overlap size and the constraint of maintaining balance on the number of nonzeros of the diagonal blocks. Our model makes use of existing partitioning tools that support fixed vertices in the recursive bipartitioning paradigm. Experimental results validate the use of our model as it achieves small overlap size and balanced diagonal blocks. © 2020 Elsevier Inc. All rights reserved. 1. Introduction Many scientific and engineering applications [6,12,19,22,25, 27] require computing the minimum norm solution of an underdetermined system of equations of the form Ax = f , (1) where A is an m × n sparse matrix and m < n [18]. A common approach for obtaining the minimum 2-norm solution of an underdetermined linear least squares problem is the use of a QR factorization in a direct method. Various packages such as SuiteSparseQR [13,14], HSL MA49 [2], and qr_mumps [7] provide efficient parallel and sequential implementations for the general sparse QR algorithm. One recent approach [24] for obtaining the minimum 2-norm solution of an underdetermined linear least squares problem is ✩ Sandia National Laboratories is a multimission laboratory managed and operated by National Technology and Engineering Solutions of Sandia, LLC., a wholly owned subsidiary of Honeywell International, Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA-0003525. The work was done when Seher Acer was with Bilkent University. ∗ Corresponding author. E-mail addresses: sacer@sandia.gov (S. Acer), aykanat@cs.bilkent.edu.tr (C. Aykanat). https://doi.org/10.1016/j.jpdc.2020.03.002 0743-7315/© 2020 Elsevier Inc. All rights reserved. the use of the Balance scheme [17,21,23], which is an effective and efficient parallel algorithm originally proposed for an ill-conditioned banded linear system of equations. This parallel algorithm [24] can also be considered as an extension of the general sparse QR factorization to distributed-memory systems for obtaining the minimum 2-norm solution. In this parallel algorithm, the coefficient matrix is assumed to be in block-diagonal column-overlapped (BDCO) form with K diagonal blocks, which is shown in Fig. 1. As seen in the figure, the kth diagonal block Ek of the BDCO form is given by Ek = [Ck Ak Bk ], for k = 2, 3, . . . , K − 1, whereas the first and last diagonal blocks E1 and EK are respectively given by E1 = [A1 B1 ] and EK = [CK AK ]. Here, successive diagonal blocks Ek−1 and Ek overlap along the columns of their submatrices Bk−1 and Ck . Columns of the overlapping submatrices are referred to as coupling columns. The parallel algorithm [24] exploits the BDCO form so that each of K processors independently solves a small linear least squares problem of the form Ek zk = fk , (2) 100 S. Acer and C. Aykanat / Journal of Parallel and Distributed Computing 140 (2020) 99–109 Fig. 1. A sample underdetermined system of equations Ax = f where matrix A is in a K -way BDCO form. where ] z1 = [ x1 e1 zK = [ êK −1 xK , zk = ] [ êk−1 xk ek ] for k = 2, 3, . . . , K − 1, and . This step is performed in parallel without any communication. For x to be a solution, ek and êk should be equal for k = 1, 2, . . . , K −1. This is ensured by a sequential step in which a small system called reduced system is solved. The size of the reduced system is determined by the total overlap size, i.e., the total number of coupling columns. When compared to the state-of-the-art parallel QR implementations, this approach is reported to achieve better scalability on both shared- and distributed-memory architectures [24]. In this paper, our aim is to find two permutations P and Q such that PAQ is in a K -way BDCO form. We refer to this permutation problem as the BDCO problem. In the BDCO problem, the objective is to minimize the total overlap size in PAQ since the performance of the above-mentioned parallel algorithm considerably deteriorates with increasing total overlap size, as reported in [24]. The constraint of the BDCO problem is to maintain balance on the number of nonzeros in the diagonal blocks of PAQ to ensure balanced workload for processors. To the best of our knowledge, literature lacks an algorithm for solving the BDCO problem. However, there are efforts for solving problems that resemble the BDCO problem. In [5] and [26], the problems of reordering sparse matrices into singly-bordered block diagonal (SBBD) form and separated block diagonal (SBD) form were addressed, respectively. In the SBBD form, all coupling rows are reordered to a border at the bottom or all coupling columns are reordered to a border on the right. In the SBD form, coupling rows are reordered in between two parts as they are encountered during the recursive bipartitioning steps. The SBBD and SBD forms differ from the BDCO form as they allow coupling rows/columns to span more than two diagonal blocks, which are not necessarily consecutive. In [1], the problem of reordering sparse square matrices into block diagonal form with overlap (BDO form) was addressed. The differences between the BDCO and BDO forms are two-fold: (i) the BDO form is obtained on structurally symmetric matrices via symmetric row/column permutation, whereas the BDCO form is obtained on non-square matrices via row and column permutations that are different from each other, (ii) consecutive diagonal blocks overlap along both rows and columns in the BDO form, whereas they overlap along only columns in the BDCO form. In order to solve the BDCO problem, we propose the HPBDCO algorithm, which utilizes the column-net hypergraph model [8] of the given coefficient matrix A. First, we investigate the feasibility of the K -way BDCO permutation for matrix A by considering the vertex-to-vertex adjacency topology of the column-net hypergraph of A. If the respective permutation is feasible, then the corresponding hypergraph is partitioned by the proposed HPBDCO algorithm so that the partitioning objective corresponds to minimizing the total overlap size, whereas the partitioning constraint corresponds to maintaining balance on the number of nonzeros of the diagonal blocks. The HPBDCO algorithm utilizes fixed vertices within the recursive bipartitioning paradigm in order to ensure that only the successive diagonal blocks may share columns. The proposed algorithm is flexible in the sense that any hypergraph partitioning tool that supports fixed vertices can be used in the implementation. The effectiveness of the proposed HPBDCO algorithm is validated by reordering a wide range of matrices into BDCO form with small overlap size and balanced diagonal blocks as well as by running the parallel code [24] on the reordered matrices. The rest of the paper is organized as follows. Section 2 gives background on hypergraph partitioning. The proposed HPBDCO algorithm is given in Section 3. Section 4 gives experimental results and Section 5 concludes. 2. Background 2.1. Hypergraphs A hypergraph H = (V , N ) is defined as a set V of vertices and a set N of nets. Each net n ∈ N connects a subset of vertices in V , which is denoted by Pins(n). We use phrase ‘‘pins of n’’ to refer to the vertices connected by net n. In a dual manner, each vertex v ∈ V is connected by a subset of nets in N , which is denoted by Nets(v ). We adapt the following graph terminology to hypergraphs. In a hypergraph, two vertices are said to be adjacent if they are connected by at least one common net. Let Adj(v ) denote the set of vertices adjacent to vertex v . Then, Adj(v ) = {u : Nets(v ) ∩ Nets(u) ̸ = ∅} − {v}. In a hypergraph, a path is defined as a sequence ⟨v1 , v2 , . . . , vp ⟩ of vertices such that each two consecutive vertices are adjacent. The distance between vertices vi and vj , which is denoted by d(vi , vj ), is defined as the number of vertex-to-vertex hops in a shortest path from vi to vj . For example, if ⟨vi , vk , vj ⟩ is a shortest path, then the distance between vi and vj is equal to 2, i.e., d(vi , vj ) = 2. Trivially, d(v, v ) = 0. The distance between a vertex u and a set of vertices V is defined as the minimum of distances between u and vertices in V , that is, d(u, V ) = minv∈V d(u, v ). The diameter of hypergraph H = (V , N ) is then defined as the maximum distance between two vertices, that is, diameter(H) = max d(vi , vj ). vi ,vj ∈V For a hypergraph with diameter D, a vertex v is said to be a peripheral vertex if there exists a vertex u such that d(v, u) = D. S. Acer and C. Aykanat / Journal of Parallel and Distributed Computing 140 (2020) 99–109 vertices that represent the rows with a nonzero entry in column j, that is, 2.2. Hypergraph partitioning (HP) problem In a given K -way partition Π = {V1 , V2 , . . . , VK } of hypergraph H = (V , N ), net n is said to connect part Vk if it connects at least one vertex in Vk , i.e., Pins(n) ∩ Vk ̸ = ∅. Let Λ(n) denote the set of the parts connected by n, that is, Λ(n) = {Vk : Pins(n) ∩ Vk ̸ = ∅}. The number of parts connected by n is denoted by λ(n), hence, λ(n) = |Λ(n)|. A net n is said to be cut if it connects more than one part, i.e., λ(n) > 1. Otherwise, n is said to be uncut. If the part that is connected by uncut net n is Vk , then n is said to be internal to Vk . A vertex v is said to be boundary if it is connected by at least one cut net. In H, each vertex v ∈ V is assigned a non-negative weight denoted by w (v ), whereas each net is assigned a non-negative cost denoted by c(nj ). Then, the cutsize of Π is defined as the sum of the costs of the cut nets in Π , that is, cutsize(Π ) = ∑ c(n). (3) n:λ(n)>1 Pins(nj ) = {vi : aij ̸ = 0}. Each net nj is assigned a unit cost, i.e., c(nj ) = 1, whereas each vertex vi is assigned a weight equal to the number of nonzeros in row i of A, i.e., w (vi ) = |{aij : aij ̸ = 0}|. Consider a K -way partition Π = {V1 , V2 , . . . , VK } of H. In Section 3.1, we introduce a new partitioning constraint, which we refer to as the BDCO constraint, for Π to be utilized for reordering A into K -way BDCO form. We refer to Π as a BDCO partition if it satisfies the BDCO constraint. In Section 3.2, we investigate the feasibility of a K -way BDCO partition for hypergraph H. Finally in Section 3.3, we propose the RB-based HPBDCO algorithm which obtains a K -way BDCO partition of a given hypergraph H, if it is feasible. 3.1. BDCO constraint We define the BDCO constraint as restricting each cut net in Part weight Wk of Vk is defined as the sum of the weights of the ∑ vertices in Vk , i.e., Wk = w v∈Vk (v ). For given K and ϵ values and a hypergraph H, the HP problem is defined as finding a partition Π = {V1 , V2 , . . . , VK } of H with the objective of minimizing the cutsize (3) and the constraint of maintaining the balance on the part weights, which is given by Wk < Wav g (1 + ϵ ) 101 for k = 1, 2, . . . , K . (4) Here, Wav g and ϵ denote the average part weight in Π and a predetermined threshold for maximum allowable imbalance ratio, respectively. The HP problem with fixed vertices is a variant of the HP problem which has been successfully used for solving the repartitioning/remapping problem in the context of parallelizing irregular applications [3,4,9,11]. In this problem, vertices in Fk ⊆ V are pre-assigned to part Vk in a K -way partition so that Fk ⊆ Vk for k = 1, 2, . . . , K after the partitioning. A vertex v is referred to as ‘‘fixed to Vk ’’ if v ∈ Fk . Vertices can be fixed to at most one part, that is, Fk ∩ Fℓ = ∅ holds for any fixed-vertex set pair (Fk , Fℓ ).⋃ A vertex v is referred to as ‘‘free’’ if it is not fixed, K i.e., v ∈ V − k=1 Fk . 2.3. Recursive bipartitioning (RB) paradigm RB is a commonly-used paradigm for obtaining K -way hypergraph partitioning. In RB, a given hypergraph is recursively bipartitioned, i.e., partitioned into two parts, until K parts are obtained. At each RB step, a bipartition Π = {V1 , V2 } of the current hypergraph H is obtained and two new hypergraphs H1 and H2 are formed using vertex parts V1 and V2 . This procedure forms a hypothetical full binary tree in which each node represents a hypergraph formed/bipartitioned. The vertex set of each leaf hypergraph in the RB tree corresponds to a part in the resulting K -way partition Π = {V1 , V2 , . . . , VK }. 3. Proposed model Suppose that we are given an m × n sparse matrix A and an integer K > 1. For reordering A into K -way BDCO form, we propose a hypergraph partitioning algorithm, HPBDCO , which utilizes the column-net hypergraph model [8]. Let H = (V , N ) denote the column-net hypergraph of A with vertex and net sets V = {v1 , v2 , . . . , vm } and N = {n1 , n2 , . . . , nn }. Vertex vi represents row i of A for i = 1, 2, . . . , m, whereas net nj represents column j of A for j = 1, 2, . . . , n. Net nj connects Π to connect two consecutive parts. That is, for each n ∈ N with λ(n) > 1, Λ(n) = {Vk , Vk+1 } for some k ∈ {1, 2, . . . , K − 1}. A K -way partition that satisfies the BDCO constraint, a K -way BDCO partition, induces partial row and column orderings for permuting A into K -way BDCO form as follows. The rows that are represented by the vertices in Vk are ordered after the rows that are represented by the vertices in Vk−1 . The columns that are represented by the internal nets of Vk are ordered after the columns that are represented by the cut nets connecting Vk−1 and Vk and before the columns that are represented by the cut nets connecting Vk and Vk+1 . In a dual manner, the columns that are represented by the cut nets connecting Vk−1 and Vk are ordered after the columns that are represented by the internal nets of Vk−1 and before the columns that are represented by the internal nets of Vk . Internal orderings of the above-mentioned row/column groups can be arbitrary. Consequently, the vertices and internal nets of Vk respectively correspond to the rows and columns of Ak in Fig. 1, whereas the cut nets connecting Vk and Vk+1 correspond to the coupling columns shared by diagonal blocks Ek and Ek+1 in Fig. 1. Fig. 2 displays a sample 10 × 12 sparse matrix A, a 4-way BDCO partition Π of the column-net hypergraph of A, and matrix ABDCO in a 4-way BDCO form which is obtained via decoding Π as described above. For example, consider vertex part V3 = {v2 , v5 } in Π and nets connecting V3 , namely, n1 , n6 , n8 , n11 , and n12 . Then, diagonal block E3 = [C3 A3 B3 ] in ABDCO consists of rows 5 and 2 and columns 1, 6, 8, 11 and 12, where C3 contains column 1, A3 contains columns 6 and 8, and B3 contains columns 11 and 12. Note that column 1 is a coupling column and it is shared by E2 and E3 in ABDCO since cut net n1 connects both V2 and V3 in Π . Similarly, columns 11 and 12 are coupling columns and they are shared by E3 and E4 since cut nets n11 and n12 connect both V3 and V4 . The BDCO constraint ensures that each coupling column is shared by two consecutive diagonal blocks in the resulting BDCO form. Consider a column j in A and the corresponding net nj in H. Note that the rows that have at least one nonzero in column j are those represented by the vertices in Pins(nj ), by the construction of the column-net hypergraph model. First, assume that nj is internal to part Vk for some k, that is, Pins(nj ) ⊆ Vk . Since the vertices in Vk correspond to the rows of Ak (or equivalently, Ek ), each row that contains at least one nonzero in column j is a row of Ak . Hence, column j only links diagonal block Ek . Now, assume that nj is a cut net, that is, nj connects parts Vk and Vk+1 for some k. Then, the vertices in Pins(nj ) are either in Vk or in Vk+1 . Then, each row that contains at least one nonzero in column j is either a row of Ak or a row of Ak+1 . Hence, column j links consecutive diagonal blocks Ek and Ek+1 . 102 S. Acer and C. Aykanat / Journal of Parallel and Distributed Computing 140 (2020) 99–109 Fig. 2. A sample matrix A, a 4-way BDCO partition Π of the column-net hypergraph of A, and the permuted matrix ABDCO in 4-way BDCO form, where the corresponding permutation is induced by Π . 3.2. Feasibility of a K -way BDCO partition Consider a K -way BDCO partition Π of a given hypergraph H. Note that the BDCO constraint implies that the adjacent vertices are assigned either to the same part or to the consecutive parts. In other words, no two adjacent vertices can be assigned to parts that have one or more parts in between. This statement relates the maximum number of parts that one can have at a BDCO partition of H to the length of the longest shortest path in H, i.e., the diameter of H. In this section, we investigate the relation between the existence of a K -way BDCO partition of H and the diameter of H. Assume that vi and vj are two arbitrary vertices in H such that d(vi , vj ) = δ . Consider a K -way BDCO partition Π of H. Assume that vi and vj are assigned to the Ith part VI and the Jth part VJ in Π , respectively. We first claim that |I − J | ≤ δ , that is, the parts containing vi and vj can have at most δ − 1 parts in between. Assume to the contrary that |I − J | = δ ′ > δ and consider the case I < J. Then, there exists a path ⟨vI , vI +1 , . . . , vI +δ ′ =J ⟩ with end vertices vi = vI and vj = vJ such that vI +k ∈ VI +k for k = 0, 1, . . . , δ ′ . This path is a shortest path since no two vertices vI +k and vI +k′ in this path can be adjacent if |k − k′ | > 1 by the BDCO constraint. Then d(vi , vj ) > δ , which is a contradiction. This implies that for a hypergraph H with diameter D, at most D − 1 parts can exist between any two parts in a BDCO partition. Consequently, a K -way BDCO partition of H is not feasible if the diameter of H is less than K − 1. 3.3. HPBDCO algorithm The HPBDCO algorithm utilizes the RB paradigm to obtain a K way BDCO partition of a given hypergraph H = (V , N ). Before invoking the HPBDCO algorithm, we first find a pseudo-peripheral vertex u in H by utilizing the algorithm in [16], which is originally proposed for finding pseudo-peripheral vertices in graphs. This algorithm easily extends to hypergraphs thanks to the definition of adjacent vertices in hypergraphs, which is introduced in Section 2.1. After a pseudo-peripheral vertex u is obtained, a vertex v which is one of the furthest vertices from u is found. If the distance between u and v is less than K − 1, we conclude that a K -way BDCO partition of H is not feasible by the discussion given in Section 3.2. Otherwise, the recursive HPBDCO algorithm is invoked to obtain a K -way BDCO partition of H. The basic outline of the recursive step of the HPBDCO algorithm is given as follows: 1. Obtain a bipartition {V1 , V2 } of hypergraph H = (V , N ). 2. Construct two hypergraphs H1 and H2 respectively on the vertex sets V1 and V2 . 3. Invoke the HPBDCO algorithm on H1 , which returns a K /2way BDCO partition Π1 = ⟨V11 , . . . , VK1 /2 ⟩ of H1 . 4. Invoke the HPBDCO algorithm on H2 , which returns a K /2way BDCO partition Π2 = ⟨V12 , . . . , VK2 /2 ⟩ of H2 . 5. Concatenate Π1 and Π2 to obtain a K -way BDCO partition Π of H and return Π . This basic outline does not involve the details that are essential to satisfying the constraint and feasibility of a BDCO partition. In the following paragraphs, we construct the proposed BDCO algorithm on top of this basic outline by stressing the points where this basic outline fails and describing techniques to resolve them. One problem with the given outline is that Π1 and Π2 being BDCO partitions does not automatically guarantee that Π too is a BDCO partition. In order for Π to satisfy the BDCO requirement, the cross-partition adjacencies between Π1 and Π2 should be confined to be between the last part of Π1 and the first part of Π2 . S. Acer and C. Aykanat / Journal of Parallel and Distributed Computing 140 (2020) 99–109 In other words, the vertices in V1 that are adjacent to vertices in V2 should be assigned to VK1 /2 in Π1 . In a dual manner, the vertices in V2 that are adjacent to vertices in V1 should be assigned to V12 in Π2 . The HPBDCO algorithm confines the cross-partition adjacencies to be between VK1 /2 and V12 by fixing the end vertices of these adjacencies to respective parts before each bipartitioning. How we achieve this will be clear after we introduce the vertex fixation scheme in the proposed algorithm. Another problem with the given outline is that the feasibility of BDCO partitions Π1 and Π2 is not guaranteed. That is, depending on the bipartition {V1 , V2 } of H, the diameters of hypergraphs H1 or H2 may be less than K /2 − 1 even though the diameter of H is at least K − 1. Assume that two pseudo-peripheral vertices u and v in H are given, where the distance between u and v is at least K − 1, i.e., d(u, v ) ≥ K − 1. The HPBDCO algorithm ensures that the diameters of both H1 and H2 are at least K /2 − 1 by • fixing vertex u′ to V1 for each vertex u′ ∈ V with d(u′ , u) < K /2 and • fixing vertex v ′ to V2 for each vertex v ′ ∈ V with d(v ′ , v ) < K /2. Note that the vertices that are fixed to V1 and the vertices that are fixed to V2 are disjoint due to d(u, v ) ≥ K − 1. Having the above-mentioned fixed vertices during the bipartitioning, H1 and H2 are guaranteed to have vertex pairs (u, u′ ) and (v, v ′ ) with d(u, u′ ) ≥ K /2 − 1 and d(v, v ′ ) ≥ K /2 − 1, respectively. Therefore, by the described vertex fixation scheme, the existence of BDCO partitions Π1 and Π2 is guaranteed by the discussion given in Section 3.2. Algorithm 1 displays the proposed recursive HPBDCO algorithm, which we build upon the basic outline by adding a vertex fixation scheme to resolve the two problems mentioned above. As input, it takes a hypergraph H = (V , N ), the desired number of parts K , and two vertex subsets P1 ⊆ V and P2 ⊆ V that are respectively referred to as the left and the right boundary vertex sets of H. At the initial invocation of this recursive algorithm, the input boundary vertex sets P1 and P2 are respectively set to {v} and {u}, where v and u are two pseudo-peripheral vertices of H with d(u, v ) = diameter(H). In Algorithm 1, the first step is to perform the proposed vertex fixation scheme (lines 1–2). We obtain sets of vertices fixed to V1 and V2 , i.e., F1 and F2 , by invoking the FIX function on H and boundary vertex sets P1 and P2 , respectively. Algorithm 2 displays the basic steps of the FIX function. This algorithm performs multi-source BFS sourced at a given boundary vertex set P in order to find vertices in H whose distances to P are smaller than D = K /2. In other words, for each vertex v ∈ V with d(v, P1 ) < K /2, the first invocation of FIX includes v in F1 , whereas for each vertex v ∈ V with d(v, P2 ) < K /2, the second invocation of FIX includes v in F2 . Note that P1 ⊆ F1 and P2 ⊆ F2 , that is, the vertices in the left and right boundary vertex sets are always fixed to the left and right parts, respectively. This ensures that the cross-partition adjacencies between Π1 and Π2 are confined to be between the last part of Π1 and the first part of Π2 , which is required for maintaining the BDCO structure throughout the overall RB process. After fixing vertices, Algorithm 1 obtains a bipartition {V1 , V2 } of hypergraph H with fixed vertices assigned to their respective parts, i.e., F1 ⊆ V1 and F2 ⊆ V2 (line 3). Here, any hypergraph partitioning tool that supports fixed vertices can be utilized. If K is equal to 2, which is the base case, then bipartition Π = ⟨V1 , V2 ⟩ is returned (lines 5 and 22). Note that any bipartition is trivially a 2-way BDCO partition. If K is larger than 2, the recursive step of the HPBDCO algorithm is performed (lines 6–21). In the recursive step of Algorithm 1, new hypergraphs H1 = (V1 , N1 ) and H2 = (V2 , N2 ) are formed on vertex sets V1 and V2 103 Algorithm 1 HPBDCO Require: Hypergraph H = (V , N ), integer K , boundary vertex sets P1 and P2 ▷ Fix vertices F1 ← FIX(H, P1 , K /2) 2: F2 ← FIX(H, P2 , K /2) 1: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: ▷ fix some vertices to V1 ▷ fix some vertices to V2 ▷ Bipartition H with fixed vertices {V1 , V2 } ← PARTITIONER(H, F1 , F2 , 2) if K = 2 then ▷ base case Π ← ⟨V1 , V2 ⟩ ▷ Π is a 2-way BDCO partition ▷ form H1 and H2 and recurse on them else H1 ← (V1 , N1 = ∅) H2 ← (V2 , N2 = ∅) ▷ Form N1 and N2 using cut-net removal technique for each n ∈ N do if n is internal to V1 then N 1 ← N 1 ∪ { n} else if n is internal to V2 then N 2 ← N 2 ∪ { n} 18: ▷ Form boundary vertex sets of H1 and H2 P11 ← P1 , P22 ← P2 ▷ inherited boundary vertex sets P12 ← P21 ← ∅ ▷ new boundary vertex sets for each cut-net n ∈ N do P12 ← P12 ∪ (Pins(n) ∩ V1 ) P21 ← P21 ∪ (Pins(n) ∩ V2 ) 19: ▷ Recursive invocations on H1 and H2 Π1 = ⟨V11 , . . . , VK1 /2 ⟩ ← HPBDCO (H1 , K /2, P11 , P12 ) 20: Π2 = ⟨V12 , . . . , VK2 /2 ⟩ ← HPBDCO (H2 , K /2, P21 , P22 ) 14: 15: 16: 17: ▷ Concatenate Π1 and Π2 to obtain Π 21: 22: Π ← ⟨Π1 , Π2 ⟩ return Π and the HPBDCO algorithm is invoked on H1 and H2 to obtain K /2way BDCO partitions Π1 and Π2 , respectively. Net sets N1 and N2 are obtained via cut-net removal scheme which is described as follows. They are both initialized as empty sets (lines 7–8) and each internal net of V1 is included in N1 , whereas each internal net of V2 is included in N2 (lines 9–13). Note that none of the net sets include cut nets. Also note that the pin set of each net included in N1 and N2 remains unchanged. Before performing the recursive invocations on newly formed hypergraphs H1 and H2 , Algorithm 1 forms boundary vertex sets for H1 and H2 . H1 inherits its left boundary vertex set P11 from H, i.e., P11 = P1 . In a similar manner, H2 inherits its right boundary vertex set P22 from H, i.e., P22 = P2 . The right boundary vertex set of H1 and the left boundary vertex set of H2 , which are respectively denoted by P12 and P21 , are newly formed by utilizing bipartition Π (lines 15–18). Each boundary vertex in V1 that is connected by at least one cut net in Π is included in P12 . In a dual manner, each boundary vertex in V2 that is connected by at least one cut net in Π is included in P21 . Note that except for the boundary vertex sets P1 = {v} and P2 = {u} of the initial invocation, all vertices in boundary vertex sets are indeed boundary vertices of some bipartition obtained during the overall partitioning procedure. After obtaining the left and right boundary vertex sets for both H1 and H2 , the recursive invocations of the HPBDCO algorithm on H1 and H2 are performed to obtain K /2-way BDCO partitions Π1 = ⟨V11 , . . . , VK1 /2 ⟩ and Π2 = ⟨V12 , . . . , VK2 /2 ⟩ (lines 19–20). These BDCO partitions are then concatenated to obtain a K -way 104 S. Acer and C. Aykanat / Journal of Parallel and Distributed Computing 140 (2020) 99–109 Algorithm 2 FIX Require: H = (V , N ), boundary vertex set P , integer D 1: ▷ Initialize fixed vertex set F ←∅ ▷ Initialize distance and mark arrays for each vertex v ∈ V do 3: dist [v] ← ∞ 4: mark[v] ← false 2: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: ▷ Initialize the queue for multi-source BFS Q←∅ for each vertex v ∈ P do ENQUEUE(Q, v ) dist [v] ← 0 mark[v] ← true ▷ Run BFS while Q ̸ = ∅ do v ← DEQUEUE(Q ) F ← F ∪ {v} ▷ Add v to fixed vertex set if dist [v] < D − 1 then for each vertex u ∈ Adj(v ) do if mark[u] = false then ENQUEUE(Q, u) dist [u] ← dist [v] + 1 mark[u] ← true return F Recall that the HPBDCO algorithm utilizes the cut-net removal technique while forming the net sets of the new hypergraphs at each RB step (lines 9–13 of Algorithm 1). Minimizing the cutsize at all RB steps together with utilizing the cut-net removal technique corresponds to minimizing the cutsize metric (3) of the K -way partition obtained at the end of the overall RB process [8]. Also recall from Section 3.1 that the cut nets in a BDCO partition of the column-net hypergraph H of a given matrix A correspond to the coupling columns in the resulting BDCO form. Then, the partitioning objective of minimizing the sum of the cutsizes of the RB steps corresponds to the permutation objective of minimizing the total overlap size of the BDCO form. Recall that for each vertex vi in the column-net hypergraph H of A, we set the weight of vi to the number of nonzeros in row i of A. Then, the weight of a part Vk in a K -way BDCO partition of H corresponds to the total number of nonzeros in diagonal block Ek in the permuted matrix ABDCO . So, maintaining balance on the part weights in (4) corresponds to maintaining balance on the number of nonzeros in diagonal blocks of the BDCO form. Since maintaining balance on the part weights of each bipartition in the proposed algorithm corresponds to maintaining balance on the part weights of the resulting K -way BDCO partition, it also corresponds to the permutation constraint of maintaining balance on the number of nonzeros of diagonal blocks of the BDCO form. The proposed algorithm introduces an additional overhead of finding fixed vertices to the partitioning cost of the standard recursive hypergraph bipartitioning approach. The runtime of the FIX algorithm on a particular hypergraph H = (V , N ) is O(V + E), where ∑E denotes the number of adjacency relations in H, i.e., E = v∈V |Adj(v )|. Since the vertices considered on the two invocations of the FIX algorithm in lines 1 and 2 of Algorithm 1 cannot overlap, each adjacency relation in H is considered only once during the invocation of the HPBDCO algorithm on H. Similarly, the vertices processed in separate HPBDCO invocations at the same level of the RB tree cannot overlap. Therefore, the proposed vertex fixation scheme introduces an O(V + E)-time overhead to each level of the RB-tree, where V and E denote the number of vertices and the number of adjacency relations in the initial hypergraph, respectively. For K being a power of 2, which leads to log K levels in the RB tree, the total overhead introduced by the proposed algorithm is O(log K (V + E)). 4. Experiments Fig. 3. The proposed vertex fixation scheme illustrated on the first three levels of the RB tree. BDCO partition Π = ⟨Π1 , Π2 ⟩ = ⟨V1 , . . . , VK ⟩ (line 21). Finally, partition Π is returned. Fig. 3 illustrates the proposed vertex fixation scheme on the first three levels of the corresponding RB tree. In the figure, boundary vertices are shown in dark gray, whereas the fixed vertices are shown in light gray. As seen in the figure, the left and right boundary vertex sets at the initial invocation of the HPBDCO algorithm consist of pseudo-peripheral vertices u and v , respectively. Note that {u} and {v} remain to constitute the left and right boundary vertex sets of each of the further invocations on the leftmost and rightmost hypergraphs formed, respectively. As seen in lines 16–18 of Algorithm 1, each of the remaining boundary vertex sets utilized throughout the proposed algorithm is introduced by a bipartition. Recall that the HPBDCO algorithm is the first algorithm in the literature that addresses the BDCO problem. For evaluating the performance of HPBDCO , we consider a baseline method that first permutes the given matrix into a banded form and then applies block partitioning on it to obtain K diagonal blocks with column overlap. The details of this baseline algorithm, which we refer to as the RCMBDCO algorithm, are described in Section 4.1. We performed two different sets of experiments. The first set of experiments, which are described in Section 4.2, were performed on 45 semi-real sparse matrices, for which we already know nice BDCO forms exist. The second set of experiments, which are described in Section 4.3, were performed on 8 real sparse matrices, for which no prior BDCO-form knowledge exists. In both sets, we first compare RCMBDCO and HPBDCO in terms of total overlap size and the imbalance ratio on the number of nonzeros in diagonal blocks. Then we compare these two algorithms in terms of the runtime of the target parallel code [24] on a subset of matrices on which RCMBDCO performs relatively close to HPBDCO in terms of overlap size. We used the parallel code [24] with default settings and measured the total time for solving the least squares problems in diagonal blocks, forming the reduced system and solving it, and S. Acer and C. Aykanat / Journal of Parallel and Distributed Computing 140 (2020) 99–109 105 4.2. Reordering semi-real matrices into BDCO form Fig. 4. Obtaining a BDCO form on a banded matrix. finally broadcasting the solution of the reduced system. We conducted the parallel experiments on up to 64 processors on a Linux workstation equipped with four 18-core CPUs (Intel Xeon Processor E7-8860 v4) and 256 GB of memory. For the hypergraph bipartitioning in the proposed algorithm (line 3 in Algorithm 1), we used PaToH [10] with the quality setting. We used 10% as the maximum allowable imbalance ratio for the resulting K -way partition. 4.1. Baseline algorithm: RCMBDCO In the baseline RCMBDCO algorithm, we first permute A into a banded form and then apply block partitioning on the banded matrix. Note that K -way block row partitioning on a banded matrix results in a K -way BDCO form, which is illustrated in Fig. 4. For permuting A into a banded form, we use one of the bandwidth minimization algorithms proposed for nonsymmetric matrices in [20]. While it is cumbersome to define the bandwidth of a rectangular matrix, we expect this algorithm to place the nonzeros of A in a hopefully narrow band that starts from the top left corner and ends at the bottom right corner, as illustrated in Fig. 4. In this algorithm, first a symmetric B matrix of size (m + n) × (m + n) is obtained using A as follows: B= [ 0 AT A 0 ] . Row i of B represents row i of A if i ≤ m, and it represents column i − m of A, otherwise. It is important to note that the standard graph representation of B corresponds to the bipartite graph representation of A. Then, the standard RCM algorithm (symrcm routine in MATLAB) is run on B for finding a row/column permutation that minimizes the bandwidth of B. This is effectively equivalent to running RCM on the bipartite graph of A. To reorder the rows and columns of A, we consider rows/ columns of B in the order provided by the above-mentioned permutation. Suppose that row/column j of B is being considered at the moment. If j represents a row of A (i.e., j ≤ m), then row j of A is ordered as the next row. If j represents a column of A (i.e., j > m), then column j − m of A is ordered as the next column. After obtaining the reordered matrix A, which is assumed to be in a banded form, we obtain K row blocks on it by simple block partitioning where we balance the number of nonzeros across the blocks. While forming the dataset in this section, we follow an approach similar to the one utilized for obtaining ‘‘Realistic Dataset’’ in [24]. In the SuiteSparse collection [15], there were a total of 38 non-square matrices in categories ‘‘least squares problem’’, ‘‘computer graphics/vision problem’’, and ‘‘power network problem’’ at the time of experimentation. To eliminate small matrices, we excluded those with the sum of the number of rows and number of columns smaller than 3000. Due to the memory limitation, we excluded the matrices with more than 1,000,000 nonzeros. For each of the remaining matrices in the dataset, we linked together 64 copies of it to obtain a semi-real matrix in 64-way BDCO form so that each diagonal block of the semi-real matrix corresponds to the original matrix. For tall and skinny matrices in the dataset, we utilized their transposes in order to have underdetermined systems of linear equations. While obtaining a semi-real matrix by linking together 64 copies of an original matrix, we used a fixed size for the column overlap, denoted by o, between each pair of consecutive diagonal blocks. For each original matrix, we obtained five different semireal matrices in 64-way BDCO forms obtained by utilizing overlap sizes o ∈ {5, 10, 20, 50, 100}. Among the obtained semi-real matrices, we excluded matrices with disconnected hypergraphs. In order to prevent the RCMBDCO and HPBDCO algorithms from being biased with the existing BDCO forms, we destroyed those BDCO forms by randomly reordering rows and columns of each semi-real matrix. The resulting dataset consists of 45 semi-real matrices derived from a total of 9 original matrices. Table 1 displays the properties of the test matrices obtained by the above-mentioned procedure along its first four columns. In the table, matrices are displayed in increasing order of their number of nonzeros. Columns 1, 2, 3, and 4 display matrix name, number of rows, number of columns, and number of nonzeros, respectively. Note that for each original matrix M, five semi-real matrices obtained from M are displayed consecutively. The semireal matrix obtained from M with column overlap o is denoted by M_o. Note that for each matrix M, the number of rows in semireal matrices M_5, M_10, M_20, M_50, and M_100 is the same as well as the number of nonzeros. On the other hand, the number of columns differs for each semi-real matrix as the overlap size also differs. Since larger overlap size implies smaller number of columns, the number of columns decreases as overlap size o increases. Table 1 displays the results obtained by the RCMBDCO and HPBDCO algorithms along columns 5–7 and 8–11, respectively. The RCMBDCO algorithm was run on each matrix M_o once as described in Section 4.1. In the HPBDCO algorithm, the columnnet hypergraph of matrix M_o was partitioned into 64 parts and then the resulting partition was utilized to permute M_o into a 64-way BDCO form. This was repeated for ten times and the average results of these ten runs are reported in columns 8–10. Columns 5 and 8 display the resulting imbalance values, which are computed as the ratio of the maximum to the average number of nonzeros in diagonal blocks minus one. Column 6 and 9 display the total overlap sizes, which are computed as the sum of the number of coupling columns in the resulting BDCO form. Column 7 and 10 display the ratio of the total overlap size to the total number of columns in the corresponding matrix. Column 11 displays the number of HPBDCO runs that resulted in an ideal BDCO permutation on the corresponding matrix in terms of total overlap size out of ten runs. For a matrix M_o, we define an ideal BDCO permutation as a permutation having a total overlap size smaller than 63 × o × 1.1. That is, for o = 5, 10, 20, 50, and 100, an ideal 64-way BDCO permutation has a total overlap size smaller than 347, 693, 1386, 3465, and 6930, respectively. The table does not 106 S. Acer and C. Aykanat / Journal of Parallel and Distributed Computing 140 (2020) 99–109 Table 1 Properties of the semi-real test matrices and permutation results obtained by the RCMBDCO and HPBDCO algorithms. Matrix #rows #columns #nonzeros RCMBDCO imbalance (%) photogrammetry2_5 photogrammetry2_10 photogrammetry2_20 photogrammetry2_50 photogrammetry2_100 gemat1_5 gemat1_10 gemat1_20 gemat1_50 gemat1_100 psse1_5 psse1_10 psse1_20 psse1_50 psse1_100 Kemelmacher_5 Kemelmacher_10 Kemelmacher_20 Kemelmacher_50 Kemelmacher_100 psse0_5 psse0_10 psse0_20 psse0_50 psse0_100 psse2_5 psse2_10 psse2_20 psse2_50 psse2_100 graphics_5 graphics_10 graphics_20 graphics_50 graphics_100 image_interp_5 image_interp_10 image_interp_20 image_interp_50 image_interp_100 mesh_deform_5 mesh_deform_10 mesh_deform_20 mesh_deform_50 mesh_deform_100 Average HPBDCO overlap† overlap/ #columns imbalance (%) overlap† overlap/ #columns #ideal runs* 59,904 285,893 2,371,584 0.1 199,801 0.6989 0.0 315 0.0011 10 59,904 285,578 2,371,584 0.0 190,726 0.6679 0.0 630 0.0022 10 59,904 284,948 2,371,584 0.0 191,730 0.6729 0.0 8,929 0.0313 1 59,904 283,058 2,371,584 0.0 193,401 0.6833 0.0 3,605 0.0127 9 59,904 279,908 2,371,584 0.4 219,117 0.7828 9.1 20,446 0.0730 0 315,456 677,765 3,031,616 0.0 92,146 0.1360 0.0 5,298 0.0078 3 315,456 677,450 3,031,616 0.0 536,989 0.7927 2.8 14,783 0.0218 0 315,456 676,820 3,031,616 0.0 157,881 0.2333 4.5 25,890 0.0383 0 315,456 674,930 3,031,616 0.0 577,555 0.8557 0.0 13,294 0.0197 3 315,456 671,780 3,031,616 10.3 319,351 0.4754 0.0 29,304 0.0436 3 705,792 916,037 3,672,064 0.0 106,617 0.1164 0.0 632 0.0007 1 705,792 915,722 3,672,064 0.0 104,075 0.1137 3.7 1,666 0.0018 0 705,792 915,092 3,672,064 0.0 184,804 0.2020 3.0 2,138 0.0023 0 705,792 913,202 3,672,064 0.0 290,077 0.3176 2.8 2,233 0.0024 10 10 705,792 910,052 3,672,064 0.0 413,962 0.4549 2.9 2,168 0.0024 620,352 1,820,613 6,456,000 0.0 843,697 0.4634 0.0 355 0.0002 8 620,352 1,820,298 6,456,000 0.0 427,100 0.2346 0.0 630 0.0003 10 620,352 1,819,668 6,456,000 0.0 834,819 0.4588 0.0 1,374 0.0008 6 620,352 1,817,778 6,456,000 0.0 567,986 0.3125 0.0 2,960 0.0016 10 620,352 1,814,628 6,456,000 0.0 864,040 0.4762 1.3 10,919 0.0060 0 705,792 1,709,893 6,555,648 0.0 79,383 0.0464 0.0 369 0.0002 7 705,792 1,709,578 6,555,648 0.0 1,459,255 0.8536 0.2 712 0.0004 7 705,792 1,708,948 6,555,648 0.0 319,190 0.1868 1.7 2,265 0.0013 1 705,792 1,707,058 6,555,648 0.0 313,960 0.1839 1.0 2,735 0.0016 8 705,792 1,703,908 6,555,648 0.0 414,302 0.2431 4.0 6,313 0.0037 9 705,792 1,832,261 7,376,768 0.0 240,840 0.1314 0.0 575 0.0003 4 705,792 1,831,946 7,376,768 0.0 297,952 0.1626 0.0 990 0.0005 4 705,792 1,831,316 7,376,768 0.1 458,841 0.2506 0.0 4,048 0.0022 1 705,792 1,829,426 7,376,768 0.0 483,611 0.2644 3.1 4,968 0.0027 3 705,792 1,826,276 7,376,768 0.0 723,853 0.3964 2.6 4,918 0.0027 10 756,608 1,887,237 7,549,056 0.0 22,908 0.0121 0.0 315 0.0002 10 756,608 1,886,922 7,549,056 0.0 4,780 0.0025 0.0 1,068 0.0006 8 756,608 1,886,292 7,549,056 0.0 6,406 0.0034 0.0 1,559 0.0008 9 756,608 1,884,402 7,549,056 0.0 8,275 0.0044 0.1 3,629 0.0019 8 756,608 1,881,252 7,549,056 0.0 44,322 0.0236 0.3 5,657 0.0030 8 7,680,000 15,359,685 45,547,712 0.0 95,484 0.0062 0.0 315 0.0000 10 7,680,000 15,359,370 45,547,712 0.0 92,576 0.0060 0.0 630 0.0000 10 7,680,000 15,358,740 45,547,712 0.0 87,563 0.0057 0.0 1,259 0.0001 10 7,680,000 15,356,850 45,547,712 0.0 73,343 0.0048 0.0 3,149 0.0002 10 7,680,000 15,353,700 45,547,712 0.0 49,695 0.0032 0.0 6,173 0.0004 10 601,152 14,977,157 54,645,056 0.0 1,007,329 0.0673 0.0 315 0.0000 10 – 601,152 14,976,842 54,645,056 0.0 921,691 0.0615 0.0 630 0.0000 10 601,152 14,976,212 54,645,056 0.0 755,161 0.0504 0.0 1,259 0.0001 10 601,152 14,974,322 54,645,056 0.0 2,222,376 0.1484 0.0 3,149 0.0002 10 601,152 14,971,172 54,645,056 0.0 1,714,018 0.1145 0.0 6,299 0.0004 10 – – 0.2 – 0.2752 1.0 – 0.0065 († ): boldface denotes achieving an ideal BDCO permutation on average. (∗ ): #ideal runs denotes the number of runs that HPBDCO achieves an ideal BDCO permutation out of ten runs. have such a column for RCMBDCO , because it is not able achieve any ideal BDCO permutations. As seen in Table 1, compared to the RCMBDCO algorithm, the HPBDCO algorithm always obtains smaller total overlap size while usually obtaining slightly larger imbalance. On average, RCMBDCO places 27.5% of the columns in overlaps, while this ratio is only 0.6% for HPBDCO . The overlap size obtained by RCMBDCO gets closest to those obtained by HPBDCO on matrices graphic_10, graphic_20, and graphic_50, still being 4.48x, 4.11x, and 2.28x worse, respectively. For RCMBDCO and HPBDCO , imbalance ratios on the number of nonzeros in diagonal blocks are 0.2% and 1.0% on average, respectively. As seen in Table 1, the HPBDCO algorithm obtains an ideal BDCO permutation in each of the 10 runs on 18 test matrices, which correspond to the 40% of the dataset. It obtains 9, 8, 7, and 6 ideal BDCO permutations out of 10 runs on 3, 5, 2, and 1 test matrices, respectively. So, the HPBDCO algorithm obtains an ideal permutation in at least 5 of the 10 runs on 29 out of 45 test matrices, which correspond to 64% of the dataset. When all runs of all test matrices are considered, the HPBDCO algorithm obtains S. Acer and C. Aykanat / Journal of Parallel and Distributed Computing 140 (2020) 99–109 107 Table 2 Runtimes of the parallel code [24] on three semi-real matrices for K = 64 (in seconds). Matrix RCMBDCO HPBDCO graphics_10 graphics_20 graphics_50 1.70 2.54 3.00 0.87 0.88 1.16 Fig. 5. Matrix mri1. an ideal permutation in 291 runs, which correspond to the 64% of the 450 runs. Although the HPBDCO algorithm could not obtain any ideal permutations on some test matrices, the overlap size normalized with respect to the number of columns is still small on almost all test matrices. For example on each of matrices psse1_10, psse1_20, and Kemelmacher_100, the normalized overlap size is smaller than 0.01 although no ideal permutations are obtained on them. Note that the HPBDCO algorithm obtains a normalized overlap size smaller than 0.05 for each test matrix. For 38 out of 45 matrices, the normalized overlap size obtained by the HPBDCO algorithm is smaller than 0.05. It is smaller than 0.001 and 0.0001 for 21 and 4 matrices, respectively. As seen in Table 1, the performance of the HPBDCO algorithm in total overlap size usually increases with increasing matrix size. This can be attributed to the increasing solution space of the respective partitioning problem. For example, consider the largest 15 matrices in the dataset. The HPBDCO algorithm almost always finds an ideal BDCO partition on these large matrices. As seen in Table 1, the HPBDCO algorithm obtains an imbalance value smaller than 5% on each test matrix, except for photogrammetry_2_100 which has an imbalance value of 9.1%. Note that the HPBDCO algorithm obtains an imbalance value smaller than 1% on 32 matrices, which corresponds to the 71% of the 45 matrices. Table 2 displays the runtimes of the parallel code [24] on 64 processors when the input coefficient matrix is reordered into 64-way BDCO form by RCMBDCO and HPBDCO . In these experiments, we only consider the three matrices on which RCMBDCO performs relatively close to HPBDCO in terms of total overlap size. As seen in the table, the parallel runtimes obtained by RCMBDCO on graphics_10, graphics_20, and graphics_50 are 1.70, 2.54, and 3.00 seconds, whereas these runtimes are 0.87, 0.88, and 1.16 seconds for RCMBDCO , respectively. On these matrices, the 1.95x, 2.88x, and 2.58x speedups in parallel runtime are due to the 4.48x, 4.11x, and 2.28x reductions in total overlap size. 4.3. Reordering real matrices into BDCO form In this section, we performed experiments on eight real matrices obtained from the SuiteSparse collection [15]. Unlike the case Fig. 6. 4-, 8-, 16-, and 32-way BDCO forms obtained on matrix mri1 by the HPBDCO algorithm. 108 S. Acer and C. Aykanat / Journal of Parallel and Distributed Computing 140 (2020) 99–109 Table 3 Properties of real matrices and permutation results obtained by RCMBDCO and HPBDCO . Matrix #rows #columns #nonzeros K RCMBDCO HPBDCO imbalance (%) watson_1 watson_2 lp_ganges model8 l30 lp_truss progas overlap overlap/ #columns imbalance (%) overlap overlap/ #columns 201,155 386,992 1,055,093 2 0.0 98,866 0.2555 0.0 26 352,013 677,224 1,846,391 2 0.0 160,097 0.2364 4.8 16 0.0000 1,309 1,706 6,937 4 4.7 628 0.3681 4.3 212 0.1243 2,896 6,464 25,277 8 0.2 3,459 0.5351 6.7 2,005 0.3102 2,701 16,281 52,070 8 0.3 5,052 0.3103 2.2 434 0.0267 1,000 8,806 27,836 16 1.6 6,487 0.7367 6.3 3,410 0.3872 1,650 1,900 8,897 16 2.0 1,303 0.6858 0.4 737 0.3879 in Section 4.2, we do not know if a K -way BDCO form exists on these matrices for a given K value, until we run the RCMBDCO and HPBDCO algorithms. In the first part of these experiments, we used seven matrices on which the HPBDCO algorithm is able to obtain a 16-way BDCO form. The properties of these matrices and the permutation results are given in Table 3. The RCMBDCO algorithm, however, fails to obtain a 16-way BDCO form on most of these matrices. It fails when more than two diagonal blocks overlap along a column, which happens when the band obtained by RCM is not narrow enough. Hence, on each of these matrices, we ran the RCMBDCO algorithm for K ∈ {2, 4, 8, 16} values and report the results for the largest K value where RCM does not fail to obtain a K -way BDCO form. As seen in Table 3, the overlap size obtained by HPBDCO is smaller than that obtained by HPBDCO on each of the seven matrices. On watson_1 and watson_2, the total overlap sizes obtained by RCMBDCO are 98,866 and 160,097 for K = 2, whereas they are only 26 and 16 for HPBDCO , respectively. On lp_ganges, the overlap sizes obtained by RCMBDCO and HPBDCO for K = 4 are 628 and 212, which correspond to 36.8% and 12.4% of all columns, respectively. On model8 and l30, RCMBDCO places 53.5% and 31.0% of columns in overlaps for K = 8, while these ratios are 31.0% and 2.7% for HPBDCO , respectively. Finally on lp_truss and progas, the respective ratios for K = 16 are 73.7% and 68.6% for RCMBDCO and 38.7% and 38.8% for HPBDCO , respectively. On these matrices, the imbalance values obtained by RCMBDCO are generally smaller than those obtained by HPBDCO . In the second part of the experiments, we show how the performances of the RCMBDCO and HPBDCO algorithms vary on different K values. These experiments were performed for K ∈ {4, 8, 16, 32, 64} values on a real matrix, mri1, which is also obtained from the SuiteSparse collection [15]. This matrix was obtained from MRI reconstruction data and contains 65,536 rows, 147,456 columns, and 589,824 nonzeros. We removed 32,819 empty columns, so, the resulting matrix contains 114,637 columns. Fig. 5 displays the sparsity pattern of the resulting mri1 matrix. Table 4 displays the resulting imbalance values and total overlap sizes obtained by RCMBDCO and HPBDCO on mri1. Fig. 6 displays the 4-, 8-, 16-, and 32-way BDCO forms of matrix mri1 obtained by HPBDCO . As seen in Table 4, compared to the RCMBDCO algorithm, the HPBDCO algorithm obtains smaller total overlap size while obtaining slightly larger imbalance for each K value. For K being 4, 8, 16, 32, and 64, the total overlap sizes obtained by RCMBDCO are 2381, 5342, 11,103, 22,485, and 45,247, whereas those obtained by HPBDCO are 2174, 5066, 9648, 19,687, and 41,067, respectively. The performance degradation in the overlap size with increasing K values is expected for both algorithms as the problem gets harder. Note that the performance gap between RCMBDCO and HPBDCO on mri1 is small compared to the other matrices in this work. We attribute this small performance gap to the fact that the graph/hypergraph corresponding to mri1 has a more regular 0.0001 Table 4 Permutation results obtained on matrix mri1. K 4 8 16 32 64 RCMBDCO HPBDCO imb. (%) overlap overlap/ #columns imb. (%) overlap overlap/ #columns 0.0 0.0 0.0 0.0 0.1 2,381 5,342 11,103 22,485 45,247 0.0208 0.0466 0.0969 0.1961 0.3947 0.6 5.7 2.0 2.1 2.1 2,174 5,066 9,648 19,687 41,067 0.0190 0.0442 0.0842 0.1717 0.3582 Table 5 Runtimes of the parallel code [24] on real matrices (in seconds). Matrix K RCMBDCO HPBDCO lp_ganges model8 l30 lp_truss progas mri1 4 8 8 16 16 64 0.11 3.03 9.52 7.58 0.10 100.83 0.03 1.69 0.57 3.78 0.06 75.40 structure compared to the other matrices, so, RCMBDCO is able to achieve a much narrower band on this matrix. In the last part of the experiments, we ran the parallel code [24] on the real matrices. Table 5 displays the respective results. In these experiments, we considered all real matrices except for watson_1 and watson_2. The reason why we excluded watson_1 and watson_2 is because the reduced systems obtained by the RCMBDCO algorithm for these matrices for K = 2 were too large to be solved by the parallel code. For each remaining matrix, we performed the parallel experiments for the largest K value for which both RCMBDCO and HPBDCO are able to find solutions. As seen in the table, the parallel runtimes obtained by RCMBDCO on lp_ganges, model8, l30, lp_truss, progas, and mri1 are 0.11, 3.03, 9.52, 7.58, 0.10, and 100.83 seconds, respectively. The parallel runtimes obtained by HPBDCO on lp_ganges, model8, l30, lp_truss, progas, and mri1 are 0.03, 1.69, 0.57, 3.78, 0.06, and 75.40 seconds, respectively. Note that the largest reduction rate obtained by HPBDCO is observed on l30, where the parallel runtime reduces from 9.52 seconds to 0.57 seconds. The 16x speedup in the parallel runtime for this matrix is explained by the 12x reduction in total overlap size. 5. Conclusion In this paper, we target the problem of reordering a given sparse matrix into block-diagonal column-overlapped (BDCO) form, which arises in a recent parallel algorithm proposed for obtaining the minimum norm solution of a given underdetermined system of equations. We first defined an equivalent hypergraph partitioning problem with an additional constraint tailored to the BDCO form and investigated the feasibility of its solutions. S. Acer and C. Aykanat / Journal of Parallel and Distributed Computing 140 (2020) 99–109 Then we proposed a recursive-bipartitioning-based partitioning algorithm utilizing fixed vertices in order to solve this problem. The experimental results prove the effectiveness of the proposed algorithm in obtaining a BDCO form of a given sparse matrix. Declaration of competing interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. References [1] S. Acer, E. Kayaaslan, C. Aykanat, A recursive bipartitioning algorithm for permuting sparse square matrices into block diagonal form with overlap, SIAM J. Sci. Comput. 35 (1) (2013) C99–C121, http://dx.doi.org/10.1137/ 120861242, arXiv:https://doi.org/10.1137/120861242. [2] P.R. Amestoy, I.S. Duff, C. Puglisi, Multifrontal QR factorization in a multiprocessor environment, Numer. Linear Algebra Appl. 3 (4) (1996) 275–300. [3] C. Aykanat, B.B. Cambazoglu, F. Findik, T. Kurc, Adaptive decomposition and remapping algorithms for object-space-parallel direct volume rendering of unstructured grids, J. Parallel Distrib. Comput. 67 (1) (2007) 77–99, http:// dx.doi.org/10.1016/j.jpdc.2006.05.005, URL http://www.sciencedirect.com/ science/article/pii/S0743731506001122. [4] C. Aykanat, B.B. Cambazoglu, B. Uçar, Multi-level direct K-way hypergraph partitioning with multiple constraints and fixed vertices, J. Parallel Distrib. Comput. 68 (5) (2008) 609–625, http://dx.doi.org/10.1016/ j.jpdc.2007.09.006, URL http://www.sciencedirect.com/science/article/pii/ S0743731507001724. [5] C. Aykanat, A. Pinar, U.V. Catalyurek, Permuting sparse rectangular matrices into block-diagonal form, SIAM J. Sci. Comput. 25 (6) (2004) 1860– 1879, http://dx.doi.org/10.1137/S1064827502401953, arXiv:http://dx.doi. org/10.1137/S1064827502401953. [6] A.M. Bruckstein, D.L. Donoho, M. Elad, From sparse solutions of systems of equations to sparse modeling of signals and images, SIAM Rev. 51 (1) (2009) 34–81, http://dx.doi.org/10.1137/060657704, arXiv:http://dx.doi. org/10.1137/060657704. [7] A. Buttari, Fine-grained multithreading for the multifrontal QR factorization of sparse matrices, SIAM J. Sci. Comput. 35 (4) (2013) C323–C345, http:// dx.doi.org/10.1137/110846427, arXiv:http://dx.doi.org/10.1137/110846427. [8] U.V. Catalyurek, C. Aykanat, Hypergraph-partitioning-based decomposition for parallel sparse-matrix vector multiplication, IEEE Trans. Parallel Distrib. Syst. 10 (7) (1999) 673–693, http://dx.doi.org/10.1109/71.780863. [9] U.V. Catalyurek, E.G. Boman, K.D. Devine, D. Bozdağ, R.T. Heaphy, L.A. Riesen, A repartitioning hypergraph model for dynamic load balancing, J. Parallel Distrib. Comput. 69 (8) (2009) 711–724, http://dx.doi.org/10.1016/ j.jpdc.2009.04.011, URL http://www.sciencedirect.com/science/article/pii/ S0743731509000847. [10] U.V. Çatalyürek, C. Aykanat, PaToH: Partitioning Tool for Hypergraphs, Tech. Rep., Department of Computer Engineering, Bilkent University, 1999. [11] A. Cevahir, C. Aykanat, A. Turk, B.B. Cambazoglu, Site-based partitioning and repartitioning techniques for parallel Page Rank computation, IEEE Trans. Parallel Distrib. Syst. 22 (5) (2011) 786–802, http://dx.doi.org/10. 1109/TPDS.2010.119. [12] S.F. Cotter, B.D. Rao, K. Engan, K. Kreutz-Delgado, Sparse solutions to linear inverse problems with multiple measurement vectors, IEEE Trans. Signal Process. 53 (7) (2005) 2477–2488, http://dx.doi.org/10.1109/TSP. 2005.849172. [13] T.A. Davis, User’s guide for SuiteSparseQR, a multifrontal multithreaded sparse QR factorization package, 2009. [14] T.A. Davis, Algorithm 915, SuiteSparseQR: multifrontal multithreaded rankrevealing sparse QR factorization, ACM Trans. Math. Software 38 (1) (2011) 8:1–8:22, http://dx.doi.org/10.1145/2049662.2049670, URL http://doi.acm. org/10.1145/2049662.2049670. 109 [15] T.A. Davis, Y. Hu, The University of Florida sparse matrix collection, ACM Trans. Math. Softw. 38 (1) (2011) http://dx.doi.org/10.1145/2049662. 2049663, https://doi.org/10.1145/2049662.2049663. [16] A. George, J.W.H. Liu, An implementation of a pseudoperipheral node finder, ACM Trans. Math. Software 5 (3) (1979) 284–295, http://dx.doi.org/ 10.1145/355841.355845, URL http://doi.acm.org/10.1145/355841.355845. [17] G.H. Golub, A.H. Sameh, V. Sarin, A parallel balance scheme for banded linear systems, Numer. Linear Algebra Appl. 8 (5) (2001) 297–316. [18] C.L. Lawson, R.J. Hanson, Solving Least Squares Problems, SIAM, 1995. [19] K. Matsuura, Y. Okabe, Selective minimum-norm solution of the biomagnetic inverse problem, IEEE Trans. Biomed. Eng. 42 (6) (1995) 608–615, http://dx.doi.org/10.1109/10.387200. [20] J. Reid, J. Scott, Reducing the total bandwidth of a sparse unsymmetric matrix, SIAM J. Matrix Anal. Appl. 28 (3) (2006) 805–821, http://dx.doi. org/10.1137/050629938, arXiv:https://doi.org/10.1137/050629938. [21] A.H. Sameh, V. Sarin, Parallel algorithms for indefinite linear systems, Parallel Comput. 28 (2) (2002) 285–299, http://dx.doi.org/10. 1016/S0167-8191(01)00140-5, URL http://www.sciencedirect.com/science/ article/pii/S0167819101001405. [22] M.K. Sen, P.L. Stoffa, Global Optimization Methods in Geophysical Inversion, Cambridge University Press, 2013. [23] T.E. Tezduyar, A. Sameh, Parallel finite element computations in fluid mechanics, Comput. Methods Appl. Mech. Engrg. 195 (13–16) (2006) 1872– 1884, http://dx.doi.org/10.1016/j.cma.2005.05.038, A Tribute to Thomas J.R. Hughes on the Occasion of his 60th Birthday, URL http://www. sciencedirect.com/science/article/pii/S0045782505003105. [24] F.S. Torun, M. Manguoglu, C. Aykanat, Parallel minimum norm solution of sparse block diagonal column overlapped underdetermined systems, ACM Trans. Math. Software 43 (4) (2017) 31:1–31:21, http://dx.doi.org/10.1145/ 3004280, URL http://doi.acm.org/10.1145/3004280. [25] J.Z. Wang, S.J. Williamson, L. Kaufman, Magnetic source images determined by a lead-field analysis: the unique minimum-norm least-squares estimation, IEEE Trans. Biomed. Eng. 39 (7) (1992) 665–675, http://dx.doi.org/10. 1109/10.142641. [26] A. Yzelman, R. Bisseling, Cache-oblivious sparse matrix–vector multiplication by using sparse matrix partitioning methods, SIAM J. Sci. Comput. 31 (4) (2009) 3128–3154, http://dx.doi.org/10.1137/080733243, arXiv:https: //doi.org/10.1137/080733243. [27] M. Zhdanov, P. Wannamaker, in: M.S. Zhdanov, P.E. Wannamaker (Eds.), Geophysical Inverse Theory and Regularization Problems, in: Methods in Geochemistry and Geophysics, vol. 35, Elsevier, 2002, pp. v–vi. Seher Acer received her B.S., M.S. and Ph.D. degrees in Computer Engineering from Bilkent University, Turkey. She is currently a postdoctoral researcher at Center for Computing Research, Sandia National Laboratories, Albuquerque, New Mexico, USA. Her research interests include parallel computing and combinatorial scientific computing with a focus on partitioning sparse irregular computations. Cevdet Aykanat received the B.S. and M.S. degrees from Middle East Technical University, Ankara, Turkey, both in Electrical Engineering, and the Ph.D. degree from Ohio State University, Columbus, in Electrical and Computer Engineering. He worked at the Intel Supercomputer Systems Division, Beaverton, Oregon, as a research associate. Since 1989, he has been affiliated with Computer Engineering Department, Bilkent University, Ankara, Turkey, where he is currently a professor. His research interests mainly include parallel computing and its combinatorial aspects. He is the recipient of the 1995 Investigator Award of The Scientific and Technological Research Council of Turkey and 2007 Parlar Science Award. He has served as an Associate Editor of IEEE Transactions of Parallel and Distributed Systems between 2008 and 2012.