Sparse vertex-star relaxation for high-order FEMP. D. Brubeck and P. E. Farrell
A scalable and robust vertex-star relaxation for high-order FEM††thanks: Submitted to the editors August 31, 2021. \funding PDB was supported by the University of Oxford Mathematical Institute Graduate Scholarship. PEF was supported by EPSRC grants EP/V001493/1 and EP/R029423/1. We would also like to thank Lawrence Mitchell for his helpful advice for the implementation in Firedrake.
Abstract
Pavarino proved that the additive Schwarz method with vertex patches and a low-order coarse space gives a -robust solver for symmetric and coercive problems [42]. However, for very high polynomial degree it is not feasible to assemble or factorize the matrices for each patch. In this work we introduce a direct solver for separable patch problems that scales to very high polynomial degree on tensor product cells. The solver constructs a tensor product basis that diagonalizes the blocks in the stiffness matrix for the internal degrees of freedom of each individual cell. As a result, the non-zero structure of the cell matrices is that of the graph connecting internal degrees of freedom to their projection onto the facets. In the new basis, the patch problem is as sparse as a low-order finite difference discretization, while having a sparser Cholesky factorization. We can thus afford to assemble and factorize the matrices for the vertex-patch problems, even for very high polynomial degree. In the non-separable case, the method can be applied as a preconditioner by approximating the problem with a separable surrogate. We demonstrate the approach by solving the Poisson equation and a -conforming interior penalty discretization of linear elasticity in three dimensions at .
keywords:
preconditioning, high-order, tensor product, additive Schwarz, sparse Cholesky65F08, 65N35, 65N55
10.1137/21M1444187
1 Introduction
For problems with smooth solutions, high-order finite element methods offer very good convergence properties, and in some cases they do not exhibit locking phenomena found in low-order methods. Moreover, there exist optimal matrix-free algorithms for operator evaluation with high arithmetic intensity, arising from data locality, that make efficient use of modern parallel hardware architectures. Unfortunately, the conditioning of the stiffness matrix is severely affected by the polynomial degree of the approximation. In order to obtain practical iterative solvers, we require good preconditioners.
Optimal solvers are often obtained from a multiplicative multigrid V-cycle where the smoother consists of a domain decomposition method, such as additive Schwarz with a particular space decomposition. The multigrid algorithm is then accelerated by a Krylov subspace method, such as preconditioned conjugate gradients (PCG). The choice of space decomposition in the relaxation is crucial for robustness with respect to the cell size , the polynomial degree , and parameters in the equation.
One of the cheapest relaxations, with computational cost, is diagonal scaling, also known as point-Jacobi. The diagonally preconditioned Laplacian has a condition number of [36]. This implies that the number of PCG iterations, and therefore the number of residual evaluations, is , incurring a total cost of . In order to minimize the time to solution, it is reasonable to consider more expensive relaxation methods that converge in fewer iterations. Ideally, we wish to balance the cost of applying the relaxation with that of updating the residual. On tensor product elements, the latter operation can be done quickly in operations via the sum-factorization. Sum-factorization breaks down the residual evaluation into products of one-dimensional operators and diagonal scalings [40].
In 1993, Pavarino proved that the additive Schwarz method with a vertex-centered space decomposition and an additive coarse space of lowest-order () gives a robust solver with respect to and for symmetric and coercive problems [42]. This type of space decomposition is often referred to as generous overlap and is illustrated in Figure 1. We use the terminology of [19] and refer to the subdomains in this space decomposition as vertex-star patches, as they are constructed by taking all the degrees of freedom (DOFs) on the topological entities in the star of each vertex.
The most straightforward implementation of a vertex-star solver involves the assembly and direct factorization of the patch matrices (which are dense for Lagrange shape functions). This becomes prohibitively expensive at very high polynomial degrees, with the Cholesky factorization of such a matrix requiring operations. However, there exist bases for which the element matrices are sparse on affine cells, such as the hierarchical Lobatto shape functions [53]. In this basis, the stiffness matrix has a 5-point stencil in 2D, and a much larger 13-point stencil in 3D.
Efficient relaxation methods that are -robust may arise from the discretization of an auxiliary problem for which fast inversion techniques are available. For a more general approach to auxiliary space techniques, we refer to the work of Xu [56]. In our context, the underlying PDE and/or the domain can be replaced by those of a problem which is solvable by the method of separation of variables. The fast diagonalization method (FDM) [34] is a direct factorization that breaks the problem down into a sequence of one-dimensional subproblems.
For the Poisson equation discretized on meshes with all cells Cartesian (all internal angles are right angles), the vertex-star problems can be solved directly with the FDM [55]. Huismann et al. [25] introduced a remarkably fast solver with scaling on such meshes. The linear system is statically condensed by elimination of the cell DOFs, and the reduced system on the interface is solved with -multigrid and a restricted variant of the FDM onto the interface DOFs of a vertex-star. Since the statically condensed operator requires the exact inversion of the cell matrices, their approach has no obvious extension to the unstructured, non-Cartesian case.
The FDM can be applied as a relaxation by means of an auxiliary problem that is separable, but this requires a tensor product grid discretization of the patch, which is only possible when the cells are laid out in a tensor product grid [55]. When the cells are not Cartesian, the method of Witte et al. [55] approximates the whole patch as a single Cartesian domain and converges slowly even when the cells are slightly distorted. On general meshes, the patches may not have this structure, thus the FDM cannot be directly applied on such patches. An example of a vertex-star patch to which the FDM cannot be applied as a relaxation is shown in Figure 2.
A popular alternative in the literature has been to use cell-centered patches with minimal overlap by including a few layers of DOFs of the neighboring cells [20, 48, 52]. This can be done in such a way that every patch remains structured. This kind of space decomposition is more amenable to fast implementation, but does not give a -robust solver. If the number of layers is fixed, then the measure of the overlap region will decrease as is increased. Pavarino also proved that when the overlap is not generous, the rate of convergence of the additive Schwarz method will depend inversely on the overlap size [43]. To overcome this, Fischer and Lottes [33] applied a hybrid -multigrid/Schwarz method, in the context of a Poisson problem. They implemented several levels of -multigrid to overcome the lack of -robustness of the cell-centered patches with minimal overlap. The use of multiple levels increases the overlap at the coarser levels with a minor impact on the overall computational cost. Cell-centered patches without overlap have also been employed for non-symmetric problems [45, 46, 17].
Instead of replacing the vertex-star patches with cell-centered ones, the alternative FEM-SEM preconditioner [40] rediscretizes the problem on each vertex-star patch with on a GLL grid, a mesh with vertices at the DOFs of the high-order space. The theory behind this guarantees the spectral equivalence between the differential operator discretized on the two spaces [13]. Since low-order methods are naturally sparse, this approach is not constrained to Cartesian cells and can deal properly with mixed first derivatives that the FDM cannot handle. A downside of this approach is that the Cholesky factors of the patch matrices are quite dense, limiting its scalability to very high polynomial degree. Computationally advantageous approaches involve incomplete factorizations of the patch FEM-SEM matrices [44], or the use of algebraic multigrid on the global FEM-SEM operator [9].
In this work we develop a solver for vertex-star patches that scales to very high polynomial degree. Our approach does not rely on a particular structure of the patch. In particular, it applies to the patches shown in Figure 2. The key idea is to use the FDM to numerically construct a basis of functions on an interval that diagonalizes the interior blocks of both the stiffness and mass matrices in one dimension. When the problem is assembled with respect to the associated tensor product basis, the resulting stiffness and mass matrices are sparse, in the Cartesian case. In particular, the total number of non-zeros is the same as that of a low-order finite difference discretization of the Laplacian. Moreover, fill-in in the Cholesky factorization is only introduced for the interface DOFs, resulting in very sparse Cholesky factors. The factorization requires operations, while forward and back-substitution steps have a cost of operations that is optimal for , in contrast with the and costs of the naïve approach. A disadvantage of the approach is that the memory required scales like , instead of the optimal required for storing the solution. In the non-Cartesian case, we approximate the form with one that is separable in the reference problem. Robustness with respect to and should follow from the spectral equivalence between the forms, and numerical experiments indicate that the approach is effective when the cells are moderately deformed.
We demonstrate the effectiveness of our approach by applying it to a conforming discretization of a mixed formulation of incompressible linear elasticity. We present a sequence of problems of increasing complexity building up to this. In Section 2 we present the standard -FEM formulation for the Poisson problem and construct a solver based on a sparse discretization of an auxiliary locally separable PDE that employs the numerically computed FDM basis. In Section 3 we consider the application of our solver to linear elasticity. In the primal formulation, although our approach can be applied to patch problems for the individual components of displacement, we explain why it cannot be applied to the coupled vector-valued problem, which is necessary for parameter-robustness in the incompressible regime. We therefore consider a mixed formulation instead. Developing a -robust solver requires both a -robust preconditioner and a -robust discretization, and for the latter we choose a conforming approach. We then extend the method to symmetric interior-penalty discontinuous Galerkin discretizations, required for the displacement block of the mixed problem. We apply our relaxation to the displacement block of the incompressible elasticity system in conjunction with block-preconditioned Krylov methods. We end with conclusions in Section 4.
2 Sparse Poisson solver
We will first describe a solver for the Poisson equation on Cartesian cells, which will subsequently be extended as a preconditioner for more general symmetric coercive problems on non-Cartesian cells.
2.1 Continuous Galerkin formulation
We start from the standard weak formulation of the Poisson equation. Consider a bounded domain , , and let be the part of the boundary where the Dirichlet boundary condition is prescribed. The problem is to find in such that
(2.1) |
where
(2.2) |
The standard FEM discretization employs a mesh of . In this work we consider quadrilateral and hexahedral cells, so that each cell can be mapped with a diffeomorphism from the reference hypercube , where is the reference interval. The approximate solution is sought in the space of piecewise continuous tensor product polynomials on each cell, i.e. . We first define the space of shape functions on
(2.3) |
and via composition with , we define
(2.4) |
Once we fix a basis for , the approximate solution is expanded as . The resulting system of linear equations is
(2.5) |
where is the stiffness matrix, is the vector of coefficients, and is the load vector.
We recall the standard construction of the basis [26]. The basis is defined in terms of shape functions on . Given shape functions for , a tensor product basis for can be constructed as
(2.6) |
where we have expanded as a multi-index.
The interval shape functions are decomposed into interface and interior modes. The interface modes have non-zero support on either endpoint of , while the interior modes vanish at the boundary of . In multiple dimensions, the shape functions decompose into interior, facet, edge, and vertex modes, depending on how many 1D interface functions are multiplied together. To generate a basis, we simply match the shape of individual interface modes. Hence, will be block sparse, since when and correspond to interior modes supported on different cells.
For the interval shape functions, one standard choice is the set of Lagrange polynomials on the Gauß–Lobatto–Legendre (GLL) nodes . These nodes are the roots of , where is the Legendre polynomial of degree . The Lagrange polynomials satisfy by construction,
(2.7) |
Another useful basis is formed by the hierarchical Lobatto shape functions , which are constructed by augmenting the so-called bubble functions (integrated Legendre polynomials) with linear Lagrange functions,
(2.8) |
These two choices of shape functions are plotted in Figure 3.
The assembly of the stiffness matrix is described as follows. On each cell we define the cell stiffness matrix in terms of the basis functions that are supported on , which are obtained from the reference shape functions via . Then, the cell stiffness matrices are
(2.9) |
The global stiffness matrix is then assembled via direct stiffness summation:
(2.10) |
where is the Boolean restriction matrix from the global DOFs to those local to cell .
2.2 Tensor product structure on Cartesian cells
If and is an affine mapping, then the cell stiffness matrices are
(2.11) |
Here is the stiffness matrix on the reference interval, and , where denotes the measure of the cell .
For , we will first consider the case where can be tessellated with a mesh consisting of Cartesian cells, i.e. the cells are rectangular quadrilaterals or hexahedra (all internal angles are right angles). In this idealized setting, the cell stiffness matrices are separable in the reference coordinates
(2.12) |
where
(2.13) |
is the mass matrix on the reference interval, , and is the length of along the -th axis divided by . The symbol denotes the Kronecker product, which for matrices , is defined as the block matrix
(2.14) |
It follows that if and are sparse, then is also sparse.
For the GLL basis , both and are dense, but these are sparse in the hierarchical basis . This is illustrated in Figure 3. On affine cells, the hierarchical basis yields a sparse stiffness matrix. The bubble functions , satisfy , and due to the orthogonality of the Legendre polynomials, the interior block of is diagonal. The only off-diagonal non-zeros in are due to the coupling between the interface modes , . Nevertheless, in order for this sparsity to propagate to higher dimensions, we would additionally wish that is also as sparse as possible. This is not quite the case for , as has two interior blocks with tri-diagonal structure, in the even-odd decomposition. Therefore, on a typical row, will have the structure of the 5-point stencil for and that of a 13-point stencil for .
2.3 The fast diagonalization method
Linear systems involving structured matrices such as that defined in (2.12) can be solved efficiently using the fast diagonalization method (FDM) [34]. This method reduces the computation into a sequence of eigenvalue problems on the interval in a similar fashion as the method of separation of variables. It requires a separable PDE and a tensor product basis; therefore it can only be applied on meshes or mesh patches with tensor product structure.
To illustrate the FDM, we consider solving a problem on the interior of a Cartesian cell, , where and is the Boolean restriction matrix onto the interior DOFs of . We may first solve the generalized eigenvalue problem on the interior of the reference interval
(2.15) |
in conjunction with the normalization condition . Here are the interior blocks of and , respectively, is the diagonal matrix of eigenvalues, and is the matrix of eigenvectors. The generalized eigenproblem (2.15) may be equivalently rewritten as
(2.16) |
The corresponding continuous problem is to find and , such that
(2.17) |
with solution , . Thus, when and are discretized with the GLL basis, approximates a discrete sine transform on the interior GLL nodes.
If is given by (2.12), then its inverse has the following diagonal factorization
(2.18) |
where
(2.19) |
Therefore, the solution of a system can be obtained with computational work.
The main limitation of this approach is that it does not generalize to terms that contain first derivatives, ruling out the possible extension to advection problems. Mixed first derivative terms are also very common in symmetric coercive problems, for instance, when the cells have non-orthotropic deformations, or for vector-valued operators that mix first derivatives of distinct vector components, such as ).
2.4 Sparse FDM discretization
Our key idea is to construct a new finite element basis on the interval, inspired by the FDM, which yields a sparse stiffness matrix. To construct this new basis, we solve (2.16) with and in the GLL basis. Then, we interpolate the eigenvectors of the FDM with polynomials that satisfy , for . The unisolvent dual basis of our proposed FDM discretization of consists of point evaluation at the vertices and integral moments against the orthogonal interior shape functions , . This construction ensures that and become diagonal under this basis. As an additional consequence, the 1D interface shape functions are also orthogonal to the interior ones, but not to each other. Hence becomes as sparse as possible in this new FDM basis. Figure 4 shows the shape functions and the non-zero structure of and for the FDM basis.
The stiffness matrix for a Cartesian cell in this FDM basis is also given by (2.12), and has structured sparsity, since is sparse when and are sparse. Consequently, the global matrix is also sparse. We can thus apply sparse direct factorization methods or other preconditioners such as the additive Schwarz method.
For problems that are separable in Cartesian coordinates, we obtain a solver that is reminiscent of the FDM. The multiplication times the Kronecker product of the matrices of eigenvectors is incorporated in the definition of the FDM shape functions. If for some reason the problem needs to be solved with any other basis, one can transform the linear system back and forth to the FDM basis via interpolation and restriction at each application of the method. We also replace the diagonal matrix of the traditional FDM with a sparse matrix, which can be assembled even on 3D meshes with a very large number of cells and for very high degree. This approach removes the requirement of a global tensor product grid while exploiting the separability of the PDE and the local structure of , at the expense of replacing the diagonal factor by a sparse matrix.
2.5 Hybrid -multigrid/Schwarz method
The solver of Pavarino is fully additive, across both the coarse grid and the vertex-star patches. In our work we consider a small variation of this, with the solver multiplicative between the two levels while remaining additive among the vertex-star patches. This improves the convergence at essentially no cost. The method can be interpreted as a hybrid multiplicative two-level -cycle with the additive Schwarz method [18] with vertex-star patches as the fine grid relaxation and the lowest-order discretization on the same mesh as the coarse space. The sparse matrix for the coarse space may be assembled and factorized, or other preconditioners such as geometric or algebraic multigrid may be applied instead. The vertex-star patch includes the degrees of freedom associated with vertex of and all cells, facets, and edges adjacent to (the topological entities in the star of the vertex, a standard concept in algebraic topology [38, §2]).
We may write the multigrid relaxation as
(2.20) |
with the Boolean restriction matrix onto , and are the sparse patch matrices for which we may explicitly compute a Cholesky decomposition. The relaxation is scaled by the damping coefficient
(2.21) |
where are the extremal eigenvalues of estimated via the CG-Lanczos procedure [30], and is chosen to tackle the high frequency error, also ensuring that the error iteration matrix is contractive.
To illustrate the direct solver on the Cartesian vertex-star patch shown in Figure 1a, we show in Figure 5 the non-zero structure for the patch matrix and its Cholesky factor. The sparsity pattern of the global matrix connects the interior DOFs to their projections onto the facets, hence a typical interior row of will have non-zeros. For the patch matrix , an interior row will only have non-zeros, as the patch only includes one facet per dimension on each cell. Moreover, the total number of non-zeros of is the same as that of a low-order finite difference or finite element discretization with the 5-point or 7-point stencil on the same grid.
2.6 Computational complexity
Here we discuss the computational cost of the solution of the patch problem using the Cholesky factorization. Once the factorization has been computed, it may be applied in cost, which is optimal for . Unfortunately, the factorization phase is suboptimal, requiring operations to compute. The memory required to store the Cholesky factor is , which for is one factor of higher than that required to store the solution.
Consider a stiffness matrix discretized with the FDM basis on any mesh with all cells Cartesian. The number of floating point operations (flops) needed to solve a linear system using a sparse Cholesky factorization is roughly four times the number of non-zero entries in [14]. To maximize the sparsity in , it is crucial to reorder the DOFs, such that interior DOFs are followed by the interface DOFs. This ensures that the fill-in is introduced only on the bottom-left block. To analyze the cost of factorization, we first introduce the block decomposition
(2.22) |
where is the interface Schur complement. By construction, the top-left block is diagonal with positive entries, with Cholesky factor . If we decompose and in the second matrix on the RHS of (2.22) into their Cholesky factors, and distribute each factor onto the other two matrices, we obtain the Cholesky decomposition of :
(2.23) |
where the Schur complement is factorized as . Since is diagonal, the off-diagonal block will preserve the non-zero structure of , and similarly for its transpose. Thus, fill-in is only introduced on the interface block through .
An ordering strategy that minimizes fill-in consists of applying nested dissection [21] on the adjacency graph that connects topological entities. Each node in this graph represents a cell, face, edge or vertex. The ordering of the entities is then used to permute the corresponding blocks in .
Assuming for the worst case that is dense, the memory required to store the Cholesky factor is . This represents a significant increase from the traditional FDM, which is kept at the optimal . However, the DOF ordering does lead to some structured sparsity in , as can be seen in Figure 5. Nevertheless, we still observe dense blocks. The fact that contains these dense blocks indicates that operations are required in the factorization phase. However, the forward and back-substitution steps have a computational cost of operations, which is optimal for .
Compared to the FEM-SEM approach, our approach with the FDM basis has a sparser Cholesky factorization. In Figure 6 we present the ratio of the number of non-zeros in the Cholesky factors of our approach and the FEM-SEM preconditioner ordered with nested dissection for . The fact that the ratio is always below 1 confirms that our approach is sparser, with a substantial gain at higher degrees. Practical FEM-SEM solvers use AMG [9] or patchwise multigrid with ILU smoothers [44] to avoid the cost of the Cholesky factorizations of the patch matrices.
2.7 Extension to non-Cartesian cells
For arbitrarily deformed cells, the local stiffness matrices cannot be expressed in terms of tensor products of and as in (2.12), and is not sparse in the FDM basis. The preconditioning techniques found in [16, 20, 55] introduce an auxiliary Cartesian domain to construct a separable problem for which the FDM is a direct solver. The method described by Fischer [20] constructs a preconditioner by replacing with its nearest rectangular approximation, whose dimensions are computed as the mean separation between the mapped GLL nodes from opposite facets of . Witte et al. [55] obtain the lengths from the average arclength of opposite sides of , but it is not clear how this extends to the 3D case. To the best of our knowledge, no theory underpins these choices.
Our approach to construct the separable surrogate is based on the theory of equivalent operator preconditioning [6]. We work with the bilinear form in terms of the reference coordinates. We discard the mixed derivative terms that prevent separability, and we replace the coefficients with piecewise constants in the reference coordinates111Recall that piecewise constant coefficients in the physical coordinates will not yield piecewise constant coefficients in the reference coordinates.. We will prove that this choice yields a spectrally equivalent operator.
The bilinear form can be expressed as a sum of cell contributions where integration and differentiation are with respect to . The measure is replaced by and the gradient is computed via the chain rule, since the arguments of the form become functions of after being composed with . Hence,
(2.24) |
where is the gradient with respect to , and is the inverse metric of the coordinate transformation weighted by the Jacobian determinant,
(2.25) |
This tensor encapsulates all of the geometry-dependent information in the form; it is spatially dependent for generally-deformed elements, and constant in the case of affine transformations. For a separable geometry, is diagonal, and thus for a Cartesian cell it is diagonal and constant. To construct an auxiliary problem that is separable by the FDM in the reference coordinates, we replace in with a constant diagonal approximation . Each is given by the cell-wise average of the diagonal entry ,
(2.26) |
where summation over the index is not implied. As the approximation is local to each cell, it is still possible to assemble a sparse stiffness matrix discretizing the auxiliary problem on meshes where cells are not structured in a tensor product grid.
We now establish the spectral equivalence between the original bilinear form and the auxiliary separable one.
Theorem 2.1.
Let be the constant diagonal approximation of , and define the auxiliary bilinear form
(2.27) |
Then, there exist -independent constants , that depend on through such that
(2.28) |
Proof 2.2.
Let be lower and upper bounds for the spectrum of the diagonally scaled metric, so that for all . We claim that
(2.29) |
This result is obtained by first rewriting with instead of , and then replacing with or to find the lower or upper bounds, respectively. It then follows that
(2.30) |
Let be the stiffness matrix associated with the auxiliary form . By Theorem 2.1, the condition number is bounded by independently of . Numerical experiments also indicate that is independent of under uniform refinements. Now consider a preconditioner where the auxiliary form is used additively in both the coarse solve and the vertex-star patches. In this case, Theorem 1 of [42] guarantees that is bounded from above independently of and . Hence we may conclude that is bounded independently of and . In practice, we expect that using multiplicative coarse grid correction with the original form can only improve the preconditioner.
To gain useful insight, we consider the case where and is an affine transformation, that is when is a parallelogram. Without loss of generality, suppose that one of the sides of has length and is aligned with the first reference coordinate axis, and the other side of length is at an angle with respect to the same axis. The Jacobian of the coordinate transformation is
(2.31) |
to which corresponds the Jacobian-weighted inverse metric
(2.32) |
Since is constant, is simply the diagonal part of . The spectrum of the diagonally scaled metric will be independent of and , but still depend on ,
(2.33) |
This spectrum is desirable because it is centered at 1 and bounded above for all . If we follow the geometric approaches of [20, 55], we would have to choose a rectangle with side lengths and as the auxiliary domain for the Poisson problem. Then, the previous bounds (2.33) would become scaled by . In this case, the spectrum is unbounded from above in the limit .
2.8 Numerical experiments
We provide an implementation of the element with the FDM shape functions on the interval in the FIAT [28] package. The extension to quadrilaterals and hexahedra is achieved by taking tensor products of the one-dimensional element with FInAT [23]. Code for the sum-factorized evaluation of the residual is automatically generated by Firedrake [47, 24], implementing a Gauß–Lobatto quadrature rule with points along each direction. The sparse preconditioner discretizing the auxiliary form is implemented in firedrake.FDMPC using PETSc [8]. The Cholesky factorization of the patch matrices is computed using CHOLMOD [14]. Most of our computations were performed using an Intel Xeon CPU E5-4627 v2 3.30GHz with 32 cores and 67.6 GB of RAM storage.
The hybrid -multigrid/Schwarz solver employing the FDM/sparse relaxation is illustrated in Figure 7. To achieve scalability with respect to the mesh parameter , on the -coarse problem we employ geometric multigrid with damped point-Jacobi relaxation and a Cholesky factorization on the coarsest level using MUMPS [2]. We test the effectiveness of this approach on a hierarchy of meshes obtained by uniform refinements of the base meshes shown in Figure 8.
We present results for the Poisson equation in discretized on the three hierarchies of Cartesian, unstructured, and structured but deformed (Kershaw) [27] meshes. The coordinate field of the Kershaw mesh is in , with a cell aspect ratio of near the corners of the domain. We impose homogeneous Dirichlet BCs on and a constant forcing . In Table 1 we present PCG iteration counts required to reduce the Euclidean norm of the residual by a factor of starting from a zero initial guess. In Table 2 we show the condition number estimated by CG-Lanczos. The results show almost complete - and -robustness in the Cartesian case, and very slow growth of iteration counts in the unstructured case. Given the lack of shape regularity, the Kershaw mesh is significantly more challenging; even with exact patch solvers, we do not expect - or -robustness.
Cartesian | Unstructured | Kershaw | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2 | 0 | 1 | 2 | 0 | 1 | 2 | ||
2 | 3 | 7 | 8 | 9 | 12 | 13 | 14 | 27 | 35 | 54 |
7 | 8 | 8 | 9 | 16 | 16 | 17 | 44 | 56 | 78 | |
15 | 8 | 9 | 9 | 19 | 19 | 19 | 58 | 69 | 90 | |
31 | 8 | 9 | 9 | 21 | 20 | 21 | 67 | 80 | 97 | |
3 | 3 | 12 | 12 | 12 | 17 | 17 | 18 | 54 | 66 | 102 |
7 | 12 | 12 | 12 | 22 | 21 | 21 | 98 | 106 | 158 | |
15 | 12 | 13 | 25 | 24 | 131 | 132 |
Cartesian | Unstructured | Kershaw | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2 | 0 | 1 | 2 | 0 | 1 | 2 | ||
2 | 3 | 1.44 | 1.49 | 1.50 | 2.14 | 2.37 | 2.81 | 9.34 | 15.6 | 34.5 |
7 | 1.48 | 1.48 | 1.50 | 3.23 | 3.27 | 3.79 | 19.6 | 30.3 | 57.6 | |
15 | 1.51 | 1.51 | 1.52 | 4.06 | 3.78 | 4.13 | 30.5 | 45.8 | 69.0 | |
31 | 1.54 | 1.52 | 1.52 | 4.45 | 4.06 | 4.36 | 40.4 | 57.1 | 73.3 | |
3 | 3 | 2.87 | 2.49 | 2.45 | 4.16 | 4.21 | 4.55 | 34.8 | 46.1 | 117 |
7 | 2.79 | 2.70 | 2.67 | 5.88 | 5.54 | 5.47 | 100 | 110 | 266 | |
15 | 2.83 | 2.79 | 7.12 | 6.44 | 165 | 151 |
To assess the computational performance of our approach, we solve the three-dimensional Poisson equation on a Cartesian mesh with cells with a single core of an Intel Core i7-10875H CPU 2.30GHz. We plot in Figure 9 the runtimes, flop counts, and achieved arithmetic performance for the Cholesky factorization of the patch matrices, the solution of the patch problems using this factorization (per application of the relaxation), and the matrix-free evaluation of the residual (per Krylov iteration, excluding the application of the global to local map) as functions of . The dotted lines are to indicate powers of , which is the number of DOFs along each side of typical vertex-star patch not intersecting the mesh boundary.
Despite the computational cost of the Cholesky factorization, these results show scaling for runtime up to . This speedup can be explained mainly by data locality. The sparse Cholesky factorization is obtained by recursively applying the block decomposition up to the point where the Schur complement is sufficiently dense. The computation of this Schur complement via dense matrix-matrix multiplication (BLAS-3 dgemm) dominates the computational cost. As is increased, the utilization of arithmetic units increases in proportion to the dimension of the Schur complement, which explains the scaling of the achieved arithmetic performance. As the arithmetic capabilities become saturated for , the scaling in the factorization runtime should become apparent.
Most of the time in the relaxation step is spent in accessing the factor matrix from memory, given the sub-optimal storage per patch. The relaxation is therefore limited by memory bandwidth and not arithmetically intense, which explains the poor arithmetic performance. This is in contrast to the sum-factorized residual evaluation, which has a memory footprint and presents better arithmetic utilization [29]. Nevertheless, the results indicate that the runtime for the solution of the patch problems with the sparse Cholesky factorization remains very close to that of the matrix-free residual evaluation for moderate , being slightly faster for , mainly due to lower operation count.
3 Application to linear elasticity problems
3.1 Primal formulation of linear elasticity
We now consider how these ideas may be applied in the more complex setting of a nonseparable, vector-valued PDE. The equations of linear elasticity describe the displacement of a solid body with a reference configuration . The primal formulation is to find such that
(3.1) |
where
(3.2) |
Here, we assume that the material is homogeneous and isotropic, and can thus be described by the Lamé parameters ; is the linearized strain tensor; is Dirichlet data prescribed on ; and is a body force. The Poisson ratio measures the compressibility of the material. In the incompressible limit (i.e. ), the problem becomes ill-conditioned, as becomes insensitive to divergence-free perturbations in the arguments.
Consider the partitioning of the stiffness matrix into blocks that act on each displacement component,
(3.3) |
The diagonal block discretizes the bilinear form
(3.4) |
where summation is not implied, and and are components of and , respectively. The off-diagonal blocks , , discretize
(3.5) |
The diagonal blocks can be diagonalized by the FDM on the interior of a Cartesian cell when the reference axes are aligned with the physical coordinates. The same statement does not hold true for the off-diagonal blocks, as they couple together different displacement components. This is because they discretize products of different first derivatives on the different components and hence are not separable.
The separate displacement components (SDC) preconditioner [11, 22] is defined as the block diagonal matrix . In other words, this approach is also described as block-Jacobi in the displacement components. The SDC preconditioner discretized with the FDM basis is sparse for Cartesian cells aligned with the coordinate axes. On arbitrary cells, for each separate component, we obtain an auxiliary form that is separable in the reference coordinates by selecting constant diagonal coefficients .
It is shown in [11] that for a homogeneous isotropic material with principal axes parallel to the axes of the reference coordinate system, the condition number of the preconditioned matrix will depend on the Poisson ratio:
(3.6) |
where is the constant appearing in Korn’s inequality,
(3.7) |
Thus, the convergence rate of the SDC preconditioner will deteriorate for approaching , the so-called nearly incompressible case.
We consider the reference configuration discretized on a Cartesian mesh with 8 cells along each direction. We specify , a uniform downwards body force , and homogeneous Dirichlet BCs on . In Table 3 we present the PCG iteration counts required to reduce the Euclidean norm of the residual by a factor of starting from a zero initial guess. As the preconditioner, we employ the hybrid -multigrid/Schwarz method with vertex-star patches and the SDC/FDM/sparse relaxation and a coarse space with . As expected from (3.6), the results confirm that the approach is reasonably -robust, but that robustness with respect to cannot be achieved with SDC relaxation on vertex-star patches.
2 | 3 | 13 | 14 | 24 | 70 | 199 |
---|---|---|---|---|---|---|
7 | 17 | 17 | 28 | 76 | 236 | |
15 | 18 | 19 | 30 | 81 | 249 | |
31 | 20 | 20 | 32 | 84 | 258 | |
3 | 3 | 20 | 22 | 39 | 114 | 362 |
7 | 25 | 28 | 48 | 123 | 381 | |
15 | 27 | 29 | 51 | 125 | 373 |
3.2 Mixed FEM formulations of linear elasticity
In order to avoid locking in nearly incompressible continua, or impose the incompressibility constraint, the standard approach is to introduce a pressure-like variable and discretize with a mixed FEM. This is expressed by the weak formulation: find such that
(3.8) | |||||
(3.9) |
where
(3.10) |
and for and , or otherwise.
In order for this problem to have a unique solution, we require the well-known Brezzi conditions: the solution for is unique if is coercive on the kernel of , and the solution for is unique if there exists a right inverse for . This is expressed in the so-called inf–sup condition or LBB condition [7, 12]: there exists , which might depend on , such that
(3.11) |
After selecting suitable finite dimensional subspaces , , we obtain a system of linear equations with the saddle point structure
(3.12) |
We require the analogous Brezzi conditions for the discrete problem: that is coercive on the discrete kernel of , and that there exists a discrete inf-sup constant independent of the mesh but possibly depending on such that
(3.13) |
The discretization must be chosen carefully to satisfy these conditions; the discrete inf-sup condition will not be satisfied by arbitrary discretizations. If they are, we have the well-known error estimates
(3.14a) | |||
(3.14b) |
where are generic constants independent of the mesh parameter . For the use of high-order discretizations, it is desirable to choose an element pair where does not decrease as the polynomial degree of the approximation is increased. Such a discretization is referred to as -stable.
In fact, -stability is important for solvers also. Approaches based on block-Gaußian elimination, such as the Uzawa algorithm [5] and block-preconditioned MINRES [41], require preconditioners for the negative pressure Schur complement . It is well known that for the Stokes system, the continuous analogue of , , is well approximated by the identity operator [51]. It follows that is spectrally equivalent to the pressure mass matrix, ,
(3.15) |
The rate of convergence of block-preconditioned MINRES will be determined by the ratio . For Stokes flows with pure Dirichlet BCs, , and otherwise. In general, we have . Since is spectrally equivalent to the vector Laplacian, these results also hold for linear elasticity. We may expect solvers based on such techniques to degrade with -refinement if the discretization is not -stable.
If we choose to work with the -conforming space , some standard inf–sup stable choices for are , and . denotes discontinuous piecewise polynomials of degree at most in each direction, while denotes discontinuous piecewise polynomials of total degree at most . The choice gives rise to the high-order generalization of the Taylor–Hood mixed element [54]. Here, will not be block diagonal, and hence more expensive preconditioning techniques will be required. Moreover, it is shown numerically in [1] that the Taylor–Hood element is not -stable. The choice exhibits an asymptotic decay of as [10], and thus is not -stable. In practice, it is observed that this is quite a pessimistic bound for moderate [35]. The choice is -stable, but numerical experiments reveal that the stability is severely affected by the cell aspect ratio, unlike the previous two choices [50]. Moreover this last space does not have tensor product shape functions, so its efficient implementation becomes challenging.
To construct -stable discretizations that are also robust to cell aspect ratio, we turn to nonconforming schemes with [15, 31]. In particular we consider the use of Raviart–Thomas elements [4] of degree for for the displacement, paired with . This pair satisfies , which enforces the incompressibility constraint (3.9) exactly in the numerical approximation for . The Raviart–Thomas elements are defined on the reference quadrilateral as
(3.16) |
The analogous element in three dimensions is referred to as the Nédélec face element [39]. The definition can be extended to curvilinear cells via the contravariant Piola transform: for a function , we define as
(3.17) |
and set
(3.18) |
These elements have superb properties, but their nonconforming nature must be suitably addressed in the discretization. They only impose continuity of the normal components of across cell facets, and we therefore weakly enforce the tangential continuity via the symmetric interior penalty (SIPG) method [3]. The use of SIPG for the displacement requires further extension of the FDM/sparse relaxation; in particular, we must consider the additional facet integrals arising in the method, and show that the stiffness remains sparse.
3.3 Extension to interior penalty DG methods
Interior penalty discontinuous Galerkin (IP-DG) methods relax the continuity requirement of the discretization space. For instance, instead of , we consider a larger function space with weaker continuity, such as or . As previously mentioned, in order to deal with the non-conformity, -continuity is weakly enforced via the introduction of a penalty term on the set of interior facets of the mesh that vanishes for -continuous functions. Similarly, the weak prescription of the Dirichlet BC on is achieved by introducing a penalty term on .
We consider the following SIPG formulation:
(3.19) |
(3.20) |
Here is a linear viscous flux. For the vector Poisson equation, the viscous flux is given by . For the primal formulation of linear elasticity, the viscous flux corresponds to the stress tensor . For the mixed formulation of linear elasticity, the -block of the system has viscous flux .
From left to right, the terms in the surface integral in (3.19) are referred to as the penalty, consistency, and adjoint consistency terms. The quantity is known as the homogeneity tensor,
(3.21) |
for which we define the adjoint product with
(3.22) |
The average and jump operators are defined for scalar, vector, and tensor arguments as follows. Let be a facet of the mesh. For an interior facet, let and be the two mesh cells that share it, and let and be the traces of a function on from and , respectively. On each facet we define
(3.23) |
In order to ensure coercivity of as we do or refinement, the penalty term must be sufficiently large. The penalty coefficient must be chosen inversely proportional to the normal spacing of GLL nodes near the facet, i.e. [37]. For the reciprocal length scale in the direction normal to facet we use
(3.24) |
The stiffness matrix that corresponds to in the SIPG formulation is obtained via direct stiffness summation over the cells and facets:
(3.25) |
where is the cell matrix discretizing the volume integral in , is the Boolean restriction onto the DOFs of , is the facet matrix discretizing the surface integral on , and is Boolean restriction onto the DOFs of the cells sharing facet .
To illustrate the extension of our approach to the SIPG discretization, we consider again the scalar Poisson equation. The discrete problem is to find . On Cartesian cells, both and have a tensor product structure of the form (2.12), with matrices of operators on the interval that can be sparsified by the FDM. To illustrate this, suppose that, for , reorders the DOFs such that the cells and share along the -th reference coordinate axis, while leaving the other axes consistently oriented on both cells. The facet matrices are
(3.26) |
where the interval facet matrix is defined in terms of the coefficients appearing in (2.12), the 1D shape functions , and their normal derivatives on (the usual derivative with a sign). When , is given by
(3.27) |
Here , where is the reference coordinate normal to , and describes the facet as the image of the plane under . When , is a block matrix with blocks , , given by
(3.28) |
Here , where and are the reference directions normal to on and , respectively. Similarly, the facet is the image of under and that of under .
Some implementations of do not feature an interior-interface decomposition and use the Gauß–Legendre (GL) nodal shape functions. The GL nodes do not include the endpoints, thus all shape functions have non-zero support at the facets, causing to be dense. The matrices are sparse for a basis with an interior-interface decomposition, such as the GLL Lagrange polynomials , the hierarchical Lobatto polynomials , and the FDM polynomials . Since is non-zero for a single , each term in (3.28) and (3.27) corresponds to a non-zero entry, a non-zero row, and a non-zero column of , respectively, as seen in Figure 10.
Instead of diagonalizing the SIPG patch matrix as in [55], our FDM-based approach produces a sparse matrix with diagonal interior blocks on (possibly) unstructured vertex-star patches. Figure 11 shows the sparsity pattern of the matrix for the SIPG formulation of the Poisson equation on a Cartesian vertex-star patch in the FDM basis, along with its Cholesky factor. Here the matrix size is increased from DOFs in the CG case to . At low polynomial degrees, the interface DOFs form a large fraction of the total number, but the proportion decreases as increases. The computational complexity analysis of Section 2.6 carries over to the SIPG case.
For a general viscous flux, we construct an auxiliary separable form by expressing the cell integrals in terms of the reference coordinates
(3.29) |
where is the homogeneity tensor in the reference coordinates,
(3.30) |
The auxiliary form is constructed by approximating with a piecewise constant tensor that discards the entries where or . The corresponding cell stiffness matrices become sparse in the FDM basis, and have a similar form as (2.12), except that the coefficients are diagonal matrices that multiply each term through an additional Kronecker product. Hence, we expect that preconditioners based on the auxiliary form to be limited by the coupling between vector components, by the mesh geometry, and by how varies within .
This approach carries over to non-Cartesian cells for DG discretizations of the Poisson equation in the same way as the CG case. Unfortunately, the extension of our approach to vector-valued problems in on non-Cartesian cells does not yield a good preconditioner. In this setting, we construct a block diagonal preconditioner separating the components of the DOFs, which are in the reference coordinates. For the vector Poisson problem on non-Cartesian cells, the Piola transform introduces off-diagonal contributions from the volume and surface terms, which does not occur on non-Piola-mapped elements, such as and . The excluded terms are required for the surface integral terms to vanish for arguments with continuity, and without them the preconditioner might become indefinite on non-Cartesian cells.
3.4 Results for mixed formulations of linear elasticity
We consider the same problem as in Table 3, with both a conforming discretization and a non-conforming discretization. For the -conforming discretization, the normal components of the Dirichlet BCs are enforced strongly, while the tangential components of the BCs are weakly enforced with SIPG. Enforcing the normal conditions strongly is crucial for achieving a divergence-free solution in the Stokes limit . For the penalty coefficient, we use . We restrict our experiment to Cartesian cells, so that the FDM/sparse relaxation is applicable to the -conforming discretization.
We iteratively solve the discrete system via MINRES with a symmetric positive definite block diagonal preconditioner,
(3.31) |
Here is a preconditioner for the displacement block , and is a preconditioner for the scaled pressure mass matrix . For we employ the hybrid -multigrid/Schwarz method with the SDC/FDM/sparse relaxation and as the coarse space. In our tests, we discretize the pressure space with the GL basis, and employ point-Jacobi on the pressure mass matrix, i.e. . When consists of Cartesian cells, in the GL basis. The solver is illustrated in Figure 12.
In Table 4 we present MINRES iteration counts for the same configuration considered in Table 3 in Section 3.1, using the and elements, respectively. Both discretizations yield robust iteration counts with respect to ; the iterations grow with the former discretization much more quickly than the latter, especially in 3D.
2 | 3 | 28 | 40 | 43 | 43 | 43 | 25 | 36 | 39 | 40 | 40 |
---|---|---|---|---|---|---|---|---|---|---|---|
7 | 31 | 45 | 50 | 51 | 51 | 28 | 40 | 43 | 45 | 45 | |
15 | 34 | 50 | 57 | 57 | 57 | 30 | 43 | 48 | 48 | 48 | |
31 | 36 | 53 | 64 | 65 | 65 | 31 | 45 | 51 | 51 | 51 | |
3 | 3 | 44 | 67 | 75 | 76 | 76 | 34 | 50 | 55 | 56 | 56 |
7 | 50 | 83 | 96 | 97 | 98 | 39 | 58 | 63 | 65 | 65 | |
15 | 53 | 88 | 111 | 118 | 119 | 41 | 63 | 70 | 70 | 70 |
The solver configuration shown in Figure 12 is optimized for memory usage, employing a block diagonal preconditioner so that the short-term recurrences of MINRES may be exploited. If one is willing to trade memory for time, one may consider an alternative configuration shown in Figure 13 employing right-preconditioned GMRES [49] with a block upper triangular preconditioner,
(3.32) |
which requires a single application each of , , and per GMRES iteration. The GMRES iteration counts are presented in Table 5.
2 | 3 | 17 | 23 | 25 | 26 | 26 | 14 | 20 | 22 | 22 | 22 |
---|---|---|---|---|---|---|---|---|---|---|---|
7 | 18 | 25 | 27 | 28 | 28 | 16 | 22 | 24 | 24 | 24 | |
15 | 20 | 27 | 33 | 34 | 34 | 17 | 23 | 26 | 26 | 26 | |
31 | 21 | 30 | 38 | 38 | 39 | 18 | 24 | 28 | 28 | 28 | |
3 | 3 | 24 | 32 | 33 | 34 | 34 | 17 | 23 | 25 | 25 | 25 |
7 | 27 | 35 | 38 | 38 | 38 | 21 | 26 | 29 | 29 | 29 | |
15 | 28 | 38 | 44 | 46 | 46 | 22 | 28 | 31 | 31 | 32 |
In Table 6 we study the performance of our solver on an unstructured mesh. We consider the discretization of incompressible linear elasticity (). We prescribe , a uniform downwards body force , and homogeneous Dirichlet BCs on the displacement on the holes of the domain. The three-dimensional mesh is obtained via extrusion by 16 layers of the two-dimensional mesh. The iteration counts follow the same pattern as before for this element: they are not -robust as expected, but they remain modest even at very high degrees.
# DOFs | Iter. | ||
---|---|---|---|
2 | 3 | 17 466 | 26 |
7 | 104 250 | 29 | |
15 | 499 002 | 35 | |
31 | 2 173 242 | 44 | |
3 | 3 | 588 927 | 39 |
7 | 7 876 575 | 47 | |
11 | 31 236 927 | 51 |
4 Conclusion
We have introduced a fast relaxation method required for unstructured vertex-centered patch problems arising in Pavarino’s approach, extending its practicality to much higher polynomial degrees. Our method relies on a spectrally equivalent form, constructed such that it is separable in the reference coordinates. We show promising results for the Poisson equation and mixed formulations of linear elasticity. A downside of the approach is its narrow applicability; it will not be effective on more general problems, especially for those where the dominant terms include mixed derivatives and mixed vector components. In addition, our method relies on having a good quality mesh, with its performance depending on the minimal angle; however, mesh generators with guarantees on the minimal angle are available in two dimensions [32]. So far, we have only considered constant-coefficient problems, but the theory of [6] suggests that our approach would remain effective for spatially varying coefficients. Work in progress shows that the memory bandwidth limitations and the suboptimal complexity of the sparse Cholesky factorization can be overcome by iterative patch solvers, such as incomplete factorizations and algebraic multigrid.
References
- [1] M. Ainsworth, P. Coggins, and B. Senior, Mixed -finite element methods for incompressible flow, Chapman and Hall CRC Research Notes in Mathematics, (2000), pp. 1–20.
- [2] P. R. Amestoy, I. S. Duff, J.-Y. L’Excellent, and J. Koster, A fully asynchronous multifrontal solver using distributed dynamic scheduling, SIAM J. Matrix Anal. Appl., 23 (2001), pp. 15–41.
- [3] D. N. Arnold, An interior penalty finite element method with discontinuous elements, SIAM J. Numer. Anal., 19 (1982), pp. 742–760.
- [4] D. N. Arnold, D. Boffi, and R. S. Falk, Quadrilateral (div) finite elements, SIAM J. Numer. Anal., 42 (2005), pp. 2429–2451.
- [5] K. J. Arrow, L. Hurwicz, and H. Uzawa, Studies in Non-linear Programming, Stanford University Press, 1958.
- [6] O. Axelsson and J. Karátson, Equivalent operator preconditioning for elliptic problems, Numer. Algorithms, 50 (2009), pp. 297–380.
- [7] I. Babuška, The finite element method with Lagrangian multipliers, Numer. Math., 20 (1973), pp. 179–192.
- [8] S. Balay, S. Abhyankar, M. F. Adams, J. Brown, P. Brune, K. Buschelman, L. Dalcin, V. Eijkhout, W. D. Gropp, D. Karpeyev, D. Kaushik, M. G. Knepley, D. A. May, L. C. McInnes, R. T. Mills, T. Munson, K. Rupp, P. Sanan, B. F. Smith, S. Zampini, H. Zhang, and H. Zhang, PETSc users manual, Tech. Report ANL-95/11 - Revision 3.11, Argonne National Laboratory, 2019.
- [9] P. D. Bello-Maldonado and P. F. Fischer, Scalable low-order finite element preconditioners for high-order spectral element Poisson solvers, SIAM J. Sci. Comput., 41 (2019), pp. S2–S18.
- [10] C. Bernardi, Y. Maday, and B. Métivet, Spectral approximation of the periodic-nonperiodic Navier-Stokes equations, Numer. Math., 51 (1987), pp. 655–700.
- [11] R. Blaheta, Displacement decomposition—incomplete factorization preconditioning techniques for linear elasticity problems, Numer. Lin. Algebra Appl., 1 (1994), pp. 107–128.
- [12] F. Brezzi, On the existence, uniqueness and approximation of saddle-point problems arising from Lagrangian multipliers, ESAIM Math. Model. Num., 8 (1974), pp. 129–151.
- [13] C. Canuto, Stabilization of spectral methods by finite element bubble functions, Comput. Methods Appl. Mech. Engrg., 116 (1994), pp. 13–26.
- [14] Y. Chen, T. A. Davis, W. W. Hager, and S. Rajamanickam, Algorithm 887: CHOLMOD, Supernodal Sparse Cholesky Factorization and Update/Downdate, ACM Trans. Math. Softw., 35 (2008).
- [15] B. Cockburn, G. Kanschat, and D. Schötzau, A note on discontinuous Galerkin divergence-free solutions of the Navier–Stokes equations, J. Sci. Comput., 31 (2007), pp. 61–73.
- [16] W. Couzy and M. O. Deville, A fast Schur complement method for the spectral element discretization of the incompressible Navier-Stokes equations, J. Comp. Phys., 116 (1995), pp. 135 – 142.
- [17] L. T. Diosady and S. M. Murman, Scalable tensor-product preconditioners for high-order finite-element methods: Scalar equations, J. Comp. Phys., 394 (2019), pp. 759–776.
- [18] M. Dryja and O. Widlund, An additive variant of the Schwarz alternating method for the case of many subregions, Technical Report 339, Ultracomputer Note 131, Department of Computer Science, Courant Institute, 1987.
- [19] P. E. Farrell, M. G. Knepley, L. Mitchell, and F. Wechsung, PCPATCH: Software for the Topological Construction of Multigrid Relaxation Methods, ACM Trans. Math. Softw., 47 (2021).
- [20] P. F. Fischer, H. M. Tufo, and N. I. Miller, An overlapping Schwarz method for spectral element simulation of three-dimensional incompressible flows, in Parallel Solution of Partial Differential Equations, Springer, 2000, pp. 159–180.
- [21] A. George, Nested dissection of a regular finite element mesh, SIAM J. Numer. Anal., 10 (1973), pp. 345–363.
- [22] I. Gustafsson and G. Lindskog, On parallel solution of linear elasticity problems: Part I: theory, Numer. Lin. Algebra Appl., 5 (1998), pp. 123–139.
- [23] M. Homolya, R. C. Kirby, and D. A. Ham, Exposing and exploiting structure: optimal code generation for high-order finite element methods, arXiv preprint arXiv:1711.02473, (2017).
- [24] M. Homolya, L. Mitchell, F. Luporini, and D. A. Ham, TSFC: a structure-preserving form compiler, SIAM J. Sci. Comput., 40 (2018), pp. C401–C428.
- [25] I. Huismann, J. Stiller, and J. Fröhlich, Scaling to the stars–a linearly scaling elliptic solver for -multigrid, J. Comp. Phys., 398 (2019), p. 108868.
- [26] G. Karniadakis and S. Sherwin, Spectral/ element methods for computational fluid dynamics, Oxford University Press, 2013.
- [27] D. S. Kershaw, Differencing of the diffusion equation in Lagrangian hydrodynamic codes, J. Comp. Phys., 39 (1981), pp. 375–395.
- [28] R. C. Kirby, Algorithm 839: FIAT, a new paradigm for computing finite element basis functions, ACM Trans. Math. Soft., 30 (2004), pp. 502–516.
- [29] M. Kronbichler and W. A. Wall, A Performance Comparison of Continuous and Discontinuous Galerkin Methods with Fast Multigrid Solvers, SIAM J. Sci. Comput., 40 (2018), pp. A3423–A3448.
- [30] C. Lanczos, An iteration method for the solution of the eigenvalue problem of linear differential and integral operators, (1950).
- [31] P. L. Lederer and J. Schöberl, Polynomial robust stability analysis for -conforming finite elements for the Stokes equations, IMA J. Numer. Anal., 38 (2018), pp. 1832–1860.
- [32] X. Liang and Y. Zhang, Hexagon-based all-quadrilateral mesh generation with guaranteed angle bounds, Comput. Methods Appl. Mech. Engrg., 200 (2011), pp. 2005–2020.
- [33] J. W. Lottes and P. F. Fischer, Hybrid multigrid/Schwarz algorithms for the spectral element method, J. Sci. Comput., 24 (2005), pp. 45–78.
- [34] R. E. Lynch, J. R. Rice, and D. H. Thomas, Direct solution of partial difference equations by tensor product methods, Numer. Math., 6 (1964), pp. 185–199.
- [35] Y. Maday, D. Meiron, A. T. Patera, and E. M. Rønquist, Analysis of iterative methods for the steady and unsteady Stokes problem: Application to spectral element discretizations, SIAM J. Sci. Comput., 14 (1993), pp. 310–337.
- [36] J.-F. Maitre and O. Pourquier, Condition number and diagonal preconditioning: comparison of the -version and the spectral element methods, Numer. Math., 74 (1996), pp. 69–84.
- [37] J. Manzanero, A. M. Rueda-Ramírez, G. Rubio, and E. Ferrer, The Bassi Rebay 1 scheme is a special case of the symmetric interior penalty formulation for discontinuous Galerkin discretisations with Gauss–Lobatto points, J. Comp. Phys., 363 (2018), pp. 1–10.
- [38] J. R. Munkres, Elements of Algebraic Topology, CRC Press, 1984.
- [39] J.-C. Nédélec, Mixed finite elements in , Numer. Math., 35 (1980), pp. 315–341.
- [40] S. A. Orszag, Spectral methods for problems in complex geometries, J. Comp. Phys., 37 (1980), pp. 70 – 92.
- [41] C. C. Paige and M. A. Saunders, Solution of sparse indefinite systems of linear equations, SIAM J. Numer. Anal., 12 (1975), pp. 617–629.
- [42] L. F. Pavarino, Additive Schwarz methods for the -version finite element method, Numer. Math., 66 (1993), pp. 493–515.
- [43] L. F. Pavarino, E. Zampieri, R. Pasquetti, and F. Rapetti, Overlapping Schwarz methods for Fekete and Gauss–Lobatto spectral elements, 29 (2007), pp. 1073–1092.
- [44] W. Pazner, Efficient low-order refined preconditioners for high-order matrix-free continuous and discontinuous Galerkin methods, SIAM J. Sci. Comput., 42 (2020), pp. A3055–A3083.
- [45] W. Pazner and P.-O. Persson, Approximate tensor-product preconditioners for very high order discontinuous Galerkin methods, J. Comp. Phys., 354 (2018), pp. 344–369.
- [46] W. Pazner and P.-O. Persson, Interior penalty tensor-product preconditioners for high-order discontinuous Galerkin discretizations, in 2018 AIAA Aerospace Sciences Meeting, 2018, p. 1093.
- [47] F. Rathgeber, D. A. Ham, L. Mitchell, M. Lange, F. Luporini, A. T. T. McRae, G.-T. Bercea, G. R. Markall, and P. H. J. Kelly, Firedrake: automating the finite element method by composing abstractions, ACM Trans. Math. Soft., 43 (2016), pp. 24:1–24:27.
- [48] J.-F. Remacle, R. Gandham, and T. Warburton, GPU accelerated spectral finite elements on all-hex meshes, J. Comp. Phys., 324 (2016), pp. 246–257.
- [49] Y. Saad and M. H. Schultz, GMRES: a generalized minimal residual algorithm for solving nonsymmetric linear systems, SIAM J. Sci. Stat. Comput., 7 (1986), pp. 856–869.
- [50] D. Schötzau and C. Schwab, Mixed -FEM on anisotropic meshes, Math. Models Methods Appl. Sci., 8 (1998), pp. 787–820.
- [51] D. Silvester and A. Wathen, Fast iterative solution of stabilised Stokes systems Part II: using general block preconditioners, SIAM J. Numer. Anal., 31 (1994), pp. 1352–1367.
- [52] J. Stiller, Robust multigrid for Cartesian interior penalty DG formulations of the Poisson equation in 3D, in Spectral and High Order Methods for Partial Differential Equations ICOSAHOM 2016, M. L. Bittencourt, N. A. Dumont, and J. S. Hesthaven, eds., Cham, 2017, Springer International Publishing, pp. 189–201.
- [53] B. Szabó and I. Babuška, Finite element analysis, John Wiley & Sons, 1991.
- [54] C. Taylor and P. Hood, A numerical solution of the Navier-Stokes equations using the finite element technique, Comput. Fluids, 1 (1973), pp. 73–100.
- [55] J. Witte, D. Arndt, and G. Kanschat, Fast tensor product Schwarz smoothers for high-order discontinuous Galerkin methods, J. Comput. Methods Appl. Math., (2020).
- [56] J. Xu, The auxiliary space method and optimal multigrid preconditioning techniques for unstructured grids, Computing, 56 (1996), pp. 215–235.