\headers

Sparse vertex-star relaxation for high-order FEMP. D. Brubeck and P. E. Farrell

A scalable and robust vertex-star relaxation for high-order FEM^†^†thanks: Submitted to the editors August 31, 2021. \funding PDB was supported by the University of Oxford Mathematical Institute Graduate Scholarship. PEF was supported by EPSRC grants EP/V001493/1 and EP/R029423/1. We would also like to thank Lawrence Mitchell for his helpful advice for the implementation in Firedrake.

Pablo D. Brubeck Mathematical Institute, University of Oxford, Oxford UK () brubeckmarti@maths.ox.ac.uk Patrick E. Farrell Mathematical Institute, University of Oxford, Oxford UK () patrick.farrell@maths.ox.ac.uk

Abstract

Pavarino proved that the additive Schwarz method with vertex patches and a low-order coarse space gives a $p$ -robust solver for symmetric and coercive problems [42]. However, for very high polynomial degree it is not feasible to assemble or factorize the matrices for each patch. In this work we introduce a direct solver for separable patch problems that scales to very high polynomial degree on tensor product cells. The solver constructs a tensor product basis that diagonalizes the blocks in the stiffness matrix for the internal degrees of freedom of each individual cell. As a result, the non-zero structure of the cell matrices is that of the graph connecting internal degrees of freedom to their projection onto the facets. In the new basis, the patch problem is as sparse as a low-order finite difference discretization, while having a sparser Cholesky factorization. We can thus afford to assemble and factorize the matrices for the vertex-patch problems, even for very high polynomial degree. In the non-separable case, the method can be applied as a preconditioner by approximating the problem with a separable surrogate. We demonstrate the approach by solving the Poisson equation and a $H(\mathrm{div})$ -conforming interior penalty discretization of linear elasticity in three dimensions at $p=15$ .

keywords:

preconditioning, high-order, tensor product, additive Schwarz, sparse Cholesky

{AMS}

65F08, 65N35, 65N55

{DOI}

10.1137/21M1444187

1 Introduction

For problems with smooth solutions, high-order finite element methods offer very good convergence properties, and in some cases they do not exhibit locking phenomena found in low-order methods. Moreover, there exist optimal matrix-free algorithms for operator evaluation with high arithmetic intensity, arising from data locality, that make efficient use of modern parallel hardware architectures. Unfortunately, the conditioning of the stiffness matrix is severely affected by the polynomial degree of the approximation. In order to obtain practical iterative solvers, we require good preconditioners.

Optimal solvers are often obtained from a multiplicative multigrid V-cycle where the smoother consists of a domain decomposition method, such as additive Schwarz with a particular space decomposition. The multigrid algorithm is then accelerated by a Krylov subspace method, such as preconditioned conjugate gradients (PCG). The choice of space decomposition in the relaxation is crucial for robustness with respect to the cell size $h$ , the polynomial degree $p$ , and parameters in the equation.

One of the cheapest relaxations, with $\mathcal{O}(p^{d})$ computational cost, is diagonal scaling, also known as point-Jacobi. The diagonally preconditioned Laplacian has a condition number of $\mathcal{O}(p^{2d-2})$ [36]. This implies that the number of PCG iterations, and therefore the number of residual evaluations, is $\mathcal{O}(p^{d-1})$ , incurring a total cost of $\mathcal{O}(p^{2d})$ . In order to minimize the time to solution, it is reasonable to consider more expensive relaxation methods that converge in fewer iterations. Ideally, we wish to balance the cost of applying the relaxation with that of updating the residual. On tensor product elements, the latter operation can be done quickly in $\mathcal{O}(p^{d+1})$ operations via the sum-factorization. Sum-factorization breaks down the residual evaluation into products of one-dimensional operators and diagonal scalings [40].

In 1993, Pavarino proved that the additive Schwarz method with a vertex-centered space decomposition and an additive coarse space of lowest-order ( $p=1$ ) gives a robust solver with respect to $h$ and $p$ for symmetric and coercive problems [42]. This type of space decomposition is often referred to as generous overlap and is illustrated in Figure 1. We use the terminology of [19] and refer to the subdomains in this space decomposition as vertex-star patches, as they are constructed by taking all the degrees of freedom (DOFs) on the topological entities in the star of each vertex.

The most straightforward implementation of a vertex-star solver involves the assembly and direct factorization of the $\mathcal{O}(p^{d})\times\mathcal{O}(p^{d})$ patch matrices (which are dense for Lagrange shape functions). This becomes prohibitively expensive at very high polynomial degrees, with the Cholesky factorization of such a matrix requiring $\mathcal{O}(p^{3d})$ operations. However, there exist bases for which the element matrices are sparse on affine cells, such as the hierarchical Lobatto shape functions [53]. In this basis, the stiffness matrix has a 5-point stencil in 2D, and a much larger 13-point stencil in 3D.

Efficient relaxation methods that are $p$ -robust may arise from the discretization of an auxiliary problem for which fast inversion techniques are available. For a more general approach to auxiliary space techniques, we refer to the work of Xu [56]. In our context, the underlying PDE and/or the domain can be replaced by those of a problem which is solvable by the method of separation of variables. The fast diagonalization method (FDM) [34] is a $\mathcal{O}(p^{d+1})$ direct factorization that breaks the problem down into a sequence of one-dimensional subproblems.

For the Poisson equation discretized on meshes with all cells Cartesian (all internal angles are right angles), the vertex-star problems can be solved directly with the FDM [55]. Huismann et al. [25] introduced a remarkably fast solver with $\mathcal{O}(p^{d})$ scaling on such meshes. The linear system is statically condensed by elimination of the cell DOFs, and the reduced system on the interface is solved with $p$ -multigrid and a restricted variant of the FDM onto the interface DOFs of a vertex-star. Since the statically condensed operator requires the exact inversion of the cell matrices, their approach has no obvious extension to the unstructured, non-Cartesian case.

The FDM can be applied as a relaxation by means of an auxiliary problem that is separable, but this requires a tensor product grid discretization of the patch, which is only possible when the cells are laid out in a tensor product grid [55]. When the cells are not Cartesian, the method of Witte et al. [55] approximates the whole patch as a single Cartesian domain and converges slowly even when the cells are slightly distorted. On general meshes, the patches may not have this structure, thus the FDM cannot be directly applied on such patches. An example of a vertex-star patch to which the FDM cannot be applied as a relaxation is shown in Figure 2.

Refer to caption — (a) Vertex-star patch

A popular alternative in the literature has been to use cell-centered patches with minimal overlap by including a few layers of DOFs of the neighboring cells [20, 48, 52]. This can be done in such a way that every patch remains structured. This kind of space decomposition is more amenable to fast implementation, but does not give a $p$ -robust solver. If the number of layers is fixed, then the measure of the overlap region will decrease as $p$ is increased. Pavarino also proved that when the overlap is not generous, the rate of convergence of the additive Schwarz method will depend inversely on the overlap size [43]. To overcome this, Fischer and Lottes [33] applied a hybrid $p$ -multigrid/Schwarz method, in the context of a Poisson problem. They implemented several levels of $p$ -multigrid to overcome the lack of $p$ -robustness of the cell-centered patches with minimal overlap. The use of multiple levels increases the overlap at the coarser levels with a minor impact on the overall computational cost. Cell-centered patches without overlap have also been employed for non-symmetric problems [45, 46, 17].

Instead of replacing the vertex-star patches with cell-centered ones, the alternative FEM-SEM preconditioner [40] rediscretizes the problem on each vertex-star patch with $p=1$ on a GLL grid, a mesh with vertices at the DOFs of the high-order space. The theory behind this guarantees the spectral equivalence between the differential operator discretized on the two spaces [13]. Since low-order methods are naturally sparse, this approach is not constrained to Cartesian cells and can deal properly with mixed first derivatives that the FDM cannot handle. A downside of this approach is that the Cholesky factors of the patch matrices are quite dense, limiting its scalability to very high polynomial degree. Computationally advantageous approaches involve incomplete factorizations of the patch FEM-SEM matrices [44], or the use of algebraic multigrid on the global FEM-SEM operator [9].

In this work we develop a solver for vertex-star patches that scales to very high polynomial degree. Our approach does not rely on a particular structure of the patch. In particular, it applies to the patches shown in Figure 2. The key idea is to use the FDM to numerically construct a basis of functions on an interval that diagonalizes the interior blocks of both the stiffness and mass matrices in one dimension. When the problem is assembled with respect to the associated tensor product basis, the resulting stiffness and mass matrices are sparse, in the Cartesian case. In particular, the total number of non-zeros is the same as that of a low-order finite difference discretization of the Laplacian. Moreover, fill-in in the Cholesky factorization is only introduced for the interface DOFs, resulting in very sparse Cholesky factors. The factorization requires $\mathcal{O}(p^{3(d-1)})$ operations, while forward and back-substitution steps have a cost of $\mathcal{O}(p^{2(d-1)})$ operations that is optimal for $d\in\{2,3\}$ , in contrast with the $\mathcal{O}(p^{3d})$ and $\mathcal{O}(p^{2d})$ costs of the naïve approach. A disadvantage of the approach is that the memory required scales like $\mathcal{O}(p^{2(d-1)})$ , instead of the optimal $\mathcal{O}(p^{d})$ required for storing the solution. In the non-Cartesian case, we approximate the form with one that is separable in the reference problem. Robustness with respect to $h$ and $p$ should follow from the spectral equivalence between the forms, and numerical experiments indicate that the approach is effective when the cells are moderately deformed.

We demonstrate the effectiveness of our approach by applying it to a $H(\mathrm{div})\times L^{2}$ conforming discretization of a mixed formulation of incompressible linear elasticity. We present a sequence of problems of increasing complexity building up to this. In Section 2 we present the standard $hp$ -FEM formulation for the Poisson problem and construct a solver based on a sparse discretization of an auxiliary locally separable PDE that employs the numerically computed FDM basis. In Section 3 we consider the application of our solver to linear elasticity. In the primal formulation, although our approach can be applied to patch problems for the individual components of displacement, we explain why it cannot be applied to the coupled vector-valued problem, which is necessary for parameter-robustness in the incompressible regime. We therefore consider a mixed formulation instead. Developing a $p$ -robust solver requires both a $p$ -robust preconditioner and a $p$ -robust discretization, and for the latter we choose a $H(\mathrm{div})\times L^{2}$ conforming approach. We then extend the method to symmetric interior-penalty discontinuous Galerkin discretizations, required for the displacement block of the mixed problem. We apply our relaxation to the displacement block of the incompressible elasticity system in conjunction with block-preconditioned Krylov methods. We end with conclusions in Section 4.

2 Sparse Poisson solver

We will first describe a solver for the Poisson equation on Cartesian cells, which will subsequently be extended as a preconditioner for more general symmetric coercive problems on non-Cartesian cells.

2.1 Continuous Galerkin formulation

We start from the standard weak formulation of the Poisson equation. Consider a bounded domain $\Omega\subset\mathbb{R}^{d}$ , $d\in\{1,2,3\}$ , and let $\Gamma_{D}\subseteq\partial\Omega$ be the part of the boundary where the Dirichlet boundary condition $\left.u\right|_{\Gamma_{D}}=u_{0}$ is prescribed. The problem is to find $u-u_{0}$ in $V\coloneqq H^{1}_{0}(\Omega)=\{v\in H^{1}(\Omega),\left.v\right|_{\Gamma_{D}}=0\}$ such that

a(v,u)=L(v)\quad\forall\,v\in V,

(2.1)

where

a(v,u)\coloneqq\int_{\Omega}\nabla v\cdot\nabla u\leavevmode\nobreak\ \mathrm{% d}\mathbf{x},\quad L(v)\coloneqq\int_{\Omega}vf\leavevmode\nobreak\ \mathrm{d}% \mathbf{x}.

(2.2)

The standard FEM discretization employs a mesh $\mathcal{T}_{h}=\{K\}$ of $\Omega$ . In this work we consider quadrilateral and hexahedral cells, so that each cell $K$ can be mapped with a diffeomorphism $F_{K}:\hat{K}\to K$ from the reference hypercube $\hat{K}=\hat{\mathcal{I}}^{d}$ , where $\hat{\mathcal{I}}=[-1,1]$ is the reference interval. The approximate solution $u_{h}\in V_{h}$ is sought in the space of piecewise continuous tensor product polynomials on each cell, i.e. $V_{h}\coloneqq\mathrm{Q}_{p}(\Omega)\subset V$ . We first define the space of shape functions on $\hat{K}$

\mathrm{Q}_{p}(\hat{K})\coloneqq\bigotimes_{j=1}^{d}\mathrm{P}_{p}(\hat{% \mathcal{I}}),\quad\mathrm{P}_{p}(\hat{\mathcal{I}})\coloneqq\operatorname{% span}\left\{\hat{x}^{j},0\leq j\leq p\right\},

(2.3)

and via composition with $F_{K}^{-1}$ , we define

\mathrm{Q}_{p}(\Omega)\coloneqq\left\{v\in H^{1}_{0}(\Omega):\forall\,K\in% \mathcal{T}_{h}\ \exists\ \hat{v}\in\mathrm{Q}_{p}(\hat{K})\text{ s.t. }\left.% v\right|_{K}=\hat{v}\circ F_{K}^{-1}\right\}.

(2.4)

Once we fix a basis $\{\phi_{j}\}_{j=1}^{n}$ for $V_{h}$ , the approximate solution is expanded as $u_{h}=\sum_{j=1}^{n}u_{j}\phi_{j}$ . The resulting $n\times n$ system of linear equations is

A\underline{u}=\underline{f},

(2.5)

where $[A]_{ij}=a(\phi_{i},\phi_{j})$ is the stiffness matrix, $\underline{u}=(u_{1},\ldots,u_{n})^{\top}$ is the vector of coefficients, and $\underline{f}=(L(\phi_{1}),\ldots,L(\phi_{n}))^{\top}$ is the load vector.

We recall the standard construction of the basis $\{\phi_{j}\}$ [26]. The basis is defined in terms of shape functions $\{\hat{\phi}_{j}\}$ on $\hat{K}$ . Given shape functions $\{\hat{\phi}^{\mathrm{1D}}_{j}\}_{j=0}^{p}$ for $\mathrm{P}_{p}(\hat{\mathcal{I}})$ , a tensor product basis $\{\hat{\phi}_{j}\}$ for $\mathrm{Q}_{p}(\hat{K})$ can be constructed as

\hat{\phi}_{j}(\hat{\mathbf{x}})=\prod_{k=1}^{d}\hat{\phi}^{\mathrm{1D}}_{j_{k% }}(\hat{x}_{k}),

(2.6)

where we have expanded $j=(j_{1},\ldots,j_{d})\in[0,p]^{d}$ as a multi-index.

The interval shape functions are decomposed into interface and interior modes. The interface modes have non-zero support on either endpoint of $\hat{\mathcal{I}}$ , while the interior modes vanish at the boundary of $\hat{\mathcal{I}}$ . In multiple dimensions, the shape functions decompose into interior, facet, edge, and vertex modes, depending on how many 1D interface functions are multiplied together. To generate a $C^{0}$ basis, we simply match the shape of individual interface modes. Hence, $A$ will be block sparse, since $[A]_{ij}=0$ when $i$ and $j$ correspond to interior modes supported on different cells.

For the interval shape functions, one standard choice is the set of Lagrange polynomials on the Gauß–Lobatto–Legendre (GLL) nodes $\{\hat{\xi}_{i}\}_{i=0}^{p}\subset[-1,1]$ . These nodes are the roots of $(1-\hat{\xi}^{2})P^{\prime}_{p}(\hat{\xi})$ , where $P_{k}(\hat{\xi})$ is the Legendre polynomial of degree $k$ . The Lagrange polynomials $\{\ell_{j}(\hat{x})\}$ satisfy $\ell_{j}(\hat{\xi}_{i})=\delta_{ij}$ by construction,

\ell_{j}(\hat{x})=\prod_{k=0,k\neq j}^{p}\frac{\hat{x}-\hat{\xi}_{k}}{\hat{\xi% }_{j}-\hat{\xi}_{k}},\quad j=0,\ldots,p.

(2.7)

Another useful basis is formed by the hierarchical Lobatto shape functions $\{l_{j}\}$ , which are constructed by augmenting the so-called bubble functions (integrated Legendre polynomials) with linear Lagrange functions,

l_{j}(\hat{x})=\begin{cases}(1-\hat{x})/2&\mbox{for }j=0,\\ (1+\hat{x})/2&\mbox{for }j=p,\\ \int_{-1}^{\hat{x}}P_{j}(z)\leavevmode\nobreak\ \mathrm{d}z&\mbox{for }j=1,% \ldots,p-1.\end{cases}

(2.8)

These two choices of shape functions are plotted in Figure 3.

The assembly of the stiffness matrix $A$ is described as follows. On each cell $K\in\mathcal{T}_{h}$ we define the cell stiffness matrix $A^{K}\in\mathbb{R}^{(p+1)^{d}\times(p+1)^{d}}$ in terms of the basis functions $\{\phi^{K}_{j}\}$ that are supported on $K$ , which are obtained from the reference shape functions $\{\hat{\phi}_{j}\}$ via $\phi^{K}_{j}=\hat{\phi}_{j}\circ F_{K}^{-1}$ . Then, the cell stiffness matrices are

[A^{K}]_{ij}=\int_{K}\nabla\phi^{K}_{i}\cdot\nabla\phi^{K}_{j}\leavevmode% \nobreak\ \mathrm{d}\mathbf{x}.

(2.9)

The global stiffness matrix is then assembled via direct stiffness summation:

A=\sum_{K\in\mathcal{T}_{h}}R_{K}^{\top}A^{K}R_{K},

(2.10)

where $R_{K}\in\mathbb{R}^{(p+1)^{d}\times n}$ is the Boolean restriction matrix from the global DOFs to those local to cell $K$ .

2.2 Tensor product structure on Cartesian cells

If $d=1$ and $F_{K}$ is an affine mapping, then the cell stiffness matrices are

[A^{K}]_{ij}=\frac{1}{L^{K}}\int_{\hat{\mathcal{I}}}\hat{\phi}^{\prime}_{i}% \hat{\phi}^{\prime}_{j}\leavevmode\nobreak\ \mathrm{d}\hat{x}=\frac{1}{L^{K}}[% \hat{A}]_{ij}.

(2.11)

Here $\hat{A}\in\mathbb{R}^{(p+1)\times(p+1)}$ is the stiffness matrix on the reference interval, and $L^{K}=\absolutevalue{K}/|\hat{\mathcal{I}}|$ , where $\absolutevalue{K}$ denotes the measure of the cell $K$ .

For $d=\{2,3\}$ , we will first consider the case where $\Omega$ can be tessellated with a mesh $\mathcal{T}_{h}$ consisting of Cartesian cells, i.e. the cells are rectangular quadrilaterals or hexahedra (all internal angles are right angles). In this idealized setting, the cell stiffness matrices are separable in the reference coordinates

A^{K}=\begin{cases}\mu^{K}_{1}\hat{B}\otimes\hat{A}+\mu^{K}_{2}\hat{A}\otimes% \hat{B}&\mbox{if\leavevmode\nobreak\ }d=2,\\ \mu^{K}_{1}\hat{B}\otimes\hat{B}\otimes\hat{A}+\mu^{K}_{2}\hat{B}\otimes\hat{A% }\otimes\hat{B}+\mu^{K}_{3}\hat{A}\otimes\hat{B}\otimes\hat{B}&\mbox{if% \leavevmode\nobreak\ }d=3,\end{cases}

(2.12)

where

[\hat{B}]_{ij}=\int_{\hat{\mathcal{I}}}\hat{\phi}_{i}\leavevmode\nobreak\ \hat% {\phi}_{j}\leavevmode\nobreak\ \mathrm{d}\hat{x},

(2.13)

is the mass matrix on the reference interval, $\mu^{K}_{j}=(L^{K}_{j})^{-2}\prod_{i=1}^{d}L^{K}_{i}$ , and $L^{K}_{j}$ is the length of $K$ along the $j$ -th axis divided by $|\hat{\mathcal{I}}|$ . The symbol $\otimes$ denotes the Kronecker product, which for matrices $A\in\mathbb{R}^{m\times n},B\in\mathbb{R}^{r\times s}$ , is defined as the block matrix

A\otimes B=\begin{bmatrix}a_{11}B&\cdots&a_{1n}B\\ \vdots&\ddots&\vdots\\ a_{m1}B&\cdots&a_{mn}B\end{bmatrix}\in\mathbb{R}^{rm\times sn}.

(2.14)

It follows that if $A$ and $B$ are sparse, then $A\otimes B$ is also sparse.

For the GLL basis $\{\ell_{j}\}$ , both $\hat{A}$ and $\hat{B}$ are dense, but these are sparse in the hierarchical basis $\{l_{j}\}$ . This is illustrated in Figure 3. On affine cells, the hierarchical basis yields a sparse stiffness matrix. The bubble functions $\{l_{j}\}_{j=1}^{p-1}$ , satisfy $l^{\prime}_{j}(\hat{x})=P_{j}(\hat{x})$ , and due to the orthogonality of the Legendre polynomials, the interior block of $\hat{A}$ is diagonal. The only off-diagonal non-zeros in $\hat{A}$ are due to the coupling between the interface modes $l_{0}$ , $l_{p}$ . Nevertheless, in order for this sparsity to propagate to higher dimensions, we would additionally wish that $\hat{B}$ is also as sparse as possible. This is not quite the case for $\{l_{j}\}$ , as $\hat{B}$ has two interior blocks with tri-diagonal structure, in the even-odd decomposition. Therefore, on a typical row, $A$ will have the structure of the 5-point stencil for $d=2$ and that of a 13-point stencil for $d=3$ .

2.3 The fast diagonalization method

Linear systems involving structured matrices such as that defined in (2.12) can be solved efficiently using the fast diagonalization method (FDM) [34]. This method reduces the computation into a sequence of eigenvalue problems on the interval in a similar fashion as the method of separation of variables. It requires a separable PDE and a tensor product basis; therefore it can only be applied on meshes or mesh patches with tensor product structure.

To illustrate the FDM, we consider solving a problem on the interior of a Cartesian cell, $A^{K}_{II}\underline{u}^{K}_{I}=\underline{r}^{K}_{I}$ , where $A^{K}_{II}=R^{K}_{I}AR^{K}_{I}{}^{\top}$ and $R^{K}_{I}\in\mathbb{R}^{(p-1)^{d}\times n}$ is the Boolean restriction matrix onto the interior DOFs of $K$ . We may first solve the generalized eigenvalue problem on the interior of the reference interval

\hat{A}_{II}\hat{S}_{II}=\hat{B}_{II}\hat{S}_{II}\hat{\Lambda}_{II},

(2.15)

in conjunction with the normalization condition $\hat{S}_{II}^{\top}\hat{B}_{II}\hat{S}_{II}={\mathds{1}}$ . Here $\hat{A}_{II},\hat{B}_{II}\in\mathbb{R}^{(p-1)\times(p-1)}$ are the interior blocks of $\hat{A}$ and $\hat{B}$ , respectively, $\hat{\Lambda}_{II}\in\mathbb{R}^{(p-1)\times(p-1)}$ is the diagonal matrix of eigenvalues, and $\hat{S}_{II}\in\mathbb{R}^{(p-1)\times(p-1)}$ is the matrix of eigenvectors. The generalized eigenproblem (2.15) may be equivalently rewritten as

\hat{S}_{II}^{\top}\hat{A}_{II}\hat{S}_{II}=\hat{\Lambda}_{II},\quad\hat{S}_{% II}^{\top}\hat{B}_{II}\hat{S}_{II}={\mathds{1}}.

(2.16)

The corresponding continuous problem is to find $s_{j}(\hat{x})$ and $\lambda_{j}$ , $j=1,\dots,p-1,$ such that

\int_{\hat{\mathcal{I}}}s^{\prime}_{i}(\hat{x})s^{\prime}_{j}(\hat{x})% \leavevmode\nobreak\ \mathrm{d}\hat{x}=\lambda_{j}\delta_{ij},\quad\int_{\hat{% \mathcal{I}}}s_{i}(\hat{x})s_{j}(\hat{x})\leavevmode\nobreak\ \mathrm{d}\hat{x% }=\delta_{ij},\quad s_{j}(\pm 1)=0,

(2.17)

with solution $s_{j}(\hat{x})=\sin(j\pi(1+\hat{x})/2)$ , $\lambda_{j}=j^{2}\pi^{2}/4$ . Thus, when $\hat{A}$ and $\hat{B}$ are discretized with the GLL basis, $\hat{S}_{II}$ approximates a discrete sine transform on the interior GLL nodes.

If $A^{K}$ is given by (2.12), then its inverse has the following diagonal factorization

\left(A^{K}_{II}\right)^{-1}=\left\{\bigotimes_{k=1}^{d}\hat{S}_{II}\right\}% \left(\Lambda^{K}_{II}\right)^{-1}\left\{\bigotimes_{k=1}^{d}\hat{S}_{II}^{% \top}\right\},

(2.18)

where

\Lambda^{K}_{II}=\begin{cases}\mu^{K}_{1}{\mathds{1}}\otimes\hat{\Lambda}_{II}% +\mu^{K}_{2}\hat{\Lambda}_{II}\otimes{\mathds{1}}&\mbox{if\leavevmode\nobreak% \ }d=2,\\ \mu^{K}_{1}{\mathds{1}}\otimes{\mathds{1}}\otimes\hat{\Lambda}_{II}+\mu^{K}_{2% }{\mathds{1}}\otimes\hat{\Lambda}_{II}\otimes{\mathds{1}}+\mu^{K}_{3}\hat{% \Lambda}_{II}\otimes{\mathds{1}}\otimes{\mathds{1}}&\mbox{if\leavevmode% \nobreak\ }d=3.\\ \end{cases}

(2.19)

Therefore, the solution of a system $A^{K}_{II}\underline{u}^{K}_{I}=\underline{r}^{K}_{I}$ can be obtained with $\mathcal{O}(p^{d+1})$ computational work.

The main limitation of this approach is that it does not generalize to terms that contain first derivatives, ruling out the possible extension to advection problems. Mixed first derivative terms are also very common in symmetric coercive problems, for instance, when the cells have non-orthotropic deformations, or for vector-valued operators that mix first derivatives of distinct vector components, such as $\nabla(\nabla\cdot\mathbf{u}$ ).

2.4 Sparse FDM discretization

Our key idea is to construct a new finite element basis on the interval, inspired by the FDM, which yields a sparse stiffness matrix. To construct this new basis, we solve (2.16) with $\hat{A}$ and $\hat{B}$ in the GLL basis. Then, we interpolate the eigenvectors of the FDM with polynomials $\{\hat{s}_{j}\}_{j=1}^{p-1}\subset\mathrm{P}_{p}(\hat{\mathcal{I}})$ that satisfy $\hat{s}_{j}(\pm 1)=0$ , $\hat{s}_{j}(\hat{\xi}_{i})=[\hat{S}_{II}]_{ij}$ for $i,j\in[1,p-1]$ . The unisolvent dual basis of our proposed FDM discretization of $\mathrm{P}_{p}(\hat{\mathcal{I}})$ consists of point evaluation at the vertices and integral moments against the orthogonal interior shape functions $\hat{s}_{j}$ , $j=1,\ldots,p-1$ . This construction ensures that $\hat{A}_{II}$ and $\hat{B}_{II}$ become diagonal under this basis. As an additional consequence, the 1D interface shape functions are also orthogonal to the interior ones, but not to each other. Hence $\hat{B}$ becomes as sparse as possible in this new FDM basis. Figure 4 shows the shape functions and the non-zero structure of $\hat{A}$ and $\hat{B}$ for the FDM basis.

The stiffness matrix for a Cartesian cell $A^{K}$ in this FDM basis is also given by (2.12), and has structured sparsity, since $A\otimes B$ is sparse when $A$ and $B$ are sparse. Consequently, the global matrix $A$ is also sparse. We can thus apply sparse direct factorization methods or other preconditioners such as the additive Schwarz method.

For problems that are separable in Cartesian coordinates, we obtain a solver that is reminiscent of the FDM. The multiplication times the Kronecker product of the matrices of eigenvectors is incorporated in the definition of the FDM shape functions. If for some reason the problem needs to be solved with any other basis, one can transform the linear system back and forth to the FDM basis via interpolation and restriction at each application of the method. We also replace the diagonal matrix of the traditional FDM with a sparse matrix, which can be assembled even on 3D meshes with a very large number of cells and for very high degree. This approach removes the requirement of a global tensor product grid while exploiting the separability of the PDE and the local structure of $\mathrm{Q}_{p}(\Omega)$ , at the expense of replacing the diagonal factor by a sparse matrix.

2.5 Hybrid $p$ -multigrid/Schwarz method

The solver of Pavarino is fully additive, across both the coarse grid and the vertex-star patches. In our work we consider a small variation of this, with the solver multiplicative between the two levels while remaining additive among the vertex-star patches. This improves the convergence at essentially no cost. The method can be interpreted as a hybrid multiplicative two-level $\mathrm{V}(1,1)$ -cycle with the additive Schwarz method [18] with vertex-star patches as the fine grid relaxation and the lowest-order discretization on the same mesh as the coarse space. The sparse matrix for the coarse space may be assembled and factorized, or other preconditioners such as geometric or algebraic multigrid may be applied instead. The vertex-star patch $V_{j}$ includes the degrees of freedom associated with vertex $\mathbf{v}_{j}$ of $\mathcal{T}_{h}$ and all cells, facets, and edges adjacent to $\mathbf{v}_{j}$ (the topological entities in the star of the vertex, a standard concept in algebraic topology [38, §2]).

We may write the multigrid relaxation as

P_{\mathrm{ASM}}^{-1}=\sum_{j=1}^{J}\bar{R}_{j}^{\top}A_{j}^{-1}\bar{R}_{j},

(2.20)

with $\bar{R}_{j}$ the Boolean restriction matrix onto $V_{j}$ , and $A_{j}=\bar{R}_{j}A\bar{R}_{j}^{\top}$ are the sparse patch matrices for which we may explicitly compute a Cholesky decomposition. The relaxation is scaled by the damping coefficient

\omega=2\left[(1+\alpha)\lambda_{\mathrm{max}}+(1-\alpha)\lambda_{\mathrm{min}% }\right]^{-1},

(2.21)

where $\lambda_{\mathrm{\min}},\lambda_{\mathrm{\max}}$ are the extremal eigenvalues of $P_{\mathrm{ASM}}^{-1}A$ estimated via the CG-Lanczos procedure [30], and $\alpha=0.25$ is chosen to tackle the high frequency error, also ensuring that the error iteration matrix ${\mathds{1}}-\omega P_{\mathrm{ASM}}^{-1}A$ is contractive.

To illustrate the direct solver on the Cartesian vertex-star patch shown in Figure 1a, we show in Figure 5 the non-zero structure for the patch matrix $A_{j}$ and its Cholesky factor. The sparsity pattern of the global matrix $A$ connects the interior DOFs to their projections onto the facets, hence a typical interior row of $A$ will have $2d+1$ non-zeros. For the patch matrix $A_{j}$ , an interior row will only have $d+1$ non-zeros, as the patch only includes one facet per dimension on each cell. Moreover, the total number of non-zeros of $A_{j}$ is the same as that of a low-order finite difference or finite element discretization with the 5-point or 7-point stencil on the same grid.

2.6 Computational complexity

Here we discuss the computational cost of the solution of the patch problem using the Cholesky factorization. Once the factorization has been computed, it may be applied in $\mathcal{O}(p^{2(d-1)})$ cost, which is optimal for $d\in\{2,3\}$ . Unfortunately, the factorization phase is suboptimal, requiring $\mathcal{O}(p^{3(d-1)})$ operations to compute. The memory required to store the Cholesky factor is $\mathcal{O}(p^{2(d-1)})$ , which for $d=3$ is one factor of $p$ higher than that required to store the solution.

Consider a stiffness matrix $A$ discretized with the FDM basis on any mesh with all cells Cartesian. The number of floating point operations (flops) needed to solve a linear system using a sparse Cholesky factorization $A=LL^{\top}$ is roughly four times the number of non-zero entries in $L$ [14]. To maximize the sparsity in $L$ , it is crucial to reorder the DOFs, such that interior DOFs are followed by the interface DOFs. This ensures that the fill-in is introduced only on the bottom-left block. To analyze the cost of factorization, we first introduce the block $LDL^{\top}$ decomposition

A=\begin{bmatrix}A_{II}&A_{I\Gamma}\\ A_{\Gamma I}&A_{\Gamma\Gamma}\end{bmatrix}=\begin{bmatrix}{\mathds{1}}&0\\ A_{\Gamma I}A_{II}^{-1}&{\mathds{1}}\end{bmatrix}\begin{bmatrix}A_{II}&0\\ 0&S_{\Gamma}\end{bmatrix}\begin{bmatrix}{\mathds{1}}&A_{II}^{-1}A_{I\Gamma}\\ 0&{\mathds{1}}\end{bmatrix},

(2.22)

where $S_{\Gamma}=A_{\Gamma\Gamma}-A_{\Gamma I}A_{II}^{-1}A_{I\Gamma}$ is the interface Schur complement. By construction, the top-left block $A_{II}$ is diagonal with positive entries, with Cholesky factor $A_{II}^{1/2}$ . If we decompose $A_{II}$ and $S_{\Gamma}$ in the second matrix on the RHS of (2.22) into their Cholesky factors, and distribute each factor onto the other two matrices, we obtain the Cholesky decomposition of $A$ :

A=LL^{\top}=\begin{bmatrix}A_{II}^{1/2}&0\\ A_{\Gamma I}A_{II}^{-1/2}&L_{\Gamma}\end{bmatrix}\begin{bmatrix}A_{II}^{1/2}&A% _{II}^{-1/2}A_{I\Gamma}\\ 0&L_{\Gamma}^{\top}\end{bmatrix},

(2.23)

where the Schur complement is factorized as $S_{\Gamma}=L_{\Gamma}L_{\Gamma}^{\top}$ . Since $A_{II}$ is diagonal, the off-diagonal block $A_{\Gamma I}A_{II}^{-1/2}$ will preserve the non-zero structure of $A_{\Gamma I}$ , and similarly for its transpose. Thus, fill-in is only introduced on the interface block through $L_{\Gamma}$ .

An ordering strategy that minimizes fill-in consists of applying nested dissection [21] on the adjacency graph that connects topological entities. Each node in this graph represents a cell, face, edge or vertex. The ordering of the entities is then used to permute the corresponding blocks in $A$ .

Assuming for the worst case that $L_{\Gamma}^{\top}$ is dense, the memory required to store the Cholesky factor is $\mathcal{O}(p^{2(d-1)})$ . This represents a significant increase from the traditional FDM, which is kept at the optimal $\mathcal{O}(p^{d})$ . However, the DOF ordering does lead to some structured sparsity in $L_{\Gamma}^{\top}$ , as can be seen in Figure 5. Nevertheless, we still observe dense $\mathcal{O}(p^{d-1})\times\mathcal{O}(p^{d-1})$ blocks. The fact that $L_{\Gamma}^{\top}$ contains these dense blocks indicates that $\mathcal{O}(p^{3(d-1)})$ operations are required in the factorization phase. However, the forward and back-substitution steps have a computational cost of $\mathcal{O}(p^{2(d-1)})$ operations, which is optimal for $d\leq 3$ .

Compared to the FEM-SEM approach, our approach with the FDM basis has a sparser Cholesky factorization. In Figure 6 we present the ratio of the number of non-zeros in the Cholesky factors of our approach and the FEM-SEM preconditioner ordered with nested dissection for $d\in\{2,3\}$ . The fact that the ratio is always below 1 confirms that our approach is sparser, with a substantial gain at higher degrees. Practical FEM-SEM solvers use AMG [9] or patchwise multigrid with ILU smoothers [44] to avoid the cost of the Cholesky factorizations of the patch matrices.

Figure 6: Relative number of non-zero entries in the Cholesky factors of the stiffness matrix with the FDM approach and the FEM-SEM preconditioner, for a Cartesian vertex-star patch of

2^{d}

cells. The FDM approach is sparser, with substantial gains at higher degrees.

2.7 Extension to non-Cartesian cells

For arbitrarily deformed cells, the local stiffness matrices $A^{K}$ cannot be expressed in terms of tensor products of $\hat{A}$ and $\hat{B}$ as in (2.12), and $A^{K}$ is not sparse in the FDM basis. The preconditioning techniques found in [16, 20, 55] introduce an auxiliary Cartesian domain to construct a separable problem for which the FDM is a direct solver. The method described by Fischer [20] constructs a preconditioner by replacing $K$ with its nearest rectangular approximation, whose dimensions are computed as the mean separation between the mapped GLL nodes from opposite facets of $K$ . Witte et al. [55] obtain the lengths from the average arclength of opposite sides of $K$ , but it is not clear how this extends to the 3D case. To the best of our knowledge, no theory underpins these choices.

Our approach to construct the separable surrogate is based on the theory of equivalent operator preconditioning [6]. We work with the bilinear form $a(\cdot,\cdot)$ in terms of the reference coordinates. We discard the mixed derivative terms that prevent separability, and we replace the coefficients with piecewise constants in the reference coordinates¹¹1Recall that piecewise constant coefficients in the physical coordinates will not yield piecewise constant coefficients in the reference coordinates.. We will prove that this choice yields a spectrally equivalent operator.

The bilinear form $a(\cdot,\cdot)$ can be expressed as a sum of cell contributions $a_{K}(\cdot,\cdot)$ where integration and differentiation are with respect to $\hat{\mathbf{x}}$ . The measure $\mathrm{d}\mathbf{x}$ is replaced by $\absolutevalue{\det(\mathrm{D}F_{K})}\mathrm{d}\hat{\mathbf{x}}$ and the gradient is computed via the chain rule, since the arguments of the form become functions of $\hat{\mathbf{x}}$ after being composed with $F_{K}$ . Hence,

a(v,u)=\sum_{K\in\mathcal{T}_{h}}a_{K}(v,u)=\sum_{K\in\mathcal{T}_{h}}\int_{% \hat{K}}\hat{\nabla}v\circ F_{K}\cdot\hat{G}^{K}\hat{\nabla}u\circ F_{K}% \leavevmode\nobreak\ \mathrm{d}\hat{\mathbf{x}},

(2.24)

where $\hat{\nabla}$ is the gradient with respect to $\hat{\mathbf{x}}$ , and $\hat{G}^{K}:\hat{K}\to\mathbb{R}^{d\times d}$ is the inverse metric of the coordinate transformation weighted by the Jacobian determinant,

\hat{G}^{K}=\absolutevalue{\det(\mathrm{D}F_{K})}\mathrm{D}F_{K}^{-1}\mathrm{D% }F_{K}^{-\top}.

(2.25)

This tensor encapsulates all of the geometry-dependent information in the form; it is spatially dependent for generally-deformed elements, and constant in the case of affine transformations. For a separable geometry, $\hat{G}^{K}$ is diagonal, and thus for a Cartesian cell it is diagonal and constant. To construct an auxiliary problem that is separable by the FDM in the reference coordinates, we replace $\hat{G}^{K}$ in $a_{K}(\cdot,\cdot)$ with a constant diagonal approximation $\operatorname{diag}(\mu^{K}_{j})$ . Each $\mu^{K}_{j}$ is given by the cell-wise average of the diagonal entry $\hat{G}^{K}_{jj}$ ,

\mu^{K}_{j}\coloneqq\frac{1}{|\hat{K}|}\int_{\hat{K}}\hat{G}^{K}_{jj}% \leavevmode\nobreak\ \mathrm{d}\hat{\mathbf{x}},

(2.26)

where summation over the index $j$ is not implied. As the approximation is local to each cell, it is still possible to assemble a sparse stiffness matrix discretizing the auxiliary problem on meshes where cells are not structured in a tensor product grid.

We now establish the spectral equivalence between the original bilinear form and the auxiliary separable one.

Theorem 2.1.

Let $\hat{\mu}_{K}\coloneqq\operatorname{diag}(\mu^{K}_{j})$ be the constant diagonal approximation of $\hat{G}^{K}$ , and define the auxiliary bilinear form

\tilde{a}(v,u)\coloneqq\sum_{K\in\mathcal{T}_{h}}\tilde{a}_{K}(v,u)\coloneqq% \sum_{K\in\mathcal{T}_{h}}\int_{\hat{K}}\hat{\nabla}v\circ F_{K}\cdot\hat{\mu}% _{K}\hat{\nabla}u\circ F_{K}\leavevmode\nobreak\ \mathrm{d}\hat{\mathbf{x}}.

(2.27)

Then, there exist $p$ -independent constants $c$ , $C>0$ that depend on $\mathcal{T}_{h}$ through $\hat{G}^{K}$ such that

c\leq\frac{a(v,v)}{\tilde{a}(v,v)}\leq C\quad\forall\,v\in V\setminus\{0\}.

(2.28)

Proof 2.2.

Let $c_{K},C_{K}$ be lower and upper bounds for the spectrum of the diagonally scaled metric, so that $\sigma(\hat{\mu}_{K}^{-1/2}\hat{G}^{K}\hat{\mu}_{K}^{-1/2})\in[c_{K},C_{K}]$ for all $\hat{\mathbf{x}}\in\hat{K}$ . We claim that

c_{K}\leq\frac{a_{K}(v,v)}{\tilde{a}_{K}(v,v)}\leq C_{K}\quad\forall\,v\in\{v% \in V:\left.v\right|_{K}\neq 0\}.

(2.29)

This result is obtained by first rewriting $a_{K}(v,v)$ with $\hat{\mu}_{K}^{1/2}\hat{\mu}_{K}^{-1/2}\hat{G}^{K}\hat{\mu}_{K}^{-1/2}\hat{\mu% }_{K}^{1/2}$ instead of $\hat{G}^{K}$ , and then replacing $\hat{\mu}^{-1/2}_{K}\hat{G}^{K}\hat{\mu}^{-1/2}_{K}$ with $c_{K}{\mathds{1}}$ or $C_{K}{\mathds{1}}$ to find the lower or upper bounds, respectively. It then follows that

c\coloneqq\min_{K\in\mathcal{T}_{h}}c_{K}\leq\frac{a(v,v)}{\tilde{a}(v,v)}\leq% \max_{K\in\mathcal{T}_{h}}C_{K}\eqqcolon C\quad\forall\,v\in V\setminus\{0\}.

(2.30)

Let $\tilde{A}$ be the stiffness matrix associated with the auxiliary form $\tilde{a}(\cdot,\cdot)$ . By Theorem 2.1, the condition number $\kappa(\tilde{A}^{-1}A)$ is bounded by $C/c$ independently of $p$ . Numerical experiments also indicate that $\kappa(\tilde{A}^{-1}A)$ is independent of $h$ under uniform refinements. Now consider a preconditioner $P$ where the auxiliary form $\tilde{a}(\cdot,\cdot)$ is used additively in both the coarse solve and the vertex-star patches. In this case, Theorem 1 of [42] guarantees that $\kappa(P^{-1}\tilde{A})$ is bounded from above independently of $h$ and $p$ . Hence we may conclude that $\kappa(P^{-1}A)\leq\kappa(P^{-1}\tilde{A})\kappa(\tilde{A}^{-1}A)$ is bounded independently of $h$ and $p$ . In practice, we expect that using multiplicative coarse grid correction with the original form $a(\cdot,\cdot)$ can only improve the preconditioner.

To gain useful insight, we consider the case where $d=2$ and $F_{K}$ is an affine transformation, that is when $K$ is a parallelogram. Without loss of generality, suppose that one of the sides of $K$ has length $2L_{1}$ and is aligned with the first reference coordinate axis, and the other side of length $2L_{2}$ is at an angle $\theta$ with respect to the same axis. The Jacobian of the coordinate transformation is

\mathrm{D}F_{K}=\begin{bmatrix}L_{1}&L_{2}\cos{\theta}\\ 0&L_{2}\sin{\theta}\end{bmatrix},

(2.31)

to which corresponds the Jacobian-weighted inverse metric

\hat{G}^{K}=\frac{1}{L_{1}L_{2}\absolutevalue{\sin{\theta}}}\begin{bmatrix}L_{% 2}^{2}&-L_{1}L_{2}\cos{\theta}\\ -L_{1}L_{2}\cos{\theta}&L_{1}^{2}\end{bmatrix}.

(2.32)

Since $\hat{G}^{K}$ is constant, $\hat{\mu}_{K}$ is simply the diagonal part of $\hat{G}^{K}$ . The spectrum of the diagonally scaled metric will be independent of $L_{1}$ and $L_{2}$ , but still depend on $\theta$ ,

\sigma\left(\hat{\mu}_{K}^{-1/2}\hat{G}^{K}\hat{\mu}_{K}^{-1/2}\right)=\left[1% -\absolutevalue{\cos{\theta}},1+\absolutevalue{\cos{\theta}}\right].

(2.33)

This spectrum is desirable because it is centered at 1 and bounded above for all $\theta$ . If we follow the geometric approaches of [20, 55], we would have to choose a rectangle with side lengths $2L_{1}$ and $2L_{2}$ as the auxiliary domain for the Poisson problem. Then, the previous bounds (2.33) would become scaled by $\absolutevalue{\sin{\theta}}^{-1}$ . In this case, the spectrum is unbounded from above in the limit $\theta\to 0$ .

2.8 Numerical experiments

We provide an implementation of the $\mathrm{P}_{p}$ element with the FDM shape functions on the interval in the FIAT [28] package. The extension to quadrilaterals and hexahedra is achieved by taking tensor products of the one-dimensional element with FInAT [23]. Code for the sum-factorized evaluation of the residual is automatically generated by Firedrake [47, 24], implementing a Gauß–Lobatto quadrature rule with $3(p+1)/2$ points along each direction. The sparse preconditioner discretizing the auxiliary form is implemented in firedrake.FDMPC using PETSc [8]. The Cholesky factorization of the patch matrices is computed using CHOLMOD [14]. Most of our computations were performed using an Intel Xeon CPU E5-4627 v2 $@$ 3.30GHz with 32 cores and 67.6 GB of RAM storage.

The hybrid $p$ -multigrid/Schwarz solver employing the FDM/sparse relaxation is illustrated in Figure 7. To achieve scalability with respect to the mesh parameter $h$ , on the $p$ -coarse problem we employ geometric multigrid with damped point-Jacobi relaxation and a Cholesky factorization on the coarsest level using MUMPS [2]. We test the effectiveness of this approach on a hierarchy of meshes obtained by $l\geq 0$ uniform refinements of the base meshes shown in Figure 8.

Figure 7: Solver diagram for the Poisson problem.

We present results for the Poisson equation in $\Omega=(0,1)^{d}$ discretized on the three hierarchies of Cartesian, unstructured, and structured but deformed (Kershaw) [27] meshes. The coordinate field of the Kershaw mesh is in $[\mathrm{Q}_{3}(\Omega)]^{d}\cap C^{1}(\Omega)$ , with a cell aspect ratio of $\varepsilon_{y}=\varepsilon_{z}=0.3$ near the corners of the domain. We impose homogeneous Dirichlet BCs on $\Gamma_{D}=\partial\Omega$ and a constant forcing $f=1$ . In Table 1 we present PCG iteration counts required to reduce the Euclidean norm of the residual by a factor of $10^{8}$ starting from a zero initial guess. In Table 2 we show the condition number $\kappa(P^{-1}A)$ estimated by CG-Lanczos. The results show almost complete $p$ - and $h$ -robustness in the Cartesian case, and very slow growth of iteration counts in the unstructured case. Given the lack of shape regularity, the Kershaw mesh is significantly more challenging; even with exact patch solvers, we do not expect $h$ - or $p$ -robustness.

Table 1: PCG iteration counts for the hybrid

p

-multigrid/Schwarz solver with the FDM/sparse relaxation. The patch problems are solved exactly on the Cartesian mesh.

		Cartesian			Unstructured			Kershaw
$d$	$p\setminus l$	0	1	2	0	1	2	0	1	2
2	3	7	8	9	12	13	14	27	35	54
	7	8	8	9	16	16	17	44	56	78
	15	8	9	9	19	19	19	58	69	90
	31	8	9	9	21	20	21	67	80	97
3	3	12	12	12	17	17	18	54	66	102
	7	12	12	12	22	21	21	98	106	158
	15	12	13		25	24		131	132

Table 2: Estimated condition numbers for the preconditioned operator

\kappa(P^{-1}A)

using the hybrid

p

-multigrid/Schwarz solver with the FDM/sparse relaxation.

		Cartesian			Unstructured			Kershaw
$d$	$p\setminus l$	0	1	2	0	1	2	0	1	2
2	3	1.44	1.49	1.50	2.14	2.37	2.81	9.34	15.6	34.5
	7	1.48	1.48	1.50	3.23	3.27	3.79	19.6	30.3	57.6
	15	1.51	1.51	1.52	4.06	3.78	4.13	30.5	45.8	69.0
	31	1.54	1.52	1.52	4.45	4.06	4.36	40.4	57.1	73.3
3	3	2.87	2.49	2.45	4.16	4.21	4.55	34.8	46.1	117
	7	2.79	2.70	2.67	5.88	5.54	5.47	100	110	266
	15	2.83	2.79		7.12	6.44		165	151

To assess the computational performance of our approach, we solve the three-dimensional Poisson equation on a Cartesian mesh with $3\times 3\times 3$ cells with a single core of an Intel Core i7-10875H CPU $@$ 2.30GHz. We plot in Figure 9 the runtimes, flop counts, and achieved arithmetic performance for the Cholesky factorization of the patch matrices, the solution of the patch problems using this factorization (per application of the relaxation), and the matrix-free evaluation of the residual (per Krylov iteration, excluding the application of the global to local map) as functions of $p$ . The dotted lines are to indicate powers of $2p-1$ , which is the number of DOFs along each side of typical vertex-star patch not intersecting the mesh boundary.

Despite the $\mathcal{O}(p^{3(d-1)})$ computational cost of the Cholesky factorization, these results show $\mathcal{O}(p^{2(d-1)})$ scaling for runtime up to $p=15$ . This speedup can be explained mainly by data locality. The sparse Cholesky factorization is obtained by recursively applying the block $LDL^{\top}$ decomposition up to the point where the Schur complement is sufficiently dense. The computation of this Schur complement via dense matrix-matrix multiplication (BLAS-3 dgemm) dominates the computational cost. As $p$ is increased, the utilization of arithmetic units increases in proportion to the dimension of the Schur complement, which explains the $\mathcal{O}(p^{d-1})$ scaling of the achieved arithmetic performance. As the arithmetic capabilities become saturated for $p>15$ , the $\mathcal{O}(p^{3(d-1)})$ scaling in the factorization runtime should become apparent.

Most of the time in the relaxation step is spent in accessing the factor matrix from memory, given the $\mathcal{O}(p^{2(d-1)})$ sub-optimal storage per patch. The relaxation is therefore limited by memory bandwidth and not arithmetically intense, which explains the poor arithmetic performance. This is in contrast to the sum-factorized residual evaluation, which has a $\mathcal{O}(p^{d})$ memory footprint and presents better arithmetic utilization [29]. Nevertheless, the results indicate that the runtime for the solution of the patch problems with the sparse Cholesky factorization remains very close to that of the matrix-free residual evaluation for moderate $p$ , being slightly faster for $p\leq 7$ , mainly due to lower operation count.

(a) Runtime

(b) Flop count

Figure 9: Runtimes, flop counts, and achieved arithmetic performance for the Cholesky factorization (symbolic and numeric), solution of the patch problems, and residual evaluation for a Cartesian mesh with

3\times 3\times 3

cells on a single CPU core. We observe that the factorization runtime scales better than expected, close to the optimal

\mathcal{O}(p^{d+1})

complexity.

3 Application to linear elasticity problems

3.1 Primal formulation of linear elasticity

We now consider how these ideas may be applied in the more complex setting of a nonseparable, vector-valued PDE. The equations of linear elasticity describe the displacement $\mathbf{u}:\Omega\to\mathbb{R}^{d}$ of a solid body with a reference configuration $\Omega\subset\mathbb{R}^{d}$ . The primal formulation is to find $\mathbf{u}-\mathbf{u}_{0}\in V\coloneqq[H^{1}_{0}(\Omega)]^{d}$ such that

a(\mathbf{v},\mathbf{u})=L(\mathbf{v})\quad\forall\,\mathbf{v}\in V,

(3.1)

where

a(\mathbf{v},\mathbf{u})=\int_{\Omega}2\mu\varepsilon(\mathbf{v}):\varepsilon(% \mathbf{u})+\lambda\nabla\cdot\mathbf{v}\nabla\cdot\mathbf{u}\leavevmode% \nobreak\ \mathrm{d}\mathbf{x},\quad L(\mathbf{v})=\int_{\Omega}\mathbf{v}% \cdot\mathbf{B}\leavevmode\nobreak\ \mathrm{d}\mathbf{x}.

(3.2)

Here, we assume that the material is homogeneous and isotropic, and can thus be described by the Lamé parameters $\mu,\lambda>0$ ; $\varepsilon(\mathbf{u})=(\nabla\mathbf{u}+\nabla\mathbf{u}^{\top})/2$ is the linearized strain tensor; $\mathbf{u}_{0}$ is Dirichlet data prescribed on $\Gamma_{D}\subseteq\partial\Omega$ ; and $\mathbf{B}\in[L^{2}(\Omega)]^{d}$ is a body force. The Poisson ratio $\nu=\lambda/(2\mu+2\lambda)$ measures the compressibility of the material. In the incompressible limit $\lambda\to\infty$ (i.e. $\nu\to 1/2$ ), the problem becomes ill-conditioned, as $a(\cdot,\cdot)$ becomes insensitive to divergence-free perturbations in the arguments.

Consider the partitioning of the stiffness matrix $A$ into blocks that act on each displacement component,

A=\begin{bmatrix}A_{11}&\cdots&A_{1d}\\ \vdots&\ddots&\vdots\\ A_{d1}&\cdots&A_{dd}\end{bmatrix}.

(3.3)

The diagonal block $A_{jj}$ discretizes the bilinear form

\int_{\Omega}\mu\nabla v_{j}\cdot\nabla u_{j}+(\mu+\lambda)\frac{\partial v_{j% }}{\partial x_{j}}\frac{\partial u_{j}}{\partial x_{j}}\leavevmode\nobreak\ % \mathrm{d}\mathbf{x},

(3.4)

where summation is not implied, and $u_{j}$ and $v_{j}$ are components of $\mathbf{u}$ and $\mathbf{v}$ , respectively. The off-diagonal blocks $A_{ij}$ , $i\neq j$ , discretize

\int_{\Omega}\mu\frac{\partial v_{i}}{\partial x_{j}}\frac{\partial u_{j}}{% \partial x_{i}}+\lambda\frac{\partial v_{i}}{\partial x_{i}}\frac{\partial u_{% j}}{\partial x_{j}}\leavevmode\nobreak\ \mathrm{d}\mathbf{x}.

(3.5)

The diagonal blocks can be diagonalized by the FDM on the interior of a Cartesian cell when the reference axes are aligned with the physical coordinates. The same statement does not hold true for the off-diagonal blocks, as they couple together different displacement components. This is because they discretize products of different first derivatives on the different components and hence are not separable.

The separate displacement components (SDC) preconditioner [11, 22] is defined as the block diagonal matrix $A_{\mathrm{SDC}}=\operatorname{diag}(A_{11},\ldots,A_{dd})$ . In other words, this approach is also described as block-Jacobi in the displacement components. The SDC preconditioner discretized with the FDM basis is sparse for Cartesian cells aligned with the coordinate axes. On arbitrary cells, for each separate component, we obtain an auxiliary form that is separable in the reference coordinates by selecting constant diagonal coefficients $\hat{\mu}_{K}$ .

It is shown in [11] that for a homogeneous isotropic material with principal axes parallel to the axes of the reference coordinate system, the condition number of the preconditioned matrix will depend on the Poisson ratio:

\kappa(A_{\mathrm{SDC}}^{-1}A)\leq\frac{d-1}{\gamma}\frac{1-\nu}{1-2\nu},

(3.6)

where $\gamma$ is the constant appearing in Korn’s inequality,

\norm{\mathbf{u}}_{H^{1}(\Omega)^{d}}^{2}\leq\gamma\int_{\Omega}\mathbf{u}% \cdot\mathbf{u}+\varepsilon(\mathbf{u}):\varepsilon(\mathbf{u})\leavevmode% \nobreak\ \mathrm{d}\mathbf{x}\quad\forall\,\mathbf{u}\in V.

(3.7)

Thus, the convergence rate of the SDC preconditioner will deteriorate for $\nu$ approaching $1/2$ , the so-called nearly incompressible case.

We consider the reference configuration $\Omega=(0,1)^{d}$ discretized on a Cartesian mesh with 8 cells along each direction. We specify $\mu=1$ , a uniform downwards body force $\mathbf{B}=-0.02\mathbf{e}_{2}$ , and homogeneous Dirichlet BCs on $\Gamma_{D}=\{\mathbf{x}\in\partial\Omega,x_{1}=0\}$ . In Table 3 we present the PCG iteration counts required to reduce the Euclidean norm of the residual by a factor of $10^{8}$ starting from a zero initial guess. As the preconditioner, we employ the hybrid $p$ -multigrid/Schwarz method with vertex-star patches and the SDC/FDM/sparse relaxation and a coarse space with $p=1$ . As expected from (3.6), the results confirm that the approach is reasonably $p$ -robust, but that robustness with respect to $\nu$ cannot be achieved with SDC relaxation on vertex-star patches.

Table 3: PCG iteration counts for the primal formulation of the linear elasticity problem using the SDC/FDM/sparse relaxation.

$d$	$p\setminus\lambda$	$0$	$10^{0}$	$10^{1}$	$10^{2}$	$10^{3}$
2	3	13	14	24	70	199
	7	17	17	28	76	236
	15	18	19	30	81	249
	31	20	20	32	84	258
3	3	20	22	39	114	362
	7	25	28	48	123	381
	15	27	29	51	125	373

3.2 Mixed FEM formulations of linear elasticity

In order to avoid locking in nearly incompressible continua, or impose the incompressibility constraint, the standard approach is to introduce a pressure-like variable and discretize with a mixed FEM. This is expressed by the weak formulation: find $(\mathbf{u}-\mathbf{u}_{0},p)\in V\times Q$ such that

	$\displaystyle a(\mathbf{v},\mathbf{u})+b(p,\mathbf{v})$	$\displaystyle=L(\mathbf{v})$	$\displaystyle\quad\forall\,\mathbf{v}\in V,$		(3.8)
	$\displaystyle b(q,\mathbf{u})-c(q,p)$	$\displaystyle=0$	$\displaystyle\quad\forall\,q\in Q,$		(3.9)

where

a(\mathbf{v},\mathbf{u})=\int_{\Omega}2\mu\varepsilon(\mathbf{v}):\varepsilon(% \mathbf{u})\leavevmode\nobreak\ \mathrm{d}\mathbf{x},\quad b(q,\mathbf{u})=% \int_{\Omega}q\leavevmode\nobreak\ \mathrm{div}(\mathbf{u})\leavevmode\nobreak% \ \mathrm{d}\mathbf{x},\quad c(q,p)=\int_{\Omega}\lambda^{-1}qp\leavevmode% \nobreak\ \mathrm{d}\mathbf{x},

(3.10)

and $Q=L^{2}_{0}(\Omega)$ for $\lambda=\infty$ and $\Gamma_{D}=\partial\Omega$ , or $Q=L^{2}(\Omega)$ otherwise.

In order for this problem to have a unique solution, we require the well-known Brezzi conditions: the solution for $\mathbf{u}$ is unique if $a(\cdot,\cdot)$ is coercive on the kernel of $b(\cdot,\cdot)$ , and the solution for $p$ is unique if there exists a right inverse for $b(\cdot,\cdot)$ . This is expressed in the so-called inf–sup condition or LBB condition [7, 12]: there exists $\beta$ , which might depend on $\Omega$ , such that

0<\beta\coloneqq\adjustlimits{\inf}_{q\in Q}{\sup}_{\mathbf{v}\in V}\frac{b(q,% \mathbf{v})}{a(\mathbf{v},\mathbf{v})^{1/2}\norm{q}_{Q}}.

(3.11)

After selecting suitable finite dimensional subspaces $V_{h}\subset V$ , $Q_{h}\subset Q$ , we obtain a system of linear equations with the saddle point structure

\begin{bmatrix}A&B^{\top}\\ B&-C\end{bmatrix}\begin{bmatrix}\underline{\mathbf{u}}\\ \underline{p}\end{bmatrix}=\begin{bmatrix}\underline{\mathbf{f}}\\ \underline{g}\end{bmatrix}.

(3.12)

We require the analogous Brezzi conditions for the discrete problem: that $a(\cdot,\cdot)$ is coercive on the discrete kernel of $b(\cdot,\cdot)$ , and that there exists a discrete inf-sup constant $\tilde{\beta}$ independent of the mesh but possibly depending on $p$ such that

0<\tilde{\beta}\coloneqq\adjustlimits{\inf}_{q_{h}\in Q_{h}}{\sup}_{\mathbf{v}% _{h}\in V_{h}}\frac{b(q_{h},\mathbf{v}_{h})}{a(\mathbf{v}_{h},\mathbf{v}_{h})^% {1/2}\norm{q_{h}}_{Q_{h}}}.

(3.13)

The discretization $V_{h}\times Q_{h}$ must be chosen carefully to satisfy these conditions; the discrete inf-sup condition will not be satisfied by arbitrary discretizations. If they are, we have the well-known error estimates

	$\displaystyle\norm{\mathbf{u}_{h}-\mathbf{u}}_{V}\leq C_{1}\left\{\inf_{% \mathbf{v}_{h}\in V_{h}}\norm{\mathbf{v}_{h}-\mathbf{u}}_{V}+\inf_{q_{h}\in Q_% {h}}\norm{q_{h}-p}_{Q}\right\},$		(3.14a)
	$\displaystyle\norm{p_{h}-p}_{Q}\leq\tilde{\beta}^{-1}C_{2}\left\{\inf_{\mathbf% {v}_{h}\in V_{h}}\norm{\mathbf{v}_{h}-\mathbf{u}}_{V}+\inf_{q_{h}\in Q_{h}}% \norm{q_{h}-p}_{Q}\right\},$		(3.14b)

where $C_{1},C_{2}>0$ are generic constants independent of the mesh parameter $h$ . For the use of high-order discretizations, it is desirable to choose an element pair where $\tilde{\beta}$ does not decrease as the polynomial degree of the approximation is increased. Such a discretization is referred to as $p$ -stable.

In fact, $p$ -stability is important for solvers also. Approaches based on block-Gaußian elimination, such as the Uzawa algorithm [5] and block-preconditioned MINRES [41], require preconditioners for the negative pressure Schur complement $S=C+BA^{-1}B^{\top}$ . It is well known that for the Stokes system, the continuous analogue of $S$ , $\nabla\cdot(-\nabla^{2})^{-1}\nabla$ , is well approximated by the identity operator [51]. It follows that $S$ is spectrally equivalent to the pressure mass matrix, $M_{p}$ ,

\beta_{0}^{2}\leq\frac{\underline{q}^{\top}S\underline{q}}{\underline{q}^{\top% }M_{p}\underline{q}}\leq\beta_{1}^{2}\quad\forall\,\underline{q}\in\mathbb{R}^% {\dim(Q_{h})}\setminus\{0\}.

(3.15)

The rate of convergence of block-preconditioned MINRES will be determined by the ratio $\beta_{1}/\beta_{0}$ . For Stokes flows with pure Dirichlet BCs, $\beta_{1}=1$ , and $\beta_{1}=\sqrt{d}$ otherwise. In general, we have $\beta_{0}=\tilde{\beta}$ . Since $A$ is spectrally equivalent to the vector Laplacian, these results also hold for linear elasticity. We may expect solvers based on such techniques to degrade with $p$ -refinement if the discretization is not $p$ -stable.

If we choose to work with the $[H^{1}(\Omega)]^{d}$ -conforming space $V_{h}=[\mathrm{Q}_{p}(\Omega)]^{d}$ , some standard inf–sup stable choices for $Q_{h}$ are $\mathrm{Q}_{p-1}(\Omega)$ , $\mathrm{DQ}_{p-2}(\Omega)$ and $\mathrm{DP}_{p-1}(\Omega)$ . $\mathrm{DQ}_{p-2}$ denotes discontinuous piecewise polynomials of degree at most $p-2$ in each direction, while $\mathrm{DP}_{p-1}$ denotes discontinuous piecewise polynomials of total degree at most $p-1$ . The choice $Q_{h}=\mathrm{Q}_{p-1}$ gives rise to the high-order generalization of the Taylor–Hood mixed element [54]. Here, $M_{p}$ will not be block diagonal, and hence more expensive preconditioning techniques will be required. Moreover, it is shown numerically in [1] that the Taylor–Hood element is not $p$ -stable. The choice $Q_{h}=\mathrm{DQ}_{p-2}$ exhibits an asymptotic decay of $\tilde{\beta}\leq Cp^{(1-d)/2}$ as $p\to\infty$ [10], and thus is not $p$ -stable. In practice, it is observed that this is quite a pessimistic bound for moderate $p$ [35]. The choice $Q_{h}=\mathrm{DP}_{p-1}$ is $p$ -stable, but numerical experiments reveal that the stability is severely affected by the cell aspect ratio, unlike the previous two choices [50]. Moreover this last space does not have tensor product shape functions, so its efficient implementation becomes challenging.

To construct $p$ -stable discretizations that are also robust to cell aspect ratio, we turn to nonconforming schemes with $V_{h}\subset H(\mathrm{div},\Omega)$ [15, 31]. In particular we consider the use of Raviart–Thomas elements [4] of degree $p$ for $V_{h}$ for the displacement, paired with $Q_{h}=\mathrm{DQ}_{p-1}$ . This pair satisfies $\mathrm{div}(V_{h})=Q_{h}$ , which enforces the incompressibility constraint (3.9) exactly in the numerical approximation for $\lambda=\infty$ . The Raviart–Thomas elements are defined on the reference quadrilateral as

\mathrm{RT}_{p}(\hat{K})=\mathrm{P}_{p}(\hat{\mathcal{I}})\otimes\mathrm{DP}_{% p-1}(\hat{\mathcal{I}})\oplus\mathrm{DP}_{p-1}(\hat{\mathcal{I}})\otimes% \mathrm{P}_{p}(\hat{\mathcal{I}}).

(3.16)

The analogous element in three dimensions is referred to as the Nédélec face element [39]. The definition can be extended to curvilinear cells via the contravariant Piola transform: for a function $\hat{\mathbf{u}}:\hat{K}\to\mathbb{R}^{d}$ , we define $\mathbf{u}:K\to\mathbb{R}^{d}$ as

\mathbf{u}=\mathcal{P}_{K}(\hat{\mathbf{u}})\coloneqq\frac{1}{\absolutevalue{% \mathrm{D}F_{K}}}\mathrm{D}F_{K}\left(\hat{\mathbf{u}}\circ F_{K}^{-1}\right),

(3.17)

and set

\mathrm{RT}_{p}(K)=\mathcal{P}_{K}\left(\mathrm{RT}_{p}(\hat{K})\right).

(3.18)

These elements have superb properties, but their nonconforming nature must be suitably addressed in the discretization. They only impose continuity of the normal components of $\mathbf{u}$ across cell facets, and we therefore weakly enforce the tangential continuity via the symmetric interior penalty (SIPG) method [3]. The use of SIPG for the displacement requires further extension of the FDM/sparse relaxation; in particular, we must consider the additional facet integrals arising in the method, and show that the stiffness remains sparse.

3.3 Extension to interior penalty DG methods

Interior penalty discontinuous Galerkin (IP-DG) methods relax the continuity requirement of the discretization space. For instance, instead of $[H^{1}(\Omega)]^{d}$ , we consider a larger function space with weaker continuity, such as $[L^{2}(\Omega)]^{d}$ or $H(\mathrm{div},\Omega)$ . As previously mentioned, in order to deal with the non-conformity, $C^{0}$ -continuity is weakly enforced via the introduction of a penalty term on the set of interior facets $\Gamma_{I}$ of the mesh $\mathcal{T}_{h}$ that vanishes for $C^{0}$ -continuous functions. Similarly, the weak prescription of the Dirichlet BC $\mathbf{u}=\mathbf{u}_{0}$ on $\Gamma_{D}$ is achieved by introducing a penalty term on $\Gamma_{D}$ .

We consider the following SIPG formulation:

\begin{split}&a(\mathbf{v},\mathbf{u})=\sum_{K\in\mathcal{T}_{h}}\int_{K}% \nabla\mathbf{v}:\mathcal{F}^{v}(\nabla\mathbf{u})\leavevmode\nobreak\ \mathrm% {d}\mathbf{x}\\ &+\sum_{e\in\Gamma_{I}\cup\Gamma_{D}}\int_{e}\eta h_{e}^{-1}\left\{G^{\top}% \right\}\left\llbracket\mathbf{v}\right\rrbracket:\left\llbracket\mathbf{u}% \right\rrbracket-\left\llbracket\mathbf{v}\right\rrbracket:\left\{\mathcal{F}^% {v}(\nabla\mathbf{u})\right\}-\left\{G^{\top}\nabla\mathbf{v}\right\}:\left% \llbracket\mathbf{u}\right\rrbracket\leavevmode\nobreak\ \mathrm{d}s,\end{split}

(3.19)

L(\mathbf{v})=\int_{\Omega}\mathbf{v}\cdot\mathbf{B}\leavevmode\nobreak\ % \mathrm{d}\mathbf{x}+\int_{\Gamma_{D}}\eta h_{e}^{-1}G^{\top}(\mathbf{v}% \otimes\mathbf{n}):(\mathbf{u}_{0}\otimes\mathbf{n})-G^{\top}\nabla\mathbf{v}:% (\mathbf{u}_{0}\otimes\mathbf{n})\leavevmode\nobreak\ \mathrm{d}s.

(3.20)

Here $\mathcal{F}^{v}(\nabla\mathbf{u})$ is a linear viscous flux. For the vector Poisson equation, the viscous flux is given by $\mathcal{F}^{v}(\nabla\mathbf{u})=\nabla\mathbf{u}$ . For the primal formulation of linear elasticity, the viscous flux corresponds to the stress tensor $\mathcal{F}^{v}(\nabla\mathbf{u})=\mu(\nabla\mathbf{u}+\nabla\mathbf{u}^{\top}% )+\lambda\nabla\cdot\mathbf{u}{\mathds{1}}$ . For the mixed formulation of linear elasticity, the $(1,1)$ -block of the system has viscous flux $\mathcal{F}^{v}(\nabla\mathbf{u})=\mu(\nabla\mathbf{u}+\nabla\mathbf{u}^{\top})$ .

From left to right, the terms in the surface integral in (3.19) are referred to as the penalty, consistency, and adjoint consistency terms. The quantity $G$ is known as the homogeneity tensor,

G_{ijkl}=\frac{\partial}{\partial u_{k,l}}[\mathcal{F}^{v}(\nabla\mathbf{u})]_% {ij},

(3.21)

for which we define the adjoint product with $\nabla\mathbf{v}$

[G^{\top}\nabla\mathbf{v}]_{kl}=G_{ijkl}v_{i,j}.

(3.22)

The average $\left\{\cdot\right\}$ and jump $\left\llbracket\cdot\right\rrbracket$ operators are defined for scalar, vector, and tensor arguments as follows. Let $e$ be a facet of the mesh. For an interior facet, let $K^{-}$ and $K^{+}$ be the two mesh cells that share it, and let $\mathbf{w}^{-}$ and $\mathbf{w}^{+}$ be the traces of a function $\mathbf{w}$ on $e$ from $K^{-}$ and $K^{+}$ , respectively. On each facet we define

\left\{\mathbf{w}\right\}_{e}=\begin{cases}\frac{1}{2}(\mathbf{w}^{-}+\mathbf{% w}^{+})&e\in\Gamma_{I},\\ \mathbf{w}&\mbox{otherwise},\end{cases}\quad\left\llbracket\mathbf{w}\right% \rrbracket_{e}=\begin{cases}\mathbf{w}^{-}\otimes\mathbf{n}^{-}+\mathbf{w}^{+}% \otimes\mathbf{n}^{+}&e\in\Gamma_{I},\\ \mathbf{w}\otimes\mathbf{n}&\mbox{otherwise}.\end{cases}

(3.23)

In order to ensure coercivity of $a(\cdot,\cdot)$ as we do $h$ or $p$ refinement, the penalty term must be sufficiently large. The penalty coefficient $\eta h_{e}^{-1}$ must be chosen inversely proportional to the normal spacing of GLL nodes near the facet, i.e. $\eta=\mathcal{O}(p(p+1))$ [37]. For the reciprocal length scale in the direction normal to facet $e$ we use

h_{e}^{-1}\coloneqq\absolutevalue{e}\left\{\absolutevalue{K}^{-1}\right\}_{e}.

(3.24)

The stiffness matrix that corresponds to $a(\cdot,\cdot)$ in the SIPG formulation is obtained via direct stiffness summation over the cells and facets:

A=\sum_{K\in\mathcal{T}_{h}}R_{K}^{\top}A^{K}R_{K}+\sum_{e\in\Gamma_{I}\cup% \Gamma_{D}}R_{e}^{\top}A^{e}R_{e},

(3.25)

where $A^{K}$ is the cell matrix discretizing the volume integral in $K$ , $R_{K}$ is the Boolean restriction onto the DOFs of $K$ , $A^{e}$ is the facet matrix discretizing the surface integral on $e$ , and $R_{e}$ is Boolean restriction onto the DOFs of the cells sharing facet $e$ .

To illustrate the extension of our approach to the SIPG discretization, we consider again the scalar Poisson equation. The discrete problem is to find $u_{h}\in V_{h}=\mathrm{DQ}_{p}(\Omega)\subset L^{2}(\Omega)$ . On Cartesian cells, both $A^{K}$ and $A^{e}$ have a tensor product structure of the form (2.12), with matrices of operators on the interval that can be sparsified by the FDM. To illustrate this, suppose that, for $e\in\Gamma_{I}$ , $R_{e}$ reorders the DOFs such that the cells $K^{-}$ and $K^{+}$ share $e$ along the $d$ -th reference coordinate axis, while leaving the other axes consistently oriented on both cells. The facet matrices are

A^{e}=\begin{cases}E^{e}\otimes\hat{B}&\mbox{if\leavevmode\nobreak\ }d=2,\\ E^{e}\otimes\hat{B}\otimes\hat{B}&\mbox{if\leavevmode\nobreak\ }d=3,\end{cases}

(3.26)

where the interval facet matrix $E^{e}$ is defined in terms of the coefficients $\mu^{K}_{j}$ appearing in (2.12), the 1D shape functions $\{\hat{\phi}_{j}\}$ , and their normal derivatives $\frac{\partial}{\partial n}\hat{\phi}_{j}$ on $\partial\hat{\mathcal{I}}$ (the usual derivative with a sign). When $e\in\Gamma_{D}$ , $E^{e}\in\mathbb{R}^{(p+1)\times(p+1)}$ is given by

[E^{e}]_{ij}=\mu^{e}\left(\eta\hat{\phi}_{i}(\hat{x}^{e})\hat{\phi}_{j}(\hat{x% }^{e})-\hat{\phi}_{i}(\hat{x}^{e})\tfrac{\partial}{\partial n}\hat{\phi}_{j}(% \hat{x}^{e})-\tfrac{\partial}{\partial n}\hat{\phi}_{i}(\hat{x}^{e})\hat{\phi}% _{j}(\hat{x}^{e})\right).

(3.27)

Here $\mu^{e}=\mu^{K}_{l}$ , where $\hat{x}_{l}$ is the reference coordinate normal to $e$ , and $\hat{x}^{e}\in\partial\hat{\mathcal{I}}$ describes the facet $e$ as the image of the plane $\hat{x}_{l}=\hat{x}^{e}$ under $F_{K}$ . When $e\in\Gamma_{I}$ , $E^{e}$ is a $2\times 2$ block matrix with blocks $E^{e}_{rs}\in\mathbb{R}^{(p+1)\times(p+1)}$ , $r,s\in\{0,1\}$ , given by

[E^{e}_{rs}]_{ij}=\tfrac{(-1)^{r-s}}{2}\left(\eta(\mu^{e}_{0}+\mu^{e}_{1})\hat% {\phi}_{i}(\hat{x}^{e}_{r})\hat{\phi}_{j}(\hat{x}^{e}_{s})-\mu^{e}_{s}\hat{% \phi}_{i}(\hat{x}^{e}_{r})\tfrac{\partial}{\partial n}\hat{\phi}_{j}(\hat{x}^{% e}_{s})-\mu^{e}_{r}\tfrac{\partial}{\partial n}\hat{\phi}_{i}(\hat{x}^{e}_{r})% \hat{\phi}_{j}(\hat{x}^{e}_{s})\right).

(3.28)

Here $\mu^{e}_{0}=\mu^{K^{-}}_{l},\mu^{e}_{1}=\mu^{K^{+}}_{m}$ , where $\hat{x}_{l}$ and $\hat{x}_{m}$ are the reference directions normal to $e$ on $K^{-}$ and $K^{+}$ , respectively. Similarly, the facet $e$ is the image of $\hat{x}_{l}=\hat{x}^{e}_{0}$ under $F_{K^{-}}$ and that of $\hat{x}_{m}=\hat{x}^{e}_{1}$ under $F_{K^{+}}$ .

Some implementations of $\mathrm{DQ}_{p}$ do not feature an interior-interface decomposition and use the Gauß–Legendre (GL) nodal shape functions. The GL nodes do not include the endpoints, thus all shape functions have non-zero support at the facets, causing $E^{e}$ to be dense. The matrices $E^{e}$ are sparse for a basis with an interior-interface decomposition, such as the GLL Lagrange polynomials $\{\ell_{j}\}$ , the hierarchical Lobatto polynomials $\{l_{j}\}$ , and the FDM polynomials $\{\hat{s}_{j}\}$ . Since $\hat{\phi}_{j}(\pm 1)$ is non-zero for a single $j\in\{0,p\}$ , each term in (3.28) and (3.27) corresponds to a non-zero entry, a non-zero row, and a non-zero column of $E^{e}$ , respectively, as seen in Figure 10.

Instead of diagonalizing the SIPG patch matrix as in [55], our FDM-based approach produces a sparse matrix with diagonal interior blocks on (possibly) unstructured vertex-star patches. Figure 11 shows the sparsity pattern of the matrix for the SIPG formulation of the Poisson equation on a Cartesian vertex-star patch in the FDM basis, along with its Cholesky factor. Here the matrix size is increased from $(2p-1)^{d}$ DOFs in the CG case to $(2p+2)^{d}$ . At low polynomial degrees, the interface DOFs form a large fraction of the total number, but the proportion decreases as $p$ increases. The computational complexity analysis of Section 2.6 carries over to the SIPG case.

For a general viscous flux, we construct an auxiliary separable form by expressing the cell integrals in terms of the reference coordinates

a_{K}(\mathbf{v},\mathbf{u})=\int_{K}G_{ijkl}^{K}\frac{\partial v_{i}}{% \partial x_{j}}\frac{\partial u_{k}}{\partial x_{l}}\leavevmode\nobreak\ % \mathrm{d}\mathbf{x}=\int_{\hat{K}}\hat{G}_{ijkl}^{K}\frac{\partial}{\partial% \hat{x}_{j}}\left(v_{i}\circ F_{K}\right)\frac{\partial}{\partial\hat{x}_{l}}% \left(u_{k}\circ F_{K}\right)\leavevmode\nobreak\ \mathrm{d}\hat{\mathbf{x}},

(3.29)

where $\hat{G}^{K}$ is the homogeneity tensor in the reference coordinates,

\hat{G}_{ijkl}^{K}=\absolutevalue{\det(\mathrm{D}F_{K})}[\mathrm{D}F_{K}^{-1}]% _{jm}[\mathrm{D}F_{K}^{-1}]_{ln}G_{imkn}^{K}.

(3.30)

The auxiliary form $\tilde{a}(\cdot,\cdot)$ is constructed by approximating $\hat{G}_{ijkl}^{K}$ with a piecewise constant tensor that discards the entries where $i\neq k$ or $j\neq l$ . The corresponding cell stiffness matrices become sparse in the FDM basis, and have a similar form as (2.12), except that the coefficients $\mu^{K}_{j}$ are diagonal matrices that multiply each term through an additional Kronecker product. Hence, we expect that preconditioners based on the auxiliary form to be limited by the coupling between vector components, by the mesh geometry, and by how $G^{K}$ varies within $K$ .

This approach carries over to non-Cartesian cells for DG discretizations of the Poisson equation in the same way as the CG case. Unfortunately, the extension of our approach to vector-valued problems in $H(\mathrm{div})$ on non-Cartesian cells does not yield a good preconditioner. In this setting, we construct a block diagonal preconditioner separating the components of the DOFs, which are in the reference coordinates. For the vector Poisson problem on non-Cartesian cells, the Piola transform introduces off-diagonal contributions from the volume and surface terms, which does not occur on non-Piola-mapped elements, such as $[\mathrm{Q}_{p}]^{d}$ and $[\mathrm{DQ}_{p}]^{d}$ . The excluded terms are required for the surface integral terms to vanish for arguments with $C^{0}$ continuity, and without them the preconditioner might become indefinite on non-Cartesian cells.

3.4 Results for mixed formulations of linear elasticity

We consider the same problem as in Table 3, with both a conforming $[\mathrm{Q}_{p}]^{d}\times\mathrm{DQ}_{p-2}$ discretization and a non-conforming $\mathrm{RT}_{p}\times\mathrm{DQ}_{p-1}$ discretization. For the $H(\mathrm{div})$ -conforming discretization, the normal components of the Dirichlet BCs are enforced strongly, while the tangential components of the BCs are weakly enforced with SIPG. Enforcing the normal conditions strongly is crucial for achieving a divergence-free solution in the Stokes limit $\lambda=\infty$ . For the penalty coefficient, we use $\eta=p(p+1)$ . We restrict our experiment to Cartesian cells, so that the FDM/sparse relaxation is applicable to the $H(\mathrm{div})$ -conforming discretization.

We iteratively solve the discrete system $\eqref{eq:saddle}$ via MINRES with a symmetric positive definite block diagonal preconditioner,

\mathcal{P}_{\mathrm{diag}}=\begin{bmatrix}P_{1}&0\\ 0&P_{2}\end{bmatrix}.

(3.31)

Here $P_{1}$ is a preconditioner for the displacement block $A$ , and $P_{2}$ is a preconditioner for the scaled pressure mass matrix $(\mu^{-1}+\lambda^{-1})M_{p}$ . For $P_{1}$ we employ the hybrid $p$ -multigrid/Schwarz method with the SDC/FDM/sparse relaxation and $[\mathrm{Q}_{1}]^{d}$ as the coarse space. In our tests, we discretize the pressure space with the GL basis, and employ point-Jacobi on the pressure mass matrix, i.e. $P_{2}=(\mu^{-1}+\lambda^{-1})\operatorname{diag}(M_{p})$ . When $\mathcal{T}_{h}$ consists of Cartesian cells, $M_{p}=\operatorname{diag}(M_{p})$ in the GL basis. The solver is illustrated in Figure 12.

Figure 12: Solver diagram for the mixed linear elasticity problem.

In Table 4 we present MINRES iteration counts for the same configuration considered in Table 3 in Section 3.1, using the $[\mathrm{Q}_{p}]^{d}\times\mathrm{DQ}_{p-2}$ and $\mathrm{RT}_{p}\times\mathrm{DQ}_{p-1}$ elements, respectively. Both discretizations yield robust iteration counts with respect to $\lambda$ ; the iterations grow with the former discretization much more quickly than the latter, especially in 3D.

Table 4: MINRES iteration counts for the mixed linear elasticity problem, using the solver in Figure 12.

		$[\mathrm{Q}_{p}]^{d}\times\mathrm{DQ}_{p-2}$					$\mathrm{RT}_{p}\times\mathrm{DQ}_{p-1}$
$d$	$p\setminus\lambda$	$10^{0}$	$10^{1}$	$10^{2}$	$10^{3}$	$\infty$	$10^{0}$	$10^{1}$	$10^{2}$	$10^{3}$	$\infty$
2	3	28	40	43	43	43	25	36	39	40	40
	7	31	45	50	51	51	28	40	43	45	45
	15	34	50	57	57	57	30	43	48	48	48
	31	36	53	64	65	65	31	45	51	51	51
3	3	44	67	75	76	76	34	50	55	56	56
	7	50	83	96	97	98	39	58	63	65	65
	15	53	88	111	118	119	41	63	70	70	70

The solver configuration shown in Figure 12 is optimized for memory usage, employing a block diagonal preconditioner so that the short-term recurrences of MINRES may be exploited. If one is willing to trade memory for time, one may consider an alternative configuration shown in Figure 13 employing right-preconditioned GMRES [49] with a block upper triangular preconditioner,

\mathcal{P}_{\mathrm{upper}}=\begin{bmatrix}P_{1}&B^{\top}\\ 0&-P_{2}\end{bmatrix},

(3.32)

which requires a single application each of $P_{1}^{-1}$ , $P_{2}^{-1}$ , and $B^{\top}$ per GMRES iteration. The GMRES iteration counts are presented in Table 5.

Figure 13: Solver diagram for the mixed linear elasticity problem that trades memory for iteration counts.

Table 5: GMRES iteration counts for the mixed linear elasticity problem, using the solver in Figure 13.

		$[\mathrm{Q}_{p}]^{d}\times\mathrm{DQ}_{p-2}$					$\mathrm{RT}_{p}\times\mathrm{DQ}_{p-1}$
$d$	$p\setminus\lambda$	$10^{0}$	$10^{1}$	$10^{2}$	$10^{3}$	$\infty$	$10^{0}$	$10^{1}$	$10^{2}$	$10^{3}$	$\infty$
2	3	17	23	25	26	26	14	20	22	22	22
	7	18	25	27	28	28	16	22	24	24	24
	15	20	27	33	34	34	17	23	26	26	26
	31	21	30	38	38	39	18	24	28	28	28
3	3	24	32	33	34	34	17	23	25	25	25
	7	27	35	38	38	38	21	26	29	29	29
	15	28	38	44	46	46	22	28	31	31	32

In Table 6 we study the performance of our solver on an unstructured mesh. We consider the $[\mathrm{Q}_{p}]^{d}\times\mathrm{DQ}_{p-2}$ discretization of incompressible linear elasticity ( $\lambda=\infty$ ). We prescribe $\mu=1$ , a uniform downwards body force $\mathbf{B}=-0.02\mathbf{e}_{2}$ , and homogeneous Dirichlet BCs on the displacement on the holes of the domain. The three-dimensional mesh is obtained via extrusion by 16 layers of the two-dimensional mesh. The iteration counts follow the same pattern as before for this element: they are not $p$ -robust as expected, but they remain modest even at very high degrees.

[Uncaptioned image] — Table 6: GMRES iteration counts for the mixed formulation of the incompressible linear elasticity problem on the unstructured mesh shown here, using the solver in Figure 13.

4 Conclusion

We have introduced a fast relaxation method required for unstructured vertex-centered patch problems arising in Pavarino’s approach, extending its practicality to much higher polynomial degrees. Our method relies on a spectrally equivalent form, constructed such that it is separable in the reference coordinates. We show promising results for the Poisson equation and mixed formulations of linear elasticity. A downside of the approach is its narrow applicability; it will not be effective on more general problems, especially for those where the dominant terms include mixed derivatives and mixed vector components. In addition, our method relies on having a good quality mesh, with its performance depending on the minimal angle; however, mesh generators with guarantees on the minimal angle are available in two dimensions [32]. So far, we have only considered constant-coefficient problems, but the theory of [6] suggests that our approach would remain effective for spatially varying coefficients. Work in progress shows that the memory bandwidth limitations and the suboptimal complexity of the sparse Cholesky factorization can be overcome by iterative patch solvers, such as incomplete factorizations and algebraic multigrid.

References

[1] M. Ainsworth, P. Coggins, and B. Senior, Mixed $hp$ -finite element methods for incompressible flow, Chapman and Hall CRC Research Notes in Mathematics, (2000), pp. 1–20.
[2] P. R. Amestoy, I. S. Duff, J.-Y. L’Excellent, and J. Koster, A fully asynchronous multifrontal solver using distributed dynamic scheduling, SIAM J. Matrix Anal. Appl., 23 (2001), pp. 15–41.
[3] D. N. Arnold, An interior penalty finite element method with discontinuous elements, SIAM J. Numer. Anal., 19 (1982), pp. 742–760.
[4] D. N. Arnold, D. Boffi, and R. S. Falk, Quadrilateral $H$ (div) finite elements, SIAM J. Numer. Anal., 42 (2005), pp. 2429–2451.
[5] K. J. Arrow, L. Hurwicz, and H. Uzawa, Studies in Non-linear Programming, Stanford University Press, 1958.
[6] O. Axelsson and J. Karátson, Equivalent operator preconditioning for elliptic problems, Numer. Algorithms, 50 (2009), pp. 297–380.
[7] I. Babuška, The finite element method with Lagrangian multipliers, Numer. Math., 20 (1973), pp. 179–192.
[8] S. Balay, S. Abhyankar, M. F. Adams, J. Brown, P. Brune, K. Buschelman, L. Dalcin, V. Eijkhout, W. D. Gropp, D. Karpeyev, D. Kaushik, M. G. Knepley, D. A. May, L. C. McInnes, R. T. Mills, T. Munson, K. Rupp, P. Sanan, B. F. Smith, S. Zampini, H. Zhang, and H. Zhang, PETSc users manual, Tech. Report ANL-95/11 - Revision 3.11, Argonne National Laboratory, 2019.
[9] P. D. Bello-Maldonado and P. F. Fischer, Scalable low-order finite element preconditioners for high-order spectral element Poisson solvers, SIAM J. Sci. Comput., 41 (2019), pp. S2–S18.
[10] C. Bernardi, Y. Maday, and B. Métivet, Spectral approximation of the periodic-nonperiodic Navier-Stokes equations, Numer. Math., 51 (1987), pp. 655–700.
[11] R. Blaheta, Displacement decomposition—incomplete factorization preconditioning techniques for linear elasticity problems, Numer. Lin. Algebra Appl., 1 (1994), pp. 107–128.
[12] F. Brezzi, On the existence, uniqueness and approximation of saddle-point problems arising from Lagrangian multipliers, ESAIM Math. Model. Num., 8 (1974), pp. 129–151.
[13] C. Canuto, Stabilization of spectral methods by finite element bubble functions, Comput. Methods Appl. Mech. Engrg., 116 (1994), pp. 13–26.
[14] Y. Chen, T. A. Davis, W. W. Hager, and S. Rajamanickam, Algorithm 887: CHOLMOD, Supernodal Sparse Cholesky Factorization and Update/Downdate, ACM Trans. Math. Softw., 35 (2008).
[15] B. Cockburn, G. Kanschat, and D. Schötzau, A note on discontinuous Galerkin divergence-free solutions of the Navier–Stokes equations, J. Sci. Comput., 31 (2007), pp. 61–73.
[16] W. Couzy and M. O. Deville, A fast Schur complement method for the spectral element discretization of the incompressible Navier-Stokes equations, J. Comp. Phys., 116 (1995), pp. 135 – 142.
[17] L. T. Diosady and S. M. Murman, Scalable tensor-product preconditioners for high-order finite-element methods: Scalar equations, J. Comp. Phys., 394 (2019), pp. 759–776.
[18] M. Dryja and O. Widlund, An additive variant of the Schwarz alternating method for the case of many subregions, Technical Report 339, Ultracomputer Note 131, Department of Computer Science, Courant Institute, 1987.
[19] P. E. Farrell, M. G. Knepley, L. Mitchell, and F. Wechsung, PCPATCH: Software for the Topological Construction of Multigrid Relaxation Methods, ACM Trans. Math. Softw., 47 (2021).
[20] P. F. Fischer, H. M. Tufo, and N. I. Miller, An overlapping Schwarz method for spectral element simulation of three-dimensional incompressible flows, in Parallel Solution of Partial Differential Equations, Springer, 2000, pp. 159–180.
[21] A. George, Nested dissection of a regular finite element mesh, SIAM J. Numer. Anal., 10 (1973), pp. 345–363.
[22] I. Gustafsson and G. Lindskog, On parallel solution of linear elasticity problems: Part I: theory, Numer. Lin. Algebra Appl., 5 (1998), pp. 123–139.
[23] M. Homolya, R. C. Kirby, and D. A. Ham, Exposing and exploiting structure: optimal code generation for high-order finite element methods, arXiv preprint arXiv:1711.02473, (2017).
[24] M. Homolya, L. Mitchell, F. Luporini, and D. A. Ham, TSFC: a structure-preserving form compiler, SIAM J. Sci. Comput., 40 (2018), pp. C401–C428.
[25] I. Huismann, J. Stiller, and J. Fröhlich, Scaling to the stars–a linearly scaling elliptic solver for $p$ -multigrid, J. Comp. Phys., 398 (2019), p. 108868.
[26] G. Karniadakis and S. Sherwin, Spectral/ $hp$ element methods for computational fluid dynamics, Oxford University Press, 2013.
[27] D. S. Kershaw, Differencing of the diffusion equation in Lagrangian hydrodynamic codes, J. Comp. Phys., 39 (1981), pp. 375–395.
[28] R. C. Kirby, Algorithm 839: FIAT, a new paradigm for computing finite element basis functions, ACM Trans. Math. Soft., 30 (2004), pp. 502–516.
[29] M. Kronbichler and W. A. Wall, A Performance Comparison of Continuous and Discontinuous Galerkin Methods with Fast Multigrid Solvers, SIAM J. Sci. Comput., 40 (2018), pp. A3423–A3448.
[30] C. Lanczos, An iteration method for the solution of the eigenvalue problem of linear differential and integral operators, (1950).
[31] P. L. Lederer and J. Schöberl, Polynomial robust stability analysis for $H(\mathrm{div})$ -conforming finite elements for the Stokes equations, IMA J. Numer. Anal., 38 (2018), pp. 1832–1860.
[32] X. Liang and Y. Zhang, Hexagon-based all-quadrilateral mesh generation with guaranteed angle bounds, Comput. Methods Appl. Mech. Engrg., 200 (2011), pp. 2005–2020.
[33] J. W. Lottes and P. F. Fischer, Hybrid multigrid/Schwarz algorithms for the spectral element method, J. Sci. Comput., 24 (2005), pp. 45–78.
[34] R. E. Lynch, J. R. Rice, and D. H. Thomas, Direct solution of partial difference equations by tensor product methods, Numer. Math., 6 (1964), pp. 185–199.
[35] Y. Maday, D. Meiron, A. T. Patera, and E. M. Rønquist, Analysis of iterative methods for the steady and unsteady Stokes problem: Application to spectral element discretizations, SIAM J. Sci. Comput., 14 (1993), pp. 310–337.
[36] J.-F. Maitre and O. Pourquier, Condition number and diagonal preconditioning: comparison of the $p$ -version and the spectral element methods, Numer. Math., 74 (1996), pp. 69–84.
[37] J. Manzanero, A. M. Rueda-Ramírez, G. Rubio, and E. Ferrer, The Bassi Rebay 1 scheme is a special case of the symmetric interior penalty formulation for discontinuous Galerkin discretisations with Gauss–Lobatto points, J. Comp. Phys., 363 (2018), pp. 1–10.
[38] J. R. Munkres, Elements of Algebraic Topology, CRC Press, 1984.
[39] J.-C. Nédélec, Mixed finite elements in $\mathbb{R}^{3}$ , Numer. Math., 35 (1980), pp. 315–341.
[40] S. A. Orszag, Spectral methods for problems in complex geometries, J. Comp. Phys., 37 (1980), pp. 70 – 92.
[41] C. C. Paige and M. A. Saunders, Solution of sparse indefinite systems of linear equations, SIAM J. Numer. Anal., 12 (1975), pp. 617–629.
[42] L. F. Pavarino, Additive Schwarz methods for the $p$ -version finite element method, Numer. Math., 66 (1993), pp. 493–515.
[43] L. F. Pavarino, E. Zampieri, R. Pasquetti, and F. Rapetti, Overlapping Schwarz methods for Fekete and Gauss–Lobatto spectral elements, 29 (2007), pp. 1073–1092.
[44] W. Pazner, Efficient low-order refined preconditioners for high-order matrix-free continuous and discontinuous Galerkin methods, SIAM J. Sci. Comput., 42 (2020), pp. A3055–A3083.
[45] W. Pazner and P.-O. Persson, Approximate tensor-product preconditioners for very high order discontinuous Galerkin methods, J. Comp. Phys., 354 (2018), pp. 344–369.
[46] W. Pazner and P.-O. Persson, Interior penalty tensor-product preconditioners for high-order discontinuous Galerkin discretizations, in 2018 AIAA Aerospace Sciences Meeting, 2018, p. 1093.
[47] F. Rathgeber, D. A. Ham, L. Mitchell, M. Lange, F. Luporini, A. T. T. McRae, G.-T. Bercea, G. R. Markall, and P. H. J. Kelly, Firedrake: automating the finite element method by composing abstractions, ACM Trans. Math. Soft., 43 (2016), pp. 24:1–24:27.
[48] J.-F. Remacle, R. Gandham, and T. Warburton, GPU accelerated spectral finite elements on all-hex meshes, J. Comp. Phys., 324 (2016), pp. 246–257.
[49] Y. Saad and M. H. Schultz, GMRES: a generalized minimal residual algorithm for solving nonsymmetric linear systems, SIAM J. Sci. Stat. Comput., 7 (1986), pp. 856–869.
[50] D. Schötzau and C. Schwab, Mixed $hp$ -FEM on anisotropic meshes, Math. Models Methods Appl. Sci., 8 (1998), pp. 787–820.
[51] D. Silvester and A. Wathen, Fast iterative solution of stabilised Stokes systems Part II: using general block preconditioners, SIAM J. Numer. Anal., 31 (1994), pp. 1352–1367.
[52] J. Stiller, Robust multigrid for Cartesian interior penalty DG formulations of the Poisson equation in 3D, in Spectral and High Order Methods for Partial Differential Equations ICOSAHOM 2016, M. L. Bittencourt, N. A. Dumont, and J. S. Hesthaven, eds., Cham, 2017, Springer International Publishing, pp. 189–201.
[53] B. Szabó and I. Babuška, Finite element analysis, John Wiley & Sons, 1991.
[54] C. Taylor and P. Hood, A numerical solution of the Navier-Stokes equations using the finite element technique, Comput. Fluids, 1 (1973), pp. 73–100.
[55] J. Witte, D. Arndt, and G. Kanschat, Fast tensor product Schwarz smoothers for high-order discontinuous Galerkin methods, J. Comput. Methods Appl. Math., (2020).
[56] J. Xu, The auxiliary space method and optimal multigrid preconditioning techniques for unstructured grids, Computing, 56 (1996), pp. 215–235.

$d$	$p$	# DOFs	Iter.
2	3	17 466	26
	7	104 250	29
	15	499 002	35
	31	2 173 242	44
3	3	588 927	39
	7	7 876 575	47
	11	31 236 927	51

Abstract

keywords:

1 Introduction

2 Sparse Poisson solver

2.1 Continuous Galerkin formulation

2.2 Tensor product structure on Cartesian cells

2.3 The fast diagonalization method

2.4 Sparse FDM discretization

2.5 Hybrid p𝑝pitalic_p-multigrid/Schwarz method

2.6 Computational complexity

2.7 Extension to non-Cartesian cells

Theorem 2.1.

Proof 2.2.

2.8 Numerical experiments

3 Application to linear elasticity problems

3.1 Primal formulation of linear elasticity

3.2 Mixed FEM formulations of linear elasticity

3.3 Extension to interior penalty DG methods

3.4 Results for mixed formulations of linear elasticity

4 Conclusion

References

2.5 Hybrid $p$ -multigrid/Schwarz method