Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
27 views

NumericalAnalysis Notes (In Progress)

Uploaded by

Angie Pulgarin
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views

NumericalAnalysis Notes (In Progress)

Uploaded by

Angie Pulgarin
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 79

Numerical Analysis

Lecture Notes 2024/25


Master Degree in Civil Engineering

Politecnico di Milano

Luca Dede’

October 20, 2024


Contents

1 Introduction to Numerical Analysis and Scientific Computing 1


1.1 Mathematical Models and Scientific Computing . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Mathematical models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.2 Analytical methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1.3 Numerical methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Representation of Real Numbers and Computer Operations . . . . . . . . . . . . . . . . . 8
1.2.1 Floating–point numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.2 Floating–point arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3 From the Mathematical Problem to the Numerical Problem . . . . . . . . . . . . . . . . . 10
1.3.1 Properties of the mathematical problem . . . . . . . . . . . . . . . . . . . . . . . 10
1.3.2 Properties of the numerical problem . . . . . . . . . . . . . . . . . . . . . . . . . 11

2 Numerical Solution of Linear Systems 15


2.1 Goals, Examples, and Methods Classification . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Direct Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.1 “Simple” linear systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.2 LU factorization method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.3 Cholesky factorization method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2.4 Thomas algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.2.5 Accuracy of the numerical solution obtained by direct methods . . . . . . . . . . . 27
2.2.6 The Matlabr command \ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.3 Iterative Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.3.1 The general algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.3.2 Splitting methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.3.3 Jacobi and Gauss–Seidel methods . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.3.4 Richardson methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.3.5 Gradient methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.3.6 Conjugate gradient methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.3.7 Computational error and stopping criteria . . . . . . . . . . . . . . . . . . . . . . 38
2.4 A (Brief) Comparison Between Direct and Iterative Methods . . . . . . . . . . . . . . . . 39

3 Approximation of Zeros of Nonlinear Equations and Systems 41


3.1 Newton Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.1.1 Newton method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.1.2 Modified Newton method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.1.3 Stopping criteria for Newton methods . . . . . . . . . . . . . . . . . . . . . . . . 44
3.1.4 Inexact and quasi–Newton methods . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.1.5 Newton methods for systems of nonlinear equations . . . . . . . . . . . . . . . . 46
3.2 Fixed Point Iterations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.2.1 Fixed point iterations for scalar functions . . . . . . . . . . . . . . . . . . . . . . 48
3.2.2 Stopping criterion for fixed point iterations . . . . . . . . . . . . . . . . . . . . . 51
3.2.3 Newton methods as fixed point iterations methods . . . . . . . . . . . . . . . . . 51

i
ii Contents

3.2.4 Fixed point iterations for vector–valued functions . . . . . . . . . . . . . . . . . . 52

4 Approximation of Functions and Data 53


4.1 Motivations and Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.2 Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.2.1 Lagrange polynomial interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.2.2 Piecewise polynomial interpolation . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.3 Least Squares Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5 Numerical Differentiation and Integration 63


5.1 Numerical Differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.1.1 Finite difference schemes for the first derivative . . . . . . . . . . . . . . . . . . . 63
5.1.2 Finite difference scheme for the second derivative . . . . . . . . . . . . . . . . . . 65
5.2 Numerical Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.2.1 Midpoint quadrature formulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.2.2 Trapezoidal quadrature formulas . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.2.3 Simpson’s quadrature formulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.2.4 Interpolatory quadrature formulas . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.2.5 Gaussian quadrature formulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.2.6 Numerical integration in dimension d > 1 . . . . . . . . . . . . . . . . . . . . . . 74

Bibliography 75

Copyright c Luca Dede’ 2024


Chapter 1

Introduction to Numerical Analysis and


Scientific Computing

The course introduces fundamental concepts of Numerical Analysis, focusing on Scientific Comput-
ing, numerical simulation, and numerical methods for solving Partial Differential Equations, particularly
through the Finite Element method. It covers both theoretical and practical aspects, aiming to equip stu-
dents with the skills necessary to numerically solve mathematical problems that are relevant to Civil En-
gineering. The objectives of the course include developing competencies in the following key topics:
numerical linear algebra, numerical solution of nonlinear equations, approximations of functions and data,
numerical differentiation and integration, numerical solution of Ordinary Differential Equations (ODEs),
boundary value problems, and Partial Differential Equations (PDEs) in 1D, particularly using the Finite
Element (FE) method.
Most of the content of these lecture notes refers to [1, 2, 3]. The course will feature the use of the soft-
ware library Matlabr [4], a multi–paradigm programming language and numerical computing environment
developed by MathWorks, to (i) solve mathematical and numerical problems; (ii) perform numerical sim-
ulations; and (iii) handle postprocessing, as well as visualization of the numerical solution and scientific
computing data.

1.1 Mathematical Models and Scientific Computing


A mathematical model is a set of (algebraic, differential, or integral) equations that is able to represent the
features of a complex (physical) system or process. Models are developed to describe, forecast, and control
the behavior or evolution of such systems and processes. Mathematical modeling assumes nowadays a
crucial role in the description of several phenomena in Engineering and applied Sciences. The computa-
tional power available through calculators enables a widespread use of numerical methods for simulating
mathematical models. In Engineering, for example, computational tools are a key factor for the design of
products, risk assessment, and technological developments.
Physics–based models indicate those mathematical models that are obtained from physical principles
– like conservation laws of mass, momentum, energy, etc. – and that encode natural laws often leading
to differential equations whose solutions are typically represented in the form of functions. When the
mathematical model is in the form of differential equations, its analytical solution is rarely available in
closed form, for which numerical approximation methods are instead employed. Numerical modeling
indicates sets of numerical methods that determine an approximate solution of the original (often infinite-
dimensional) mathematical model, by turning it into a discrete problem (algebraic, finite-dimensional),
whose dimension (size) is typically very large. Numerical Analysis is a branch of Mathematics focused on
approximating mathematical models, constructing numerical solutions, and analyzing the theoretical prop-
erties of these numerical methods. Scientific Computing can be viewed as a branch of Numerical Analysis
that extensively uses computers to solve mathematical models numerically. For large-scale numerical prob-

1
2 Introduction to Numerical Analysis and Scientific Computing

Figure 1.1: Numerical Analysis: physical, mathematical, and numerical problems and solutions

lems, parallel architectures and High Performance Computing frameworks are typically employed. See the
sketches in Figs. 1.1 and 1.2.
Mathematical models are conventionally used altogether with theoretical (mathematical) models and
experimental tests. In several cases, however, theoretical models are not available – like in Computational
Mechanics and Fluid Dynamics – or experimental tests are not meaningful or cannot be performed (for
example, for nuclear testing, extreme natural events as earthquakes, etc.). Physics–based models have
witnessed an increasing role in the modern society in virtue of the massive developments of Scientific
Computing and computational tools starting from the 1950s and 1960s. More recently, we entered in a new
era characterized by Data Science. Indeed, a large amount of data is becoming available from multiple
sources nowadays (digital measures, experiments, etc.). Data–driven models are still mathematical models,
but built from meaningful data that do not rely on physical principles – because the latter are not available
or are not reliable – and whose construction calls for statistical learning methods.
In any case, data are crucial for physics–based models: as a matter of fact, despite the different
paradigms, data–driven and physics–based models can be and must be used in a synergistic way. Physics–
based models require input data (for example in the form of physical and geometrical parameters for differ-
ential equations) and validation through experimental tests and measurements. More recently, new ways
of balancing data–driven and physics–based models have emerged and are increasingly used as numerical
models. An example is represented by Scientific Machine Learning, an evolution of Scientific Comput-
ing that exploits statistical learning based on Machine and Deep Learning algorithms. This can be used
to accelerate or empower standard numerical methods in Scientific Computing, to reduce complexity and
computational costs of the numerical simulations, or to build digital twins. A digital twin is a virtual
replica of a physical system or process, serving as its digital counterpart for practical applications such as
simulation, testing, monitoring, maintenance, prediction, and forecasting. Real–time interaction between
the physical system and its digital twin is essential since it must guarantee effective observation and control.
Physics–based mathematical models (mathematical problems) are a fundamental pillar in the under-
standing and prediction of several physical phenomena and processes (physical problems). However,
these mathematical models may lead to problems that cannot be solved analytically, or in an exact way
(thus yielding the exact solution), especially for differential problems. In these cases, one is unable to
write their solution explicitly.
Numerical methods and numerical approximation techniques (numerical problems) serve the purpose
to determine an (approximate) numerical solution of a mathematical model. When the calculator is used to
determine such (approximate) numerical solution, the latter is called numerical solution at the calculator;
see Fig. 1.2.

Copyright c Luca Dede’ 2024


Mathematical Models and Scientific Computing 3

Figure 1.2: Scientific Computing: physical, mathematical, and numerical problems and solutions

1.1.1 Mathematical models


We provide some examples of mathematical models. Even if the course will feature a wide array of math-
ematical problems – e.g., linear systems, nonlinear equations, definite integrals, etc. – we focus in this
introduction on differential models, such as Ordinary Differential Equations (ODEs) and Partial Differ-
ential Equations (PDEs).
A (physics–based) mathematical model generally stems from two main ingredients: general physical
laws and constitutive relations. General physical laws come from Continuum Mechanics, in the form of
conservation, balance, or equilibrium laws among physical quantities (e.g. of mass, momentum, energy,
etc.). The constitutive relations are of an experimental nature and strongly depend on the characteristics of
the phenomena under consideration. Examples of constitutive relations are Fourier’s law of heat conduction
or Fick’s law for the diffusion of a substance. The result of combining the two ingredients is usually a PDE
or a system of PDEs.
An algebraic model is a mathematical model that uses algebraic equations to represent relationships
among variables. These models typically involve polynomials, linear equations, or systems of nonlinear
equations to describe how different quantities relate to one another.
An integral model uses integral equations to describe relationships among variables, often capturing
cumulative phenomena, total quantities, or systems to capture the rate of change over time or space.
A differential model is an equation that involves one or more derivatives of an unknown function. In
an ODE, every derivative of the unknwon solution is with respect to a single independent variable. If
instead, derivatives are partial, then we have a PDE. Differential models are common examples of physics-
based models and represent one of the most widely used domains in numerical simulations and Scientific
Computing.
The Mathematical Problem (MP) model can be indicated in an abstract manner as follows:

P(u; g) = 0 Mathematical Problem (MP)

In the former, P generically indicates the model, u is the exact solution – a vector or a function of one or
more independent variables (space and/or time variables) – and g indicates the set of data.
In the following, we report some examples of MPs in the form of differential models, involving ODEs
and PDEs; we consider specifically linear models for the PDEs with a scalar unknown solution. The goal is
to illustrate some challenging models that we will address in this course, which are relevant and meaningful
for applications in Civil Engineering.

Copyright c Luca Dede’ 2024


4 Introduction to Numerical Analysis and Scientific Computing

ODEs
ODEs are also known as initial values problems. A first order ODE, a Cauchy problem, is a differential
problem, whose solution u = u(t) is a function of a single independent variable t, often interpreted as time.
A single condition is assigned on the solution, at a point (usually, the left end of the integration interval).
Its form is the following: find u : I ⊂ R → R such that

 du
(t) = f (t, u(t)) t ∈ I,
dt (1.1)
 u(t ) = u ,
0 0

where I = (t0 , tf ] ⊂ R is a time interval, u0 is the initial value assigned at t = t0 , and f : I × R → R.


The equation describes the evolution of a scalar quantity u over time t, without distribution in space.
In vectorial problems, the unknown is a vector–valued function u = u(t), where u = (u1 , . . . , um ) ∈
Rm , with m ≥ 1. The first order Cauchy problem reads: find u : I ⊂ R → Rm such that

 du
(t) = f (t, u(t)) t ∈ I,
dt
 u(t ) = u ,
0 0

where u0 ∈ Rm is the initial datum and f : I × Rm → Rm .


A second order Cauchy problem sees second order time derivatives and two initial conditions. Its reads
as: find u : I ⊂ R → R such that
 2  
d u du
(t) = f t, u(t), (t) t ∈ I,


 dt2 dt


du (1.2)
 (t0 ) = v0 ,
 dt



u(t0 ) = u0 ,

where the intitial data are u0 and v0 , while f : I × R × R → R.

Boundary value problem (1D)


It is a stationary differential model with a single independent variable x, representing the space coordinate
in an interval Ω = (a, b) ⊂ R (1D). The problem involves second order derivatives of the unknown solution
u = u(x) with respect to x. The value of u, or the value of its first derivative, is set at the two boundaries
of the domain (interval) Ω, that is at x = a and x = b (the domain boundary is ∂Ω = {a, b}).
Let us consider the following Poisson problem with (homogeneous) Dirichlet boundary conditions: find
u : Ω ⊂ R → R such that

 d2 u
− 2 (x) = f (x) x ∈ Ω = (a, b), (1.3)
dx
 u(a) = u(b) = 0.

This equation models a stationary phenomenon – the time variable does not appear in fact – and represents
a diffusion model. For example, this models the diffusion of a pollutant along a 1D channel Ω = (a, b) or
the vertical displacement of an elastic string (thread) fixed at its ends. In the first case, f = f (x) indicates
the source of the pollutant along the flow, while in the second case, f is the transverse force acting on the
elastic string, in the hypothesis of negligible mass and small displacements of the string.
We remark that the boundary value problem in 1D is a particular case of PDEs, even if it involves only
derivatives with respect to a single independent variable x. Indeed, even if apparently similar to a second
order ODE, the boundary value problem is in reality substantially different from an ODE: in Eq. (1.2), two
conditions are set at t = t0 , while in Eq. (1.3), one condition is set a x = a and the other one at x = b. The
conditions in the boundary value problem determine to the so–called global nature of the model.

Copyright c Luca Dede’ 2024


Mathematical Models and Scientific Computing 5

PDE, initial and boundary value problems in 1D

These problems concern equations that depend on space and time: the unknown solution u = u(x, t) both
depends on the space coordinate x ∈ Ω ⊂ R in 1D and the time variable t ∈ I ⊂ I. In this case, the initial
conditions at t = 0 must be prescribed, as well as the boundary conditions at the ends of the interval in 1D.
The heat equation – also known as diffusion equation – with Dirichlet boundary conditions assumes
the following form: find u : Ω × I → R such that

∂u ∂2u
(x, t) − µ 2 (x, t) = f (x, t) x ∈ Ω = (a, b), t ∈ I,



∂t ∂x (1.4)

 u(a, t) = u(b, t) = 0 t ∈ I,
 u(x, t ) = u (x)
0 0 x ∈ Ω = (a, b).

For example, the unknown function u(x, t) describes the temperature in a point x ∈ Ω = (a, b) and time
t ∈ I of a metallic bar covering the space interval Ω. The diffusion coefficient µ represents the thermal
response of the material and it is related to its thermal conductivity. The Dirichlet boundary conditions
express the fact that the ends of the bar are kept at a reference temperature (zero degrees in this case), while
at the time t = t0 , the temperature is assigned in each point x ∈ Ω through the initial function u0 (x).
Finally, the bar is subjected to a heat source of linear density f (x, t).

Another example is the wave equation, which involves second order derivatives with respect to both
space and time variables. The model is used to describe wave phenomena, including standing wave fields,
such as mechanical waves (e.g., water waves, sound waves, and seismic waves) or electromagnetic waves
(e.g., light waves). It finds application in fields such as acoustics, seismic, electromagnetism, and fluid
dynamics. The wave equation reads: find u : Ω × I → R such that
 2 2
∂ u 2∂ u


2
(x, t) − c (x, t) = f (x, t) x ∈ Ω = (a, b), t ∈ I,
 ∂t ∂x2



u(a, t) = u(b, t) = 0 t ∈ I,
(1.5)
 ∂u
 (x, t0 ) = v0 (x) x ∈ Ω = (a, b),
 ∂t


x ∈ Ω = (a, b).

u(x, t0 ) = u0 (x)

For example, the unknown function u(x, t) describes the displacement in a point x ∈ Ω = (a, b) and time
t ∈ I of an elastic string covering the space interval Ω. The parameter c > 0 is a fixed real coefficient
∂u
representing the propagation speed of the wave. At the initial time t = t0 , both the fields u and need
∂t
to be assigned in each point x ∈ Ω through the functions u0 (x), and v0 (x), respectively.

PDE, boundary value problem in multidimensional domains Ω ⊂ Rd , with d = 2, 3

Problem (1.3) can be extended in multidimensional domains Ω ⊂ Rd , with d = 2, 3; the solution is u =


u(x), where x = (x1 , . . . , xd )T ∈ Rd . This leads to the following Poisson problem with (homogeneous)
Dirichlet boundary conditions: find u : Ω ⊂ Rd → R such that

−4u = f in Ω (i.e. x ∈ Ω),
(1.6)
u=0 on ∂Ω (i.e. x ∈ ∂Ω),

where
d
X ∂2u
4u(x) := (x)
i=1
∂x2i

is the Laplace operator, the domain Ω ⊂ Rd is endowed with boundary ∂Ω, and f = f (x) is the external
forcing term. This equation is used for example to model the vertical displacement of an elastic membrane
fixed at the boundaries.

Copyright c Luca Dede’ 2024


6 Introduction to Numerical Analysis and Scientific Computing

1.1.2 Analytical methods


Analytical solutions of the differential problems presented in Section 1.1.1 are rarely available, that is u
can not be esplicitely expressed as a function of the independent variables.
For example, the Cauchy problem (1.1), under suitable regularity hypothesis on the data, leads to the
solution Z t Z t
u(t) = y(t0 ) + f (τ, y(τ ))dτ = u0 + f (τ, y(τ ))dτ,
t0 t0

if u ∈ C 1 (I). The expression of u(t) can be determined only if f (t, y) assumes specific forms.
For the 1D Poisson problem (1.3), if we consider for simplicity the case in which Ω = (a, b) = (0, 1)
and f ∈ C 0 ([0, 1]), then there exists a unique solution u ∈ C 2 ([0, 1]), which is expressed as:
Z 1 Z x
u(x) = x (1 − s) f (s) ds − (x − s) f (s) ds. (1.7)
0 0

We remark that it is possible to find the solution u(x) explicitly as long as the primitives of the functions
under integrals are available.
For the heat equation (1.4), in the case Ω = (a, b) = (0, 1) and f = 0, the solution can be expressed as
a Fourier expansion:
+∞
X 2
u(x, t) = u0,j e−µ (jπ) t sin(jπx),
j=1

where u0,j is the j-th Fourier coefficient


Z 1
u0,j = 2 u0 (x) sin(jπx) dx, j = 1, 2, . . . .
0

Other than calling for the calculation of integrals, the solution is often approximated – i.e. the Fourier
series is truncated – when the initial data is represented through a Fourier expansion with infinite terms.
Analytical methods for differential models can be used to find the solution of some simple problems,
for example through the method of separation of variables for some PDEs. These methods are useful
for determining the analytical expression of the solution, as well as investigating some of its qualitative
properties that can be connected to the physical problem from which the mathematical problem under
examination originates.
In addition, analytical methods can be used to establish if the correct data have been assigned to the
problem – like boundary conditions for PDEs – and therefore if the problem is well–posed. It is indeed
crucial to determine if the solution of a differential model exists and is unique, even if an expression for
the analytical solution can not be determined.

1.1.3 Numerical methods


As anticipated, in general, we cannot analytically solve an ODE or a PDE in most of the cases of practical
interest. This highlights the importance of using numerical methods that allow to construct an approx-
imation uh of the exact solution u, for which the corresponding error (u − uh ) can be quantified and/or
estimated. Here, the subscript h indicates a discretization parameter that characterizes the numerical ap-
proximation. Conventionally, the smaller is h, the better is the approximation of u made by uh ; that is,
the error (u − uh ) tends to zero as h gets smaller and smaller. In general, we can provide the following
representation:

P(u; g) = 0 Mathematical Problem (MP)


↓ numerical method
Ph (uh ; gh ) = 0 Numerical Problem (NP)

Copyright c Luca Dede’ 2024


Mathematical Models and Scientific Computing 7

where gh is an approximation of the data g, while Ph is a characterization of the approximate problem. In


this course, we will specifically introduce the FE method to build the numerical approximation of PDEs.
As example, we present in the following a numerical method based in the finite differences for the
approximation of the boundary value problem (1.3) for Ω = (0, 1). Let us assume that the differential
equation must be satisfied in every node xj internal to the domain Ω = (0, 1), obtained by subdividing
1
[0, 1] through N + 1 nodes xj = j h, for j = 0, . . . , N + 1, into subintervals of size h = . In
N +1
practice, we ask that
d2 u
− 2 (xj ) = f (xj ) for all j = 1, . . . , N.
dx
We can approximate this set of N equations by replacing the second derivative with the following finite
difference at a general point x̄ ∈ Ω:
u(x̄ + h) − 2u(x̄) + u(x̄ − h)
δ 2 u(x̄) = ,
h2
for h > 0. If u : Ω → R is “sufficiently” regular in a neighborhood of x̄ ∈ (0, 1), then δ 2 u(x̄) approximates
d2 u d2 u
(x̄) with accuracy of order 2 with respect to h, that is the error (x̄) − δ 2 u(x̄) goes as h2 .
dx2 dx2
For the problem (1.3), instead of searching a function u(x) that satisfies this boundary value problem,
we can look for a set of N real values {uj }Nj=1 such that

uj+1 − 2uj + uj−1


(
− = f (xj ) for all j = 1, . . . , N,
h2
u0 = uN +1 = 0,
by means of the finite difference method. In this manner, uj represents an approximation of u(xj ). The
former system of equations leads to a linear system
Au = f ,
where f = (f (x1 ), f (x2 ), . . . , f (xN −1 ), f (xN ))T , u = (u1 , . . . , uN )T is the vector of unknowns, and
A ∈ RN ×N is the following tridiagonal matrix:
2 −1 0
 
... 0
 .. .. 
−1 2 . . 
1 1 

. .

A = 2 tridiag(−1, 2, 1) = 2  0 .. .. −1 0

,
h h  
 .
 ..

−1 2 −1
0 ... 0 −1 2
which is symmetric and positive definite. As N is typically large, a computer is used to solve the linear
system Au = f by means of suitable method. Here, let us stress the fact that the linear system Au = f is
another MP arising from the NP and that we are called to solve or approximate by means of another suitable
numerical method. For example, if A is a tridiagonal matrix, the Thomas algorithm – a direct method – is
usually used.
If f ∈ C 2 ([0, 1]), the following error estimate can be derived:
h2 d2 f
max |u(xj ) − uj | ≤ max (x) .
j=0,...,N +1 96 x∈[0,1] dx2
We deduce that the error decays with a rate proportional to h2 (the smaller is h < 1, the smaller is the
error).
The former example based on finite differences illustrates that the numerical approximation of a PDE
calls in reality for multiple numerical methods. In this course, we will consider the FE method for the nu-
merical approximation of this boundary value problem, leading to numerical solutions uh to be determined
by means of the computer.

Copyright c Luca Dede’ 2024


8 Introduction to Numerical Analysis and Scientific Computing

1.2 Representation of Real Numbers and Computer Operations


A calculator can handle only a finite set of numbers and perform operations within this limited range.
Indeed, the set of real numbers R is represented at the calculator by a set of numbers F, called floating–
point numbers. The set F is of finite size, conversely to R.
This section briefly illustrates floating–point numbers and arithmetic, that is the set of mathematical
operations performed by the calculator with floating–point numbers.

1.2.1 Floating–point numbers

Definition 1.1. The set of floating–point numbers F is the subset of real numbers that can be represented
by the calculator, that is F ⊂ R with dim (F) < +∞. In general, F = F0 ∪ {0}, where the set F0 includes
all the floating–point numbers excluding the zero.

For the sake of simplicity, from now on, we will indicate with F the floating–point numbers excluding zero.
The set F = F(β, t, L, U ) is characterized by four parameters: β, t, L, and U . Every real number x ∈ F
can be written as:
s s
x = (−1) m β e−t = (−1) (a1 a2 · · · at )β β e−t ,
where:
• β ≥ 2 is the basis, a natural number determining the numeral system;
• m = (a1 a2 · · · at )β is the mantissa, (0 < m < β t − 1) where t is the number of digits (significant
figures) such that 0 < a1 ≤ β − 1 and 0 ≤ ai ≤ β − 1 for i = 2, . . . , t;
• e ∈ Z is the exponent such that L ≤ e ≤ U , for L < 0 and U > 0;
• s = {0, 1} is the sign.
The exponent e defines the range of machine numbers, while the number of digits t in the mantissa m
defines its precision. Given the floating–point set F(β, t, L, U ) of the calculator, a number x ∈ F is fully
defined by the values assumed by s, m, and e.

Example 1.1. For x = (4.25)10 , we have the representation x = (100.01)2 = 1.0001 · 22 = 0.10001 · 23 , that is
s = 0, β = 2, m = 10001, e = 3 = (11)2 , t = 5.

Given a number x ∈ R, f l(x) ∈ F indicates the representation of the real number x as a floating–point
number. We observe that, in general, f l(x) 6= x, unless x ∈ F.
The set of floating–point numbers F(β, t, L, U ) is associated with the following properties.
• Machine epsilon, which is the value
εM := β 1−t ,
representing the distance between 1 and the smallest floating–point number larger than 1, that is the
smallest (positive) real number such that f l(1 + εM ) > 1.
• Round–off error, the relative error between x ∈ R \ {0} and its floating–point representation f l(x) ∈
F. This error is bounded as:
|x − f l(x)| 1
≤ εM x 6= 0,
|x| 2
1
where εM is the round–off unit; this is the largest relative error introduced by the calculator in
2
1
representing any real number x. Even if M is small, the absolute error |x − f l(x)| can be very
2
large, especially if |x| is large.

Copyright c Luca Dede’ 2024


Representation of Real Numbers and Computer Operations 9

• The smallestand largest (positive) numbers represented in F are xmin = β L−1 and xmax =
β U 1 − β −t , respectively. Aside from the zero, numbers smaller than xmin (in modulus), or larger
than xmax , cannot be represented.

• The cardinality of F(β, t, L, U ) is card (F) = 2(β − 1) β t−1 (U − L + 1).

• The larger are the values |f l(x)|, the less dense are the numbers in F.

Example 1.2. Let us consider the set F(2, 2, −1, 2), for which β = 2 (numeral system in base 2), t = 2, L = −1,
1 1
and U = 2. Then, we have M = β 1−t = , xmin = β L−1 = , and xmax = β U 1 − β −t = 3; the cardinality

2 4
of F is 2(β − 1) β t−1 (U − L + 1) = 2 · 1 · 2 · 4 = 16. The exponent e can assume values −1, 0, 1, or 2. The
mantissa is m = (a1 a2 )β , as t = 2; then, as β = 2, we have a1 = 1, a2 = 0, or a2 = 1. The possible values of m are
m = (10)2 = 2 or (11)2 = 3. If s = 0, the positive real numbers in F0 are x = m β e−t = m 2e−2 and are reported
in the following table.

e −1 0 1 2
1 1
m = (10)2 = 2 1 2
4 2
3 3 3
m = (11)2 = 3 3
8 4 2

The floating–point standard IEEE double precision uses a string of N = 64 bits to represent real num-
bers, in base β = 2, at the calculator F = F(2, 52, −1022, 1023).

1 11 bits 52 bits
s e m

1
Example 1.3. The real number x = can not be represented exactly in numeral systems with base β = 2; indeed,
10
1 1 1 0 0 1
x = 4 + 5 + 4 + 5 + 6 + 7 + · · · , which is an infinite series. Its floating–point representation in double
2 2 2 2 2 2
1 1 1 0 0 1 0 1
precision is f l(x) = 4 + 5 + 4 + 5 + 6 + 7 + · · · + 51 + 52 , that is an approximation of x.
2 2 2 2 2 2 2 2

For 64 bits (double precision) CPUs, numbers are represented in base β = 2, for which t = 52.
However, as 0 < a1 ≤ β − 1 = 1, we always have a1 = 1. In this case, the number of digits t used for the
mantissa m is in fact 52 + 1 = 53: since the first digit of the mantissa, a1 is always 1 it is not necessary
to explicitly store it. All numbers in Matlabr (and other programming languages as C++) are represented
in base β = 2, with number of equivalent digits t = 53 in the double precision representation (64 bits). It
follows that in Matlabr (double precision format):

• M = 2−52 ≈ 2.22 · 10−16 ;

• xmin = (1.000 . . . 000)2 · 2−1022 = 2.225073858507201 · 10−308 (realmin in command) and


xmax = (1.111 . . . 111)2 · 21023 = 21023 (2 − 2−52 ) = 1.797693134862316 · 10+308 (realmax
command); numbers larger than xmax and smaller than xmin (in modulus) produce overflow and
underflow, respectively;

• the number of significant figures in base 10 is equal to 15.

Copyright c Luca Dede’ 2024


10 Introduction to Numerical Analysis and Scientific Computing

1.2.2 Floating–point arithmetic


Algebraic operations involving floating–point numbers in F do not benefit from the same properties as real
numbers in R. In fact, round–off errors propagate and eventually grow depending on the number and type
of algebraic operations involved in the calculation.
The term flops is commonly used to indicate the number of algebraic operations perfomed in floating–
point arithmetic.
Since F is strictly contained in R, elementary algebraic operations on floating–point numbers do not
satisfy all properties of analogous operations in R. For example, the commutative property holds for
additions and multiplications, but other properties such as associative or distributive can be violated as
seen in the following example.

Example 1.4. We report some examples of round–off errors in Matlabr for the double precision standard.
• The assignment a = 1 - 3 * ( 4 / 3 - 1 ) returns a = 2.2204e-16 (not zero).
• The assignement b = sqrt(1e-16 + 1) - 1 returns b = 0.
• The assignements c = 1e-16 - 1e-16 + 1 and d = 1e-16 + 1 - 1e-16, together with the oper-
ation f = c - d return f = 1.1102e-16 (not zero).

The associative property is violated whenever an overflow or underflow situation occurs. Issues with
floating–point operations appear when two numbers with the same sign and similar value are subtracted
(see Example 1.7 later): we actually obtain the so called loss (or cancellation) of significant figures. Fur-
thermore, zero is no longer unique: in fact, there exists at least one number b 6= 0 such that in floating–point
arithmetic a + b = a: indeed, a + b is always equal to a if b is smaller than the machine epsilon of the real
number a.

(1 + x) − 1
Example 1.5. For every x ∈ R\{0}, one has ≡ 1. However, in floating–point arithmetic, one has
  x
f l (f l(1 + f l(x)) − 1)
fl = y, where y is a real number different than 1. If we were to verify the first identity in
f l(x)
r
Matlab , we would obtain a number y 6= 1 with errors depending on the value of the chosen real number x.

x 10−10 10−14 10−15 10−16


relative error 8 · 10−6 % 8 · 10−2 % 11% 100%

1.3 From the Mathematical Problem to the Numerical Problem


We introduce a general framework for the analysis of a numerical method and important concepts in Nu-
merical Analysis and Scientific Computing. In particular, we introduce key concepts such as the well–
posedness of a problem, the consistency, stability, and convergence of a numerical method, as well as the
conditioning number.

1.3.1 Properties of the mathematical problem


Let us consider a physical problem (PP) endowed with a physical solution, let say uph , and dependent on
data indicated with g. The mathematical problem (MP) is represented by the mathematical formulation of
the PP and possesses a mathematical solution u. We indicate the MP as:

P(u; g) = 0, (1.8)

where u ∈ U and g ∈ G, with U e G two suitable sets or spaces; G is the set or space of admissible data.
The error between the physical and mathematical solutions is called model error, say em := uph − u. This
source of error takes into account all those characteristics of the PP that are not represented or captured by
the MP.

Copyright c Luca Dede’ 2024


From the Mathematical Problem to the Numerical Problem 11

Example 1.6. Let us consider as PP the elastic string of Eqs. (1.3) and (1.7), whose mathematical solution u is a
function representing the vertical displacement. Then, the MP can be written as:
Z 1 Z x
P(u; g) = u − x (1 − s)f (s)ds + (x − s)f (s)ds = 0,
0 0

where data are g = {(0, 1), 0, 0, f }, representing the domain Ω, the homogeneous Dirichlet data, and the forcing term
f , respectively. Here, G = R2 × R × R × C 0 ([0, 1]) and U = C 2 ([0, 1]).

Before solving a MP, it is required ensuring that it is well–posed.

Definition 1.2. The MP P(u; g) = 0 is well–posed (stable) if and only if there exists a unique solution
u ∈ U that continuously depend on the data g ∈ G.

G is the set of admissible data, i.e. those for which the MP (1.8) admits a unique solution. Roughly
speaking, “continuous dependence on data” means that “small” perturbations on data g ∈ G lead to “small”
changes on the solution u ∈ U of the MP.
MPs that are well–posed can exhibit “large” variations of the solution u even for “small” variations of
the data values g. A measure of this sensitivity is given by the condition number of the MP.

Definition 1.3. The (relative) condition number of the MP P(u; g) = 0 for data g ∈ G is defined as
 
kδuk/kuk
K(g) := sup .
δg : (g+δg)∈G kδgk/kgk
and kδgk6=0

The norm k · k indicates a measure of data or solutions. By construction, we have K(g) ≥ 1. If K(g) is
“small”, then the MP is said well–conditioned; if K(g) is “large”, the the MP is ill–conditioned.
For a well–conditioned MP, the solution (u+δu) obtained with slightly perturbed data (g+δg) does not
differ much from the solution u of the MP with unperturbed data g. For ill–conditioned MPs, the solution
is instead very sensitive to small perturbations of the data. The condition number of a MP is independent
of the numerical method used to solve it. However, even for a simple MP, the notion of condition number
can explain how the propagation of small perturbations can lead to rather large errors in the result.

Example 1.7. Let us consider g = {g1 , g2 }, where g1 , g2 ∈ R, and the MP P(u; g) = u − (g1 − g2 ) = 0, that is the
problem of subtracting two real numbers. By indicating kgk = |g1 | + |g2 |, we have that condition number of the MP
is:
|g1 − g2 |
K(g) = ,
|g1 | + |g2 |
If g1 and g2 have opposite signs, then the MP is well–conditioned, indeed K(g) = 1. However, if g1 and g2 have the
same sign and g1 ≈ g2 , then K(g) can be very large, that is the MP is ill–conditioned. For example, for g1 = 1/51
and g2 = 1/52, we have u = 1/2652 = 3.770739064856699 · 10−4 . By truncating the representations of the
data (perturbed data) as (g1 + δg1 ) = 1.96 · 10−2 and (g2 + δg2 ) = 1.92 · 10−2 , we obtain the perturbed solution
(u+δu) = 4·10−4 , which is significantly different than u. As a matter of fact, in this case, we have K(g) = 103  1.

1.3.2 Properties of the numerical problem


The numerical problem (NP) is an approximation of the MP (1.8). We indicate its numerical solution as
uh , where h stands as a suitable discretization parameter1 . We state the NP as:

Ph (uh ; gh ) = 0, (1.9)
1
If the NP is an iterative method, h is often replaced by the natural number n that refers to the number of iterations; in this case,
1
we have h ∼ .
n

Copyright c Luca Dede’ 2024


12 Introduction to Numerical Analysis and Scientific Computing

Figure 1.3: Physical (PP), mathematical (MP), and numerical (NP) problems. Corresponding solutions
bh ) and errors (model em = uph − u, truncation eh = u − uh , round–off er = uh − u
(uph , u, uh , and u bh ,
and computational ec = eh + er errors)

where uh ∈ Uh and gh ∈ Gh , with Uh and Gh two suitable sets or spaces; gh is the representation of the
data in the NP. The error between the mathematical and numerical solutions is called truncation error, say
eh := u − uh , as depicted in Fig. 1.3. This can be regarded as the error that stems from the discretization
of the MP.
Z T
Example 1.8. Let us consider the following MP: P(u; g) = u − f (t) dt = 0, where the data are g = {T, f (t)}.
0
N
X −1
A possible NP to approximate the integral in the MP is Ph (uh ; gh ) = uh − h f (ti ) = 0, where ti = i h for
i=0
T
i = 0, . . . , N , with h = for some N ∈ N. Here, gh = g. We can also say that the size of the NP is N . The larger
N
is N , the smaller is h, and the more accurate is the NP.

If the numerical solution is computed by executing the algorithm at the calculator, then the final solu-
bh and is affected by round–off error, say er := uh − u
tion is indicated u bh . Such round-off errors depend on
the machine architecture, on the representation of the numbers at the calculator, and on operations made in
floating–point arithmetic. Both the truncation and round-off errors concur to determine the computational
error, say ec := u − ubh = eh + er . For some NP, we can have |er |  |eh |, for which ec ≈ eh .
As for the MP, we have to ensure that the NP is well–posed and we have to assess the condition number
of the NP.

Definition 1.4. The NP Ph (uh ; gh ) = 0 of Eq. (1.9) is well–posed (stable) if and only if there exists a
unique solution uh ∈ Uh that continuously depends on the data gh ∈ Gh .

Definition 1.5. The (relative) condition number of the NP Ph (uh ; gh ) = 0 with data gh ∈ Gh is defined
as:  
kδuh k/kuh k
Kh (gh ) := sup .
δgh : (gh +δgh )∈Gh kδgh k/kgh k
and kδgh k6=0

Copyright c Luca Dede’ 2024


From the Mathematical Problem to the Numerical Problem 13

Figure 1.4: Graphical estimation of the convergence order p of a NP: computational errors |ec | vs. h

As before, δgh is the perturbation on the data, while δuh the corresponding perturbation on the solution of
the NP. If Kh (gh ) is small, then the NP (1.9) is well–conditioned. Otherwise, if Kh (gh ) is large, then the
NP is ill–conditioned.
The reason for which we are interested in the condition number is related to the Wilkinson principle.
According to this principle, the result of a numerical operation on the computer – that is in floating–point
arithmetic – is equivalent to the result of the same operation in exact arithmetic carried out on data affected
by a (small) perturbation. This principle therefore provides a tool to quantify the effect of the propagation
of round-off errors in the computational process.
We are interested in NPs that allow to obtain computational errors that tend to zero as the numerical
method “improves”, namely when the discretization parameter h goes to zero. This concept is related to
the accuracy of the NP and it is encoded in the definition of convergence.

Definition 1.6. The NP is convergent when the computational error tends to zero for h tending to zero,
that is:
lim ec = 0.
h→0

A crucial aspect is to qualify the convergence of the NP, that is determining the convergence order of
the NP.

Definition 1.7. If |ec | ≤ C hp , with C a positive constant independent of h and p, then the NP is convergent
with order p.

If there exists a constant Ce ≤ C independent of h and p such that Ch e p ≤ |ec | ≤ Chp , then we can
p
write |ec | ' Ch and we can estimate the convergence order p of the NP by using the known solution u
of the MP. An approach to estimate p is algebraic. First, we compute the errors ec1 and ec2 for the NP
corresponding to two different values of h that are “sufficiently”small,
p say h1 e h2 , respectively; then, by
p p |ec1 | h1
writing |ec1 | ' Ch1 , |ec2 | ' Ch2 and by noticing that = , we estimate the order p as:
|ec2 | h2

log (|ec1 |/|ec2 |)


p= .
log (h1 /h2 )

An alternative approach is based on the graphical estimate. We represent the errors |ec | vs. h on a plot in
log–log scale. As log |ec | = log (Chp ) = log C + p log h, we have p = atan(θ), where θ is the slope of
the curve (h, ec ), a straight line in log–log scale. Instead of computing θ, it is possible to verify that the
curves (h, ec ) and (h, hp ) are parallel in log–log scale; see Fig. 1.4.
We remark that a well–posed NP is not necessarily convergent. To ensure convergence of the NP, this

Copyright c Luca Dede’ 2024


14 Introduction to Numerical Analysis and Scientific Computing

is required to satisfy the consistency property: roughly speaking, the NP must be a “faithful copy” of the
original MP.

Definition 1.8. The NP (1.9) is consistent if and only if lim Ph (u; g) = P(u; g) = 0, with g ∈ Gh .
h→0
The NP (1.9) is strongly consistent if and only if Ph (u; g) ≡ P(u; g) = 0 for all h > 0, with g ∈ Gh .

It is clear that the NP must be well–posed (possibly, well-conditioned), consistent, and convergent. The
concepts of well–posedness (stability) and convergence are indeed strongly connected and encoded in the
following theoretical result.

Theorem 1.1 (Lax–Richtmeyer, equivalence). If the NP Ph (uh ; gh ) = 0, with uh ∈ Uh and gh ∈ Gh , is


consistent, then it is well–posed if and only if it is also convergent.

It follows that, if the NP is well–posed and consistent, then the NP is also convergent. Equivalently,
if the NP is consistent and convergent, then the NP is also well–posed. The equivalence theorem is very
useful as it allows us to verify only two of the properties of a NP to obtain the third. In general, it is easier
to show the consistency of a NP, while it is harder to show well–posedness and/or convergence.
The choice of a numerical method (NP) to approximate the solution u of a MP must take into account
for: the mathematical properties of the MP; the computational efficiency in terms of expected conver-
gence order of the error, the flops involved in the calculation, the performance of the CPU installed on the
computer, the access modes and the availability of the calculation memory.
Let us indicate with N the size of the NP. Then, the number of floating–point operations required to
calculate the numerical solution uh depends on the size N of the NP. Different kinds of dependence are
depicted in the following table.
O(1) O(N ) O(N γ ) O(γ N ) O(N !)
flops independent linear polynomial exponential factorial
The following example highlights the role of selecting efficient numerical methods (NP) in relation
with available computing resources.

Example 1.9. If the matrix A ∈ RN ×N is non–singular, the solution x ∈ RN of the linear sustem Ax = b (MP) can
be obtained by applying the Cramer rule:
det(Bi )
xi = for i = 1, . . . , N,
det(A)

where Bi ∈ RN ×N is the matrix obtained from A by replacing its i-th column by the vector b ∈ RN as:
 
a11 . . . b1 . . . a1n
 a21 . . . b2 . . . a2n 
Bi =  . ..  .
 
..
 .. . . 
an1 . . . bn . . . ann

i

However, the solution of the linear system with this algorithm requires O(e(N + 1)!) floating–point operations. If
N = 100, then about 101!e ≈ 2.56 · 10160 operations are required. A calculator able to perform 1012 floating–
point operations (flops) per second – actually a supercomputer with a computing capability of 1 TeraFlop – yields the
2.56 · 10160
numerical solution in = 2.56 · 10148 seconds, that is in about 8.11 · 10140 years! More efficient numerical
1012  
2 3
methods are available to solve the linear system, as for example the LU factorization method that calls for O N
3
−6
flops. If N = 100, then the former supercalculator will provide the solution in only 10 seconds. A standard laptop
will achieve the result in less than a second.

Copyright c Luca Dede’ 2024


Chapter 2

Numerical Solution of Linear Systems

We consider the numerical solution of linear systems by means of direct and iterative methods, specifically
in the case of linear systems with real–valued matrix and vectors.

2.1 Goals, Examples, and Methods Classification


Let us consider the square matrix A ∈ Rn×n with n ≥ 1, the vector b ∈ Rn , and the solution vector
x ∈ Rn of the following linear system:
A x = b. (2.1)
The goal is to numerically approximate the solution x ∈ Rn of this linear system. Let us recall the
following.

Definition 2.1. The matrix A ∈ Rn×n with n ≥ 1 is non-singular if and only if det(A) 6= 0.

Proposition 2.1. If A ∈ Rn×n is non-singular, then there exists a unique solution x ∈ Rn to the linear
system (2.1).

With reference to the linear system (2.1), we use the following notation to represent the elements of the
matrix A ∈ Rn×n , that is, (A)ij = aij for i, j = 1, . . . , n, and the vectors x ∈ Rn and b ∈ Rn , that is,
respectively (x)i = xi and (b)i = bi for i = 1, . . . , n. Furthermore, we will use the following notation to
express the linear system in terms of the elements of A, b, and x:
     

 a11 x1 + a12 x2 + · · · + a1n xn = b1 a11 a12 · · · a1n x1 b1
a21 x1 + a22 x2 + · · · + a2n xn = b2  a21 a22 · · · a2n   x2   b2 


or  . ..   ..  =  ..  .
    
.. .. .. .. .. ..

 . . . = .  .. . . .  .   . 

an1 x1 + an2 x2 + · · · + ann xn = bn an1 an2 · · · ann xn bn

We observe that linear systems can be directly interpreted as mathematical problems that model physi-
cal problems. However, in many cases, linear systems are obtained as numerical problems associated with
the discretization of mathematical problems; some the examples shown in the previous chapter follow this
direction. This is, for example, the case for differential problems such as PDEs or ODEs; in such cases,
the approximation of the mathematical problem often leads to the numerical solution of linear systems. An
example is the Finite Element method for the spatial approximation of PDEs. In these cases, the larger
the dimension n of the linear system, the more accurate the approximation of the mathematical problem
that generated the linear system; for such problems, it is very common to solve linear systems of sizes
O(n) = 105 , 106 , or even 107 .

15
16 Numerical Solution of Linear Systems

Example 2.1. We determine the flows in a hydraulic circuit through the solution of a linear system.

The problem consists of finding the flow rates qj for j =


1, . . . , n distributed in the circuit, where n = 7, given the
applied pressure ∆p of the pump and the resistances Rj
of the pipes of the circuit. The closure of the problem is
achieved through the pressure drop balance (∆p = ∆p1 +
∆p2 +∆p5 +∆p7 , ∆p3 = ∆p2 +∆p4 , and ∆p5 = ∆p4 +
∆p6 ), mass conservation at nodes (q1 = q2 + q3 , q2 =
q4 +q5 , q3 +q4 = q6 , and q5 +q6 = q7 ), and the constitutive
equations (∆pj = Rj qj for each j = 1, . . . , n).

We obtain the following linear system, whose solution provides the distribution of flows {qj }n
j=0 in the circuit:

    
R1 R2 0 0 R5 0 R7 q1 ∆p

 0 R2 −R3 R4 0 0 0 


 q2  
  0 


 0 0 0 R4 −R5 R6 0 


 q3  
  0 


 1 −1 −1 0 0 0 0 


 q4 =
  0 .


 0 1 0 −1 −1 0 0 


 q5  
  0 

 0 0 1 1 0 −1 0   q6   0 
0 0 0 0 1 1 −1 q7 0

We observe that if the matrix A ∈ Rn×n is dense, at least n2 operations would theoretically be required
to solve the linear system. Even though this assumption is very optimistic, it is appropriate to consider and
develop methods where the number of operations is as close as possible to this ideal number. Different
considerations can be made for sparse matrices, i.e., matrices A ∈ Rn×n where the number of non-zero
elements is O(n)  n2 .

Remark 2.1. The solution of the linear system (2.1) as x = A−1 b, i.e., by explicitly computing and
assembling the inverse matrix of A, is a computationally inefficient and inaccurate procedure that should
be avoided even for relatively small matrices.

Numerical methods for solving linear systems can be classified into direct and iterative methods.

Definition 2.2. With a direct method, the solution x of the linear system (2.1) is obtained in a finite number
of operations, known a priori. In contrast, with an iterative method, the solution x is obtained, in principle,
in an infinite number of steps.

The choice of a direct or iterative method for solving the linear system (2.1) depends on multiple factors,
such as the properties, size, and sparsity of the matrix A, as well as the available computational resources
(CPU and memory).

2.2 Direct Methods


Let us consider some direct methods for solving the linear system A x = b from Eq. (2.1) and analyze their
properties. The underlying idea of this family of methods is to reduce the solution of the generic linear
system A x = b to that of a “simpler” linear system through an appropriate manipulation of the matrix A.

2.2.1 “Simple” linear systems


We provide some examples of “simple” linear systems, that is, systems that are “easy” to solve. As men-
tioned earlier, this property depends on the matrix A ∈ Rn×n under consideration.

Copyright c Luca Dede’ 2024


Direct Methods 17

Diagonal matrix
Let us consider the case of a diagonal matrix D ∈ Rn×n , that is, (D)ii = dii for i = 1, . . . , n and
(D)ij = 0 for i, j = 1, . . . , n, with j 6= i. In this case, the diagonal matrix D and the associated linear
system D x = b are given by:
  
d11 0 · · · 0 
 d11 x1 = b1
 0 d22 · · · 0   d22 x2 = b2

D= . and
 
.. .. ..  .. ..
 .. . . .  
 . = .

0 0 · · · dnn dnn xn = bn .

Setting A = D in Eq. (2.1), the solution x ∈ Rn of the diagonal system D x = b is given by:

bi
xi = for i = 1, . . . , n,
dii

and is obtained with n operations (divisions).


n
Y
Remark 2.2. Since D is a diagonal matrix, its determinant is computed as det(D) = dii . It follows
i=1
that det(D) 6= 0 if and only if dii 6= 0 for each i = 1, . . . , n.

Lower triangular matrix: forward substitution algorithm

Definition 2.3. L ∈ Rn×n is a lower triangular matrix if and only if its elements satisfy (L)ij = lij ∈ R
for i = 1, . . . , n, j = 1, . . . , i and (L)ij = 0 for i = 1, . . . , n − 1, j = i + 1, . . . , n; the lower triangular
matrix L is given by:  
l11 0 · · · 0
 l21 l22
 0 ··· 0  
 l31 l32 l33 0 · · · 0 
L= . ..  .
 
.. .. ..
 .. . . . . 
 
 
ln1 ln2 ln3 · · · lnn

Given a lower triangular matrix L ∈ Rn×n , let us consider the solution of the lower triangular system:

 l11 x1
 = b1
l x + l x = b2

21 1 22 2



L x = b, that is l31 x1 + l32 x2 + l33 x3 = b3
 .. .. .. . . ..



 . . . . = .
ln1 x1 + ln2 x2 + ln3 x3 + · · · + lnn xn = bn ,

where, referring to Eq. (2.1), we set A = L. The lower triangular system L x = b can be solved using the
forward substitution algorithm, that is:

b1
x1 = ,
l11
 
i−1 (2.2)
1  X
xi = bi − lij xj  for i = 2, . . . , n.
lii j=1

Copyright c Luca Dede’ 2024


18 Numerical Solution of Linear Systems

The forward substitution algorithm solves the lower triangular system L x = b in n2 operations, where n
n
X
is the dimension of the matrix L; in fact, the algorithm performs n divisions, (i − 1) subtractions, and
i=2
n
X n
X
(i − 1) multiplications, thus bringing the total computation of operations to n + 2 (i − 1) = n2 .
i=2 i=2

n
Y
Remark 2.3. Since L is a triangular matrix, we have det(L) = lii ; therefore, det(L) 6= 0 if and only if
i=1
lii 6= 0 for each i = 1, . . . , n.

Upper triangular matrix: backward substitution algorithm

Definition 2.4. U ∈ Rn×n is an upper triangular matrix if and only if its elements satisfy (U )ij = uij ∈ R
for i = 1, . . . , n, j = i, . . . , n and (U )ij = 0 for i = 2, . . . , n, j = 1, . . . , i − 1; the upper triangular
matrix U is given by:  
u11 u12 u13 · · · u1n
 0 u22 u23 · · · u2n 
 
 0 0 u 33 · · · u3n 
U = . ..  . (2.3)
 
 .. .. ..
 . . . 

 
0 ··· 0 unn

Given an upper triangular matrix U ∈ Rn×n , let us consider the solution of the upper triangular system:


 u11 x1 + u12 x2 + u13 x3 + · · · + u1n xn = b1
u22 x2 + u23 x3 + · · · + u2n xn = b2




U x = b, that is u33 x3 + · · · + u3n xn = b3
 .. .. ..



 . . = .
unn xn = bn ,

where, referring to Eq. (2.1), we set A = U . The upper triangular system U x = y can be solved using the
backward substitution algorithm, that is:

bn
xn = ,
unn
 
n (2.4)
1  X
xi = bi − uij xj  for i = n − 1, . . . , 1.
uii j=i+1

The backward substitution algorithm solves the upper triangular system U y = x in n2 operations, analo-
gous to the forward substitution algorithm for lower triangular systems.
n
Y
Remark 2.4. Since U is a triangular matrix, det(U ) = uii , so det(U ) 6= 0 if and only if uii 6= 0 for
i=1
each i = 1, . . . , n.

Copyright c Luca Dede’ 2024


Direct Methods 19

2.2.2 LU factorization method


Let us consider the non–singular matrix A ∈ Rn×n . The LU factorization (or LU decomposition), assum-
ing it exists, of the matrix A consists of determining a lower triangular matrix L ∈ Rn×n and an upper
triangular matrix U ∈ Rn×n such that:
A = L U.

If the LU factorization of the matrix A exists, then the linear system A x = b can be solved as a
sequential solution of the following systems, the first being lower triangular and the second being upper
triangular:
Ly = b and U x = y.
In fact, since A = L U , we have L U x = b, from which, introducing the auxiliary vector y = (U x) ∈ Rn ,
we obtain the previous result.

Definition 2.5. The LU factorization method for solving the linear system A x = b consists of:
1. determining, if it exists, the LU factorization of the matrix A (A = L U );

2. solving the lower triangular system L y = b using the forward substitution algorithm (2.2);
3. solving the upper triangular system U x = y using the backward substitution algorithm (2.4).

Since the LU factorization method is based on the LU factorization of A, it is necessary to determine the
matrices L and U of A, if they exist.

Example 2.2. Let us illustrate the LU factorization of the matrix A ∈ Rn×n with n = 2, which is given by:

      l11 u11 = a11
a11 a12 l11 0 u11 u12


= l11 u12 = a12

a21 a22 l21 l22 0 u22 or
l21 u11 = a21
A = L U


l21 u12 + l22 u22 = a22 .

We observe that the matrices L and U involve a total of 6 unknowns: l11 , l21 , l22 , u11 , u12 , and u22 . On the other
hand, only 4 constraints are available for their determination.

Remark 2.5. From the previous example, for the LU factorization of a generic matrix A ∈ Rn×n , there
are n2 + n unknowns in the matrices L and U , but only n2 constraints to impose; indeed, we have aij =
min{i,j}
X
lir urj for i, j = 1, . . . , n. To overcome this issue, by convention, the diagonal elements of the
r=1
lower triangular matrix L obtained through LU factorization of the matrix A ∈ Rn×n are set to 1; that is,
lii = 1 for every i = 1, . . . , n:
 
1 0 ··· 0
 l21 1
 0 ··· 0  
 l31 l32 1 0 ··· 0 
L= . ..  . (2.5)
 
.. .. ..
 .. . . . . 
 
 
ln1 ln2 ln3 · · · ln,n−1 1

Copyright c Luca Dede’ 2024


20 Numerical Solution of Linear Systems

Gaussian elimination method (GEM)


The Gaussian Elimination Method (GEM) allows to determine the LU factorization of a matrix A ∈ Rn×n .
To illustrate the GEM algorithm, we introduce appropriate notation; in particular, we define the matrix
(k)
A ∈ Rn×n for some k = 1, . . . , n as:
 (1) (1) (1) (1)

a11 a12 a13 ··· a1n
(2) (2) (2)
 0 a22 a23 ··· a2n 
 

.. .. 
.
 
 0 0 . 
(k)  (k) (k) 
A :=  0
 ··· 0 akk ··· akn   for k = 1, . . . , n, (2.6)
(k) (k)
 0
 ··· 0 ak+1,k · · · ak+1,n  
 . .. .. .. 
 .
. . . .

 
(k)
0 ··· 0 an,k ··· a(k)
n,n

or, equivalently:
 (i)
   aij
 for i = 1, . . . , k − 1, j = i, . . . , n
(k) (k)
A = a for i, j = k, . . . , n for k = 1, . . . , n. (2.7)
ij  ij

0 otherwise.
(1) (1)
We set A ≡ A, that is aij = aij for all i, j = 1, . . . , n.

(k)
Definition 2.6. Given an index k, with 1 ≤ k ≤ n − 1, with reference to the corresponding matrix A in
(k)
Eq. (2.6), its element akk is called the pivotal element (or pivot).

The following GEM algorithm finds the elements of the matrix L ∈ Rn×n in Eq. (2.5) and U ∈ Rn×n
(n)
in Eq. (2.3), determining the LU factorization of A ∈ Rn×n ; the matrix U coincides with A obtained at
(n)
the end of the GEM (U = A ).

Algorithm 2.1: Gauss elimination method (GEM))


(1)
set A = A;
for k = 1, . . . , n − 1 do
for i = k + 1, . . . , n do
(k)
aik
lik = (k) ;
akk
for j = k + 1, . . . , n do
(k+1) (k) (k)
aij = aij − lik akj ;
end
end
(k+1)
set A as in Eq. (2.7) ;
end
(n)
set L as in Eq. (2.5) and U = A ;
 
2 3
The number of operations associated with the GEM for the LU factorization of A is O n .
3
(k)
Remark 2.6. To perform the LU factorization of the matrix A using the GEM, all pivotal elements akk
(k) (k)
associated with the matrices A must be non–zero, i.e., akk 6= 0 for every k = 1, . . . , n − 1.

Copyright c Luca Dede’ 2024


Direct Methods 21

(k)
We provide a sketch of the GEM algorithm, showing how the transformation from the matrix A to
(k+1)
the matrix A is performed at the generic step k.

 
5 2 −1
Example 2.3. We form the LU factorization of the matrix A =  1 4 2  using the GEM. We start by
−3 −1 7
(1)
setting A = A and noting that n = 3; thus, we obtain the LU factorization by following the GEM algorithm 2.1.
(1)
• k = 1: a11 = 5 (pivot),
(1)
a21 1
– i = k + 1 = 2: l21 = (1)
= ,
a11 5
(2) (1) (1) 1 18
∗ j = k + 1 = 2: a22 = a22 − l21 a12 = 4 − 2= ,
5 5
(2) (1) (1) 1 11
∗ j = n = 3: a23 = a23 − l21 a13 = 2 − (−1) = ;
5 5
(1)
a31 −3 3
– i = n = 3: l31 = (1)
= =− ,
a11 5 5
 
(2) (1) (1) 3 1
∗ j = k + 1 = 2: a32 = a32 − l31 a12 = (−1) − − 2= ,
5 5
 
(2) (1) (1) 3 32
∗ j = n = 3: a33 = a33 − l31 a13 = 7 − − (−1) = .
5 5
   
1 0 0 5 2 −1
   
 1 
(2)
 18 11 
L=
 1 0  , A = 0

.

 5   5 5 
3 1 32
   
− ? 1 0
5 5 5
(2) 18
• k = 2: a22 = (pivot),
5
(2)
a32 1/5 1
– i = k + 1 = n = 3: l32 = (2)
= = ,
a22 18/5 18
(3) (2) (2)32 1 11 113
∗ j = k + 1 = n = 3: a33 = a33 − l32 a23 = − = .
5 18 5 18
   
1 0 0 5 2 −1
   
 1 
(3)
 18 11 
L=
 1 0 
, U =A = 0

.

 5   5 5 
3 1 113
   
− 1 0 0
3 18 18

Copyright c Luca Dede’ 2024


22 Numerical Solution of Linear Systems

Calculation of the determinant of the matrix A via LU factorization


If the matrix A admits the LU factorization, then we have:
det(A) = det(L U ) = det(L) det(U ) = det(U ),
since det(L) = 1. Therefore, the LU factorization
 3  can be conveniently used to calculate the determinant of
2n
the matrix A in a number of operations O .
3

Properties of LU factorization
The GEM provides the LU factorization of the matrix A ∈ Rn×n required to solve the linear system
A x = b using the LU factorization
 method
 of Definition
 32.5.
 The number of operations associated with
2n3 2n
the LU factorization method is O ; in fact, O operations are required by the GEM, while
3 3
n2 are needed for both forward and backward substitution algorithms.
Remark 2.7. The LU factorization of the matrix A ∈ Rn×n is independent of the vector b ∈ Rn associated
with the linear system A x = b. The LU factorization method can therefore be efficiently used to solve the
linear system for different vectors b since L and U can be assembled only once. The computational costs
associated with solving the linear system for each new vector b are determined solely by solving the lower
triangular system L y = b and the upper triangular system U x = y, using the forward and backward
substitution algorithms, respectively.
We determine the cases in which the LU factorization of a non–singular matrix A exists and is unique.
To this end, we recall the following definition.

Definition 2.7. The principal submatrix of A ∈ Rn×n of order i, with 1 ≤ i ≤ n, is the matrix Ai ∈ Ri×i
such that (Ai )lm = (A)lm for every l, m = 1, . . . , i.

.
The following proposition expresses the necessary and sufficient condition for the existence and uniqueness
of the LU factorization of a non–singular matrix A.

Proposition 2.2 (Necessary and sufficient condition for LU factorization). Given a non–singular matrix
A ∈ Rn×n , its LU factorization exists and is unique if and only if det(Ai ) 6= 0 for every i = 1, . . . , n − 1
(i.e., all principal submatrices of A of order i, with 1 ≤ i ≤ n − 1, are non–singular).

 
1 41
Example 2.4. The LU factorization of the non–singular matrix A =  2 2
3  obtained using the GEM does
4 76
 
1 1
not exist. In fact, from Proposition 2.2, we have det(A1 ) = det([1]) 6= 0, but det(A2 ) = det = 0; in
2 2
(2)
this specific case, the pivotal element a22 = 0 is found during the application of the GEM algorithm.

It is not always necessary to verify the condition of Proposition 2.2 to establish the existence and
uniqueness of the LU factorization of A, but it is sufficient to check some conditions that are only sufficient.
To this end, we recall the following definitions.

Definition 2.8. The matrix A ∈ Rn×n is:


• symmetric if and only if AT ≡ A;

• positive definite if and only if zT A z > 0 for every z ∈ Rn with z 6= 0.

Copyright c Luca Dede’ 2024


Direct Methods 23

Definition 2.9. The matrix A ∈ Rn×n is:


n
X
• strictly diagonally dominant by rows if and only if |aii | > |aij | for every i = 1, . . . , n;
j=1, j6=i

n
X
• strictly diagonally dominant by columns if and only if |aii | > |aji | for every i = 1, . . . , n.
j=1, j6=i

The following proposition expresses a series of conditions that are only sufficient to guarantee the existence
and uniqueness of the LU factorization of a non–singular matrix A.

Proposition 2.3 (Sufficient conditions for LU factorization). Given the matrix A ∈ Rn×n , if one of the
following conditions is satisfied:
• A is symmetric and positive definite,
• or A is strictly diagonally dominant by rows,

• or A is strictly diagonally dominant by columns,


then the LU factorization of A exists and is unique.
 
4 −2 1
Example 2.5. The LU factorization of A =  −2 −5 −1  exists and is unique based on Proposition 2.3 since
1 3 7
A is strictly diagonally dominant by rows; indeed, |4| > | − 2| + |1|, | − 5| > | − 2| + | − 1|, and |7| > |1| + |3|.
 
1 −2 6
Example 2.6. Consider the non–singular matrix A =  −2 5 −1  . None of the sufficient conditions of
1 1 0
Proposition 2.3 are satisfied; therefore, no conclusions can be drawn regarding the existence and uniqueness of the LU
factorization of A using this proposition. In this case, the existence and uniqueness of the LU factorization of A must
be verified in terms of the necessary and sufficient conditions ofProposition 2.2,
from which we deduce that it exists
1 −2
and is unique since det(A1 ) = det([1]) 6= 0 and det(A2 ) = det = 9 6= 0.
−2 5

LU factorization with pivoting (row pivoting technique)


If the necessary and sufficient condition of Proposition 2.2 is not satisfied, the LU factorization of the
matrix A cannot be determined by the GEM algorithm 2.1. However, we still need to solve the linear
system A x = b for any non–singular matrix A using the LU factorization. To this end, it is possible to use
the so–called pivoting technique (or pivoting) in combination with the GEM for the LU factorization of A.
Definition 2.10. The row pivoting technique consists of applying appropriate permutations of the rows of
(k)
the non–singular matrix A in the presence of zero pivotal elements akk encountered during the application
of the GEM algorithm, for some index k = 1, . . . , n − 1.

Remark 2.8. The application of the GEM with the pivoting technique (row pivoting) ensures the existence
and uniqueness of the LU factorization for any non–singular matrix A ∈ Rn×n .

Consider the specific case of pivoting technique with row permutation of the matrix A. This row
permutation of the matrix A ∈ Rn×n consists of pre–multiplying it by a permutation matrix P ∈ Rn×n ,
that is, P A. The permutation matrix P is orthogonal, meaning P T = P −1 (P T P = I); if P = I, then no
permutations are applied to the matrix A. In general, the permutation matrix P is obtained simultaneously
with the assembly of the matrices L and U during the use of the GEM with the pivoting technique (for
rows) applied to the non–singular matrix A.

Copyright c Luca Dede’ 2024


24 Numerical Solution of Linear Systems

Example 2.7. The non–singular matrix A from Example 2.4 does not admit the LU factorization with the stan-
(2)
dard
 GEM (without
 pivoting) since the pivot element a22 = 0. By introducing
  permutation matrix P =
the row
1 0 0 1 1 4
 0 0 1  , and applying it to A, we obtain the matrix A e = P A =  4 6 7 , where the second and third
0 1 0 2 2 3
rows have
 been permuted.
 Applying
 the standard
 GEM to the permuted matrix A,
e we obtain the LU factorization with
1 0 0 1 1 4
L =  4 1 0  and U =  0 2 −9  , where A e = P A = L U ; in this case, the new pivot elements are
2 0 1 0 0 −5
(1) (2)
a11 = 1 6= 0 and e
e a22 = 2 6= 0.

In general, the pivoting technique is applied in conjunction with the GEM even if the pivot elements
are not necessarily zero. In fact, the pivoting technique can also be used to reduce and contain the prop-
agation of rounding errors associated with the application of GEM. Specifically, at the generic iteration
k = 1, . . . , n − 1 of the GEM, row k is permuted with row l, where
(k)
l = arg max |aik |,
i=k,...,n

(k) (k)
akk = alk being the new pivot element for the k–th iteration.
with e

Algorithm 2.2: Gaussian Elimination Method (GEM) with pivoting (row pivoting)
(1)
set A = A and P = I ;
for k = 1, . . . , n − 1 do
(k) (k) (k)
find r̄ such that |ar̄k | = max |ar̄k | and swap row k with row r̄ in both A and P ;
r=k,...,n
for i = k + 1, . . . , n do
(k) (k)
lik = aik /akk ;
for j = k + 1, . . . , n do
(k+1) (k) (k)
aij = aij − lik akj ;
end
end
(k+1)
assign A as in Eq. (2.7) ;
end
(n)
assign L as in Eq. (2.5) and set U = A ;

If the row pivoting technique is applied to determine the LU factorization of the nonsingular matrix A,
specifically using the row permutation matrix P , then the matrices L and U provide the LU factorization
of the permuted matrix P A as:
P A = L U.
It follows that the linear system A x = b can be solved by sequentially solving the following lower and
upper triangular systems:
Ly = P b and U x = y;
Indeed, we have P A x = P b and L U x = P b, so introducing the vector y = U x, we obtain the
previous result.

Definition 2.11. The LU factorization method with row pivoting for solving the linear system A x = b,
which is based on row permutation with the matrix P , consists of:
1. determining the LU factorization of the matrix P A (P A = L U );
2. solving the lower triangular linear system L y = P b using the forward substitution algorithm (2.2);

3. solving the upper triangular system U x = y using the backward substitution algorithm (2.4).

Copyright c Luca Dede’ 2024


Direct Methods 25

2.2.3 Cholesky factorization method


If the matrix A is symmetric and positive definite, the more computationally convenient Cholesky factor-
ization can be used instead of LU factorization.

Definition 2.12. Let A ∈ Rn×n be symmetric and positive definite. Then, its Cholesky factorization
consists of determining an upper triangular matrix R ∈ Rn×n such that:
A = RT R.

The general expression of the upper triangular matrix R ∈ Rn×n is:


 
r11 r12 r13 ··· r1n

 0 r22 r23 ··· r2n 

R= .. .. .. ..
,
 
 . . . ··· . 
 
0 ··· 0 rnn

which, for A ∈ Rn×n symmetric and positive definite, is determined via the Cholesky algorithm.

Algorithm 2.3: Cholesky factorization



r11 = a11 ;
for i = 2, . . . , n do
for j = 1, . . . , i − 1 do
j−1
!
1 X
rji = aij − rki rkj ;
rjj
k=1
end
v
u
u i−1
X
rii = taii − 2 ;
rki
k=1
end

The Cholesky algorithm requires O n3 /3 operations to determine the upper triangular matrix R, approx-


imately half the number of flops associated with LU factorization; additionally, the memory used by the
computer is also lower.
If A is symmetric and positive definite, the Cholesky factorization exists (A = RT R) and the solution
of the linear system A x = b can be obtained sequentially as the solution of the following lower and upper
triangular systems:

RT y = b and R x = y,

where RT is a lower triangular matrix; indeed, since A = RT R, we have RT R x = b, from which the
previous result follows by introducing the vector y = R x ∈ Rn .

Definition 2.13. The Cholesky factorization method for solving the linear system Ax = b, with A sym-
metric and positive definite, consists of:
1. determining the Cholesky factorization of the matrix A (A = RT R);
2. solving the lower triangular system RT y = b using the forward substitution algorithm (2.2);

3. solving the upper triangular system R x = y using the backward substitution algorithm (2.4).

Copyright c Luca Dede’ 2024


26 Numerical Solution of Linear Systems

2.2.4 Thomas algorithm


Consider the non–singular matrix A ∈ Rn×n , with n ≥ 2, which is tridiagonal, expressed as:
 
a1 c1 0 ··· 0
 e2 a2 c2
 0 ··· 0  
 0 e3 a3 c3
 0 · · · 0  
 .. .. .. .. .. 
A=
 . . . . .  ,
 
 
 0
 ··· 0 en−2 an−2 cn−2 0  
 0 ··· 0 en−1 an−1 cn−1 
0 ··· 0 en an
n n−1 n
with elements {ai }i=1 , {ci }i=1 , and {ei }i=2 . We assume that this matrix A admits the existence and
uniqueness of LU factorization without pivoting; in the case of a tridiagonal matrix A, the LU factorization
generates the following bidiagonal matrices L and U :
   
1 0 ··· 0 α1 c1 0 · · · 0
 β2 1
 0 ··· 0 
 0 α2 c2
 0 ··· 0  
 0 β3 1
 0 ··· 0 
 0
 0 α3 c3 0 ··· 0  
L =  ... .. .. ..  . .. .. ..
 and U =  .. ,
  
 . . .   . . . 
   
   
 0 ··· 0 βn−1 1 0   0 ··· 0 αn−1 cn−1 
0 ··· 0 βn 1 0 ··· 0 αn
n n
with elements {αi }i=1 and {βi }i=2 determined as follows:

α1 = a1 ,
ei (2.8)
βi = and αi = ai − βi ci−1 for i = 2, . . . , n.
αi−1

Now consider the linear system Ax = b, with A ∈ Rn×n being the former tridiagonal matrix, which we
solve using the LU factorization method. The LU factorization of A is performed using Eq. (2.8); in this
way, the linear system L y = b is solved using the following forward substitution algorithm adapted to the
lower bidiagonal matrix L:

y1 = b1 ,
(2.9)
yi = bi − βi yi−1 for i = 2, . . . , n.
Finally, the linear system U x = y is solved using the following backward substitution algorithm adapted
to the upper bidiagonal matrix U :
yn
xn = ,
αn (2.10)
yi − ci xi+1
xi = for i = n − 1, . . . , 1.
αi

Definition 2.14. The Thomas algorithm for solving the linear system A x = b, with A being a non–
singular tridiagonal matrix that admits a unique LU factorization without pivoting, consists of:

1. determining the LU factorization of the matrix A (A = L U ) using the algorithm from Eq. (2.8);
2. solving the lower bidiagonal system L y = b with the algorithm from Eq. (2.9);
3. solving the upper bidiagonal system U x = y with the algorithm from Eq. (2.10).

The Thomas algorithm requires only O(8n) operations (specifically 8n − 7) to solve the linear system
associated with the tridiagonal matrix A ∈ Rn×n .

Copyright c Luca Dede’ 2024


Direct Methods 27

2.2.5 Accuracy of the numerical solution obtained by direct methods


We address the accuracy of the solution to the linear system A x = b obtained through direct methods.
Indeed, when using a computer, the numerical solution may be affected by the propagation of round–off
errors, whereas there are no errors in exact arithmetic.

Some definitions from linear algebra

Definition 2.15. Given a vector v ∈ Rn , its norm p is defined as


!1/p
X
p
kvkp := |vi | for 1 ≤ p ≤ +∞.
i=1

In particular, for a vector v ∈ Rn , we have:


n

sX X
kvk2 = v·v = |vi |2 , kvk1 = |vi |, and kvk∞ = max |vi |.
i=1,...,n
i=1 i=1

Typically, the 2-norm of a vector v is simply denoted as kvk ≡ kvk2 .

n
Definition 2.16. Given a matrix A ∈ Cn×n , its eigenvalues {λi (A)}i=1 ∈ C and the corresponding
n
eigenvectors {vi }i=1 ∈ Cn satisfy:

A vi = λi vi for each i = 1, . . . , n.

Then, spectral radius of A ∈ Cn×n is defined as:

ρ(A) := max |λi (A)| .


i=1,...,n

n
Y
Given a matrix A ∈ Cn×n , we observe that: det (A) = λi (A); λi A−1 = 1/λn+1−i (A) for i =

i=1
1, . . . , n, if the inverse A−1 exists; ρ(A) ≥ 0. Next, we focus on a real–valued matrix A ∈ Rn×n .

Proposition 2.4. If the matrix A ∈ Rn×n is symmetric, then its eigenvalues are real, i.e., λi (A) ∈ R for
every i = 1, . . . , n. Consequently, if A ∈ Rn×n is symmetric, then it is also positive definite if and only if
all its eigenvalues are strictly positive, i.e., λi (A) > 0 for every i = 1, . . . , n.

Definition 2.17. Given the matrix A ∈ Rn×n , its norm p is defined as:

kA vkp
kAkp := sup for 1 ≤ p ≤ +∞. (2.11)
v∈R n
, kvkp
v6=0

Given A ∈ Rn×n , we have:

kA vk
q
kAk2 = sup = λmax (AT A),
v∈R , v6=0 kvk
n

2
If A is symmetric and positive definite, then kAk2 = λmax (A) since λmax (AT A) = (λmax (A)) .

Definition 2.18. The condition number in norm p of a non–singular matrix A ∈ Rn×n is defined as:

Kp (A) := kAkp kA−1 kp for 1 ≤ p ≤ +∞.

Copyright c Luca Dede’ 2024


28 Numerical Solution of Linear Systems

By convention, if A is singular, then Kp (A) = +∞. For a non–singular matrix A ∈ Rn×n , we have
Kp (A) ≥ 1 for every 1 ≤ p ≤ +∞. Moreover,
s
λmax (AT A)
K2 (A) = kAk2 kA−1 k2 = .
λmin (AT A)

Definition 2.19. The spectral condition number of a non–singular matrix A ∈ Rn×n is defined as:

max |λi (A)|


 i=1,...,n
K(A) := ρ (A) ρ A−1 = ,
min |λi (A)|
i=1,...,n

−1
are the spectral radii of the matrices A and A−1 , respectively.

where ρ(A) and ρ A

λmax (A)
If the eigenvalues of A are real and strictly positive, K(A) = , where λmax (A) and λmin (A)
λmin (A)
are the maximum and minimum eigenvalues of A, respectively. Therefore, if A is symmetric and positive
definite, then
λmax (A)
K2 (A) ≡ K(A) = .
λmin (A)
Remark 2.9. The condition number of a matrix A provides a measure of the sensitivity of the solution of
the linear system A x = b to perturbations in the data, i.e., b and the matrix A itself. The system is said
to be well-conditioned if Kp (A) is relatively “small”, and ill-conditioned if Kp (A) is “very large” (e.g.,
O 109 or larger...).


Accuracy of the numerical solution


Numerically solving the linear system A x = b using a direct method at the computer is equivalent to
solving the following perturbed linear system in exact arithmetic (Wilkinson principle):
(A + δA) x
b = b + δb, (2.12)
n n×n n
where xb ∈ R is the numerical solution, δA ∈ R is the perturbation matrix of A, and δb ∈ R is the
perturbation vector of b. This remark stands at the basis of the procedure to quantify the accuracy of the
numerical solution x
b with respect to the exact one x.

Definition 2.20. For the linear system A x = b:


kx − x
bk
b, where e ∈ Rn , while the relative error is erel :=
• the absolute error is e := x − x , with
kxk
x 6= 0 and erel ∈ R;

krk
b, where r ∈ Rn , while the relative residual is rrel :=
• the residual is r := b − A x , for b 6= 0,
kbk
where rrel ∈ R.

Remark 2.10. In general, the residual r and the relative residual rrel are used as estimators of the error
associated with the numerical solution x
b; in fact, the exact solution x of the linear system A x = b is
generally unknown.

Proposition 2.5 (Stability estimate). The relative error associated with the numerical solution x
b of the
linear system A x = b is estimated as:
krk
erel ≤ K2 (A) rrel = K2 (A) . (2.13)
kbk

Copyright c Luca Dede’ 2024


Iterative Methods 29

Proof. From the definition of the residual, it follows that r := b − A x b = Ax − Axb; thus, x − x
b =
A−1 r, from which we obtain that kx − x bk = kA−1 rk ≤ kA−1 k2 krk. Furthermore, since kbk =
1 kAk2 kx − x
bk krk
kA xk ≤ kAk2 kxk, it follows that ≤ , leading to erel := ≤ kA−1 k2 ≤
kxk kbk kxk kxk
kAk2 krk
kA−1 k2 krk = K2 (A) = K2 (A)rrel .
kbk kbk

Remark 2.11. The error (stability) estimate from Eq. (2.13) is a posteriori error estimate and can be
evaluated once the numerical solution x
b has been computed.

Based on the result in (2.13), the relative residual rrel represents a criterion that satisfactorily estimates
the error associated with the numerical solution x b of the linear system obtained using a direct method on
the computer only if the conditioning number is “small," that is, when the matrix A is well–conditioned.
Conversely, if the conditioning number of the matrix A is “large", meaning that A is ill–conditioned, then
the error associated with x
b could be very “large" even if rrel is “small," due to the propagation of round–off
errors during the application of the direct method on the computer.

2.2.6 The Matlabr command \


The Matlabr command \ represents an extremely efficient implementation of direct methods for solving
the linear system A x = b:
» x = A \ b;
The Matlabr command \ bases its choice of the direct method to be employed on the properties of
the matrix A. If A ∈ Rn×n is sparse and banded, then Matlabr uses a method based on generalizations
of Thomas’s algorithm, depending on the type of band of the matrix. If A is lower or upper triangular, it
employs forward or backward substitution methods, respectively. If A is symmetric and positive definite,
Matlab uses the Cholesky factorization method. Finally, for a generic matrix A ∈ Rn×n , it applies the LU
factorization method with pivoting.

2.3 Iterative Methods


We consider iterative methods for solving the linear system A x = b. These methods yield the solution
x, in principle, by means of an infinite number of iterations, that is as x = lim x(k) , where the iterates
k→+∞
n o+∞
(k) (0)
x represent a sequence of solution vectors, while x is the initial guess (or initial solution).
k=0

2.3.1 The general algorithm


A possible manner to write an iterative algorithm to approximate the solution of A x = b, with A ∈ Rn×n
non–singular and x, b ∈ Rn , is the following:

given x(0) ∈ Rn ,
(2.14)
x(k+1) = B x(k) + g for k = 0, 1, . . . ,

where B ∈ Rn×n is the iteration matrix and g ∈ Rn is the iteration matrix. B and g depend on A, b,
and the specific method under consideration. However, the iterative method must comply with the strong
consistency condition, which is such that, if x is the solution of A x = b, then it must hold x = B x + g.
It follows that the iteration vector must read g = (I − B) A−1 b since x = A−1 b.

Copyright c Luca Dede’ 2024


30 Numerical Solution of Linear Systems

Definition 2.21. We define the error e(k) ∈ Rn correspondent to x(k) ∈ Rn of the iterative method (2.14)
as:
e(k) := x − x(k) for k = 0, 1, . . . ,
while the residual r(k) ∈ Rn is:

r(k) := b − A x(k) for k = 0, 1, . . . .


   
It follows that e(k+1) = x − x(k+1) = (B x + g) − B x(k) + g = B x − x(k) = B e(k) for
k = 0, 1, . . .. By recursion, we obtain:
e(k) = B k e(0) for k = 0, 1, . . . , (2.15)
from which we get the following error formula:

e(k) ≤ B k 2
e(0) for k = 0, 1, . . . . (2.16)

We see that lim e(k) = 0 if and only if lim B k = 0, which is verified if and only if ρ(B) < 1, with
k→+∞ k→+∞
ρ(B) the spectral radius of B. Indeed, in the general, the following holds:
k
e(k) ≤ (ρ(B)) e(0) for k = 0, 1, . . . .

Proposition 2.6 (Necessary and sufficient condition for convergence). The iterative method (2.14) is con-
vergent to the exact solution x ∈ Rn of the linear system A x = b for every choice of the initial guess
x(0) ∈ Rn if and only if the spectral radius of the iteration matrix B is strictly smaller than one, that is
ρ(B) < 1. Moeover, the smaller is ρ(B), the faster is the convergence.

2.3.2 Splitting methods


It is a family of iterative methods for which the iteration matrix B is obtained by additive splitting of A.
We introduce the non–singular matrix P ∈ Rn×n , called preconditioning matrix (or preconditioner). By
noticing that A = P − P + A and A x = b, we have
P x = (P − A) x + b,
−1 −1
from which x = P (P − A) x + P b. From the last equality, due to the strong consistency condition,
we obtain the iteration matrix and the iteration vector, respectively:

B = I − P −1 A (2.17)

and g = P −1 b. Hence, the iterative method (2.14) can be written as P x(k+1) = (P − A) x(k) + b, from
which  
P x(k+1) − x(k) = r(k) .

Definition 2.22. The preconditioned residual z(k) ∈ Rn is the solution of the following linear system:

P z(k) = r(k) for k = 0, 1, . . . ,

where P ∈ Rn×n is the non–singular preconditioner.

Hence, the iterative method (2.14) can be more conveniently written as:

given x(0) ∈ Rn ,
(2.18)
solve P z(k) = r(k) and set x(k+1) = x(k) + z(k) for k = 0, 1, . . . .

Copyright c Luca Dede’ 2024


Iterative Methods 31

We remark that r(k+1) = b − A x(k+1) = b − A x(k) − A z(k) = r(k) − A z(k) . Therefore, we write the
following algorithm.

Algorithm 2.4: Preconditioned iterative method


choose x(0) ∈ Rn and P ∈ Rn×n non–singular, set r(0) = b − A x(0) ;
for k = 0, 1, . . ., till a stopping criterion is satisfied do
solve the linear system P z(k) = r(k) (by means of a direct method);
set x(k+1) = x(k) + z(k) ;
set r(k+1) = r(k) − A z(k) ;
end

The iterative method must be stopped by means of suitable stopping criteria. We can consider the stopping
r(k)
criterion based on the normalized residual such that iterations are stopped at k ≥ 0 for which < tol,
kbk
for a prescribed tolerance tol. The number of iterations should also be limited to a maximum value kmax .
At each iteration of the method (2.18) we need to solve the linear system P z(k) = r(k) . Therefore,
the choice of the preconditioner P should be such that the linear system P z(k) = r(k) is solved in a
computationally efficient manner by means of a direct method. In other words, this linear system should
be “easy" to solve by means of a direct method.
In addition, the choice of P must guarantee that the iterative method is convergent, that is the iteration
matrix B = I − P −1 A is such that ρ(B) < 1. Moreover, it is desirable that ρ(B)  1 to ensure
convergence in a fast way to x.
In this context, it is clear that the choice of the preconditioning matrix P is a trade–off between the
“simplicity" to solve the linear system P z(k) = r(k) at each iteration k of the iterative method and the
need to ensure a (fast) convergence of the iterative method (that is ρ(B) < 1 and possibly ρ(B)  1).

2.3.3 Jacobi and Gauss–Seidel methods


These iterative methods are examples of splitting methods and can be written as in Eq. (2.18) with a suitable
choice of the preconditioner P .

Jacobi method
The Jacobi method can be applied only to a non–singular matrix A ∈ Rn×n whose diagonal elements are
non zeros, that is when aii 6= 0 for all i = 1, . . . , n.
The Jacobi method selects as preconditioner P in the algorithm (2.18) the diagonal matrix extracted
from A. We have P = PJ , where
PJ = D,
with D ∈ Rn×n the diagonal matrix with elements (D)ii = aii for all i = 1, . . . , n and (D)ij = 0 for
every i 6= j. In this manner det(PJ ) 6= 0. The linear system PJ z(k) = r(k) of Eq. (2.18) is “easy” to
solve by means of a direct method (only n divisions) since PJ = D is diagonal. The iteration matrix
corresponding to the method is:
BJ = I − PJ−1 A = I − D−1 A.

Hence, the convergence of the Jacobi method to x for every choice of the initial guess x(0) depends on the
value of ρ(BJ ) according with Proposition 2.6.

Gauss–Seidel method
The Gauss–Seidel method can be applied only to a non–singular matrix A ∈ Rn×n whose diagonal ele-
ments are non zeros, that is when aii 6= 0 for all i = 1, . . . , n. The preconditioner P in Eq. (2.18) is the
lower triangular matrix extracted from A.

Copyright c Luca Dede’ 2024


32 Numerical Solution of Linear Systems

By convention, with indicate by D the diagonal matrix extracted from A and by E ∈ Rn×n the lower
triangular matrix with zero diagonal elements such that (E)ij = −aij for i = 2, . . . , n and j = 1, . . . , i−1,
(E)ij = 0 otherwise. Finally, F ∈ Rn×n is the upper triangular matrix with zero diagonal elements such
that (F )ij = −aij for i = 1, . . . , n − 1 and j = i + 1, . . . , n, (F )ij = 0 otherwise. Thus, we have
A = D − E − F . The preconditioner for the Gauss–Seidel method is P = PGS , where

PGS = D − E,

with det (PGS ) 6= 0 according with the hypothesis. The linear system PGS z(k) = r(k) of Eq. (2.18) is
“easy” to solve by means of a direct method, in particular by means of the forward substitution method (in
n2 operations) since PGS = (D − E) is lower triangular. The corresponding iteration matrix is:
−1 −1
BGS = I − PGS A = I − (D − E) A.

The convergence properties of the method depend on ρ(BGS ), in agreement with Proposition 2.6.

Sufficient conditions for the convergence of Jacobi and Gauss–Seidel methods


According with Proposition 2.6, the necessary and sufficient condition for the convergence of the Jacobi and
Gauss–Seidel methods for every x(0) ∈ Rn is that ρ(BJ ) < 1 and ρ(BGS ) < 1, respectively. However,
in some particular cases, it is possible to establish if the method is convergence simply by inspecting
the matrix A, without the computationally expensive operations of assembling B and calculating ρ(B)
(eigenvalues must be computed too). The following are the sufficient conditions.

Proposition 2.7. If A is non–singular and strictly diagonally dominant by row, then both the Jacobi and
Gauss–Seidel methods converge to x for every x(0) ∈ Rn .

Proposition 2.8. If A is symmetric and positive definite, then the Gauss–Seidel method converges to x for
every x(0) ∈ Rn .

Proposition 2.9. If A is non–singular and tridiagonal with every diagonal element non–zero, then the
Jacobi and Gauss–Seidel methods are either both convergent to x for every x(0) ∈ Rn or divergent. If they
2
are convergen, the Gauss–Seidel method is faster that the Jacobi method since ρ (BGS ) = (ρ (BJ )) .

The former conditions are only sufficient. If these are not satisfied, then the necessary and sufficient condi-
tion of Proposition 2.6 must be verified to establish the convergence of the iterative method.
 
3 1
Example 2.8. Consider A = , which is strictly diagonally dominant by row. Since the conditions of
1 2
Proposition 2.7 are met, then both the Jacobi and Gauss–Seidel methods are convergent for every x(0) ∈ R2 to the
solution x ∈ R2 of the linear system A x = b, regardless of b ∈ R2 . The hypotheses of Propositions 2.8 and 2.9
are satisfied too. We verify the result by means of the necessary and sufficient condition of Proposition 2.6. For the
 1 
  0 −
Jacobi method, we have PJ =
3 0
and BJ = I − PJ−1 A =  1 3 , hence ρ (BJ ) = √1 < 1.
0 2 − 0 6
2  1 

3 0

−1
0 −
For the Gauss–Seidel method, we have PGS = and BGS = I − PGS A =  3 , from which
1 2 1
0
6
1
ρ (BGS ) = < 1; the method converges faster than the Jacobi method.
6

 
1 0 −1
Example 2.9. We consider the non–singular matrix A =  3 2 0  for the linear system A x = b. None
−1 −1 2
of the sufficient conditions of Propositions 2.7, 2.8, or 2.9 are satisfied. Therefore, we need to verify the necessary

Copyright c Luca Dede’ 2024


Iterative Methods 33

 
1 0 0
and sufficient condition of Proposition 2.6. For the Jacobi method, we have PJ =  0 2 0  and BJ = I −
0 0 2
0 0 1
 
 3
PJ−1 A =  − 0 0 , for which ρ (BJ ) = 109 > 1. Therefore, the Jacobi method does not converge for

 2 100
1 1

0
2 2  
1 0 0
every choice of x(0) ∈ R3 to x ∈ R3 . For the Gauss–Seidel method, we have PGS =  3 2 0  and
−1 −1 2
0 0 1
 
3 
−1 0 0 − , from which ρ (BGS ) = 1 < 1. Thefore, the Gauss–Seidel method

BGS = I − PGS A =  2 

1 4
0 0 −
4
converges to x per every choice of x(0) ∈ R3 .

2.3.4 Richardson methods


+∞
Let us consider a succession of parameters {αk }k=0 ∈ R. Then, the preconditioned Richardson method is
a generalization of the iterative method (2.18) and reads:

given x(0) ∈ Rn ,
(2.19)
solve P z(k) = r(k) e porre x(k+1) = x(k) + αk z(k) for k = 0, 1, . . . ,

with P ∈ Rn×n a non–singular preconditioning matrix.


If αk = α ∈ R for all k = 0, 1, . . ., the Richardson method (2.19) is called stationary, otherwise if αk
changes with the iteration index k = 0, 1, . . ., it is called dynamic. Let us remark that, if αk = α = 1,
we obtain the method of Eq. (2.18). Moreover, r(k+1) = b − A x(k+1) = b − A x(k) − αk A z(k) =
r(k) − αk A z(k) . For a dynamic Richardson method, the iteration matrix Bk changes with the iteration
index k, indeed Bk = I − αk P −1 A for k = 0, 1, . . .. Therefore, the convergence properties of the method
change in general with k.

Let us focus on the (preconditioned) stationary Richardson method, whose algorithm reads:

Algorithm 2.5: Preconditioned stationary Richardson method


Given x(0) ∈ Rn and α ∈ R, set r(0) = b − A x(0) ;
for k = 0, 1, . . ., till a stopping criterion is satisfied do
solve the linear system P z(k) = r(k) (by means of a direct method);
set x(k+1) = x(k) + α z(k) ;
set r(k+1) = r(k) − α A z(k) ;
end

For the stationary Richardson method, the iteration matrix reads:

Bα = I − α P −1 A.

Therefore, the convergence properties depend on the spectral radius of Bα , that is on ρ (Bα ). We notice
that the Richardson method is specified by the choice of the parameter α and the preconditioner P .
Let us consider the convergence conditions of the Richardson method. We introduce the following
definition.

Copyright c Luca Dede’ 2024


34 Numerical Solution of Linear Systems

Definition 2.23. The energy norm of a vector v ∈ Rn with respect to a symmetric and positive definite
matrix A ∈ Rn×n is defined as: √
kvkA = vT A v.

Proposition 2.10. If the matrix A and P ∈ Rn×n are symmetric and positive definite, then the stationary
Richardson method converges to x ∈ Rn for every choice of x(0) ∈ Rn if and only if
2
0<α< ,
λmax (P −1 A)

where λmax P −1 A is the largest eigenvalue of P −1 A. Moreover, the spectral radius of the iteration


matrix Bα is minimum for α = αopt , where

2
αopt := ,
λmin (P −1 A) + λmax (P −1 A)

with λmin P −1 A the smallest eigenvalue of P −1 A. For α = αopt , we also have:




ke(k) kA ≤ dk ke(0) kA
for k = 0, 1, . . . , (2.20)

K P −1 A − 1 λmax P −1 A
 
−1

with d := , where K P A = is the spectral condition number of
K (P −1 A) + 1 λmin (P −1 A)
P −1 A.

Under the assumptions of Proposition 2.10, an optimal choice for the parameter in a stationary Richardson
method is available. On the other hand, the result (2.20) also indicates that the closer the preconditioning
matrix P is to the matrix A, the closer the spectral condition number of the matrix P −1 A is to one, and
the faster the method converges; however, in this case, solving the linear system P z(k) = r(k) could be
relatively complex. In particular, for P = A, we have αopt = 1 and d = 0, meaning that the convergence
2
occurs in a single iteration. Conversely, if P = I, we have αopt = and d =
λmin (A) + λmax (A)
K (A) − 1
; in this case, the convergence of the iterative method can be slow if K(A)  1, since d . 1.
K (A) + 1
In general, the closer K P −1 A is to one, the faster the method converges.


   
4 1 4 0
Example 2.10. Consider A = and the preconditioner P = , both of which are symmetric and
1 2 0 4
positive definite. Therefore, to study the convergence properties of thestationary Richardson
 method, we can use
−1 1 1/4
the results of Proposition 2.10. These depend on the matrix P A = and its eigenvalues λmin =
1/4 1/2
√ √
3 2 3 2
λmin P −1 A = and λmax = λmax P −1 A =
 
− + . In particular, the stationary Richardson
4 4 4 4
method converges to the solution x ∈ R of a linear system associated with A for every x(0) ∈ R2 if and only
2

2 8 2 4
if 0 < α < = √ . Furthermore, the optimal parameter αopt = = minimizes the
λmax 3+ 2 λ min + λmax
 3 
−1 −1/3 −1/3
spectral radius among the iteration matrices Bα ; specifically, Bαopt = I − αopt P A = and
−1/3 1/3
√ −1
 √
 2 K P A −1 λmax − λmin  2
ρ Bαopt = < 1. From Eq. (2.20), it follows that d = = = ρ Bαopt = ;
3 K (P −1√A) + 1 λmax + λmin 3
2
that is, the error in the A–energy norm decreases by a factor of d ≤ at each iteration.
3
In general, for a preconditioned stationary Richardson method, determining the optimal parameter
αopt can be computationally expensive, as it is related to the eigenvalues of P −1 A. To avoid explicitly
computing these eigenvalues for the determination of the parameter α, one can appropriately use a dynamic
preconditioned Richardson method.

Copyright c Luca Dede’ 2024


Iterative Methods 35

2.3.5 Gradient methods


Consider the case of a symmetric positive definite matrix A ∈ Rn×n . The preconditioned gradient method
is a preconditioned dynamic Richardson method (2.19), where the parameters αk are chosen as:
T
z(k) r(k)
αk = T for k = 0, 1, . . . ,
z(k) A z(k)

where P is a symmetric and positive definite matrix, and z(k) is the preconditioned residual.

Algorithm 2.6: Preconditioned gradient method


Given x(0) ∈ Rn , set r(0) = b − A x(0) ;
for k = 0, 1, . . ., until a stopping criterion is satisfied do
solve P z(k) = r(k) ;
T
z(k) r(k)
set αk = T ;
z(k) A z(k)
set x(k+1) = x(k) + αk z(k) ;
set r(k+1) = r(k) − αk A z(k) ;
end

Similarly, the gradient method is a unpreconditioned dynamic Richardson method with parameters αk
chosen as:
T
r(k) r(k)
αk = T for k = 0, 1, . . . . (2.21)
r(k) A r(k)
It is possible to obtain the gradient method from the preconditioned gradient method by choosing P = I;
in this case, z(k) ≡ r(k) for every k = 0, 1, . . .. For the gradient method, the residual vector r(k) represents
the descent direction for the error at iteration k = 0, 1, . . ., and if A is symmetric and positive definite, the
choice of αk given by (2.21) minimizes the error ke(k+1) kA along the direction r(k) . The algorithm reads:

Algorithm 2.7: Gradient method


Given x(0) ∈ Rn , set r(0) = b − A x(0) ;
for k = 0, 1, . . ., until a stopping criterion is satisfied do
T
r(k) r(k)
set αk = T ;
r(k) A r(k)
set x(k+1) = x(k) + αk r(k) ;
set r(k+1) = r(k) − αk A r(k) ;
end

We provide the following interpretation of the gradient method (P = I). Let A ∈ Rn×n be symmetric
and positive definite. Then, the vector x ∈ Rn is a solution to the linear system A x = b if and only
1
if the system energy function Φ : Rn → R, defined as Φ(y) = yT Ay − yT b, achieves its minimum
2
n n
1 T 1 X X
at y = x. By rewriting Φ : R → R as Φ(y) = y Ay − yT b =
n
yi aij yj − bi yi and
2 2 i,j=1 i=1
 
n n
∂Φ 1 X X
since A is symmetric, we have = amj yj + yi aim  − bm or, grouping all components,
∂ym 2 j=1 i=1
1 T
∇Φ(y) = (A + A)y − b = Ay − b.
2

Copyright c Luca Dede’ 2024


36 Numerical Solution of Linear Systems

Let us assume that y now corresponds to the k–th iterate x(k) of the method, with k = 0, 1, . . .,
1
meaning that Φ(x(k) ) = Φ(x) + ke(k) k2A and ∇Φ(x(k) ) = −r(k) . The goal is to determine x(k+1) such
2
that Φ(x(k+1) ) ≤ Φ(x(k) ), so we can choose the descent direction
−∇Φ(x(k) ) = r(k) ,
which is the gradient of Φ, giving the method its name. Therefore, we write
x(k+1) = x(k) − αk ∇Φ(x(k) ) = x(k) + αk r(k) ,
with αk ∈ R to be determined. Once the descent direction r(k) is determined, the parameter αk expresses
the step size to travel along r(k) to find x(k+1) . The intersection of Φ(y) with the hyperplane passing
through (x(k) , Φ(x(k) )) and orthogonal to Rn defines the function
ϕ(α) = Φ(x(k) + α r(k) ).
We determine the step size αk such that, once the descent direction r(k) is chosen, we achieve the maximum
decrease of Φ. In other words, we want ϕ(α) to have a minimum at α = αk . We have that

ϕ0 (α) = ∇Φ(x(k) + α r(k) ) · r(k) = (A(x(k) + α r(k) ) − b) · r(k) .

By imposing that αk is such that ϕ0 (αk ) = 0, we obtain (A(x(k) + αk r(k) ) − b) · r(k) = 0 if and only if
(−r(k) + αk Ar(k) ) · r(k) = 0, from which we get the expression for αk of Eq. (2.21).
Finally, we observe that, by virtue of the choice of the descent direction and the parameter αk , the
descent directions turn out to be pairwise orthogonal, that is:

r(k) · r(k+1) = (r(k) )T r(k+1) = 0 for each k = 0, 1, . . . .

(r(k) )T r(k) (k) T


Indeed, we have (r(k) )T r(k+1) = (r(k) )T (r(k) −αk A r(k) ) = (r(k) )T r(k) − (r ) A r(k) =
(r(k) )T A r(k)
0. This indicates that the new iterate x(k+1) is optimal with respect to the direction r(k) , although it is not
guaranteed that x(k+1) is optimal with respect to the descent directions of all previous steps.
We have presented the algorithms for the gradient method and the preconditioned gradient method, i.e.,
with a preconditioning matrix P 6= I. We provide now a convergence result for both these methods.

Proposition 2.11. If the matrices A and P ∈ Rn×n are symmetric and positive definite, the preconditioned
gradient method converges to the solution x ∈ Rn for all choices of x(0) ∈ Rn and

ke(k) kA ≤ dk ke(0) kA
for k = 0, 1, . . . , (2.22)

K P −1 A − 1  λmax P −1 A
 
−1
is the spectral condition number of P −1 A .

where d := −1
; K P A = −1
K (P A) + 1 λmin (P A)

The previous result can be applied to the (non–preconditioned) gradient method by setting P = I. More-
over, the error estimate (2.22) can be used to predict in advance the number of iterations needed by the
preconditioned gradient method to converge to the solution with the desired tolerance. The previous result
also highlights the role of the preconditioner P , which should act in such a manner that K P −1 A gets
closer to 1.

2.3.6 Conjugate gradient methods


Consider a symmetric and positive definite matrix A ∈ Rn×n , then one can develop an iterative method
based on descent directions such that (p(j) )T r(k+1) = 0 for every j = 0, 1, . . . , k, meaning that the new
iterate x(k+1) is optimal not only with respect to p(k) but also all previous directions. We want to identify
descent directions that maintain optimality at each iteration. Let x(k+1) = x(k) + q. If x(k) is optimal with
respect to some direction p, we have (r(k) )T p = 0. Imposing that x(k+1) remains optimal with respect to

Copyright c Luca Dede’ 2024


Iterative Methods 37

   
p gives p · b − Ax(k+1) = p · r(k) − Aq = 0, which implies p · (Aq) = 0. Thus, the directions p
and q must be A–conjugate.
Thus, for a symmetric and positive definite matrix A ∈ Rn×n , the conjugate gradient method mini-
mizes the error ke(k+1) kA at each iteration k = 0, 1, . . ., along the descent direction p(k) ∈ Rn , which is
A–conjugate to all previously calculated descent directions p(j) for j = 0, . . . , k − 1.

Algorithm 2.8: Conjugate gradient method (unpreconditioned)


Given x(0) ∈ Rn , set r(0) = b − A x(0) and p(0) = r(0) ;
for k = 0, 1, . . ., until a stopping criterion is satisfied do
T
p(k) r(k)
Set αk = T ;
p(k) A p(k)
Set x(k+1) = x(k) + αk p(k) ;
Set r(k+1) = r(k) − αk A p(k) ;
T
p(k) A r(k+1)
Set βk = T ;
p(k) A p(k)
Set p(k+1) = r(k+1) − βk p(k) ;
end

The conjugate gradient method is not part of the dynamic Richardson methods, as it requires determining
two parameters, αk and βk , at each iteration.

Proposition 2.12. If A ∈ Rn×n is symmetric and positive definite, the conjugate gradient method con-
verges to x ∈ Rn for any choice of x(0) ∈ Rn in at most n iterations (in exact arithmetic), and

2 ck
e(k) ≤ e(0) for k = 0, 1, . . . , (2.23)
A 1 + c2k A
p
K (A) − 1
where c := p and K (A) is the spectral condition number of A.
K (A) + 1

The conjugate gradient method can be interpreted as a direct method since the convergence to x ∈ Rn
occurs in at most n iterations in exact arithmetic. However, typically the algorithm is stopped before the n
2 ck
iterations are completed. For sufficiently large k, the term in the error estimate (2.23) decreases
1 + c2k
k
like 2 c . Therefore, if A is symmetric and positive definite, the conjugate gradient method converges more
rapidly than the gradient method, since 2 ck < dk for “sufficiently" large k.

Example 2.11. We visualize in the following figure the convergence paths of the gradient and conjucate gradient 
6 −1
methods (both without preconditioning), when applied to the linear system Ax = b, with A = and
−1 2
b = (5, 1)T , whose solution is x = (1, 1)T .
As can be seen, the orthogonality of the descent directions (taken pairwise) in the case of the gradient method
leads to a slower convergence compared to the conjugate gradient method; conversely, the conjugate gradient method,
although it considers the same descent direction in the first iteration, arrives at the exact solution (up to round-off errors
of order one) by the second iteration, as stated in Proposition 2.12.

Copyright c Luca Dede’ 2024


38 Numerical Solution of Linear Systems

Iterates x(k) of the Iterates x(k) of the


gradient method conjugate gradient method

Given a preconditioning matrix P ∈ Rn×n that is symmetric and positive definite, it is possible to de-
fine the preconditioned conjugate gradient (PCG) method by generalizing the conjugate gradient method.
We do not present the algorithm here, but we highlight the following result.

Proposition 2.13. If A and P ∈ Rn×n are symmetric and positive definite matrices, the preconditioned
conjugate gradient method convergespto x ∈ Rn for every choice of x(0) ∈ Rn . The error ke(k) kA satisfies
K (P −1 A) − 1
the estimate (2.23), where now c = p .
K (P −1 A) + 1

2.3.7 Computational error and stopping criteria


Iterative methods approximate the solution x of the linear system Ax = b with a sequence of iterates
x(k) such that lim x(k) = x. However, it is necessary to limit the sequence of iterates to a finite step k
k→+∞
such that x(k) ' x. This introduces, in exact arithmetic, the truncation error et = kx(k) − xk = ke(k) k.
The use of a computer for applying an iterative method inevitably incurs round–off errors, resulting in the
approximated solution x b(k) . The round–off error is therefore er = kx(k) − xb(k) k, while the computational
error is ec = et + er . Typically, for an iterative method, it holds that et  er , so the computational error
and the truncation error are of the same order of magnitude, i.e., ec ' et .
Iterative methods must be stopped according to an appropriate stopping criterion. Additionally, the
number of iterations of the algorithm should be limited by some sufficiently large integer kmax . The stop-
ping criterion consists of terminating the execution of the algorithm at the iterate k such that an appropriate
error estimator of the true error, denoted ee(k) , is smaller than a predefined tolerance tol, that is, ee(k) < tol.
The first error estimator and stopping criterion is that of the normalized residual. In this case,
(k) kr(k) k
ee(k) = rrel :=
kbk
(k) kx − x(k) k
is used to estimate the relative error erel := , for x 6= 0. Let us recall the result of Proposi-
kxk
b = x(k) and r = r(k) , we have for a generic iterative
tion 2.5 and, specifically, Eq. (2.13); by setting x
method (preconditioned):
(k) (k)
erel ≤ K2 (A) rrel .
Thus, the stopping criterion based on the residual is satisfactory if the condition number K2 (A) of the
matrix A of the linear system to be solved is relatively “small," that is, if A is well-conditioned. Conversely,
if the matrix A is ill-conditioned, the stopping criterion based on the residual is unsatisfactory since the
true error is underestimated by the residual.
The second criterion is the difference between successive iterates, for which ee(k) = kx(k) − x(k−1) k,
for k ≥ 1. It can be shown that this stopping criterion is satisfactory if the spectral radius of the iteration
matrix B is very small, that is, ρ(B)  1. Conversely, the criterion is unsatisfactory if ρ(B) . 1 since the
true error is underestimated by the error estimator.

Copyright c Luca Dede’ 2024


A (Brief) Comparison Between Direct and Iterative Methods 39

2.4 A (Brief) Comparison Between Direct and Iterative Methods


The standard for solving linear systems A x = b, where A ∈ Rn×n is a non–singular matrix, with a direct
method is LU factorization, or Cholesky factorization for symmetric and positive definite matrices. For
tridiagonal matrices, the Thomas algorithm is used, and generalized forms can be applied to band matrices.
Iterative methods, such as stationary and dynamic preconditioned Richardson methods, are often used
for matrices with well–defined properties. The preconditioned gradient and conjugate gradient methods
represent the state of the art for symmetric positive-definite systems. If A is non–singular, solving A x = b
is equivalent to solving AT Ax = AT b. However, assembling AT A and AT b can be computationally
expensive and can further propagate numerical errors, so this approach is generally not recommended. For
non–symmetric systems, the GMRES method is widely used.
Direct methods can also be used to generate preconditioners for iterative methods. For example, incom-
plete LU factorization (ILU) and incomplete Cholesky factorization (IC) can be used for non–symmetric
and symmetric matrices, respectively.
In general, the choice between a direct or iterative method depends on the properties of A and available
computational resources. For large matrices, a rough guideline is to use iterative methods for dense matrices
and direct methods for sparse, structured matrices.

Copyright c Luca Dede’ 2024


40 Numerical Solution of Linear Systems

Copyright c Luca Dede’ 2024


Chapter 3

Approximation of Zeros of Nonlinear


Equations and Systems

The goal is to numerically approximate the zero α ∈ R of a function f (x) in the interval I = (a, b) ⊆ R.
The problem is commonly referred to as the numerical solution of a nonlinear equation.

We also want to approximate the solution of systems of nonlinear equations. Given F : Rn → Rn , for
some n ≥ 1, the problem consists of finding the (zeros) vector α ∈ Rn such that F(α) = 0. More specifi-
cally, we have:
     
x1 f1 (x) f1 (x1 , . . . , xn )
x =  ...  and F(x) =  ..
=
..
.
     
. .
xn fn (x) fn (x1 , . . . , xn )
We focus on Newton and fixed point iteration methods. We omit in this presentation the bisection
method, which is often used for continuous functions. A common feature of these numerical methods is
that they are iterative methods.

3.1 Newton Methods


We consider Newton, modified Newton, and inexact Newton methods to approximate the zero α of the
function f (x) and, subsequently, the zero α ∈ Rn of the system of nonlinear equations F(x).

3.1.1 Newton method


Let us start with the scalar case. We assume that f ∈ C 0 (I) and that it is differentiable in the interval
I = (a, b) ⊆ R. Given a generic iterate x(k) ∈ I, the equation of the tangent line to the curve (x, f (x)) at
the point (x(k) , f (x(k) )) is y(x) = f (x(k) ) + f 0 (x(k) )(x − x(k) ). If we assume that y(x(k+1) ) = 0, then
we compute the iterate x(k+1) as:

f (x(k) )
x(k+1) = x(k) − for all k ≥ 0, (3.1)
f 0 (x(k) )

41
42 Approximation of Zeros of Nonlinear Equations and Systems

given the initial iterate x(0) and assuming that f 0 (x(k) ) 6= 0 for all k ≥ 0. Eq. (3.1) is called the Newton
iterate. The zero α is obtained as the limit of the sequence of iterates {x(k+1) }+∞ k=0 that solve the tangent
(k) +∞
line equation to the curve (x, f (x)) evaluated at each iterate {x }k=0 .

Example 3.1. We graphically illustrate Newton method in the following pictures, where the first two iterations of the
method are highlighted.
Step 1 Step 2

The Newton method is applicable to a function f ∈ C 0 (I) that is differentiable in I; furthermore, given
x(0) ∈ I, Newton method consists of sequentially applying the Newton iterate (3.1), provided that f 0 (x(k) ) 6=
0 for all k ≥ 0.

Algorithm 3.1: Newton method


set k = 0 and choose the initial iterate x(0) ;
while (the stopping criterion is not satisfied) do
f (x(k) )
x(k+1) = x(k) − 0 (k) ;
f (x )
set k = k + 1;
end

Assuming that f ∈ C 2 (I), the Taylor series expansion of f (x) around x(k) is written as f (x(k+1) ) =
f (x(k) ) + f 0 (x(k) ) δ (k) + O((δ (k) )2 ), where δ (k) := x(k+1) − x(k) for k ≥ 0 is the difference between
successive iterates. If f (x(k+1) ) = 0, then Newton method represents the first–order approximation of the
Taylor series expansion of f (x) around x(k) ; we observe that this assumption is indeed satisfied if δ (k) is
“small."
The choice of the initial iterate x(0) is crucial for the success of Newton method. In fact, it is necessary
to choose x(0) “sufficiently” close to the zero α. In practice, the sequence of Newton iterates {x(k+1) }+∞ k=0
may diverge instead of converging to α if the initial iterate x(0) is not “sufficiently” close to the zero
α. Since the zero α is unknown, the choice of x(0) may not be trivial; from this perspective, the graph
of the function or the use of the bisection method can be extremely helpful in selecting values of x(0)
“sufficiently” close to α and thus initializing the Newton method.

Example 3.2. The following example illustrates how Newton iterates do not converge to the zero α: this is due to the
fact that x(0) is not “sufficiently” close to α.

Copyright c Luca Dede’ 2024


Newton Methods 43

For affine (or linear) functions, that is, functions as f (x) = cx + d with c and d ∈ R, Newton method
d
converges to the zero α = − in a single iteration, regardless of the choice of x(0) . Indeed, from Eq. (3.1),
c
f (x(0) ) cx(0) + d d
we obtain that x(1) = x(0) − 0 (0) = x(0) − = − = α, for any x(0) ∈ R.
f (x ) c c
Below, we characterize the convergence properties of Newton method. With this aim, let us introduce
the following definition, which will allow to characterize the convergence properties of an iterative method.

Definition 3.1. An iterative method for approximating the zero α of the function f (x) is convergent with
order p if and only if
x(k+1) − α
lim p = µ, (3.2)
k→+∞ x(k) − α

where µ > 0 is a real number independent of k, called the asymptotic convergence factor. In the case of
linear convergence, i.e., for p = 1, it is necessary that 0 < µ < 1.

We illustrate in a typical graph the sequence of errors e(k) := x(k) − α against the number of iterations
k for hypothetical iterative methods with convergence orders p = 1 and 2. A logarithmic scale is used on
the error axis, while a linear scale is used on the number of iterations axis. We note that linear convergence
(p = 1) is graphically represented by a straight line whose slope depends on the asymptotic convergence
factor µ. A parabola is obtained in the case of quadratic convergence (p = 2).

Proposition 3.1 (Convergence order of Newton method). Let Iα be a neighborhood of α. If f ∈ C 2 (Iα ),


x(0) is “sufficiently” close to α, and f 0 (α) 6= 0, then Newton’s method is convergent with order 2 (quadrat-
ically) to α, provided that f 0 (x(k) ) 6= 0 for all k ≥ 0. In particular, we have:

x(k+1) − α 1 f 00 (α)
lim = ;
k→+∞ (x (k) − α)2 2 f 0 (α)

1 f 00 (α)
based on Eq. (3.2), p = 2 is the convergence order and µ = is the asymptotic convergence factor.
2 f 0 (α)

Definition 3.2. Let f ∈ C m (Iα ), with m ∈ N such that m ≥ 1. The zero α ∈ Iα is a zero of multiplicity
m if f (i) (α) = 0 for every i = 0, . . . , m − 1 and f (m) (α) 6= 0. If the previous condition is satisfied for
m = 1, the zero α is called simple; otherwise, it is called multiple.

Proposition 3.2 (Convergence order of Newton method, zero of multiplicity m). If f ∈ C 2 (Iα ) ∩ C m (Iα )
and x(0) is “sufficiently” close to the zero α of multiplicity m > 1, then Newton’s method is convergent
with order 1 (linearly) to α, provided that f 0 (x(k) ) 6= 0 for all k ≥ 0. In particular, based on Eq. (3.2), we
have:
x(k+1) − α 1
lim =1− ,
k→+∞ x(k) − α m
 
1
with p = 1 as the convergence order and µ = 1 − ∈ (0, 1) as the asymptotic convergence factor.
m

Copyright c Luca Dede’ 2024


44 Approximation of Zeros of Nonlinear Equations and Systems

If the zero α is simple, i.e., m = 1, Newton method converges at least quadratically based on Proposi-
tion 3.1. On the contrary, if the zero α is multiple (m > 1), Newton method converges only linearly based
on Proposition 3.2. We observe that, in general, the higher the order of convergence p, the fewer iterations
will be needed to achieve a desired error value, meaning the method will be more efficient.

3.1.2 Modified Newton method


Assuming that f ∈ C m (Iα ), with α ∈ Iα and m ≥ 1 the multiplicity of α, the k–th iterate of the modified
Newton method is written as:

f (x(k) )
x(k+1) = x(k) − m for every k ≥ 0, (3.3)
f 0 (x(k) )

given the initial iterate x(0) and provided that f 0 (x(k) ) 6= 0 for all k ≥ 0. Based on Algorithm 3.1, we
obtain the following for the modified Newton method.

Algorithm 3.2: Modified Newton method


select m;
set k = 0 and choose the initial iterate x(0) ;
while (the stopping criterion is not satisfied) do
f (x(k) )
x(k+1) = x(k) − m 0 (k) ;
f (x )
set k = k + 1;
end

The convergence properties of the modified Newton method are characterized as follows.

Proposition 3.3 (Convergence order of the modified Newton method). If f ∈ C 2 (Iα ) ∩ C m (Iα ), where
m ≥ 1 is the multiplicity of the zero α ∈ Iα , and x(0) is “sufficiently” close to α, then the modified Newton
method is convergent with order 2 (quadratically) to α, provided that f 0 (x(k) ) 6= 0 for all k ≥ 0.
The modified Newton method requires prior knowledge of the multiplicity m of the zero α. Alternatively,
m can be estimated based on appropriate numerical methods or adaptive techniques.

3.1.3 Stopping criteria for Newton methods


We consider the stopping criteria for Newton methods. Since the zero α is generally unknown, the error
e(k) = |x(k) − α| is also not computable; therefore, it is necessary to introduce an appropriate error
estimator (error indicator) ee(k) such that ee(k) ' e(k) . Referring, for example, to Algorithms 3.1 and 3.2 for
Newton and modified Newton methods, the iterations are stopped for k = kmin such that ee(kmin ) < tol,
where tol is a prescribed tolerance, or when the maximum allowed number of iterations is reached.
First, we consider the stopping criterion based on the difference between successive iterates, where the
error estimator is chosen as:
(
|δ (k−1) | if k ≥ 1
ee(k) = with δ (k) := x(k+1) − x(k) for k ≥ 0.
tol + 1 if k = 0

This criterion is satisfactory if the zero α is simple; this can be shown by interpreting Newton method as a
fixed point iteration method. In particular, we have that the error estimator:

e(k) = x(k) − α ' m ee(k+1) = m δ (k) ,

Copyright c Luca Dede’ 2024


Newton Methods 45

where ee(k+1) = δ (k) = x(k+1) − x(k) and m ≥ 1 is the multiplicity of the zero α.
Another stopping criterion is based on the residual (absolute), defined as follows:

ee(k) = r(k) with r(k) := f (x(k) ) for k ≥ 0.

This criterion is satisfactory if |f 0 (x)| ' 1 for x ∈ Iα , where Iα is a neighborhood of the zero α. In this
case, ee(k) ' e(k) . On the other hand, the criterion is unsatisfactory if |f 0 (x)|  1 or if |f 0 (x)| ' 0 for
x ∈ Iα . Specifically, if |f 0 (x)|  1 for x ∈ Iα , then the error is overestimated by the error estimator
e(k)  e(k) ), leading to more Newton iterations than necessary thus making the criterion inefficient.
(e
Conversely, if |f 0 (x)| ' 0 for x ∈ Iα , then the error is underestimated by the error estimator (e
e(k)  e(k) ),
causing the Newton iterations to terminate prematurely, as the actual error is larger than indicated by the
estimator.

Example 3.3. The following examples graphically illustrate situations where the stopping criterion based on the resid-
ual is either satisfactory or unsatisfactory.

Satisfactory, ee(k) ' e(k) Unsatisfactory, ee(k)  e(k) Unsatisfactory, ee(k)  e(k)
(the error is overestimated) (the error is underestimated)

A popular variant of the criterion is that of the relative residual, such that the error estimator is:

r(k)
ee(k) = for k ≥ 0.
r(0)

3.1.4 Inexact and quasi–Newton methods


The Newton and modified Newton methods require the computation and evaluation of the first derivative
of the function f (x); see Eqs. (3.1) and (3.3). However, in many practical cases of interest, evaluating
f 0 (x) may be “difficult" or computationally expensive to be determined. Therefore, referring, for example,
to the Newton iterate (3.1), f 0 (x(k) ) can be approximated by a computationally more convenient quantity
q (k) ' f 0 (x(k) ) that approximates f 0 (x(k) ). The quasi–Newton or inexact methods1 are based on the use
of approximated values of f 0 (x(k) ). The generic iterate of the inexact Newton method can be written as:

f (x(k) )
x(k+1) = x(k) − for every k ≥ 0;
q (k)

the choice of q (k) determines the specific method. Consider the following cases:
• for q (k) ≡ f 0 (x(k) ), we obtain the Newton method;

f 0 (x(k) )
• for q (k) ≡ , we obtain the modified Newton method (where m is the multiplicity of α);
m
f (b) − f (a)
• for q (k) = qR = for every k ≥ 0, with α ∈ (a, b), we obtain the rope method;
b−a
1
The terminology used here for inexact and quasi–Newton methods is not entirely precise.

Copyright c Luca Dede’ 2024


46 Approximation of Zeros of Nonlinear Equations and Systems

f (x(k) ) − f (x(k−1) )
• for q (k) = for every k ≥ 1, we obtain the secant method (for the secant
x(k) − x(k−1)
(0)
method, q can be chosen as for the rope method).
The rope method converges linearly (p = 1) if the zero α is simple and under certain conditions on qR ,
while it may either converge or not if the zero α is multiple. The secant method converges with order
p ' 1.6 if the zero α is simple, while it converges linearly p = 1 if the zero α is multiple (m > 1).

Example 3.4. The following examples graphically illustrate the rope and secant methods for the generic iterate x(k) .

Rope method Secant method

3.1.5 Newton methods for systems of nonlinear equations


Newton method can be used to approximate the solution of systems of nonlinear equations. Given F :
Rn → Rn , for some n ≥ 1, the problem consists of finding the vector α ∈ Rn such that F(α) = 0.
T T
Here, we have x = (x1 , . . . , xn ) ∈ Rn and F(x) = (f1 (x), . . . , fn (x)) , for which we look for
T n
α = (α1 , . . . , αn ) ∈ R such that F(α) = 0.

Definition 3.3. Let F : Rn → Rn be differentiable in Ix ⊆ Rn , a neighborhood of x ∈ Rn , then the


∂fi
Jacobian of F at x is the matrix JF : Rn → Rn×n such that (JF (x))ij = (x) for i, j = 1, . . . , n.
∂xj

The Newton method is applicable to a system of equations F ∈ C 0 (Iα ) that is differentiable in Iα ⊆ Rd ,


a neighborhood of α. Given the initial iterate x(0) ∈ Iα , Newton method consists of sequentially applying
the following Newton iterate:

solve JF (x(k) ) δ (k) = −F(x(k) ) and set x(k+1) = x(k) + δ (k) for every k ≥ 0, (3.4)

provided that det(JF (x(k) )) 6= 0 for every k ≥ 0. Note that the Newton iterate (3.4) can be obtained as a
first–order Taylor series expansion of F(x) around x(k) , that is, as F(x(k) )+JF (x(k) ) (x(k+1) −x(k) ) = 0.

Algorithm 3.3: Newton method


set k = 0 and choose the initial iterate x(0) ;
while (the stopping criterion is not satisfied) do
assemble the Jacobian matrix JF (x(k) );
solve the linear system JF (x(k) ) δ (k) = −F(x(k) );
x(k+1) = x(k) + δ (k) ;
set k = k + 1;
end

The Newton method must employ an appropriate stopping criterion. As seen in the case of a nonlinear
function, the criterion based on the difference between successive iterates, namely ee(k) = kδ (k−1) k < tol

Copyright c Luca Dede’ 2024


Newton Methods 47

for k ≥ 1, or the criterion based on the residual, namely ee(k) = r(k) = kF(x(k) )k < tol for k ≥ 0, can be
used, where tol is a given tolerance. In the latter case, a popular stopping criterion is based on the relative
r(k) kF(x(k) )k
residual for which ee(k) = (0) = < tol for k ≥ 0.
r kF(x(0) )k
We present the following result concerning the convergence of Newton method for systems of nonlinear
equations, which generalizes the case of a nonlinear function from Proposition 3.1.

Proposition 3.4. Let F ∈ C 2 (Iα ), with Iα ⊆ Rn being a neighborhood of α, x(0) ∈ Rn "sufficiently"


close to α, and det(JF (α)) 6= 0. Then, the Newton method converges to α with order p = 2, provided that
det(JF (x(k) )) 6= 0 for every k ≥ 0, that is:

kx(k+1) − αk
lim = µ.
k→+∞ kx(k) − αk2

!
sin(x1 x2 ) − x2
Example 3.5. Consider the system of nonlinear equations F(x) = 1 and the zero α =
x1 + x2 − e−x1 x2
T " # 2
x2 cos(x1 x2 ) x1 cos(x1 x2 ) − 1

1
, 0 . Its Jacobian matrix is JF (x) = x2 −x1 x2 x1 −x1 x2 .
2 1+ e 1+ e
2 2
1
Since det(JF (α)) = 6= 0, we expect at least quadratic convergence to α if x(0) is “sufficiently" close to α.
2

At each iterate of Newton method (3.4), it is necessary to assemble and solve a linear system, unless
n = 1, in which case JF (x(k) ) ≡ f 0 (x(k) ). For n  1, these operations can prove to be computationally
expensive, which necessitates the use of numerical methods for solving linear systems discussed in Chap-
ter 2. In addition, the assembly of the Jacobian matrix JF (x(k) ) can be computationally expensive too,
especially if n is large. Roughly speaking, the computational cost of the Newton method consists of the
number of operations given for the assembly and solving of the linear system at each iteration times the
number of iterations performed. Inexact and quasi–Newton methods aim at building an approximation of
the Jacobian matrix JF (x(k) ), and possibly solving the corresponding linear system, at a reduced cost.
A possible strategy consists in assembling the Jacobian matrix only every l ∈ N iterations, thereby
reducing assembly costs. Additionally, since the Jacobian matrix remains fixed for l iterations, the LU
factorization method can be employed, allowing the construction of the L and U matrices (the most com-
putationally demanding operation) only every l iterations.
The Broyden method is an inexact Newton method that generalizes the secant method for n ≥ 1.
It consists in selecting an approximation matrix B0 ∈ Rn×n of the Jacobian matrix JF (x(0) ), and then
sequentially building approximations of the Jacobian matrices for every k ≥ 0. The algorithm reads as
follows.

Algorithm 3.4: Broyden method


set k = 0, choose the initial iterate x(0) and B0 ∈ Rn×n ;
while (the stopping criterion is not satisfied) do
solve the linear system Bk δ (k) = −F(x(k) );
x(k+1) = x(k) + δ (k) ;
1  T
Bk+1 = Bk + T F(x(k+1) ) δ (k) ;
δ (k) δ (k)
set k = k + 1;
end

Since the method is inexact, we cannot expect the same convergence order of the Newton method (p = 2).
However, the convergence order p is in general superlinear, since p ∈ (1, 2).

Copyright c Luca Dede’ 2024


48 Approximation of Zeros of Nonlinear Equations and Systems

3.2 Fixed Point Iterations


Let us consider the fixed point iteration method, both to find the fixed point of an iteration function and as
a method for solving nonlinear equations.

3.2.1 Fixed point iterations for scalar functions


Given a function f : R → R, the goal is to determine the zero α (that is, such that f (α) = 0). To this end,
we transform the problem of finding the zero α of f (x) into solving a fixed point iteration problem.

Definition 3.4. Given the iteration function φ : [a, b] ⊆ R → R, we say that α ∈ R is a fixed point of φ if
and only if φ(α) ≡ α.

Example 3.6. Let us graphically illustrate the fixed points of some iteration functions.

The goal is to find the zero α of the nonlinear function f (x). We transform this problem into a fixed
point iteration problem by appropriately selecting an iteration function φ(x) such that f (α) = 0 if and only
if φ(α) = α for α ∈ [a, b]. Note that there are several iteration functions φ(x) that can perform this task
and multiple ways to derive them.
The simplest way to obtain φ(x) from f (x) is based on the following steps. Since f (α) = 0, we have
f (α) + α = α, so we can set φ(x) = f (x) + x. However, this is often an inadequate choice for the iteration
function, as we will see later.

Example 3.7. Let us consider f (x) = 2x2 − x − 1, for which we are interested in the zero α = 1. One possibility is
x+1
to set φ1 (x) = f (x) + x = 2x2 − 1. A second choice can be derived by setting f (x) = 0, which gives x2 =
r r 2
x+1 x+1
and thus x = ± ; in this case, we can take φ2 (x) = .
2 2

We present the fixed point iteration algorithm, which reads:

x(k+1) = φ(x(k) ) for k ≥ 0, (3.5)

for some initial iterate x(0) .

Algorithm 3.5: Fixed Point Iterations


set k = 0 and the initial iterate x(0) ;
while (the stopping criterion is not satisfied) do
x(k+1) = φ(x(k) );
set k = k + 1;
end

Copyright c Luca Dede’ 2024


Fixed Point Iterations 49

Example 3.8. We graphically illustrate the fixed point iteration algorithm below. Let us consider the case φ(x) =
cos(x), with a = 0.1, b = 1.1, and x(0) = 0.2, where we observe that the algorithm converges to α = cos(α) '
0.7391.

Let us now consider φ(x) = 2x2 − 1, with a = 0.5, b = 2, and x(0) = 1.1, where the algorithm diverges from the
fixed point α = 1. We observe that φ(x) corresponds to the iteration function φ1 (x) from Example 3.7.

r
1+x
Finally, let us consider φ(x) = , with a = 0.5, b = 2, and x(0) = 1.9, where the algorithm converges to
2
α = 1; in this case, φ(x) corresponds to φ2 (x) from Example 3.7.

Copyright c Luca Dede’ 2024


50 Approximation of Zeros of Nonlinear Equations and Systems

We now state the properties that we must require from the iteration function φ(x) to ensure the existence
and uniqueness of the fixed point α within a given interval; furthermore, we will discuss the convergence
properties of the fixed point iteration method.

Proposition 3.5 (Global convergence in the interval). If φ ∈ C 1 ([a, b]), φ(x) ∈ [a, b] for every x ∈ [a, b],
and |φ0 (x)| < 1 for every x ∈ [a, b], then there exists a unique fixed point α ∈ [a, b], and the fixed point
iteration method converges for every x(0) ∈ [a, b] with order at least equal to 1 (linearly), that is:

x(k+1) − α
lim = φ0 (α),
k→+∞ x(k) − α

where φ0 (α) is the asymptotic convergence factor.

Let us introduce the constant L = max |φ0 (x)|. Then, under the hypothesis of Proposition 3.5, we observe
x∈[a,b]
that the error
e(k+1) = |x(k+1) − α| = |φ(x(k) ) − φ(α)| ≤ L|x(k) − α| = L e(k) for every k ≥ 0.
By recursion, we thus have
e(k) ≤ Lk e(0) for every k ≥ 0.
Since L < 1, we obtain lim e(k) = 0, which means that the method is convergent for every x(0) ∈ [a, b].
k→+∞

Example 3.9. We illustrate the results of Proposition 3.5 through the following examples.

All he hypothesis of Proposition 3.5 are satisfied. There-


fore, there exists a unique fixed point α ∈ [a, b], and the
method converges to α for every x(0) ∈ [a, b].

Here, φ ∈ C 1 ([a, b]), φ(x) ∈ [a, b] for every x ∈ [a, b],


but φ0 (x) > 1 for some x ∈ [a, b]. Therefore, based
on Proposition 3.5, we cannot guarantee that there exists at
least a fixed point α ∈ [a, b].

The assumptions of Proposition 3.5 are not satisfied. There-


fore, there may not exist any fixed point α ∈ [a, b].

Copyright c Luca Dede’ 2024


Fixed Point Iterations 51

We illustrate some results on the local convergence to the fixed point α, that is, in the vicinity of α.

Proposition 3.6 (Ostrowski, local convergence in the vicinity of the fixed point). If φ ∈ C 1 (Iα ), with
Iα being a neighborhood of the fixed point α of φ(x), and |φ0 (α)| < 1, then, if the initial iterate x(0) is
“sufficiently" close to α, the fixed point iteration method converges with order at least equal to 1 (linearly),
that is:
x(k+1) − α
lim = φ0 (α),
k→+∞ x(k) − α

where φ0 (α) is the asymptotic convergence factor.


Based on Proposition 3.6, we observe that for φ ∈ C 1 (Iα ):

• if |φ0 (α)| < 1, the fixed point iteration method converges to α with an order of at least 1, provided
that x(0) is “sufficiently" close to α;

• if |φ0 (α)| ≡ 1, the convergence of the method to α depends on the properties of φ(x) in the neigh-
borhood of Iα and the choice of the initial iterate x(0) (in fact, the method may or may not converge);

• if |φ0 (α)| > 1, the convergence of the method to α is impossible, unless x(0) ≡ α.

Proposition 3.7 (Local convergence in the vicinity of the fixed point). If φ ∈ C 2 (Iα ), with Iα being a
neighborhood of the fixed point α of φ(x), φ0 (α) = 0, and φ00 (α) 6= 0, then, if the initial iterate x(0) is
“sufficiently" close to α, the fixed point iteration method converges with order 2 (quadratically), that is:

x(k+1) − α 1
lim = φ00 (α),
k→+∞ (x(k) − α)2 2

1
where φ00 (α) is the asymptotic convergence factor.
2

3.2.2 Stopping criterion for fixed point iterations


It is necessary to consider a stopping criterion to terminate the fixed point iterations of Algorithm 3.5. To
this end, we introduce an appropriate error estimator ee(k) for the error e(k) := x(k) − α . This error
estimator is based on the difference between successive iterates, that is:
(
(k) δ (k−1) if k ≥ 1
ee = with δ (k) := x(k+1) − x(k) for k ≥ 0;
tol + 1 if k = 0

where tol > 0 is an appropriate tolerance. The fixed point iteration algorithm stops at the first iteration k
such that ee(k) < tol or when k = kmax , with kmax being the maximum allowed number of iterations.
1
We can determine that x(k) − α = − δ (k) , for some ξ (k) between x(k) and α. We use
1 − φ0 (ξ (k) )
it to determine whether the stopping criterion is satisfactory or not. If φ0 (x) ' 0 in a neighborhood of
α (φ0 (α) ' 0), the criterion is satisfactory since e(k) ' ee(k+1) . If φ0 (x) > −1, but φ0 (x) ' −1 in a
1
neighborhood of α, the criterion is still satisfactory since e(k) ' ee(k+1) (the error is overestimated by the
2
estimator by a factor of 2). Conversely, if φ0 (x) < 1, but φ0 (x) ' 1 in a neighborhood of α, the criterion is
unsatisfactory since ee(k+1)  e(k) , i.e. the error is underestimated by the error estimator.

3.2.3 Newton methods as fixed point iterations methods


The Newton method can be suitably used to approximate the zero α of a generic function f (x), whose
Newton iterate is specified in Eq. (3.1). The problem of finding the zero α of f (x) using Newton method

Copyright c Luca Dede’ 2024


52 Approximation of Zeros of Nonlinear Equations and Systems

can be reformulated as a fixed point iteration method via the iteration function φN (x) such that φN (α) = α.
From Eqs. (3.1) and (3.5), we deduce that the iteration function associated with Newton method is:
f (x)
φN (x) = x − .
f 0 (x)
It follows that the properties of Newton method, including convergence to α, can be inferred from those of
the iteration function φN (x).
Similarly, to the modified Newton method (Sec. 3.1.2) based on the iterate of Eq. (3.3), we associate the
iteration function φmN (x) defined as:
f (x)
φmN (x) = x − m .
f 0 (x)
As with Newton method, the rope method can also be interpreted as a fixed point iteration method with
the iteration function2 :
f (x)
φR (x) = x − ,
qR
f (b) − f (a)
where qR = .
b−a
Remark 3.1. Instead, the secant method cannot be interpreted as a fixed-point iteration method.

3.2.4 Fixed point iterations for vector–valued functions


The fixed point iteration method can be used with vector iteration functions φ : Rn → Rn , for n ≥ 1.
The problem consists of finding the vector α ∈ Rn , the fixed point, such that φ(α) = α. The fixed point
iteration method consists of sequentially applying the following fixed point iteration:

x(k+1) = φ(x(k) ) for k ≥ 0,

given the initial iterate x(0) ∈ Rn . As a stopping criterion, we can use one based on the difference between
successive iterates similarly to Section 3.2.2, that is, ee(k) = δ (k−1) < tol for k ≥ 1, with tol a prescribed
tolerance and δ (k) = x(k+1) − x(k) .
Let φ : Rn → Rn be differentiable in Ix ⊆ Rn , a neighborhood of x ∈ Rn . Then its Jacobian matrix
∂φi
at x is Jφ : Rn → Rn×n such that (Jφ (x))ij = (x) for i, j = 1, . . . , n. We also denote by ρ(Jφ (x))
∂xj
the spectral radius of Jφ (x) at x ∈ Rn . We then have the following convergence result.

Proposition 3.8. If φ ∈ C 1 (Iα ), with Iα ⊆ Rn a neighborhood of the fixed point α ∈ Rn of φ(x), and
the initial iterate x(0) ∈ Rn is “sufficiently” close to α, and ρ(Jφ (α)) < 1, then the fixed point iteration
method converges with order at least equal to 1 (linearly), that is:

kx(k+1) − αk
lim = ρ(Jφ (α)),
k→+∞ kx(k) − αk

where ρ(Jφ (α)) is the asymptotic convergence factor.

f 0 (x)
2
Using the result from Proposition 3.6, the convergence of the method is guaranteed if |φ0R (α)| < 1, since φ0R (x) = 1 − ,
qR
1 0 1
for x(0) “sufficiently" close to α. That is, the following conditions on qR hold: qR > f (α) if f 0 (α) > 0, while qR < f 0 (α)
2 2
if f 0 (α) < 0. If the zero α is multiple, then φ0R (α) = 1, so the convergence of the rope method is no longer guaranteed. In general,
f 0 (α)
if |φ0R (α)| < 1, the rope method converges linearly to α with an asymptotic convergence factor of 1 − , for x(0) “sufficiently"
qR
close to α, as deduced from Proposition 3.6. Finally, if qR = f 0 (α), we have φ0R (α) = 0, so the rope method, under the assumptions
f 00 (α)
of Proposition 3.7, converges quadratically to α (with order p = 2) with an asymptotic convergence factor of − .
2 qR

Copyright c Luca Dede’ 2024


Chapter 4

Approximation of Functions and Data

Let us consider the approximation of functions and data, particularly by means of interpolation methods
and approximation in the sense of least squares.

4.1 Motivations and Examples


We illustrate through examples some of the motivations behind the approximation of functions and data.
Example 4.1. We are interested in calculating the definite integral I of a function f (x) over the interval [a, b], i.e.,
Z b
I = I(f ) = f (x) dx, but we are unable to determine the primitive of the function f (x). A possible alternative
a
is to approximate f (x) with another function fe(x), for which we can determine the primitive and thus compute the
Z b
integral in closed form (analytically) as Ie = I(fe) = fe(x) dx, such that Ie ' I.
a

Example 4.2. Let us assume that the function f (x) is known only through its evaluations at a set of (n + 1) nodes
{xi }n n
i=0 , that is, the data pairs {(xi , f (xi ))}i=0 are given. We might therefore be interested in defining an approxi-
mating function fe(x) for the unknown function f (x).

Example 4.3. Given the set of data pairs {(xi , yi )}n


i=0 , we want to determine intermediate values between the data
pairs or make predictions about the values of the data outside the interval determined by the set of (n + 1) nodes
{xi }n
i=0 .

53
54 Approximation of Functions and Data

One way to approximate a function f ∈ C n (Ix0 ) in a neighborhood Ix0 of a point x0 ∈ R is based


on the Taylor series expansion (Taylor polynomials) of order n. The Taylor expansion of order n of the
function f (x) around x0 is written as:
n
X 1 (i)
fe(x) = f (x0 ) + f (x0 ) (x − x0 )i .
i=1
i!

However, the approximation of f (x) by fe(x) presents some issues. First, it is necessary to compute and
evaluate n derivatives of the function f (x), operations that can be computationally expensive. Furthermore,
the Taylor expansion is accurate only in a neighborhood Ix0 of x0 , while it is generally quite inaccurate
outside of this neighborhood Ix0 .

1
Example 4.4. We consider the Taylor expansion of order n of f (x) = , which is denoted as fen (x), around x0 = 1.
x

Since f (i) (x) = (−1)i i! x−(i+1) for i =


0, 1, . . . , n, we have fe(x) = fen (x) = 1 +
Xn
(−1)i (x − 1)i . As shown in the figure, the ap-
i=1
proximation provided by fen (x) becomes highly inac-
curate “far” from the point x0 = 1.

4.2 Interpolation
We introduce the concept of interpolation and classify the various interpolation methods.

n n
Definition 4.1. Let us consider a set of (n + 1) data pairs {(xi , yi )}i=0 , where {xi }i=0 are (n + 1)
distinct nodes, i.e., such that xi 6= xj for every i 6= j with i, j = 0, . . . , n; if the function f (x) is known,
n
we set yi = f (xi ) for each i = 0, . . . , n. Interpolating the data pairs {(xi , yi )}i=0 means determining an
approximating function fe(x) such that fe(xi ) = yi for every i = 0, . . . , n, or if f (x) is known, such that
fe(xi ) = f (xi ) for every i = 0, . . . , n. The function fe(x) is called the interpolant of the data at the nodes.

There are various types and methods of interpolation.

• Polynomial interpolation, such that fe(x) = a0 + a1 x + · · · + an xn for suitable n + 1 coefficients


a0 , a1 , . . . ∈ R.

a0 + a1 x + · · · + ak xk
• Rational interpolation, such that fe(x) = for suitable coefficients
ak+1 + ak+2 x + · · · + ak+n+1 xn
a0 , a1 , . . . ∈ R with k, n ≥ 0.
M
X
• Trigonometric interpolation, such that fe(x) = aj eι j x , where ι is the imaginary unit (ι2 =
j=−M
−1) and eι j x = cos(j x) + ι sin(j x), for some M and complex coefficients aj .

• Piecewise polynomial interpolation (composite).

• Interpolation obtained using spline functions.

In this course, we will focus on polynomial interpolation and piecewise polynomial interpolation.

Copyright c Luca Dede’ 2024


Interpolation 55

4.2.1 Lagrange polynomial interpolation


Let us consider the polynomial interpolation, which we specifically realize using Lagrange interpolating
polynomials. Recall that Pn denotes the set of polynomials of degree less than or equal to n.
Polynomial interpolation is based on the following result that determines the correspondence between
the number of distinct nodes and the degree of the interpolant.
n n
Proposition 4.1. For every data pair {(xi , yi )}i=0 , where {xi }i=0 are (n + 1) distinct nodes, there exists
a unique polynomial of degree less than or equal to n, denoted by Πn (x), such that Πn (xi ) = yi for every
i = 0, . . . , n. The polynomial Πn (x) ∈ Pn is called the interpolating polynomial of the data at the nodes
n
{xi }i=0 . If instead f (x) is a given continuous function, for which yi = f (xi ) for every i = 0, . . . , n, then
n
Πn f (x) ∈ Pn is called the interpolating polynomial of the function f (x) at the nodes {xi }i=0 .

Example 4.5. We illustrate two cases for which n = 1 (left) and n = 2 (right).

The goal is to determine the expression of the interpolating polynomial Πn (x) (or Πn f (x)), that is,
n
Πn (x) = a0 + a1 x + · · · + an xn ; to this end, it is necessary to calculate the coefficients {ai }i=0 of
this polynomial of degree n. For this purpose, let us consider a special family of polynomials associated
n
with the (n + 1) distinct nodes {xi }i=0 .
n
Definition 4.2. Given a set of (n + 1) distinct nodes {xi }i=0 , the Lagrange characteristic function
( as ϕk ∈ Pn , is a polynomial of degree n such that ϕk (xi ) = δki for
associated with the node xk , denoted
0 if i 6= k
every i = 0, . . . , n, where δki = . Furthermore, it is defined as:
1 if i = k n
Y x − xi
ϕk (x) = .
x
i=0 k
− xi
i6=k
n
The set of polynomials {ϕk (x)}k=0 represents the basis of the Lagrange characteristic polynomials.

Example 4.6. Let us illustrate the basis of the Lagrange characteristic polynomials for n = 1 and n = 2.

n=1
x − x1
ϕ0 (x) = ∈ P1
x0 − x1
x − x0
ϕ1 (x) = ∈ P1
x1 − x0

n=2
x − x1 x − x2
ϕ0 (x) = ∈ P2
x0 − x1 x0 − x2
x − x0 x − x2
ϕ1 (x) = ∈ P2
x1 − x0 x1 − x2
x − x0 x − x1
ϕ2 (x) = ∈ P2
x2 − x0 x2 − x1

Copyright c Luca Dede’ 2024


56 Approximation of Functions and Data

n
Definition 4.3. Given the basis of Lagrange characteristic polynomials {ϕk (x)}k=0 associated with the
n n
(n + 1) distinct nodes {xi }i=0 , the Lagrange interpolating polynomial of the data pairs {(xi , yi )}i=0 can
be expressed as:
Xn
Πn (x) = yk ϕk (x).
k=0

If the function f (x) is given and continuous, the Lagrange interpolating polynomial of the function f (x) at
n
the nodes {xi }i=0 is expressed as:

n
X
Πn f (x) = f (xk ) ϕk (x).
k=0

The Lagrange interpolating polynomial Πn (x) interpolates the data at the nodes; in fact,
n
X n
X
Πn (xi ) = yk ϕk (xi ) = yk δki = yi for every i = 0, . . . , n.
k=0 k=0

Similarly, Πn f (x) interpolates f (x) at the nodes.


The Lagrange interpolating polynomial Πn (x) ∈ Pn uses the basis of Lagrange characteristic polyno-
n n
mials {ϕk (x)}k=0 to determine the coefficients {ai }i=0 of this polynomial of degree n, that is,
n
X
Πn (x) = yk ϕk (x) = a0 + a1 x + · · · + an xn .
k=0

Example 4.7. Let us construct the Lagrange polynomial interpolant for the data pairs {(1, 3)}, {(2, 2)}, and {(4, 6)},
1 1 8
where n = 2. With x0 = 1, x1 = 2, and x2 = 4, we have: ϕ0 (x) = (x − 2)(x − 4) = x2 − 2x + ,
3 3 3
1 1 5 1 1 1 1
ϕ1 (x) = − (x − 1)(x − 4) = − x2 + x + 2, ϕ2 (x) = (x − 1)(x − 2) = x2 − x + . The Lagrange
2 2 2 6 6 2 3
interpolating polynomial of degree n = 2 is given by Π2 (x) = y0 ϕ0 (x) + y1 ϕ1 (x) + y2 ϕ2 (x) = x2 − 4x + 6,
where y0 = 3, y1 = 2, and y2 = 6.

Definition 4.4. For a continuous function f (x) and the interval I = [a, b] partitioned by (n + 1) ordered
nodes as a = x0 < x1 < · · · < xn = b, we define the error function En f (x) := f (x)−Πn f (x) associated
with the interpolating polynomial Πn f (x). The error is given by

en (f ) := max |En f (x)| = max |f (x) − Πn f (x)|


x∈I x∈I

Since Πn f (x) interpolates f (x) at the nodes, En f (xi ) = 0 for every i = 0, . . . , n.


1  √  1  √ 
Example 4.8. Given the function f (x) = sin(x) + sin 2πx + 3 + sin 4πx + 7 , let us consider its
4 10
polynomial interpolation at n + 1 equidistant nodes in the interval I = [0, 2].

We define the polynomial degree n = 6, for which we obtain


the polynomial interpolant Π6 f (x) of f (x) at the nodes x0 = 0,
1 2 4 5
x1 = , x2 = , x3 = 1, x4 = , x5 = , and x6 = 2.
3 3 3 3
We graphically represent the error function E6 f (x) = f (x) −
Π6 f (x) and observe that E6 f (xi ) = 0 for every i = 0, . . . , 6.

Copyright c Luca Dede’ 2024


Interpolation 57

n
Proposition 4.2. Consider (n+1) distinct nodes {xi }i=0 in an interval I = [a, b] such that a = x0 < x1 <
· · · < xn = b and the interpolating polynomial Πn f (x) of a function f (x) at those nodes. If f ∈ C n+1 (I)
1
for every x ∈ I, then there exists ξ = ξ(x) ∈ I such that: En f (x) = f (n+1) (ξ(x)) ωn (x), where
(n + 1)!
Yn
ωn (x) := (x − xi ). Moreover, the error en (f ) is bounded by the error estimator een (f ) as follows:
i=0

1
en (f ) ≤ een (f ) := max f (n+1) (x) max |ωn (x)| . (4.1)
(n + 1)! x∈I x∈I

n
Proposition 4.3. Consider (n + 1) equally spaced nodes {xi }i=0 in the interval I = [a, b] such that
b−a
xi = x0 + i h for i = 0, . . . , n, with x0 = a, xn = b, and h = . Then the function ωn (x) from
n
Proposition 4.2 satisfies:
 n+1
n! n+1 n! b − a
max |ωn (x)| ≤ h = .
x∈I 4 4 n
Therefore, we deduce from Eq. (4.1) the following error estimate en (f ):
n+1
hn+1

1 b−a
en (f ) ≤ een (f ) := max f (n+1) (x) = max f (n+1) (x) . (4.2)
4(n + 1) x∈I 4(n + 1) n x∈I

Under the same assumptions, we have:


0
max f 0 (x) − (Πn f ) (x) ≤ Cn hn max f (n+1) (x)
x∈I x∈I

for some positive constant Cn .

If the (n + 1) nodes are equally spaced in I, the error en (f ) may tend to zero or not as n → +∞,
hn+1
depending on the function f (x) being interpolated. From Eq. (4.2), we observe that lim = 0.
n→+∞ 4(n + 1)

On the other hand, max f (n+1) (x) may grow as n increases; indeed, there exist functions for which
x∈I

lim max f (n+1) (x) = +∞. In such cases, the growth of max f (n+1) (x) may not be compensated
n→+∞ x∈I x∈I
hn+1
by the decrease of with n, leading to lim een (f ) = +∞; thus, the error estimator een (f )
4(n + 1) n→+∞
“explodes," and typically, the error en (f ) too. The so-called Runge phenomenon is an example of this
behavior, where the error function En f (x) tends to “explode" for increasing values of n near the endpoints
of the interval I when equally spaced nodes are used for polynomial interpolation.
1
Example 4.9. Consider the polynomial interpolation of the Runge function f (x) = on (n + 1) equally spaced
1 + x2
nodes in the interval I = [−5, 5].

n=8 n = 10

Copyright c Luca Dede’ 2024


58 Approximation of Functions and Data

In this case, the interpolating polynomial Πn f (x) of f (x) exhibits the so-called Runge phenomenon for increasing
values of n, as can be observed near the endpoints of the interval I. Moreover, lim en (f ) = +∞.
n→+∞

A remedy to mitigate the Runge phenomenon related to polynomial interpolation is to use nodes that
are not equidistant within the interval I. The following definition provides a special family of nodes that
can be used for polynomial interpolation.

Definition 4.5. Given n ≥ 1, the (n + 1) Chebyshev–Gauss–Lobatto nodes in the reference interval


Ib = [−1, 1] are: π 
bi = − cos
x i i = 0, . . . , n;
n
in the generic interval I = [a, b], the (n + 1) Chebyshev–Gauss–Lobatto nodes are:

a+b b−a
xi = + x
bi i = 0, . . . , n.
2 2

xi }n
Example 4.10. We graphically highlight the (n + 1) Chebyshev–Gauss–Lobatto nodes {b i=0 in the reference
interval I = [−1, 1] for n = 4 (left) and n = 9 (right).
b

Proposition 4.4. If f ∈ C n+1 (I) and the (n + 1) Chebyshev–Gauss–Lobatto nodes are used in I = [a, b],
then lim Πn f (x) = f (x) for every x ∈ I, that is, lim en (f ) = 0.
n→+∞ n→+∞

Example 4.11. Referring to Example 4.9, in addition to considering the same data, we will show that the use of
Chebyshev–Gauss–Lobatto (CGL) nodes prevents the onset of the Runge phenomenon.

n = 8 with CGL nodes n = 10 with CGL nodes

The Lagrange interpolating polynomial Πn (x) ∈ Pn uses the basis of Lagrange characteristic polyno-
n
X
n n
mials {ϕk (x)}k=0 to determine the coefficients {ai }i=0 of this polynomial, namely Πn (x) = yk ϕk (x) =
k=0
a0 + a1 x + · · · + an xn . An alternative approach to polynomial interpolation constructed via the Lagrange
basis is to directly determine the n + 1 coefficients a = (a0 , a1 , . . . , an )T ∈ Rn+1 by imposing the (n + 1)
interpolation constraints
Πn (xi ) = Πn (xi ) = a0 + a1 xi + · · · + an xni = yi for each i = 0, . . . , n.

Copyright c Luca Dede’ 2024


Interpolation 59

In this way, the problem reduces to solving the following linear system:

V a=y (4.3)
j−1
where V ∈ R(n+1)×(n+1) is the Vandermonde matrix, with Vij = (xi−1 ) for i, j = 1, . . . , n + 1,
T
and y = (y0 , y1 , . . . , yn ) ∈ Rn+1 . The linear system (4.3) admits a unique solution if and only if
n
det(V ) 6= 0, that is, if and only if the n + 1 nodes {xi }i=0 are distinct. However, despite being intuitive,
this approach based on solving the linear system (4.3) may suffer from stability issues even for relatively
“small" values of n. This is due to the fact that the condition number of the matrix V is generally very high,
that is, K2 (V )  1. Consequently, the solution a determined at the calculator will generally be subject to
significant errors (see Chapter 2).

Remark 4.1. Polynomial interpolation is generally not suitable for extrapolating information outside the
interval I containing the nodes (see Example 4.8).

4.2.2 Piecewise polynomial interpolation


Piecewise polynomial interpolation, also known as composite interpolation, approximates a function f (x)
locally using polynomials. Piecewise polynomial interpolation is a good alternative to polynomial interpo-
lation with equally spaced nodes for extracting information within an interval containing the nodes.

n
Definition 4.6. Let us consider (n + 1) distinct nodes {xi }i=0 in the interval I = [a, b] such that a =
x0 < x1 < · · · < xn = b, which delimit n subintervals Ii = [xi , xi+1 ] for i = 0, . . . , n − 1. We define
H := max |Ii | = max (xi+1 − xi ) as the characteristic size of these subintervals.
i=0,...,n−1 i=0,...,n−1
n
Given the set of data pairs {(xi , yi )}i=0 , the piecewise linear polynomial interpolant ΠH
1 (x) of the data is
a piecewise polynomial of degree 1 such that ΠH 1 (x) ∈ P1 for every x ∈ Ii and i = 0, . . . , n − 1 (that is,
ΠH1 (x)|Ii ∈ P1 for each i = 0, . . . , n − 1), with:

yi+1 − yi
ΠH
1 (x) = yi + (x − xi ) for each i = 0, . . . , n − 1.
xi+1 − xi

If the function f ∈ C 0 (I) is known, then the piecewise linear interpolating polynomial ΠH 1 f (x) of the
function f (x) at the nodes is such that ΠH
1 f (x)|Ii ∈ P1 for each i = 0, . . . , n − 1, with:

f (xi+1 ) − f (xi )
ΠH
1 f (x) = f (xi ) + (x − xi ) for each i = 0, . . . , n − 1.
xi+1 − xi

Example 4.12. We present the piecewise linear interpolants of the (n + 1) data pairs, denoted as ΠH 1 (x), and of a
function f (x), denoted as ΠH
1 f (x), at the n + 1 nodes in the interval I. Specifically, we consider n = 4 (that is, we
have n + 1 = 5 nodes). The characteristic size of the subintervals {Ii }3i=0 is H = max |Ii |.
i=0,1,2,3

ΠH
1 (x) ΠH
1 f (x)

Copyright c Luca Dede’ 2024


60 Approximation of Functions and Data

Definition 4.7. If the function f ∈ C 0 (I) is known, we define the error associated with the piecewise linear
interpolating polynomial ΠH H H
1 f (x) as e1 (f ) := max f (x) − Π1 f (x) .
x∈I

Proposition 4.5. If f ∈ C 2 (I), then the error eH


1 (f ) associated with the piecewise linear interpolating
polynomial ΠH1 f (x) can be bounded by the error estimator eeH
1 (f ) as:

H2
eH eH
1 (f ) ≤ e 1 (f ) := max |f 00 (x)| ;
8 x∈I
it follows that the error converges to zero with order 2 in H (quadratically).

Similarly to ΠH H
1 (x), it is possible to define the piecewise quadratic interpolating polynomial Π2 (x)
H 0
such that Π2 (x)|Ii ∈ P2 for each subinterval Ii of I for i = 0, . . . , n − 1. If f ∈ C (I) is known, we use
the notation ΠH2 f (x).
In a similar way, it is possible to define the piecewise polynomial interpolant of degree r ≥ 1, denoted
as ΠH H H 0
r (x), such that Πr (x)|Ii ∈ Pr for every i = 0, . . . , n − 1 (or Πr f (x) if f ∈ C (I) is known).

Example 4.13. Let us consider the piecewise quadratic interpolating polynomial of a continuous function f (x), de-
noted as ΠH
2 f (x), on (n + 1) nodes in the interval I.

Specifically, let us set n = 4. The piecewise quadratic interpolant ΠH2 f (x) interpolates f (x) at the (n + 1) = 5 nodes,
as well as at intermediate points within each subinterval of I, such as the midpoints of the subintervals.

Proposition 4.6. If f ∈ C r+1 (I), the error eH H


r (f ) := max f (x) − Πr f (x) associated with the piece-
x∈I
wise polynomial interpolant of degree r ≥ 1, ΠH eH
r f (x), is bounded by the error estimator er (f ) as:

eH eH
r (f ) ≤ e r (f ) := Cr H
r+1
max f (r+1) (x) ,
x∈I

where Cr is a positive constant. Thus, we deduce that the error converges to zero with order (r + 1) in H.

For the composite polynomial interpolant ΠH r (x), the r + 1 nodes within each subinterval Ii can be
chosen to be equally spaced or as the Chebyshev–Gauss–Lobatto nodes.
Remark 4.2. The piecewise polynomial interpolants ΠH 0
r f (x) of any degree r ≥ 1 are only C –continuous
between one subinterval and another (across the external nodes of each subinterval), as can be seen in
Example 4.13.

Copyright c Luca Dede’ 2024


Least Squares Method 61

4.3 Least Squares Method


The least squares approximation is ideal for extracting information from a relatively large dataset, with
data potentially affected by uncertainty and noise, as well as for making predictions outside the range in
which such data are available.
n n
Definition 4.8. Given the data pairs {(xi , yi )}i=0 (or {(xi , f (xi ))}i=0 if the function f (x) is known) and
an integer m ≥ 0, we seek the approximating polynomial fem (x) of degree m such that:

n 
X n
2 X
2
yi − fem (xi ) ≤ (yi − pm (xi )) for every pm ∈ Pm .
i=0 i=0

If fem ∈ Pm exists, then it is called the polynomial of degree m approximating the data in the least squares
sense (or the function f (x) if known).
n
By convention, we assume that the nodes {xi }i=0 are distinct and that 0 ≤ m ≤ n. In a typical scenario
for using the least squares method, we have 0 ≤ m  n.
Remark 4.3. The least squares approximating polynomial fem (x) does not generally interpolate the data
(or the function f (x)) at the nodes. In particular, it is only when m = n that we can ensure fem (xi ) = yi
(or fem (xi ) = f (xi )) for every i = 0, . . . , n. In fact, in this case, fem (x) coincides with the interpolating
polynomial Πn (x) of degree n (or Πn f (x)).

Example 4.14. Let us graphically illustrate least squares approximating polynomials fem (x) for relatively large datasets
{(xi , yi )}n
i=0 , with n = 100. Consider m = 1 and 2 (on the left) and m = 2 and 3 (on the right).

As with polynomial interpolation, determining least squares approximating polynomials fem (x) of degree
m
m consists of finding the (m + 1) coefficients {ai }i=0 ; indeed, fem (x) = a0 + a1 x + · · · + am xm . To this
end, we define the vector a = (a0 , a1 , . . . , am ) ∈ Rm+1 and the function Φ : Rm+1 → R as:
T

X n
2
Φ(b) = [yi − (b0 + b1 xi + · · · + bm xm
i )] ,
i=0
n
which is associated with the dataset {(xi , yi )}i=0 for a generic vector b = (b0 , b1 , . . . , bm )T ∈ Rm+1 .
The least squares method consists of determining the coefficients a of the polynomial fem (x) such that:
Φ(a) = min
m+1
Φ(b).
b∈R
Given that Φ is differentiable and convex, the previous minimization problem corresponds to solving the
following differential problem:
∂Φ
find a ∈ Rm+1 : (a) = 0 for every j = 0, . . . , m, (4.4)
∂bj

Copyright c Luca Dede’ 2024


62 Approximation of Functions and Data

which leads to the solution of the linear system:


A a = q, (4.5)
(m+1)×(m+1) m+1
where A ∈ R and q ∈ R are given by:
 n
X n
X   X n 
 (n + 1) xi ··· xm
i   yi 
 i=0 i=0   i=0 
n n n
 n
 X X X   X 
x2i xm+1
  

 xi ··· i



 xi yi 

A= i=0 i=0 i=0  and q= i=0 .
 .. .. ..   .
 .

. . .  .
  
  
   
 n
X n
X n
X   X n 
xm xm+1 ··· x2m xm
   
i i i i yi
i=0 i=0 i=0 i=0
j−1
Referring to the Vandermonde matrix V ∈ R(n+1)×(m+1) , with Vij = (xi−1 ) for i = 1, . . . , n + 1 and
T
j = 1, . . . , m + 1, and the vector y = (y0 , y1 , . . . , yn ) ∈ Rn+1 , we observe that:
A = V TV and q = V T y.
The linear system (4.5) represents a generalization of the linear system (4.3) used for polynomial interpo-
lation. In fact, if the nodes are distinct and m = n, we obtain that fen (x) = Πn (x), so solving the two
linear systems (4.3) and (4.5) provides equivalent results.

Definition 4.9. Based on Definition 4.8, the least squares approximating polynomial fe1 (x) of degree m =
1 is called a regression line or a least squares line.

We illustrate the derivation of the system (4.5) for fe1 (x) (i.e., for the regression line, m = 1) with
n ≥ 1. In this case, the function Φ(b) can be expressed as:
n
X n
X
2  2
yi + b20 + b21 x2i − 2b0 yi − 2b1 xi yi + 2b0 b1 xi .

Φ(b) = [yi − (b0 + b1 xi )] =
i=0 i=0

To reformulate the problem as in Eq. (4.4), we compute the partial derivatives of Φ:


n
∂Φ X
(b) = [2b0 − 2yi + 2b1 xi ] ,
∂b0 i=0
n
∂Φ X
2b1 x2i − 2xi yi + 2b0 xi .

(b) =
∂b1 i=0

At this point, the problem (4.4) can be written as the linear system (4.5) with A ∈ R2×2 and q ∈ R2 ,
where:  X n   X n 
 (n + 1) xi   yi 
i=0  i=0
A= X and q = ,
  
n X n   Xn
 2   
xi xi xi yi
i=0 i=0 i=0

respectively. We observe that, in this case, the Vandermonde matrix V ∈ R(n+1)×2 can be expressed as:
 
1 x0
 
 1 x1 
 
V =  .. ..  .

 . . 
 
1 xn

Copyright c Luca Dede’ 2024


Chapter 5

Numerical Differentiation and


Integration

This chapter considers numerical methods for approximating the derivatives of a function f (x). This
approach is known as numerical differentiation or derivation.
Next, we discuss numerical methods for approximating the definite integral of the function, commonly
referred to as numerical integration, which is typically performed using the so–called quadrature formulas.

5.1 Numerical Differentiation


Given a function f ∈ C 1 ((a, b)), the objective is to approximate, for example, f 0 (x) for some x ∈ (a, b) ⊆
R. In fact, given a function f (x), it may be preferable to avoid explicitly computing f 0 (x) for some
x ∈ (a, b), an operation that is computationally expensive in many cases. The same considerations apply
to the approximation of f 00 (x).
n
Numerical differentiation, however, becomes essential when only the set of data pairs {(xi , yi )}i=0 is
available, and the function f (x) from which such data were generated is not provided. In such a case, it
may still be of interest to provide information about the first (or second) derivative of the unknown function
n
f (x) at one or more of the nodes {xi }i=0 .

5.1.1 Finite difference schemes for the first derivative


We consider some finite difference schemes for approximating f 0 (x) for some x ∈ (a, b), starting with the
simplest schemes based on forward and backward finite differences.

Definition 5.1. Given a function f (x) and a step size h > 0, the approximation of f 0 (x) at some x ∈
(a, b) ⊆ R using the forward finite difference scheme is defined as:

f (x + h) − f (x)
δ+ f (x) := ,
h

while the approximation using the backward finite difference scheme is:

f (x) − f (x − h)
δ− f (x) := .
h

Example 5.1. We illustrate graphically the forward finite difference scheme (on the left) and the backward finite
difference scheme (on the right) for approximating f 0 (x) for some h > 0. To this end, we plot the straight lines
passing through the points (x, f (x)) and (x ± h, f (x ± h)), with slopes δ+ f (x) and δ− f (x), respectively.

63
64 Numerical Differentiation and Integration

Forward finite difference Backward finite difference


δ+ f (x) δ− f (x)

Proposition 5.1. If f ∈ C 2 ((a, b)) and x ∈ (a, b), the error E+ f (x) associated with the forward finite
difference scheme is:
1
E+ f (x) := f 0 (x) − δ+ f (x) = − h f 00 (ξ+ ) for some ξ+ ∈ [x, x + h],
2
while the error E− f (x) associated with the backward finite difference scheme is:
1
E− f (x) := f 0 (x) − δ− f (x) = h f 00 (ξ− ) for some ξ− ∈ [x − h, x].
2

Proof. Forward finite difference scheme. Consider the Taylor series expansion of f (x + h) around x, that
1
is, f (x + h) = f (x) + f 0 (x) h + f 00 (ξ+ )h2 for some ξ+ ∈ [x, x + h]. The result is obtained using the
2
definition of the error E+ f (x).
In a completely analogous manner, the proof for the backward finite difference scheme is obtained.
The forward and backward finite difference schemes are methods of order of accuracy 1. In fact, the
errors E+ f (x) and E− f (x) converge to zero with order 1 with respect to the step size h. If f ∈ P1 , we
have δ+ f (x) ≡ δ− f (x) ≡ f 0 (x) for every x ∈ R.

Another scheme for approximating the first derivative of a function is the one based on central finite
differences.

Definition 5.2. Given a function f (x) and a step size h > 0, the approximation of f 0 (x) at some x ∈
(a, b) ⊆ R using the central finite difference scheme is defined as:

f (x + h) − f (x − h)
δc f (x) := .
2h

Example 5.2. We illustrate graphically the central finite difference scheme for approximating f 0 (x) for some h > 0;
specifically, we plot the straight line with slope δc f (x).
Central finite difference, δc f (x)

Copyright c Luca Dede’ 2024


Numerical Differentiation 65

Proposition 5.2. If f ∈ C 3 ((a, b)) and x ∈ (a, b), the error Ec f (x) associated with the central finite
difference scheme is:
1 2 000
Ec f (x) := f 0 (x) − δc f (x) = − h [f (ξ+ ) + f 000 (ξ− )]
12
for some ξ+ ∈ [x, x + h] and ξ− ∈ [x − h, x].

Proof. Consider the Taylor series expansion of f (x + h) around x, which gives us f (x + h) = f (x) +
1 1
f 0 (x) h + f 00 (x)h2 + f 000 (ξ+ )h3 for some ξ+ ∈ [x, x + h]; similarly, the Taylor expansion of f (x − h)
2 6
1 1
around x is f (x − h) = f (x) − f 0 (x) h + f 00 (x)h2 − f 000 (ξ− )h3 for some ξ− ∈ [x − h, x]. By applying
2 6
the definition of the error Ec f (x), we obtain the result.

The central finite difference scheme is a method with order of accuracy 2. In fact, the error Ec f (x)
converges to zero with order 2 with respect to the step size h. We observe that, for f ∈ P2 , it holds that
δc f (x) ≡ f 0 (x) for every x ∈ R.
0
It is possible to show that δc f (x) = Π2,{x−h,x,x+h} f (x), where Π2,{x−h,x,x+h} f (x) is the poly-
nomial of degree 2 that interpolates the function f (x) at the nodes x − h, x, and x + h.

5.1.2 Finite difference scheme for the second derivative


We consider the central finite difference scheme for the approximation of the second derivative f 00 (x) for
some x ∈ (a, b).

Definition 5.3. Given a function f (x) and a step size h > 0, the approximation of f 00 (x) at x ∈ (a, b) ⊆ R
using the central finite differences is defined as:

f (x + h) − 2f (x) + f (x − h)
δc2 f (x) := .
h2

Proposition 5.3. If f ∈ C 4 ((a, b)) and x ∈ (a, b), the error Ec2 f (x) associated with the central finite
difference scheme is given by:
1 2 h (4) i
Ec2 f (x) := f 00 (x) − δc2 f (x) = − h f (ξ+ ) + f (4) (ξ− )
24
for some ξ+ ∈ [x, x + h] and ξ− ∈ [x − h, x].

Proof. Consider the Taylor expansion for f (x + h) around x, which gives us f (x + h) = f (x) + f 0 (x) h +
1 00 1 1 (4)
f (x)h2 + f 000 (x)h3 + f (ξ+ )h4 for some ξ+ ∈ [x, x + h]; similarly, the Taylor expansion of
2 6 24
1 1 1
f (x − h) around x is f (x − h) = f (x) − f 0 (x) h + f 00 (x)h2 − f 000 (x)h3 + f (4) (ξ− )h4 for some
2 6 24
ξ− ∈ [x − h, x]. At this point, by applying the definition of the error Ec2 f (x) and substituting the Taylor
expansions into δc2 f (x), we obtain the result.

The central finite difference scheme for the approximation of f 00 (x) is a method of order 2. We observe
that if f ∈ P3 , then δc2 f (x) ≡ f 00 (x) for every x ∈ R.

Copyright c Luca Dede’ 2024


66 Numerical Differentiation and Integration

5.2 Numerical Integration


In this section, we consider several methods for approximating definite integrals of real functions of a real
variable, using the so–called quadrature formulas (numerical integration). In many cases, it may happen
that a primitive of the function to be integrated is not known. In any case, it is crucial to approximate
such integrals since, as we will see later, the approximation of problems described by partial differential
equations using the finite element method calls for the computation of definite integrals.
Given a function f ∈ C 0 ([a, b]), we aim to numerically approximate its (definite) integral over the
interval [a, b], namely
Z b
I(f ) = f (x) dx,
a

using suitable quadrature formulas, denoted by Iq (f ), such that Iq (f ) ' I(f ). We can classify quadrature
formulas into two main categories: simple formulas and composite formulas.
The simple quadrature formulas are based on a global approximation of the function f (x) over the
interval [a, b] using functions fe(x) that are "simple" to integrate over [a, b] (that is, they can be integrated
explicitly, or in closed form); this means that
Z b
Iq (f ) = I(fe) = fe(x) dx,
a

where fe(x) is an approximation of f (x) for x ∈ [a, b]. Typically, for simple quadrature formulas, fe(x) is
a polynomial of degree n that interpolates f (x) at n + 1 nodes in [a, b]; this is referred to as interpolatory
quadrature formulas.
The composite quadrature formulas are based on dividing the interval [a, b] into M subintervals, pos-
b−a
sibly of the same size H = , on which the function f (x) is approximated by a piecewise function
M
fe(x). Denoting the M + 1 nodes {xk }M k=0 as xk = a + k H for k = 0, . . . , M , with x0 = a and xM = b,
we recall that
XM Z xk
I(f ) = f (x) dx;
k=1 xk−1

thus, the composite quadrature formula becomes


M Z
X xk
Iq (f ) = fek (x) dx, where fek (x) = fe(x) [xk−1 ,xk ]
∀k = 1, . . . , M.
k=1 xk−1

In general, for composite quadrature formulas, fe(x) is a piecewise polynomial of degree n that interpolates
f (x) at n + 1 nodes in each of the subintervals {[xk−1 , xk ]}M
k=1 of [a, b].

Example 5.3. The difference between a simple quadrature formula and a composite one for approximating the integral
Z b
I(f ) = f (x) dx of a generic function f (x) over the interval [a, b] is shown in the following figure; we denote the
a
approximated integral as I(fe).

Simple Composite (M = 3)

Copyright c Luca Dede’ 2024


Numerical Integration 67

The following definitions are useful for characterizing quadrature formulas.

Definition 5.4. The degree of exactness of a quadrature formula is the largest integer r ≥ 0 such that
all polynomials of degree less than or equal to r are exactly integrated by the formula, that is, such that
Iq (p) ≡ I(p) for every p ∈ Pr .

Definition 5.5. The order of convergence of a composite quadrature formula (also called the order of
accuracy) is the order of convergence of the associated error with respect to H, the size of the subintervals.

5.2.1 Midpoint quadrature formulas


By approximating the function f (x) with its Lagrange polynomial interpolant of degree 0 (i.e., with a
constant function), fe(x) = Π0 f (x), or with a composite interpolant of degree 0 (i.e., a piecewise con-
Z b
H
stant function) Π0 f (x), the integral fe(x) dx leads to the simple and composite midpoint formulas,
a
respectively.

Definition 5.6. Let f (x) ∈ C 0 ([a, b]). The simple midpoint quadrature formula is defined as:
 
a+b
Imp (f ) := I (Π0 f ) = (b − a) f ,
2

a+b
where Π0 f (x) is the polynomial of degree 0 that interpolates f (x) at the midpoint x = of the
2
interval [a, b]. The composite midpoint quadrature formula is defined as:

M
X
c
(f ) := I ΠH

Imp 0 f = H f (xk ), (5.1)
k=1

M
where ΠH 0 f (x) is the piecewise polynomial of degree 0 that interpolates f (x) at the midpoints {xk }k=1
b−a xk−1 + xk
of the M subintervals of length H = into which [a, b] is divided, with xk = for each
M 2
k = 1, . . . , M .

Example 5.4. We graphically illustrate the simple midpoint formula (on the left) and the composite midpoint formula
(on the right) for approximating the integral I(f ) of a generic function f (x).

c
Simple, Imp (f ) Composite, Imp (f ) (M = 3)

Copyright c Luca Dede’ 2024


68 Numerical Differentiation and Integration

Proposition 5.4. If f ∈ C 2 ([a, b]), the error emp (f ) associated with the simple midpoint quadrature
formula is given by
(b − a)3 00
emp (f ) := I(f ) − Imp (f ) = f (ξ) for some ξ ∈ [a, b],
24
while the error ecmp (f ) associated with the composite midpoint quadrature formula is given by
(b − a) 2 00
ecmp (f ) := I(f ) − Imp
c
(f ) = H f (ξ) for some ξ ∈ [a, b].
24

a+b
Proof. Simple formula. Letting the midpoint of [a, b] be x = , we consider the Taylor expansion of
2
1
f (x) around x, given by f (x) = f (x) + f 0 (x) (x − x) + f 00 (η(x)) (x − x)2 for some η(x) ∈ [a, b].
2Z
b Z b
0 1 00
Integrating this expression, we obtain I(f ) = Imp (f ) + f (x) (x − x) dx + f (ξ) (x − x)2 dx for
a 2 a
Z b Z b
some ξ ∈ [a, b], by virtue of the integral mean theorem1 . Since (x − x) dx = 0 and (x − x)2 dx =
a a
(b − a)3
, we obtain the desired result.
12

The midpoint quadrature formulas have a degree of exactness of 1; in fact, the errors emp (f ) and ecmp (f )
are identically zero for all polynomials of degree less than or equal to 1 (if f ∈ P1 , then f 00 (ξ) = 0 for
every ξ ∈ R).
The composite midpoint formula has an order of convergence (or order of accuracy) of 2; in fact, the
error ecmp (f ) is proportional to H 2 .

5.2.2 Trapezoidal quadrature formulas


By approximating the function f (x) with its Lagrange interpolant of degree 1, fe(x) = Π1 f (x), or with a
Z b
H
composite interpolant of degree 1 (that is, a piecewise linear interpolant) Π1 f (x), the integral fe(x) dx
a
leads respectively to the simple and composite trapezoidal formulas.

Definition 5.7. Let f (x) ∈ C 0 ([a, b]). The simple trapezoidal quadrature formula is defined as:

f (a) + f (b)
It (f ) := I (Π1 f ) = (b − a) , (5.2)
2

where Π1 f (x) is the polynomial of degree 1 that interpolates f (x) at the nodes a and b.
The composite trapezoidal quadrature formula is defined as:
M M −1
 H X H X
Itc (f ) := I ΠH
1 f = [f (xk−1 ) + f (xk )] = [f (x0 ) + f (xM )] + H f (xk ),
2 2
k=1 k=1

M
where ΠH1 f (x) is the piecewise polynomial of degree 1 interpolating f (x) at the nodes {xk }k=0 of the M
b−a
subintervals of length H = in which the interval [a, b] is divided.
M

Integral Mean Theorem. Let f, g ∈ C 0 ([a, b]) and let g(x) have a constant sign on [a, b]. Then there exists a point c ∈ [a, b]
1
Z b Z b
such that f (x)g(x) dx = f (c) g(x) dx.
a a

Copyright c Luca Dede’ 2024


Numerical Integration 69

Example 5.5. We illustrate graphically the simple trapezoidal formula (on the left) and the composite trapezoidal
formula (on the right) for the approximation of the integral I(f ) of a generic function f (x).
Simple, It (f ) Composite, Itc (f ) (M = 3)

Proposition 5.5. If f ∈ C 2 ([a, b]), the error et (f ) associated with the simple trapezoidal quadrature
formula is given by

(b − a)3 00
et (f ) := I(f ) − It (f ) = − f (ξ) for some ξ ∈ [a, b],
12
while the error ect (f ) associated with the composite trapezoidal quadrature formula is given by

(b − a) 2 00
ect (f ) := I(f ) − Itc (f ) = − H f (ξ) for some ξ ∈ [a, b].
12
Z b
Proof. Simple formula. Since et (f ) = I(f )−I(Π1 f ), we have that et (f ) = (f (x) − Π1 f (x)) dx. By
a
recalling the error function E1 f (x) in the case of polynomial interpolation of degree 1 (see Proposition 4.2),
1
we have that E1 f (x) = f 00 (η(x)) ω1 (x) for some η(x) ∈ [a, b], with ω1 (x) = (x − a)(x − b). Using the
2
mean value theorem (of Lagrange), we have
1 b 00
Z Z b
1
et (f ) = f (η(x)) ω1 (x) dx = f 00 (ξ) ω1 (x) dx, for some ξ ∈ [a, b].
2 a 2 a
Z b
(b − a)3
Thus, since ω1 (x) dx = − , we obtain the desired result.
a 6
The trapezoidal quadrature formulas have a degree of accuracy r = 1; in fact, the errors et (f ) and ect (f )
are identically zero for all polynomials of degree less than or equal to 1 (if f ∈ P1 , then f 00 (ξ) = 0 for
every ξ ∈ R).
The composite trapezoidal formula has an order of convergence (or order of accuracy) of 2; in fact, the
error ect (f ) is proportional to H 2 .
Example 5.6. It is possible to use the error estimates for composite formulas to determine the minimum number of
intervals needed to achieve a certain accuracy in theZapproximation of an integral. Consider, for example, the composite
2
trapezoidal formula for approximating the integral ex dx. If we want the quadrature error to be less than a tolerance
−2
ε > 0, we will impose that
b − a 2 00 b−a 2
|ect (f )| = − H f (ξ) ≤ H max |f 00 (x)| < ε.
12 12 x∈[a,b]
 2
b−a b−a
Since f 00 (x) = ex , we have max |f 00 (x)| = e2 , leading to max |f 00 (x)| < ε and then M 2 >
x∈[−2,2] 12 M x∈[a,b]
(b − a)3
max |f 00 (x)|.
12ε x∈[a,b] r
43
Choosing, for example, ε = 10−4 , we obtain M > e2 ' 627.76, which means M ≥ 628.
12 · 10−4

Copyright c Luca Dede’ 2024


70 Numerical Differentiation and Integration

5.2.3 Simpson’s quadrature formulas


By approximating the function f (x) with its second degree Lagrange polynomial interpolant, fe(x) =
Π2 f (x), or with a composite second-degree interpolant (that is, a piecewise quadratic function) ΠH
2 f (x),
Z b
the integral fe(x) dx leads respectively to the simple and composite Simpson’s formulas.
a

Definition 5.8. Let f (x) ∈ C 0 ([a, b]). The simple Simpson’s quadrature formula is defined as:
   
b−a a+b
Is (f ) := I (Π2 f ) = f (a) + 4f + f (b) , (5.3)
6 2

where Π2 f (x) is the polynomial of degree 2 that interpolates f (x) at the nodes a, b, and at the midpoint
(a + b)/2.
The composite Simpson’s quadrature formula is defined as:

M
 H X
Isc (f ) := I ΠH
2 f = [f (xk−1 ) + 4f (xk ) + f (xk )] ,
6
k=1

M
where ΠH 2 f (x) is the piecewise polynomial of degree 2 that interpolates f (x) at the nodes {xk }k=0 and at
M b−a
the midpoints {xk }k=1 of the M subintervals of length H = in which [a, b] is divided; the midpoints
M
xk−1 + xk
of the subintervals are defined as xk = for each k = 1, . . . , M .
2

Example 5.7. We illustrate graphically the simple Simpson’s formula (on the left) and the composite Simpson’s
formula (on the right) for the approximation of the integral I(f ) of a generic function f (x).

Simple, Is (f ) Composite, Isc (f ) (M = 3)

Proposition 5.6. If f ∈ C 4 ([a, b]), the error es (f ) associated with the simple Simpson’s quadrature for-
mula is given by

(b − a)5 (4)
es (f ) := I(f ) − Is (f ) = − f (ξ) for some ξ ∈ [a, b],
2880
while the error ecs (f ) associated with the composite Simpson’s quadrature formula is given by

(b − a) 4 (4)
ecs (f ) := I(f ) − Isc (f ) = − H f (ξ) for some ξ ∈ [a, b].
2880

The Simpson’s quadrature formulas have degree of exactness r = 3; in fact, the errors es (f ) and ecs (f ) are
identically zero for all polynomials of degree less than or equal to 3 (if f ∈ P3 , then f (4) (ξ) = 0 for every
ξ ∈ R).

Copyright c Luca Dede’ 2024


Numerical Integration 71

The composite Simpson’s formula has an order of convergence (or order of accuracy) of 4; indeed, the
error ecs (f ) is proportional to H 4 .

5.2.4 Interpolatory quadrature formulas


The previous quadrature formulas are examples of interpolatory quadrature formulas. We can indeed gen-
eralize what has been seen so far to the case where f is approximated over the interval [a, b] by an interpo-
lating polynomial of degree n at n + 1 nodes. For simplicity, we will consider the case of simple formulas,
although extending to the case of composite formulas does not present any significant difficulties.

Definition 5.9. Let f (x) ∈ C 0 ([a, b]). An interpolatory quadrature formula (simple) is defined as
n
X
Iq,n (f ) := I(fe) = αj f (yj ) , (5.4)
j=0

n n
where fe(x) is an interpolating function for f (x) at n + 1 quadrature nodes {yj }j=0 in [a, b] and {αj }j=0
are the corresponding quadrature weights, with n ≥ 0.

The interpolating function fe(x) should be “simple" to integrate, and it can be obtained in many different
ways; its choice determines the method (or family of formulas) of numerical quadrature.

If fe(x) = Πn f (x), the interpolating polynomial of degree n = 0, 1, 2 at n + 1 equally spaced nodes in


[a, b] results in the simple midpoint, trapezoid, and Simpson’s quadrature formulas, respectively. Specifi-
cally, choosing
a+b
fe(x) = Π0 f (x) with n = 0, α0 = b − a and y0 =
2
in Eq. (5.4), gives the simple midpoint quadrature formula (5.1). For
b−a
fe(x) = Π1 f (x) with n = 1, α0 = α1 = and y0 = a, y1 = b
2
in Eq. (5.4), we obtain the simple trapezoid quadrature formula (5.2). Finally, for
b−a 2(b − a) a+b
fe(x) = Π2 f (x) with n = 2, α0 = α2 = , α1 = and y0 = a, y1 = , y2 = b
6 3 2
in Eq. (5.4), we obtain the simple Simpson’s quadrature formula (5.3).
In general, if we choose
n
X
f (x) = Πn f (x) =
e f (xk ) ϕk (x),
k=0
n n
the Lagrange interpolating polynomial of degree n ≥ 0 at n + 1 nodes {xk }k=0 in [a, b], with {ϕk (x)}k=0
being the corresponding Lagrange characteristic functions, the interpolatory formula (5.4) is obtained
by selecting the quadrature nodes coinciding with the interpolation nodes, yj = xj , and weights αj =
Z b
ϕj (x) dx for each j = 0, . . . , n.
a
In particular, the quadrature formulas (5.4) are called Newton–Cotes (simple) formulas if the approx-
imating function of f (x) is its interpolating polynomial of degree n, that is, Πn f (x), at equally spaced
nodes in [a, b]. The Newton-Cotes (simple) formulas have an accuracy degree of r = n + 1 if n is even
(n = 0, 2, 4, . . .), while they have an accuracy degree of r = n if n is odd (n = 1, 3, 5, . . .). We see that in
this case, the degree of accuracy of the formulas is r ≥ n.
The interpolatory quadrature formulas (5.4) are specified by n, the quadrature nodes {yj }nj=0 , and
the quadrature weights {αj }nj=0 . However, the weights and nodes in quadrature depend on the interval

Copyright c Luca Dede’ 2024


72 Numerical Differentiation and Integration

[a, b] ⊂ R in question. To provide general quadrature formulas that can be applied to functions f (x)
defined on any interval [a, b], the nodes and quadrature weights are specified for the reference interval
[−1, 1] and denoted as {yj }nj=0 and {αj }nj=0 , respectively. The quadrature nodes and weights for the
general interval [a, b] can then be obtained as follows2 :

a+b b−a
yj = + yj for j = 0, . . . , n,
2 2
and
b−a
αj = αj for j = 0, . . . , n.
2

5.2.5 Gaussian quadrature formulas


The Newton–Cotes formulas have a degree of accuracy r ≥ n; specifically, for even n, we have r = n + 1,
while for odd n, r = n. However, it is possible to find, for a given n ≥ 0, the optimal positions of
the quadrature nodes {y j }nj=0 in [−1, 1] and the corresponding values of the quadrature weights {αj }nj=0
such that the degree of accuracy of the quadrature formula is maximum. The corresponding quadrature
formulas are obtained by considering interpolating polynomials constructed using appropriate families of
polynomials (called orthogonal Legendre polynomials) as bases, and are referred to as Gaussian quadrature
formulas. Their degree of accuracy is r = 2n + 1 if the endpoints of the interval [−1, 1] are excluded (the
Gauss–Legendre formulas), or r = 2n − 1 if they are included (the Gauss–Legendre–Lobatto formulas).
We note that these quadrature formulas can then be extended to any interval [a, b] as seen previously.
The Gauss–Legendre quadrature formulas refer to a family of interpolatory quadrature formulas ob-
tained by approximating the integrand function f (x) using Legendre polynomials. The Legendre polyno-
mials {Lk (x)}n+1
k=0 on the interval [−1, 1] are defined recursively as follows:

2k + 1 k
L0 (x) = 1, L1 (x) = x, and Lk+1 (x) = x Lk (x) − Lk−1 (x) for k = 1, . . . , n;
k+1 k+1
Z 1
these polynomials are orthogonal in the sense that Ln+1 (x) Lk (x) dx = 0 for every k = 0, . . . , n. In
−1
3 1 3 1
particular, we have L0 (x) = 1, L1 (x) = x, L2 (x) = x L1 (x) − L0 (x) = x2 − , etc.
2 2 2 2

Definition 5.10. Let f (x) ∈ C 0 ([a, b]). The Gauss–Legendre quadrature formula for n ≥ 0 on the
reference interval [−1, 1] is
Xn
αGL f y GL

IGL,n = j j ,
j=0

where the nodes and weights of quadrature are given, respectively, by

y GL
j := zeros of Ln+1 (x) for each j = 0, . . . , n,

2
αGL
j := h 2 i 2 for each j = 0, . . . , n.
1− y GL
j L0n+1 y GL
j

The degree of accuracy of the Gauss–Legendre formula is r = 2n + 1 for every n ≥ 0.


In the following table, the nodes and weights of the Gauss–Legendre quadrature formulas on the interval
[−1, 1] are reported for n = 0, 1, 2, along with the corresponding degree of accuracy r. This formula
maximizes the degree of accuracy r for any given n ≥ 0.
2
The first of these two relations is analogous to the transformation used in Definition 4.5 to map the Chebyshev-Gauss-Lobatto
nodes from the reference interval [−1, 1] to the generic interval [a, b].

Copyright c Luca Dede’ 2024


Numerical Integration 73

 GL n  GL n
n y j j=0 αj j=0
r
0  0  2 1 (midpoint formula)
1 1
1 −√ , +√ {1, 1} 3
( √ 3 √3 )  
15 15 5 8 5
2 − , 0, + , , 5
5 5 9 9 9

We observe that the Gauss–Legendre quadrature formula for n = 0 coincides with the simple midpoint
formula.

Example 5.8. We can verify that the Gauss–Legendre formula for n = 1 has a degree of accuracy r = 3, meaning that
IGL,1 (f ) = I(f ) for every f ∈ P3 . Considering a generic polynomial of degree 3, say f (x) = c0 +c1 x+c2 x2 +c3 x3
for certain c0 , c1 , c2 , and c3 ∈ R, for example on the reference interval [−1, 1], we have
Z 1
2
I(f ) = f (x) dx = 2c0 + c2 .
−1 3

Taking n = 1, we find that


   
IGL,1 (f ) = αGL
0 + αGL
1 c0 + αGL GL
0 y0 + αGL1 y1
GL
c1 +
  2  2    3  3 
GL GL GL GL GL GL GL GL
+ α0 y0 + α1 y1 c2 + α0 y0 + α1 y1 c3 .

Imposing the following constraints, i.e., requiring that IGL,1 (f ) = I(f ) for every c0 , c1 , c2 , and c3 ∈ R, we have:
 GL
 α0 + αGL 1 = 2,
 GL GL GL GL
 α0 y0 +α1 y 1  = 0,


GL GL
2
GL
2 2
α 0 y 0 + α 1 y GL
1 = ,

 3
 GL GL 3 3
    
+ αGL y GL

α0 y0 1 1 = 0.

1 1
Thus, we obtain the quadrature nodes y GL 0 = − √ and y GL 1 = + √ , with the corresponding weights αGL 0 =
3 3
GL
α1 = 1; thus, we conclude that the Gauss-Legendre formula IGL,1 (f ) integrates polynomials of degree 3 exactly,
regardless of the values of the coefficients c0 , c1 , c2 , and c3 ∈ R. Therefore, we have verified that the formula has a
degree of accuracy equal to 3.

The Gauss–Legendre quadrature formulas maximize the degree of accuracy r for every given value
of n ≥ 0, but the corresponding quadrature nodes are all contained within the reference interval [−1, 1].
However, in some situations, one might want to include the endpoints of the interval {−1, 1} in the set of
quadrature nodes. The Gauss–Legendre–Lobatto quadrature formulas allow for maximizing the degree of
accuracy when the endpoints of the interval {−1, 1} are included in the set of quadrature nodes.

Definition 5.11. Let f (x) ∈ C 0 ([a, b]). Then, the Gauss–Legendre–Lobatto quadrature formula for n ≥ 1
on the reference interval [−1, 1] is
n
X
αGLL f y GLL

IGLL,n = j j ,
j=0

where the nodes and weights of quadrature are given, respectively, by

y GLL
0 := −1, y GLL
n := +1, and y GLL
j := zeros of L0n (x) for each j = 1, . . . , n − 1,

2 1
αGLL
j := for each j = 0, . . . , n.
n(n + 1) Ln y GLL 2
 
j

Copyright c Luca Dede’ 2024


74 Numerical Differentiation and Integration

The degree of accuracy of the Gauss–Legendre–Lobatto formula is r = 2n − 1 for every n ≥ 1.


In the following table, the nodes and weights of the Gauss–Legendre–Lobatto formulas on the interval
[−1, 1] for n = 1, 2, 3 are reported, along with the corresponding degree of accuracy r. Note that this
formula is not defined for n = 0.
 GLL n  GLL n
n yj j=0
αj j=0
r
1 {−1, +1}  {1, 1}  1 (trapezoidal formula)
1 4 1
2 {−1, 0, +1} , , 3 (Simpson’s formula)
   3 3 3 
1 1 1 5 5 1
3 −1, − √ , + √ , +1 , , , 5
5 5 6 6 6 6

We observe that the Gauss–Legendre–Lobatto quadrature formulas for n = 1 and n = 2 coincide with the
simple trapezoidal and Simpson’s formulas, respectively.

5.2.6 Numerical integration in dimension d > 1


Numerical integration in multiple dimensions, that is, the integration of continuous functions f : Ω → R,
with Ω ⊂ Rd for d ≥ 2, is based on the generalization of the quadrature formulas seen in the previous
sections. Simple quadrature formulas are defined for specific domains, such as trapezoids and triangles for
d = 2 or tetrahedra and hexahedra for d = 3. Numerical integration over complex domains, on the other
hand, relies on composite formulas. These formulas are of fundamental interest in finite element calculation
codes for numerical simulations of differential problems on any computational grid (in dimensions greater
than 1). In the case d = 2, a schematic representation of some of these quadrature formulas is provided
below.

Example 5.9. Schematic representation of a quadrature formula for d = 2.

Simple Composite

Copyright c Luca Dede’ 2024


Bibliography

[1] A. Quarteroni. Numerical Models for Differential Problems. Springer, Cham, 2017.
[2] A. Quarteroni, F. Saleri, and P. Gervasio. Scientific Computing with MATLAB and Octave. Springer,
2014.

[3] A. Quarteroni, R. Sacco, and F. Saleri. Numerical Mathematics. Springer, Berlin and Heidelberg, 2007.
[4] https://www.mathworks.com/products/matlab.html

75

You might also like