NumericalAnalysis Notes (In Progress)
NumericalAnalysis Notes (In Progress)
Politecnico di Milano
Luca Dede’
i
ii Contents
Bibliography 75
The course introduces fundamental concepts of Numerical Analysis, focusing on Scientific Comput-
ing, numerical simulation, and numerical methods for solving Partial Differential Equations, particularly
through the Finite Element method. It covers both theoretical and practical aspects, aiming to equip stu-
dents with the skills necessary to numerically solve mathematical problems that are relevant to Civil En-
gineering. The objectives of the course include developing competencies in the following key topics:
numerical linear algebra, numerical solution of nonlinear equations, approximations of functions and data,
numerical differentiation and integration, numerical solution of Ordinary Differential Equations (ODEs),
boundary value problems, and Partial Differential Equations (PDEs) in 1D, particularly using the Finite
Element (FE) method.
Most of the content of these lecture notes refers to [1, 2, 3]. The course will feature the use of the soft-
ware library Matlabr [4], a multi–paradigm programming language and numerical computing environment
developed by MathWorks, to (i) solve mathematical and numerical problems; (ii) perform numerical sim-
ulations; and (iii) handle postprocessing, as well as visualization of the numerical solution and scientific
computing data.
1
2 Introduction to Numerical Analysis and Scientific Computing
Figure 1.1: Numerical Analysis: physical, mathematical, and numerical problems and solutions
lems, parallel architectures and High Performance Computing frameworks are typically employed. See the
sketches in Figs. 1.1 and 1.2.
Mathematical models are conventionally used altogether with theoretical (mathematical) models and
experimental tests. In several cases, however, theoretical models are not available – like in Computational
Mechanics and Fluid Dynamics – or experimental tests are not meaningful or cannot be performed (for
example, for nuclear testing, extreme natural events as earthquakes, etc.). Physics–based models have
witnessed an increasing role in the modern society in virtue of the massive developments of Scientific
Computing and computational tools starting from the 1950s and 1960s. More recently, we entered in a new
era characterized by Data Science. Indeed, a large amount of data is becoming available from multiple
sources nowadays (digital measures, experiments, etc.). Data–driven models are still mathematical models,
but built from meaningful data that do not rely on physical principles – because the latter are not available
or are not reliable – and whose construction calls for statistical learning methods.
In any case, data are crucial for physics–based models: as a matter of fact, despite the different
paradigms, data–driven and physics–based models can be and must be used in a synergistic way. Physics–
based models require input data (for example in the form of physical and geometrical parameters for differ-
ential equations) and validation through experimental tests and measurements. More recently, new ways
of balancing data–driven and physics–based models have emerged and are increasingly used as numerical
models. An example is represented by Scientific Machine Learning, an evolution of Scientific Comput-
ing that exploits statistical learning based on Machine and Deep Learning algorithms. This can be used
to accelerate or empower standard numerical methods in Scientific Computing, to reduce complexity and
computational costs of the numerical simulations, or to build digital twins. A digital twin is a virtual
replica of a physical system or process, serving as its digital counterpart for practical applications such as
simulation, testing, monitoring, maintenance, prediction, and forecasting. Real–time interaction between
the physical system and its digital twin is essential since it must guarantee effective observation and control.
Physics–based mathematical models (mathematical problems) are a fundamental pillar in the under-
standing and prediction of several physical phenomena and processes (physical problems). However,
these mathematical models may lead to problems that cannot be solved analytically, or in an exact way
(thus yielding the exact solution), especially for differential problems. In these cases, one is unable to
write their solution explicitly.
Numerical methods and numerical approximation techniques (numerical problems) serve the purpose
to determine an (approximate) numerical solution of a mathematical model. When the calculator is used to
determine such (approximate) numerical solution, the latter is called numerical solution at the calculator;
see Fig. 1.2.
Figure 1.2: Scientific Computing: physical, mathematical, and numerical problems and solutions
In the former, P generically indicates the model, u is the exact solution – a vector or a function of one or
more independent variables (space and/or time variables) – and g indicates the set of data.
In the following, we report some examples of MPs in the form of differential models, involving ODEs
and PDEs; we consider specifically linear models for the PDEs with a scalar unknown solution. The goal is
to illustrate some challenging models that we will address in this course, which are relevant and meaningful
for applications in Civil Engineering.
ODEs
ODEs are also known as initial values problems. A first order ODE, a Cauchy problem, is a differential
problem, whose solution u = u(t) is a function of a single independent variable t, often interpreted as time.
A single condition is assigned on the solution, at a point (usually, the left end of the integration interval).
Its form is the following: find u : I ⊂ R → R such that
du
(t) = f (t, u(t)) t ∈ I,
dt (1.1)
u(t ) = u ,
0 0
This equation models a stationary phenomenon – the time variable does not appear in fact – and represents
a diffusion model. For example, this models the diffusion of a pollutant along a 1D channel Ω = (a, b) or
the vertical displacement of an elastic string (thread) fixed at its ends. In the first case, f = f (x) indicates
the source of the pollutant along the flow, while in the second case, f is the transverse force acting on the
elastic string, in the hypothesis of negligible mass and small displacements of the string.
We remark that the boundary value problem in 1D is a particular case of PDEs, even if it involves only
derivatives with respect to a single independent variable x. Indeed, even if apparently similar to a second
order ODE, the boundary value problem is in reality substantially different from an ODE: in Eq. (1.2), two
conditions are set at t = t0 , while in Eq. (1.3), one condition is set a x = a and the other one at x = b. The
conditions in the boundary value problem determine to the so–called global nature of the model.
These problems concern equations that depend on space and time: the unknown solution u = u(x, t) both
depends on the space coordinate x ∈ Ω ⊂ R in 1D and the time variable t ∈ I ⊂ I. In this case, the initial
conditions at t = 0 must be prescribed, as well as the boundary conditions at the ends of the interval in 1D.
The heat equation – also known as diffusion equation – with Dirichlet boundary conditions assumes
the following form: find u : Ω × I → R such that
∂u ∂2u
(x, t) − µ 2 (x, t) = f (x, t) x ∈ Ω = (a, b), t ∈ I,
∂t ∂x (1.4)
u(a, t) = u(b, t) = 0 t ∈ I,
u(x, t ) = u (x)
0 0 x ∈ Ω = (a, b).
For example, the unknown function u(x, t) describes the temperature in a point x ∈ Ω = (a, b) and time
t ∈ I of a metallic bar covering the space interval Ω. The diffusion coefficient µ represents the thermal
response of the material and it is related to its thermal conductivity. The Dirichlet boundary conditions
express the fact that the ends of the bar are kept at a reference temperature (zero degrees in this case), while
at the time t = t0 , the temperature is assigned in each point x ∈ Ω through the initial function u0 (x).
Finally, the bar is subjected to a heat source of linear density f (x, t).
Another example is the wave equation, which involves second order derivatives with respect to both
space and time variables. The model is used to describe wave phenomena, including standing wave fields,
such as mechanical waves (e.g., water waves, sound waves, and seismic waves) or electromagnetic waves
(e.g., light waves). It finds application in fields such as acoustics, seismic, electromagnetism, and fluid
dynamics. The wave equation reads: find u : Ω × I → R such that
2 2
∂ u 2∂ u
2
(x, t) − c (x, t) = f (x, t) x ∈ Ω = (a, b), t ∈ I,
∂t ∂x2
u(a, t) = u(b, t) = 0 t ∈ I,
(1.5)
∂u
(x, t0 ) = v0 (x) x ∈ Ω = (a, b),
∂t
x ∈ Ω = (a, b).
u(x, t0 ) = u0 (x)
For example, the unknown function u(x, t) describes the displacement in a point x ∈ Ω = (a, b) and time
t ∈ I of an elastic string covering the space interval Ω. The parameter c > 0 is a fixed real coefficient
∂u
representing the propagation speed of the wave. At the initial time t = t0 , both the fields u and need
∂t
to be assigned in each point x ∈ Ω through the functions u0 (x), and v0 (x), respectively.
where
d
X ∂2u
4u(x) := (x)
i=1
∂x2i
is the Laplace operator, the domain Ω ⊂ Rd is endowed with boundary ∂Ω, and f = f (x) is the external
forcing term. This equation is used for example to model the vertical displacement of an elastic membrane
fixed at the boundaries.
if u ∈ C 1 (I). The expression of u(t) can be determined only if f (t, y) assumes specific forms.
For the 1D Poisson problem (1.3), if we consider for simplicity the case in which Ω = (a, b) = (0, 1)
and f ∈ C 0 ([0, 1]), then there exists a unique solution u ∈ C 2 ([0, 1]), which is expressed as:
Z 1 Z x
u(x) = x (1 − s) f (s) ds − (x − s) f (s) ds. (1.7)
0 0
We remark that it is possible to find the solution u(x) explicitly as long as the primitives of the functions
under integrals are available.
For the heat equation (1.4), in the case Ω = (a, b) = (0, 1) and f = 0, the solution can be expressed as
a Fourier expansion:
+∞
X 2
u(x, t) = u0,j e−µ (jπ) t sin(jπx),
j=1
Other than calling for the calculation of integrals, the solution is often approximated – i.e. the Fourier
series is truncated – when the initial data is represented through a Fourier expansion with infinite terms.
Analytical methods for differential models can be used to find the solution of some simple problems,
for example through the method of separation of variables for some PDEs. These methods are useful
for determining the analytical expression of the solution, as well as investigating some of its qualitative
properties that can be connected to the physical problem from which the mathematical problem under
examination originates.
In addition, analytical methods can be used to establish if the correct data have been assigned to the
problem – like boundary conditions for PDEs – and therefore if the problem is well–posed. It is indeed
crucial to determine if the solution of a differential model exists and is unique, even if an expression for
the analytical solution can not be determined.
Definition 1.1. The set of floating–point numbers F is the subset of real numbers that can be represented
by the calculator, that is F ⊂ R with dim (F) < +∞. In general, F = F0 ∪ {0}, where the set F0 includes
all the floating–point numbers excluding the zero.
For the sake of simplicity, from now on, we will indicate with F the floating–point numbers excluding zero.
The set F = F(β, t, L, U ) is characterized by four parameters: β, t, L, and U . Every real number x ∈ F
can be written as:
s s
x = (−1) m β e−t = (−1) (a1 a2 · · · at )β β e−t ,
where:
• β ≥ 2 is the basis, a natural number determining the numeral system;
• m = (a1 a2 · · · at )β is the mantissa, (0 < m < β t − 1) where t is the number of digits (significant
figures) such that 0 < a1 ≤ β − 1 and 0 ≤ ai ≤ β − 1 for i = 2, . . . , t;
• e ∈ Z is the exponent such that L ≤ e ≤ U , for L < 0 and U > 0;
• s = {0, 1} is the sign.
The exponent e defines the range of machine numbers, while the number of digits t in the mantissa m
defines its precision. Given the floating–point set F(β, t, L, U ) of the calculator, a number x ∈ F is fully
defined by the values assumed by s, m, and e.
Example 1.1. For x = (4.25)10 , we have the representation x = (100.01)2 = 1.0001 · 22 = 0.10001 · 23 , that is
s = 0, β = 2, m = 10001, e = 3 = (11)2 , t = 5.
Given a number x ∈ R, f l(x) ∈ F indicates the representation of the real number x as a floating–point
number. We observe that, in general, f l(x) 6= x, unless x ∈ F.
The set of floating–point numbers F(β, t, L, U ) is associated with the following properties.
• Machine epsilon, which is the value
εM := β 1−t ,
representing the distance between 1 and the smallest floating–point number larger than 1, that is the
smallest (positive) real number such that f l(1 + εM ) > 1.
• Round–off error, the relative error between x ∈ R \ {0} and its floating–point representation f l(x) ∈
F. This error is bounded as:
|x − f l(x)| 1
≤ εM x 6= 0,
|x| 2
1
where εM is the round–off unit; this is the largest relative error introduced by the calculator in
2
1
representing any real number x. Even if M is small, the absolute error |x − f l(x)| can be very
2
large, especially if |x| is large.
• The smallestand largest (positive) numbers represented in F are xmin = β L−1 and xmax =
β U 1 − β −t , respectively. Aside from the zero, numbers smaller than xmin (in modulus), or larger
than xmax , cannot be represented.
• The larger are the values |f l(x)|, the less dense are the numbers in F.
Example 1.2. Let us consider the set F(2, 2, −1, 2), for which β = 2 (numeral system in base 2), t = 2, L = −1,
1 1
and U = 2. Then, we have M = β 1−t = , xmin = β L−1 = , and xmax = β U 1 − β −t = 3; the cardinality
2 4
of F is 2(β − 1) β t−1 (U − L + 1) = 2 · 1 · 2 · 4 = 16. The exponent e can assume values −1, 0, 1, or 2. The
mantissa is m = (a1 a2 )β , as t = 2; then, as β = 2, we have a1 = 1, a2 = 0, or a2 = 1. The possible values of m are
m = (10)2 = 2 or (11)2 = 3. If s = 0, the positive real numbers in F0 are x = m β e−t = m 2e−2 and are reported
in the following table.
e −1 0 1 2
1 1
m = (10)2 = 2 1 2
4 2
3 3 3
m = (11)2 = 3 3
8 4 2
The floating–point standard IEEE double precision uses a string of N = 64 bits to represent real num-
bers, in base β = 2, at the calculator F = F(2, 52, −1022, 1023).
1 11 bits 52 bits
s e m
1
Example 1.3. The real number x = can not be represented exactly in numeral systems with base β = 2; indeed,
10
1 1 1 0 0 1
x = 4 + 5 + 4 + 5 + 6 + 7 + · · · , which is an infinite series. Its floating–point representation in double
2 2 2 2 2 2
1 1 1 0 0 1 0 1
precision is f l(x) = 4 + 5 + 4 + 5 + 6 + 7 + · · · + 51 + 52 , that is an approximation of x.
2 2 2 2 2 2 2 2
For 64 bits (double precision) CPUs, numbers are represented in base β = 2, for which t = 52.
However, as 0 < a1 ≤ β − 1 = 1, we always have a1 = 1. In this case, the number of digits t used for the
mantissa m is in fact 52 + 1 = 53: since the first digit of the mantissa, a1 is always 1 it is not necessary
to explicitly store it. All numbers in Matlabr (and other programming languages as C++) are represented
in base β = 2, with number of equivalent digits t = 53 in the double precision representation (64 bits). It
follows that in Matlabr (double precision format):
Example 1.4. We report some examples of round–off errors in Matlabr for the double precision standard.
• The assignment a = 1 - 3 * ( 4 / 3 - 1 ) returns a = 2.2204e-16 (not zero).
• The assignement b = sqrt(1e-16 + 1) - 1 returns b = 0.
• The assignements c = 1e-16 - 1e-16 + 1 and d = 1e-16 + 1 - 1e-16, together with the oper-
ation f = c - d return f = 1.1102e-16 (not zero).
The associative property is violated whenever an overflow or underflow situation occurs. Issues with
floating–point operations appear when two numbers with the same sign and similar value are subtracted
(see Example 1.7 later): we actually obtain the so called loss (or cancellation) of significant figures. Fur-
thermore, zero is no longer unique: in fact, there exists at least one number b 6= 0 such that in floating–point
arithmetic a + b = a: indeed, a + b is always equal to a if b is smaller than the machine epsilon of the real
number a.
(1 + x) − 1
Example 1.5. For every x ∈ R\{0}, one has ≡ 1. However, in floating–point arithmetic, one has
x
f l (f l(1 + f l(x)) − 1)
fl = y, where y is a real number different than 1. If we were to verify the first identity in
f l(x)
r
Matlab , we would obtain a number y 6= 1 with errors depending on the value of the chosen real number x.
P(u; g) = 0, (1.8)
where u ∈ U and g ∈ G, with U e G two suitable sets or spaces; G is the set or space of admissible data.
The error between the physical and mathematical solutions is called model error, say em := uph − u. This
source of error takes into account all those characteristics of the PP that are not represented or captured by
the MP.
Example 1.6. Let us consider as PP the elastic string of Eqs. (1.3) and (1.7), whose mathematical solution u is a
function representing the vertical displacement. Then, the MP can be written as:
Z 1 Z x
P(u; g) = u − x (1 − s)f (s)ds + (x − s)f (s)ds = 0,
0 0
where data are g = {(0, 1), 0, 0, f }, representing the domain Ω, the homogeneous Dirichlet data, and the forcing term
f , respectively. Here, G = R2 × R × R × C 0 ([0, 1]) and U = C 2 ([0, 1]).
Definition 1.2. The MP P(u; g) = 0 is well–posed (stable) if and only if there exists a unique solution
u ∈ U that continuously depend on the data g ∈ G.
G is the set of admissible data, i.e. those for which the MP (1.8) admits a unique solution. Roughly
speaking, “continuous dependence on data” means that “small” perturbations on data g ∈ G lead to “small”
changes on the solution u ∈ U of the MP.
MPs that are well–posed can exhibit “large” variations of the solution u even for “small” variations of
the data values g. A measure of this sensitivity is given by the condition number of the MP.
Definition 1.3. The (relative) condition number of the MP P(u; g) = 0 for data g ∈ G is defined as
kδuk/kuk
K(g) := sup .
δg : (g+δg)∈G kδgk/kgk
and kδgk6=0
The norm k · k indicates a measure of data or solutions. By construction, we have K(g) ≥ 1. If K(g) is
“small”, then the MP is said well–conditioned; if K(g) is “large”, the the MP is ill–conditioned.
For a well–conditioned MP, the solution (u+δu) obtained with slightly perturbed data (g+δg) does not
differ much from the solution u of the MP with unperturbed data g. For ill–conditioned MPs, the solution
is instead very sensitive to small perturbations of the data. The condition number of a MP is independent
of the numerical method used to solve it. However, even for a simple MP, the notion of condition number
can explain how the propagation of small perturbations can lead to rather large errors in the result.
Example 1.7. Let us consider g = {g1 , g2 }, where g1 , g2 ∈ R, and the MP P(u; g) = u − (g1 − g2 ) = 0, that is the
problem of subtracting two real numbers. By indicating kgk = |g1 | + |g2 |, we have that condition number of the MP
is:
|g1 − g2 |
K(g) = ,
|g1 | + |g2 |
If g1 and g2 have opposite signs, then the MP is well–conditioned, indeed K(g) = 1. However, if g1 and g2 have the
same sign and g1 ≈ g2 , then K(g) can be very large, that is the MP is ill–conditioned. For example, for g1 = 1/51
and g2 = 1/52, we have u = 1/2652 = 3.770739064856699 · 10−4 . By truncating the representations of the
data (perturbed data) as (g1 + δg1 ) = 1.96 · 10−2 and (g2 + δg2 ) = 1.92 · 10−2 , we obtain the perturbed solution
(u+δu) = 4·10−4 , which is significantly different than u. As a matter of fact, in this case, we have K(g) = 103 1.
Ph (uh ; gh ) = 0, (1.9)
1
If the NP is an iterative method, h is often replaced by the natural number n that refers to the number of iterations; in this case,
1
we have h ∼ .
n
Figure 1.3: Physical (PP), mathematical (MP), and numerical (NP) problems. Corresponding solutions
bh ) and errors (model em = uph − u, truncation eh = u − uh , round–off er = uh − u
(uph , u, uh , and u bh ,
and computational ec = eh + er errors)
where uh ∈ Uh and gh ∈ Gh , with Uh and Gh two suitable sets or spaces; gh is the representation of the
data in the NP. The error between the mathematical and numerical solutions is called truncation error, say
eh := u − uh , as depicted in Fig. 1.3. This can be regarded as the error that stems from the discretization
of the MP.
Z T
Example 1.8. Let us consider the following MP: P(u; g) = u − f (t) dt = 0, where the data are g = {T, f (t)}.
0
N
X −1
A possible NP to approximate the integral in the MP is Ph (uh ; gh ) = uh − h f (ti ) = 0, where ti = i h for
i=0
T
i = 0, . . . , N , with h = for some N ∈ N. Here, gh = g. We can also say that the size of the NP is N . The larger
N
is N , the smaller is h, and the more accurate is the NP.
If the numerical solution is computed by executing the algorithm at the calculator, then the final solu-
bh and is affected by round–off error, say er := uh − u
tion is indicated u bh . Such round-off errors depend on
the machine architecture, on the representation of the numbers at the calculator, and on operations made in
floating–point arithmetic. Both the truncation and round-off errors concur to determine the computational
error, say ec := u − ubh = eh + er . For some NP, we can have |er | |eh |, for which ec ≈ eh .
As for the MP, we have to ensure that the NP is well–posed and we have to assess the condition number
of the NP.
Definition 1.4. The NP Ph (uh ; gh ) = 0 of Eq. (1.9) is well–posed (stable) if and only if there exists a
unique solution uh ∈ Uh that continuously depends on the data gh ∈ Gh .
Definition 1.5. The (relative) condition number of the NP Ph (uh ; gh ) = 0 with data gh ∈ Gh is defined
as:
kδuh k/kuh k
Kh (gh ) := sup .
δgh : (gh +δgh )∈Gh kδgh k/kgh k
and kδgh k6=0
Figure 1.4: Graphical estimation of the convergence order p of a NP: computational errors |ec | vs. h
As before, δgh is the perturbation on the data, while δuh the corresponding perturbation on the solution of
the NP. If Kh (gh ) is small, then the NP (1.9) is well–conditioned. Otherwise, if Kh (gh ) is large, then the
NP is ill–conditioned.
The reason for which we are interested in the condition number is related to the Wilkinson principle.
According to this principle, the result of a numerical operation on the computer – that is in floating–point
arithmetic – is equivalent to the result of the same operation in exact arithmetic carried out on data affected
by a (small) perturbation. This principle therefore provides a tool to quantify the effect of the propagation
of round-off errors in the computational process.
We are interested in NPs that allow to obtain computational errors that tend to zero as the numerical
method “improves”, namely when the discretization parameter h goes to zero. This concept is related to
the accuracy of the NP and it is encoded in the definition of convergence.
Definition 1.6. The NP is convergent when the computational error tends to zero for h tending to zero,
that is:
lim ec = 0.
h→0
A crucial aspect is to qualify the convergence of the NP, that is determining the convergence order of
the NP.
Definition 1.7. If |ec | ≤ C hp , with C a positive constant independent of h and p, then the NP is convergent
with order p.
If there exists a constant Ce ≤ C independent of h and p such that Ch e p ≤ |ec | ≤ Chp , then we can
p
write |ec | ' Ch and we can estimate the convergence order p of the NP by using the known solution u
of the MP. An approach to estimate p is algebraic. First, we compute the errors ec1 and ec2 for the NP
corresponding to two different values of h that are “sufficiently”small,
p say h1 e h2 , respectively; then, by
p p |ec1 | h1
writing |ec1 | ' Ch1 , |ec2 | ' Ch2 and by noticing that = , we estimate the order p as:
|ec2 | h2
An alternative approach is based on the graphical estimate. We represent the errors |ec | vs. h on a plot in
log–log scale. As log |ec | = log (Chp ) = log C + p log h, we have p = atan(θ), where θ is the slope of
the curve (h, ec ), a straight line in log–log scale. Instead of computing θ, it is possible to verify that the
curves (h, ec ) and (h, hp ) are parallel in log–log scale; see Fig. 1.4.
We remark that a well–posed NP is not necessarily convergent. To ensure convergence of the NP, this
is required to satisfy the consistency property: roughly speaking, the NP must be a “faithful copy” of the
original MP.
Definition 1.8. The NP (1.9) is consistent if and only if lim Ph (u; g) = P(u; g) = 0, with g ∈ Gh .
h→0
The NP (1.9) is strongly consistent if and only if Ph (u; g) ≡ P(u; g) = 0 for all h > 0, with g ∈ Gh .
It is clear that the NP must be well–posed (possibly, well-conditioned), consistent, and convergent. The
concepts of well–posedness (stability) and convergence are indeed strongly connected and encoded in the
following theoretical result.
It follows that, if the NP is well–posed and consistent, then the NP is also convergent. Equivalently,
if the NP is consistent and convergent, then the NP is also well–posed. The equivalence theorem is very
useful as it allows us to verify only two of the properties of a NP to obtain the third. In general, it is easier
to show the consistency of a NP, while it is harder to show well–posedness and/or convergence.
The choice of a numerical method (NP) to approximate the solution u of a MP must take into account
for: the mathematical properties of the MP; the computational efficiency in terms of expected conver-
gence order of the error, the flops involved in the calculation, the performance of the CPU installed on the
computer, the access modes and the availability of the calculation memory.
Let us indicate with N the size of the NP. Then, the number of floating–point operations required to
calculate the numerical solution uh depends on the size N of the NP. Different kinds of dependence are
depicted in the following table.
O(1) O(N ) O(N γ ) O(γ N ) O(N !)
flops independent linear polynomial exponential factorial
The following example highlights the role of selecting efficient numerical methods (NP) in relation
with available computing resources.
Example 1.9. If the matrix A ∈ RN ×N is non–singular, the solution x ∈ RN of the linear sustem Ax = b (MP) can
be obtained by applying the Cramer rule:
det(Bi )
xi = for i = 1, . . . , N,
det(A)
where Bi ∈ RN ×N is the matrix obtained from A by replacing its i-th column by the vector b ∈ RN as:
a11 . . . b1 . . . a1n
a21 . . . b2 . . . a2n
Bi = . .. .
..
.. . .
an1 . . . bn . . . ann
↑
i
However, the solution of the linear system with this algorithm requires O(e(N + 1)!) floating–point operations. If
N = 100, then about 101!e ≈ 2.56 · 10160 operations are required. A calculator able to perform 1012 floating–
point operations (flops) per second – actually a supercomputer with a computing capability of 1 TeraFlop – yields the
2.56 · 10160
numerical solution in = 2.56 · 10148 seconds, that is in about 8.11 · 10140 years! More efficient numerical
1012
2 3
methods are available to solve the linear system, as for example the LU factorization method that calls for O N
3
−6
flops. If N = 100, then the former supercalculator will provide the solution in only 10 seconds. A standard laptop
will achieve the result in less than a second.
We consider the numerical solution of linear systems by means of direct and iterative methods, specifically
in the case of linear systems with real–valued matrix and vectors.
Definition 2.1. The matrix A ∈ Rn×n with n ≥ 1 is non-singular if and only if det(A) 6= 0.
Proposition 2.1. If A ∈ Rn×n is non-singular, then there exists a unique solution x ∈ Rn to the linear
system (2.1).
With reference to the linear system (2.1), we use the following notation to represent the elements of the
matrix A ∈ Rn×n , that is, (A)ij = aij for i, j = 1, . . . , n, and the vectors x ∈ Rn and b ∈ Rn , that is,
respectively (x)i = xi and (b)i = bi for i = 1, . . . , n. Furthermore, we will use the following notation to
express the linear system in terms of the elements of A, b, and x:
a11 x1 + a12 x2 + · · · + a1n xn = b1 a11 a12 · · · a1n x1 b1
a21 x1 + a22 x2 + · · · + a2n xn = b2 a21 a22 · · · a2n x2 b2
or . .. .. = .. .
.. .. .. .. .. ..
. . . = . .. . . . . .
an1 x1 + an2 x2 + · · · + ann xn = bn an1 an2 · · · ann xn bn
We observe that linear systems can be directly interpreted as mathematical problems that model physi-
cal problems. However, in many cases, linear systems are obtained as numerical problems associated with
the discretization of mathematical problems; some the examples shown in the previous chapter follow this
direction. This is, for example, the case for differential problems such as PDEs or ODEs; in such cases,
the approximation of the mathematical problem often leads to the numerical solution of linear systems. An
example is the Finite Element method for the spatial approximation of PDEs. In these cases, the larger
the dimension n of the linear system, the more accurate the approximation of the mathematical problem
that generated the linear system; for such problems, it is very common to solve linear systems of sizes
O(n) = 105 , 106 , or even 107 .
15
16 Numerical Solution of Linear Systems
Example 2.1. We determine the flows in a hydraulic circuit through the solution of a linear system.
We obtain the following linear system, whose solution provides the distribution of flows {qj }n
j=0 in the circuit:
R1 R2 0 0 R5 0 R7 q1 ∆p
0 R2 −R3 R4 0 0 0
q2
0
0 0 0 R4 −R5 R6 0
q3
0
1 −1 −1 0 0 0 0
q4 =
0 .
0 1 0 −1 −1 0 0
q5
0
0 0 1 1 0 −1 0 q6 0
0 0 0 0 1 1 −1 q7 0
We observe that if the matrix A ∈ Rn×n is dense, at least n2 operations would theoretically be required
to solve the linear system. Even though this assumption is very optimistic, it is appropriate to consider and
develop methods where the number of operations is as close as possible to this ideal number. Different
considerations can be made for sparse matrices, i.e., matrices A ∈ Rn×n where the number of non-zero
elements is O(n) n2 .
Remark 2.1. The solution of the linear system (2.1) as x = A−1 b, i.e., by explicitly computing and
assembling the inverse matrix of A, is a computationally inefficient and inaccurate procedure that should
be avoided even for relatively small matrices.
Numerical methods for solving linear systems can be classified into direct and iterative methods.
Definition 2.2. With a direct method, the solution x of the linear system (2.1) is obtained in a finite number
of operations, known a priori. In contrast, with an iterative method, the solution x is obtained, in principle,
in an infinite number of steps.
The choice of a direct or iterative method for solving the linear system (2.1) depends on multiple factors,
such as the properties, size, and sparsity of the matrix A, as well as the available computational resources
(CPU and memory).
Diagonal matrix
Let us consider the case of a diagonal matrix D ∈ Rn×n , that is, (D)ii = dii for i = 1, . . . , n and
(D)ij = 0 for i, j = 1, . . . , n, with j 6= i. In this case, the diagonal matrix D and the associated linear
system D x = b are given by:
d11 0 · · · 0
d11 x1 = b1
0 d22 · · · 0 d22 x2 = b2
D= . and
.. .. .. .. ..
.. . . .
. = .
0 0 · · · dnn dnn xn = bn .
Setting A = D in Eq. (2.1), the solution x ∈ Rn of the diagonal system D x = b is given by:
bi
xi = for i = 1, . . . , n,
dii
Definition 2.3. L ∈ Rn×n is a lower triangular matrix if and only if its elements satisfy (L)ij = lij ∈ R
for i = 1, . . . , n, j = 1, . . . , i and (L)ij = 0 for i = 1, . . . , n − 1, j = i + 1, . . . , n; the lower triangular
matrix L is given by:
l11 0 · · · 0
l21 l22
0 ··· 0
l31 l32 l33 0 · · · 0
L= . .. .
.. .. ..
.. . . . .
ln1 ln2 ln3 · · · lnn
Given a lower triangular matrix L ∈ Rn×n , let us consider the solution of the lower triangular system:
l11 x1
= b1
l x + l x = b2
21 1 22 2
L x = b, that is l31 x1 + l32 x2 + l33 x3 = b3
.. .. .. . . ..
. . . . = .
ln1 x1 + ln2 x2 + ln3 x3 + · · · + lnn xn = bn ,
where, referring to Eq. (2.1), we set A = L. The lower triangular system L x = b can be solved using the
forward substitution algorithm, that is:
b1
x1 = ,
l11
i−1 (2.2)
1 X
xi = bi − lij xj for i = 2, . . . , n.
lii j=1
The forward substitution algorithm solves the lower triangular system L x = b in n2 operations, where n
n
X
is the dimension of the matrix L; in fact, the algorithm performs n divisions, (i − 1) subtractions, and
i=2
n
X n
X
(i − 1) multiplications, thus bringing the total computation of operations to n + 2 (i − 1) = n2 .
i=2 i=2
n
Y
Remark 2.3. Since L is a triangular matrix, we have det(L) = lii ; therefore, det(L) 6= 0 if and only if
i=1
lii 6= 0 for each i = 1, . . . , n.
Definition 2.4. U ∈ Rn×n is an upper triangular matrix if and only if its elements satisfy (U )ij = uij ∈ R
for i = 1, . . . , n, j = i, . . . , n and (U )ij = 0 for i = 2, . . . , n, j = 1, . . . , i − 1; the upper triangular
matrix U is given by:
u11 u12 u13 · · · u1n
0 u22 u23 · · · u2n
0 0 u 33 · · · u3n
U = . .. . (2.3)
.. .. ..
. . .
0 ··· 0 unn
Given an upper triangular matrix U ∈ Rn×n , let us consider the solution of the upper triangular system:
u11 x1 + u12 x2 + u13 x3 + · · · + u1n xn = b1
u22 x2 + u23 x3 + · · · + u2n xn = b2
U x = b, that is u33 x3 + · · · + u3n xn = b3
.. .. ..
. . = .
unn xn = bn ,
where, referring to Eq. (2.1), we set A = U . The upper triangular system U x = y can be solved using the
backward substitution algorithm, that is:
bn
xn = ,
unn
n (2.4)
1 X
xi = bi − uij xj for i = n − 1, . . . , 1.
uii j=i+1
The backward substitution algorithm solves the upper triangular system U y = x in n2 operations, analo-
gous to the forward substitution algorithm for lower triangular systems.
n
Y
Remark 2.4. Since U is a triangular matrix, det(U ) = uii , so det(U ) 6= 0 if and only if uii 6= 0 for
i=1
each i = 1, . . . , n.
If the LU factorization of the matrix A exists, then the linear system A x = b can be solved as a
sequential solution of the following systems, the first being lower triangular and the second being upper
triangular:
Ly = b and U x = y.
In fact, since A = L U , we have L U x = b, from which, introducing the auxiliary vector y = (U x) ∈ Rn ,
we obtain the previous result.
Definition 2.5. The LU factorization method for solving the linear system A x = b consists of:
1. determining, if it exists, the LU factorization of the matrix A (A = L U );
2. solving the lower triangular system L y = b using the forward substitution algorithm (2.2);
3. solving the upper triangular system U x = y using the backward substitution algorithm (2.4).
Since the LU factorization method is based on the LU factorization of A, it is necessary to determine the
matrices L and U of A, if they exist.
Example 2.2. Let us illustrate the LU factorization of the matrix A ∈ Rn×n with n = 2, which is given by:
l11 u11 = a11
a11 a12 l11 0 u11 u12
= l11 u12 = a12
a21 a22 l21 l22 0 u22 or
l21 u11 = a21
A = L U
l21 u12 + l22 u22 = a22 .
We observe that the matrices L and U involve a total of 6 unknowns: l11 , l21 , l22 , u11 , u12 , and u22 . On the other
hand, only 4 constraints are available for their determination.
Remark 2.5. From the previous example, for the LU factorization of a generic matrix A ∈ Rn×n , there
are n2 + n unknowns in the matrices L and U , but only n2 constraints to impose; indeed, we have aij =
min{i,j}
X
lir urj for i, j = 1, . . . , n. To overcome this issue, by convention, the diagonal elements of the
r=1
lower triangular matrix L obtained through LU factorization of the matrix A ∈ Rn×n are set to 1; that is,
lii = 1 for every i = 1, . . . , n:
1 0 ··· 0
l21 1
0 ··· 0
l31 l32 1 0 ··· 0
L= . .. . (2.5)
.. .. ..
.. . . . .
ln1 ln2 ln3 · · · ln,n−1 1
or, equivalently:
(i)
aij
for i = 1, . . . , k − 1, j = i, . . . , n
(k) (k)
A = a for i, j = k, . . . , n for k = 1, . . . , n. (2.7)
ij ij
0 otherwise.
(1) (1)
We set A ≡ A, that is aij = aij for all i, j = 1, . . . , n.
(k)
Definition 2.6. Given an index k, with 1 ≤ k ≤ n − 1, with reference to the corresponding matrix A in
(k)
Eq. (2.6), its element akk is called the pivotal element (or pivot).
The following GEM algorithm finds the elements of the matrix L ∈ Rn×n in Eq. (2.5) and U ∈ Rn×n
(n)
in Eq. (2.3), determining the LU factorization of A ∈ Rn×n ; the matrix U coincides with A obtained at
(n)
the end of the GEM (U = A ).
(k)
We provide a sketch of the GEM algorithm, showing how the transformation from the matrix A to
(k+1)
the matrix A is performed at the generic step k.
5 2 −1
Example 2.3. We form the LU factorization of the matrix A = 1 4 2 using the GEM. We start by
−3 −1 7
(1)
setting A = A and noting that n = 3; thus, we obtain the LU factorization by following the GEM algorithm 2.1.
(1)
• k = 1: a11 = 5 (pivot),
(1)
a21 1
– i = k + 1 = 2: l21 = (1)
= ,
a11 5
(2) (1) (1) 1 18
∗ j = k + 1 = 2: a22 = a22 − l21 a12 = 4 − 2= ,
5 5
(2) (1) (1) 1 11
∗ j = n = 3: a23 = a23 − l21 a13 = 2 − (−1) = ;
5 5
(1)
a31 −3 3
– i = n = 3: l31 = (1)
= =− ,
a11 5 5
(2) (1) (1) 3 1
∗ j = k + 1 = 2: a32 = a32 − l31 a12 = (−1) − − 2= ,
5 5
(2) (1) (1) 3 32
∗ j = n = 3: a33 = a33 − l31 a13 = 7 − − (−1) = .
5 5
1 0 0 5 2 −1
1
(2)
18 11
L=
1 0 , A = 0
.
5 5 5
3 1 32
− ? 1 0
5 5 5
(2) 18
• k = 2: a22 = (pivot),
5
(2)
a32 1/5 1
– i = k + 1 = n = 3: l32 = (2)
= = ,
a22 18/5 18
(3) (2) (2)32 1 11 113
∗ j = k + 1 = n = 3: a33 = a33 − l32 a23 = − = .
5 18 5 18
1 0 0 5 2 −1
1
(3)
18 11
L=
1 0
, U =A = 0
.
5 5 5
3 1 113
− 1 0 0
3 18 18
Properties of LU factorization
The GEM provides the LU factorization of the matrix A ∈ Rn×n required to solve the linear system
A x = b using the LU factorization
method
of Definition
32.5.
The number of operations associated with
2n3 2n
the LU factorization method is O ; in fact, O operations are required by the GEM, while
3 3
n2 are needed for both forward and backward substitution algorithms.
Remark 2.7. The LU factorization of the matrix A ∈ Rn×n is independent of the vector b ∈ Rn associated
with the linear system A x = b. The LU factorization method can therefore be efficiently used to solve the
linear system for different vectors b since L and U can be assembled only once. The computational costs
associated with solving the linear system for each new vector b are determined solely by solving the lower
triangular system L y = b and the upper triangular system U x = y, using the forward and backward
substitution algorithms, respectively.
We determine the cases in which the LU factorization of a non–singular matrix A exists and is unique.
To this end, we recall the following definition.
Definition 2.7. The principal submatrix of A ∈ Rn×n of order i, with 1 ≤ i ≤ n, is the matrix Ai ∈ Ri×i
such that (Ai )lm = (A)lm for every l, m = 1, . . . , i.
.
The following proposition expresses the necessary and sufficient condition for the existence and uniqueness
of the LU factorization of a non–singular matrix A.
Proposition 2.2 (Necessary and sufficient condition for LU factorization). Given a non–singular matrix
A ∈ Rn×n , its LU factorization exists and is unique if and only if det(Ai ) 6= 0 for every i = 1, . . . , n − 1
(i.e., all principal submatrices of A of order i, with 1 ≤ i ≤ n − 1, are non–singular).
1 41
Example 2.4. The LU factorization of the non–singular matrix A = 2 2
3 obtained using the GEM does
4 76
1 1
not exist. In fact, from Proposition 2.2, we have det(A1 ) = det([1]) 6= 0, but det(A2 ) = det = 0; in
2 2
(2)
this specific case, the pivotal element a22 = 0 is found during the application of the GEM algorithm.
It is not always necessary to verify the condition of Proposition 2.2 to establish the existence and
uniqueness of the LU factorization of A, but it is sufficient to check some conditions that are only sufficient.
To this end, we recall the following definitions.
n
X
• strictly diagonally dominant by columns if and only if |aii | > |aji | for every i = 1, . . . , n.
j=1, j6=i
The following proposition expresses a series of conditions that are only sufficient to guarantee the existence
and uniqueness of the LU factorization of a non–singular matrix A.
Proposition 2.3 (Sufficient conditions for LU factorization). Given the matrix A ∈ Rn×n , if one of the
following conditions is satisfied:
• A is symmetric and positive definite,
• or A is strictly diagonally dominant by rows,
Remark 2.8. The application of the GEM with the pivoting technique (row pivoting) ensures the existence
and uniqueness of the LU factorization for any non–singular matrix A ∈ Rn×n .
Consider the specific case of pivoting technique with row permutation of the matrix A. This row
permutation of the matrix A ∈ Rn×n consists of pre–multiplying it by a permutation matrix P ∈ Rn×n ,
that is, P A. The permutation matrix P is orthogonal, meaning P T = P −1 (P T P = I); if P = I, then no
permutations are applied to the matrix A. In general, the permutation matrix P is obtained simultaneously
with the assembly of the matrices L and U during the use of the GEM with the pivoting technique (for
rows) applied to the non–singular matrix A.
Example 2.7. The non–singular matrix A from Example 2.4 does not admit the LU factorization with the stan-
(2)
dard
GEM (without
pivoting) since the pivot element a22 = 0. By introducing
permutation matrix P =
the row
1 0 0 1 1 4
0 0 1 , and applying it to A, we obtain the matrix A e = P A = 4 6 7 , where the second and third
0 1 0 2 2 3
rows have
been permuted.
Applying
the standard
GEM to the permuted matrix A,
e we obtain the LU factorization with
1 0 0 1 1 4
L = 4 1 0 and U = 0 2 −9 , where A e = P A = L U ; in this case, the new pivot elements are
2 0 1 0 0 −5
(1) (2)
a11 = 1 6= 0 and e
e a22 = 2 6= 0.
In general, the pivoting technique is applied in conjunction with the GEM even if the pivot elements
are not necessarily zero. In fact, the pivoting technique can also be used to reduce and contain the prop-
agation of rounding errors associated with the application of GEM. Specifically, at the generic iteration
k = 1, . . . , n − 1 of the GEM, row k is permuted with row l, where
(k)
l = arg max |aik |,
i=k,...,n
(k) (k)
akk = alk being the new pivot element for the k–th iteration.
with e
Algorithm 2.2: Gaussian Elimination Method (GEM) with pivoting (row pivoting)
(1)
set A = A and P = I ;
for k = 1, . . . , n − 1 do
(k) (k) (k)
find r̄ such that |ar̄k | = max |ar̄k | and swap row k with row r̄ in both A and P ;
r=k,...,n
for i = k + 1, . . . , n do
(k) (k)
lik = aik /akk ;
for j = k + 1, . . . , n do
(k+1) (k) (k)
aij = aij − lik akj ;
end
end
(k+1)
assign A as in Eq. (2.7) ;
end
(n)
assign L as in Eq. (2.5) and set U = A ;
If the row pivoting technique is applied to determine the LU factorization of the nonsingular matrix A,
specifically using the row permutation matrix P , then the matrices L and U provide the LU factorization
of the permuted matrix P A as:
P A = L U.
It follows that the linear system A x = b can be solved by sequentially solving the following lower and
upper triangular systems:
Ly = P b and U x = y;
Indeed, we have P A x = P b and L U x = P b, so introducing the vector y = U x, we obtain the
previous result.
Definition 2.11. The LU factorization method with row pivoting for solving the linear system A x = b,
which is based on row permutation with the matrix P , consists of:
1. determining the LU factorization of the matrix P A (P A = L U );
2. solving the lower triangular linear system L y = P b using the forward substitution algorithm (2.2);
3. solving the upper triangular system U x = y using the backward substitution algorithm (2.4).
Definition 2.12. Let A ∈ Rn×n be symmetric and positive definite. Then, its Cholesky factorization
consists of determining an upper triangular matrix R ∈ Rn×n such that:
A = RT R.
which, for A ∈ Rn×n symmetric and positive definite, is determined via the Cholesky algorithm.
The Cholesky algorithm requires O n3 /3 operations to determine the upper triangular matrix R, approx-
imately half the number of flops associated with LU factorization; additionally, the memory used by the
computer is also lower.
If A is symmetric and positive definite, the Cholesky factorization exists (A = RT R) and the solution
of the linear system A x = b can be obtained sequentially as the solution of the following lower and upper
triangular systems:
RT y = b and R x = y,
where RT is a lower triangular matrix; indeed, since A = RT R, we have RT R x = b, from which the
previous result follows by introducing the vector y = R x ∈ Rn .
Definition 2.13. The Cholesky factorization method for solving the linear system Ax = b, with A sym-
metric and positive definite, consists of:
1. determining the Cholesky factorization of the matrix A (A = RT R);
2. solving the lower triangular system RT y = b using the forward substitution algorithm (2.2);
3. solving the upper triangular system R x = y using the backward substitution algorithm (2.4).
α1 = a1 ,
ei (2.8)
βi = and αi = ai − βi ci−1 for i = 2, . . . , n.
αi−1
Now consider the linear system Ax = b, with A ∈ Rn×n being the former tridiagonal matrix, which we
solve using the LU factorization method. The LU factorization of A is performed using Eq. (2.8); in this
way, the linear system L y = b is solved using the following forward substitution algorithm adapted to the
lower bidiagonal matrix L:
y1 = b1 ,
(2.9)
yi = bi − βi yi−1 for i = 2, . . . , n.
Finally, the linear system U x = y is solved using the following backward substitution algorithm adapted
to the upper bidiagonal matrix U :
yn
xn = ,
αn (2.10)
yi − ci xi+1
xi = for i = n − 1, . . . , 1.
αi
Definition 2.14. The Thomas algorithm for solving the linear system A x = b, with A being a non–
singular tridiagonal matrix that admits a unique LU factorization without pivoting, consists of:
1. determining the LU factorization of the matrix A (A = L U ) using the algorithm from Eq. (2.8);
2. solving the lower bidiagonal system L y = b with the algorithm from Eq. (2.9);
3. solving the upper bidiagonal system U x = y with the algorithm from Eq. (2.10).
The Thomas algorithm requires only O(8n) operations (specifically 8n − 7) to solve the linear system
associated with the tridiagonal matrix A ∈ Rn×n .
n
Definition 2.16. Given a matrix A ∈ Cn×n , its eigenvalues {λi (A)}i=1 ∈ C and the corresponding
n
eigenvectors {vi }i=1 ∈ Cn satisfy:
A vi = λi vi for each i = 1, . . . , n.
n
Y
Given a matrix A ∈ Cn×n , we observe that: det (A) = λi (A); λi A−1 = 1/λn+1−i (A) for i =
i=1
1, . . . , n, if the inverse A−1 exists; ρ(A) ≥ 0. Next, we focus on a real–valued matrix A ∈ Rn×n .
Proposition 2.4. If the matrix A ∈ Rn×n is symmetric, then its eigenvalues are real, i.e., λi (A) ∈ R for
every i = 1, . . . , n. Consequently, if A ∈ Rn×n is symmetric, then it is also positive definite if and only if
all its eigenvalues are strictly positive, i.e., λi (A) > 0 for every i = 1, . . . , n.
Definition 2.17. Given the matrix A ∈ Rn×n , its norm p is defined as:
kA vkp
kAkp := sup for 1 ≤ p ≤ +∞. (2.11)
v∈R n
, kvkp
v6=0
kA vk
q
kAk2 = sup = λmax (AT A),
v∈R , v6=0 kvk
n
2
If A is symmetric and positive definite, then kAk2 = λmax (A) since λmax (AT A) = (λmax (A)) .
Definition 2.18. The condition number in norm p of a non–singular matrix A ∈ Rn×n is defined as:
By convention, if A is singular, then Kp (A) = +∞. For a non–singular matrix A ∈ Rn×n , we have
Kp (A) ≥ 1 for every 1 ≤ p ≤ +∞. Moreover,
s
λmax (AT A)
K2 (A) = kAk2 kA−1 k2 = .
λmin (AT A)
Definition 2.19. The spectral condition number of a non–singular matrix A ∈ Rn×n is defined as:
−1
are the spectral radii of the matrices A and A−1 , respectively.
where ρ(A) and ρ A
λmax (A)
If the eigenvalues of A are real and strictly positive, K(A) = , where λmax (A) and λmin (A)
λmin (A)
are the maximum and minimum eigenvalues of A, respectively. Therefore, if A is symmetric and positive
definite, then
λmax (A)
K2 (A) ≡ K(A) = .
λmin (A)
Remark 2.9. The condition number of a matrix A provides a measure of the sensitivity of the solution of
the linear system A x = b to perturbations in the data, i.e., b and the matrix A itself. The system is said
to be well-conditioned if Kp (A) is relatively “small”, and ill-conditioned if Kp (A) is “very large” (e.g.,
O 109 or larger...).
krk
b, where r ∈ Rn , while the relative residual is rrel :=
• the residual is r := b − A x , for b 6= 0,
kbk
where rrel ∈ R.
Remark 2.10. In general, the residual r and the relative residual rrel are used as estimators of the error
associated with the numerical solution x
b; in fact, the exact solution x of the linear system A x = b is
generally unknown.
Proposition 2.5 (Stability estimate). The relative error associated with the numerical solution x
b of the
linear system A x = b is estimated as:
krk
erel ≤ K2 (A) rrel = K2 (A) . (2.13)
kbk
Proof. From the definition of the residual, it follows that r := b − A x b = Ax − Axb; thus, x − x
b =
A−1 r, from which we obtain that kx − x bk = kA−1 rk ≤ kA−1 k2 krk. Furthermore, since kbk =
1 kAk2 kx − x
bk krk
kA xk ≤ kAk2 kxk, it follows that ≤ , leading to erel := ≤ kA−1 k2 ≤
kxk kbk kxk kxk
kAk2 krk
kA−1 k2 krk = K2 (A) = K2 (A)rrel .
kbk kbk
Remark 2.11. The error (stability) estimate from Eq. (2.13) is a posteriori error estimate and can be
evaluated once the numerical solution x
b has been computed.
Based on the result in (2.13), the relative residual rrel represents a criterion that satisfactorily estimates
the error associated with the numerical solution x b of the linear system obtained using a direct method on
the computer only if the conditioning number is “small," that is, when the matrix A is well–conditioned.
Conversely, if the conditioning number of the matrix A is “large", meaning that A is ill–conditioned, then
the error associated with x
b could be very “large" even if rrel is “small," due to the propagation of round–off
errors during the application of the direct method on the computer.
given x(0) ∈ Rn ,
(2.14)
x(k+1) = B x(k) + g for k = 0, 1, . . . ,
where B ∈ Rn×n is the iteration matrix and g ∈ Rn is the iteration matrix. B and g depend on A, b,
and the specific method under consideration. However, the iterative method must comply with the strong
consistency condition, which is such that, if x is the solution of A x = b, then it must hold x = B x + g.
It follows that the iteration vector must read g = (I − B) A−1 b since x = A−1 b.
Definition 2.21. We define the error e(k) ∈ Rn correspondent to x(k) ∈ Rn of the iterative method (2.14)
as:
e(k) := x − x(k) for k = 0, 1, . . . ,
while the residual r(k) ∈ Rn is:
e(k) ≤ B k 2
e(0) for k = 0, 1, . . . . (2.16)
We see that lim e(k) = 0 if and only if lim B k = 0, which is verified if and only if ρ(B) < 1, with
k→+∞ k→+∞
ρ(B) the spectral radius of B. Indeed, in the general, the following holds:
k
e(k) ≤ (ρ(B)) e(0) for k = 0, 1, . . . .
Proposition 2.6 (Necessary and sufficient condition for convergence). The iterative method (2.14) is con-
vergent to the exact solution x ∈ Rn of the linear system A x = b for every choice of the initial guess
x(0) ∈ Rn if and only if the spectral radius of the iteration matrix B is strictly smaller than one, that is
ρ(B) < 1. Moeover, the smaller is ρ(B), the faster is the convergence.
B = I − P −1 A (2.17)
and g = P −1 b. Hence, the iterative method (2.14) can be written as P x(k+1) = (P − A) x(k) + b, from
which
P x(k+1) − x(k) = r(k) .
Definition 2.22. The preconditioned residual z(k) ∈ Rn is the solution of the following linear system:
Hence, the iterative method (2.14) can be more conveniently written as:
given x(0) ∈ Rn ,
(2.18)
solve P z(k) = r(k) and set x(k+1) = x(k) + z(k) for k = 0, 1, . . . .
We remark that r(k+1) = b − A x(k+1) = b − A x(k) − A z(k) = r(k) − A z(k) . Therefore, we write the
following algorithm.
The iterative method must be stopped by means of suitable stopping criteria. We can consider the stopping
r(k)
criterion based on the normalized residual such that iterations are stopped at k ≥ 0 for which < tol,
kbk
for a prescribed tolerance tol. The number of iterations should also be limited to a maximum value kmax .
At each iteration of the method (2.18) we need to solve the linear system P z(k) = r(k) . Therefore,
the choice of the preconditioner P should be such that the linear system P z(k) = r(k) is solved in a
computationally efficient manner by means of a direct method. In other words, this linear system should
be “easy" to solve by means of a direct method.
In addition, the choice of P must guarantee that the iterative method is convergent, that is the iteration
matrix B = I − P −1 A is such that ρ(B) < 1. Moreover, it is desirable that ρ(B) 1 to ensure
convergence in a fast way to x.
In this context, it is clear that the choice of the preconditioning matrix P is a trade–off between the
“simplicity" to solve the linear system P z(k) = r(k) at each iteration k of the iterative method and the
need to ensure a (fast) convergence of the iterative method (that is ρ(B) < 1 and possibly ρ(B) 1).
Jacobi method
The Jacobi method can be applied only to a non–singular matrix A ∈ Rn×n whose diagonal elements are
non zeros, that is when aii 6= 0 for all i = 1, . . . , n.
The Jacobi method selects as preconditioner P in the algorithm (2.18) the diagonal matrix extracted
from A. We have P = PJ , where
PJ = D,
with D ∈ Rn×n the diagonal matrix with elements (D)ii = aii for all i = 1, . . . , n and (D)ij = 0 for
every i 6= j. In this manner det(PJ ) 6= 0. The linear system PJ z(k) = r(k) of Eq. (2.18) is “easy” to
solve by means of a direct method (only n divisions) since PJ = D is diagonal. The iteration matrix
corresponding to the method is:
BJ = I − PJ−1 A = I − D−1 A.
Hence, the convergence of the Jacobi method to x for every choice of the initial guess x(0) depends on the
value of ρ(BJ ) according with Proposition 2.6.
Gauss–Seidel method
The Gauss–Seidel method can be applied only to a non–singular matrix A ∈ Rn×n whose diagonal ele-
ments are non zeros, that is when aii 6= 0 for all i = 1, . . . , n. The preconditioner P in Eq. (2.18) is the
lower triangular matrix extracted from A.
By convention, with indicate by D the diagonal matrix extracted from A and by E ∈ Rn×n the lower
triangular matrix with zero diagonal elements such that (E)ij = −aij for i = 2, . . . , n and j = 1, . . . , i−1,
(E)ij = 0 otherwise. Finally, F ∈ Rn×n is the upper triangular matrix with zero diagonal elements such
that (F )ij = −aij for i = 1, . . . , n − 1 and j = i + 1, . . . , n, (F )ij = 0 otherwise. Thus, we have
A = D − E − F . The preconditioner for the Gauss–Seidel method is P = PGS , where
PGS = D − E,
with det (PGS ) 6= 0 according with the hypothesis. The linear system PGS z(k) = r(k) of Eq. (2.18) is
“easy” to solve by means of a direct method, in particular by means of the forward substitution method (in
n2 operations) since PGS = (D − E) is lower triangular. The corresponding iteration matrix is:
−1 −1
BGS = I − PGS A = I − (D − E) A.
The convergence properties of the method depend on ρ(BGS ), in agreement with Proposition 2.6.
Proposition 2.7. If A is non–singular and strictly diagonally dominant by row, then both the Jacobi and
Gauss–Seidel methods converge to x for every x(0) ∈ Rn .
Proposition 2.8. If A is symmetric and positive definite, then the Gauss–Seidel method converges to x for
every x(0) ∈ Rn .
Proposition 2.9. If A is non–singular and tridiagonal with every diagonal element non–zero, then the
Jacobi and Gauss–Seidel methods are either both convergent to x for every x(0) ∈ Rn or divergent. If they
2
are convergen, the Gauss–Seidel method is faster that the Jacobi method since ρ (BGS ) = (ρ (BJ )) .
The former conditions are only sufficient. If these are not satisfied, then the necessary and sufficient condi-
tion of Proposition 2.6 must be verified to establish the convergence of the iterative method.
3 1
Example 2.8. Consider A = , which is strictly diagonally dominant by row. Since the conditions of
1 2
Proposition 2.7 are met, then both the Jacobi and Gauss–Seidel methods are convergent for every x(0) ∈ R2 to the
solution x ∈ R2 of the linear system A x = b, regardless of b ∈ R2 . The hypotheses of Propositions 2.8 and 2.9
are satisfied too. We verify the result by means of the necessary and sufficient condition of Proposition 2.6. For the
1
0 −
Jacobi method, we have PJ =
3 0
and BJ = I − PJ−1 A = 1 3 , hence ρ (BJ ) = √1 < 1.
0 2 − 0 6
2 1
3 0
−1
0 −
For the Gauss–Seidel method, we have PGS = and BGS = I − PGS A = 3 , from which
1 2 1
0
6
1
ρ (BGS ) = < 1; the method converges faster than the Jacobi method.
6
1 0 −1
Example 2.9. We consider the non–singular matrix A = 3 2 0 for the linear system A x = b. None
−1 −1 2
of the sufficient conditions of Propositions 2.7, 2.8, or 2.9 are satisfied. Therefore, we need to verify the necessary
1 0 0
and sufficient condition of Proposition 2.6. For the Jacobi method, we have PJ = 0 2 0 and BJ = I −
0 0 2
0 0 1
3
PJ−1 A = − 0 0 , for which ρ (BJ ) = 109 > 1. Therefore, the Jacobi method does not converge for
2 100
1 1
0
2 2
1 0 0
every choice of x(0) ∈ R3 to x ∈ R3 . For the Gauss–Seidel method, we have PGS = 3 2 0 and
−1 −1 2
0 0 1
3
−1 0 0 − , from which ρ (BGS ) = 1 < 1. Thefore, the Gauss–Seidel method
BGS = I − PGS A = 2
1 4
0 0 −
4
converges to x per every choice of x(0) ∈ R3 .
given x(0) ∈ Rn ,
(2.19)
solve P z(k) = r(k) e porre x(k+1) = x(k) + αk z(k) for k = 0, 1, . . . ,
Let us focus on the (preconditioned) stationary Richardson method, whose algorithm reads:
Bα = I − α P −1 A.
Therefore, the convergence properties depend on the spectral radius of Bα , that is on ρ (Bα ). We notice
that the Richardson method is specified by the choice of the parameter α and the preconditioner P .
Let us consider the convergence conditions of the Richardson method. We introduce the following
definition.
Definition 2.23. The energy norm of a vector v ∈ Rn with respect to a symmetric and positive definite
matrix A ∈ Rn×n is defined as: √
kvkA = vT A v.
Proposition 2.10. If the matrix A and P ∈ Rn×n are symmetric and positive definite, then the stationary
Richardson method converges to x ∈ Rn for every choice of x(0) ∈ Rn if and only if
2
0<α< ,
λmax (P −1 A)
where λmax P −1 A is the largest eigenvalue of P −1 A. Moreover, the spectral radius of the iteration
2
αopt := ,
λmin (P −1 A) + λmax (P −1 A)
ke(k) kA ≤ dk ke(0) kA
for k = 0, 1, . . . , (2.20)
K P −1 A − 1 λmax P −1 A
−1
with d := , where K P A = is the spectral condition number of
K (P −1 A) + 1 λmin (P −1 A)
P −1 A.
Under the assumptions of Proposition 2.10, an optimal choice for the parameter in a stationary Richardson
method is available. On the other hand, the result (2.20) also indicates that the closer the preconditioning
matrix P is to the matrix A, the closer the spectral condition number of the matrix P −1 A is to one, and
the faster the method converges; however, in this case, solving the linear system P z(k) = r(k) could be
relatively complex. In particular, for P = A, we have αopt = 1 and d = 0, meaning that the convergence
2
occurs in a single iteration. Conversely, if P = I, we have αopt = and d =
λmin (A) + λmax (A)
K (A) − 1
; in this case, the convergence of the iterative method can be slow if K(A) 1, since d . 1.
K (A) + 1
In general, the closer K P −1 A is to one, the faster the method converges.
4 1 4 0
Example 2.10. Consider A = and the preconditioner P = , both of which are symmetric and
1 2 0 4
positive definite. Therefore, to study the convergence properties of thestationary Richardson
method, we can use
−1 1 1/4
the results of Proposition 2.10. These depend on the matrix P A = and its eigenvalues λmin =
1/4 1/2
√ √
3 2 3 2
λmin P −1 A = and λmax = λmax P −1 A =
− + . In particular, the stationary Richardson
4 4 4 4
method converges to the solution x ∈ R of a linear system associated with A for every x(0) ∈ R2 if and only
2
2 8 2 4
if 0 < α < = √ . Furthermore, the optimal parameter αopt = = minimizes the
λmax 3+ 2 λ min + λmax
3
−1 −1/3 −1/3
spectral radius among the iteration matrices Bα ; specifically, Bαopt = I − αopt P A = and
−1/3 1/3
√ −1
√
2 K P A −1 λmax − λmin 2
ρ Bαopt = < 1. From Eq. (2.20), it follows that d = = = ρ Bαopt = ;
3 K (P −1√A) + 1 λmax + λmin 3
2
that is, the error in the A–energy norm decreases by a factor of d ≤ at each iteration.
3
In general, for a preconditioned stationary Richardson method, determining the optimal parameter
αopt can be computationally expensive, as it is related to the eigenvalues of P −1 A. To avoid explicitly
computing these eigenvalues for the determination of the parameter α, one can appropriately use a dynamic
preconditioned Richardson method.
where P is a symmetric and positive definite matrix, and z(k) is the preconditioned residual.
Similarly, the gradient method is a unpreconditioned dynamic Richardson method with parameters αk
chosen as:
T
r(k) r(k)
αk = T for k = 0, 1, . . . . (2.21)
r(k) A r(k)
It is possible to obtain the gradient method from the preconditioned gradient method by choosing P = I;
in this case, z(k) ≡ r(k) for every k = 0, 1, . . .. For the gradient method, the residual vector r(k) represents
the descent direction for the error at iteration k = 0, 1, . . ., and if A is symmetric and positive definite, the
choice of αk given by (2.21) minimizes the error ke(k+1) kA along the direction r(k) . The algorithm reads:
We provide the following interpretation of the gradient method (P = I). Let A ∈ Rn×n be symmetric
and positive definite. Then, the vector x ∈ Rn is a solution to the linear system A x = b if and only
1
if the system energy function Φ : Rn → R, defined as Φ(y) = yT Ay − yT b, achieves its minimum
2
n n
1 T 1 X X
at y = x. By rewriting Φ : R → R as Φ(y) = y Ay − yT b =
n
yi aij yj − bi yi and
2 2 i,j=1 i=1
n n
∂Φ 1 X X
since A is symmetric, we have = amj yj + yi aim − bm or, grouping all components,
∂ym 2 j=1 i=1
1 T
∇Φ(y) = (A + A)y − b = Ay − b.
2
Let us assume that y now corresponds to the k–th iterate x(k) of the method, with k = 0, 1, . . .,
1
meaning that Φ(x(k) ) = Φ(x) + ke(k) k2A and ∇Φ(x(k) ) = −r(k) . The goal is to determine x(k+1) such
2
that Φ(x(k+1) ) ≤ Φ(x(k) ), so we can choose the descent direction
−∇Φ(x(k) ) = r(k) ,
which is the gradient of Φ, giving the method its name. Therefore, we write
x(k+1) = x(k) − αk ∇Φ(x(k) ) = x(k) + αk r(k) ,
with αk ∈ R to be determined. Once the descent direction r(k) is determined, the parameter αk expresses
the step size to travel along r(k) to find x(k+1) . The intersection of Φ(y) with the hyperplane passing
through (x(k) , Φ(x(k) )) and orthogonal to Rn defines the function
ϕ(α) = Φ(x(k) + α r(k) ).
We determine the step size αk such that, once the descent direction r(k) is chosen, we achieve the maximum
decrease of Φ. In other words, we want ϕ(α) to have a minimum at α = αk . We have that
By imposing that αk is such that ϕ0 (αk ) = 0, we obtain (A(x(k) + αk r(k) ) − b) · r(k) = 0 if and only if
(−r(k) + αk Ar(k) ) · r(k) = 0, from which we get the expression for αk of Eq. (2.21).
Finally, we observe that, by virtue of the choice of the descent direction and the parameter αk , the
descent directions turn out to be pairwise orthogonal, that is:
Proposition 2.11. If the matrices A and P ∈ Rn×n are symmetric and positive definite, the preconditioned
gradient method converges to the solution x ∈ Rn for all choices of x(0) ∈ Rn and
ke(k) kA ≤ dk ke(0) kA
for k = 0, 1, . . . , (2.22)
K P −1 A − 1 λmax P −1 A
−1
is the spectral condition number of P −1 A .
where d := −1
; K P A = −1
K (P A) + 1 λmin (P A)
The previous result can be applied to the (non–preconditioned) gradient method by setting P = I. More-
over, the error estimate (2.22) can be used to predict in advance the number of iterations needed by the
preconditioned gradient method to converge to the solution with the desired tolerance. The previous result
also highlights the role of the preconditioner P , which should act in such a manner that K P −1 A gets
closer to 1.
p gives p · b − Ax(k+1) = p · r(k) − Aq = 0, which implies p · (Aq) = 0. Thus, the directions p
and q must be A–conjugate.
Thus, for a symmetric and positive definite matrix A ∈ Rn×n , the conjugate gradient method mini-
mizes the error ke(k+1) kA at each iteration k = 0, 1, . . ., along the descent direction p(k) ∈ Rn , which is
A–conjugate to all previously calculated descent directions p(j) for j = 0, . . . , k − 1.
The conjugate gradient method is not part of the dynamic Richardson methods, as it requires determining
two parameters, αk and βk , at each iteration.
Proposition 2.12. If A ∈ Rn×n is symmetric and positive definite, the conjugate gradient method con-
verges to x ∈ Rn for any choice of x(0) ∈ Rn in at most n iterations (in exact arithmetic), and
2 ck
e(k) ≤ e(0) for k = 0, 1, . . . , (2.23)
A 1 + c2k A
p
K (A) − 1
where c := p and K (A) is the spectral condition number of A.
K (A) + 1
The conjugate gradient method can be interpreted as a direct method since the convergence to x ∈ Rn
occurs in at most n iterations in exact arithmetic. However, typically the algorithm is stopped before the n
2 ck
iterations are completed. For sufficiently large k, the term in the error estimate (2.23) decreases
1 + c2k
k
like 2 c . Therefore, if A is symmetric and positive definite, the conjugate gradient method converges more
rapidly than the gradient method, since 2 ck < dk for “sufficiently" large k.
Example 2.11. We visualize in the following figure the convergence paths of the gradient and conjucate gradient
6 −1
methods (both without preconditioning), when applied to the linear system Ax = b, with A = and
−1 2
b = (5, 1)T , whose solution is x = (1, 1)T .
As can be seen, the orthogonality of the descent directions (taken pairwise) in the case of the gradient method
leads to a slower convergence compared to the conjugate gradient method; conversely, the conjugate gradient method,
although it considers the same descent direction in the first iteration, arrives at the exact solution (up to round-off errors
of order one) by the second iteration, as stated in Proposition 2.12.
Given a preconditioning matrix P ∈ Rn×n that is symmetric and positive definite, it is possible to de-
fine the preconditioned conjugate gradient (PCG) method by generalizing the conjugate gradient method.
We do not present the algorithm here, but we highlight the following result.
Proposition 2.13. If A and P ∈ Rn×n are symmetric and positive definite matrices, the preconditioned
conjugate gradient method convergespto x ∈ Rn for every choice of x(0) ∈ Rn . The error ke(k) kA satisfies
K (P −1 A) − 1
the estimate (2.23), where now c = p .
K (P −1 A) + 1
The goal is to numerically approximate the zero α ∈ R of a function f (x) in the interval I = (a, b) ⊆ R.
The problem is commonly referred to as the numerical solution of a nonlinear equation.
We also want to approximate the solution of systems of nonlinear equations. Given F : Rn → Rn , for
some n ≥ 1, the problem consists of finding the (zeros) vector α ∈ Rn such that F(α) = 0. More specifi-
cally, we have:
x1 f1 (x) f1 (x1 , . . . , xn )
x = ... and F(x) = ..
=
..
.
. .
xn fn (x) fn (x1 , . . . , xn )
We focus on Newton and fixed point iteration methods. We omit in this presentation the bisection
method, which is often used for continuous functions. A common feature of these numerical methods is
that they are iterative methods.
f (x(k) )
x(k+1) = x(k) − for all k ≥ 0, (3.1)
f 0 (x(k) )
41
42 Approximation of Zeros of Nonlinear Equations and Systems
given the initial iterate x(0) and assuming that f 0 (x(k) ) 6= 0 for all k ≥ 0. Eq. (3.1) is called the Newton
iterate. The zero α is obtained as the limit of the sequence of iterates {x(k+1) }+∞ k=0 that solve the tangent
(k) +∞
line equation to the curve (x, f (x)) evaluated at each iterate {x }k=0 .
Example 3.1. We graphically illustrate Newton method in the following pictures, where the first two iterations of the
method are highlighted.
Step 1 Step 2
The Newton method is applicable to a function f ∈ C 0 (I) that is differentiable in I; furthermore, given
x(0) ∈ I, Newton method consists of sequentially applying the Newton iterate (3.1), provided that f 0 (x(k) ) 6=
0 for all k ≥ 0.
Assuming that f ∈ C 2 (I), the Taylor series expansion of f (x) around x(k) is written as f (x(k+1) ) =
f (x(k) ) + f 0 (x(k) ) δ (k) + O((δ (k) )2 ), where δ (k) := x(k+1) − x(k) for k ≥ 0 is the difference between
successive iterates. If f (x(k+1) ) = 0, then Newton method represents the first–order approximation of the
Taylor series expansion of f (x) around x(k) ; we observe that this assumption is indeed satisfied if δ (k) is
“small."
The choice of the initial iterate x(0) is crucial for the success of Newton method. In fact, it is necessary
to choose x(0) “sufficiently” close to the zero α. In practice, the sequence of Newton iterates {x(k+1) }+∞ k=0
may diverge instead of converging to α if the initial iterate x(0) is not “sufficiently” close to the zero
α. Since the zero α is unknown, the choice of x(0) may not be trivial; from this perspective, the graph
of the function or the use of the bisection method can be extremely helpful in selecting values of x(0)
“sufficiently” close to α and thus initializing the Newton method.
Example 3.2. The following example illustrates how Newton iterates do not converge to the zero α: this is due to the
fact that x(0) is not “sufficiently” close to α.
For affine (or linear) functions, that is, functions as f (x) = cx + d with c and d ∈ R, Newton method
d
converges to the zero α = − in a single iteration, regardless of the choice of x(0) . Indeed, from Eq. (3.1),
c
f (x(0) ) cx(0) + d d
we obtain that x(1) = x(0) − 0 (0) = x(0) − = − = α, for any x(0) ∈ R.
f (x ) c c
Below, we characterize the convergence properties of Newton method. With this aim, let us introduce
the following definition, which will allow to characterize the convergence properties of an iterative method.
Definition 3.1. An iterative method for approximating the zero α of the function f (x) is convergent with
order p if and only if
x(k+1) − α
lim p = µ, (3.2)
k→+∞ x(k) − α
where µ > 0 is a real number independent of k, called the asymptotic convergence factor. In the case of
linear convergence, i.e., for p = 1, it is necessary that 0 < µ < 1.
We illustrate in a typical graph the sequence of errors e(k) := x(k) − α against the number of iterations
k for hypothetical iterative methods with convergence orders p = 1 and 2. A logarithmic scale is used on
the error axis, while a linear scale is used on the number of iterations axis. We note that linear convergence
(p = 1) is graphically represented by a straight line whose slope depends on the asymptotic convergence
factor µ. A parabola is obtained in the case of quadratic convergence (p = 2).
x(k+1) − α 1 f 00 (α)
lim = ;
k→+∞ (x (k) − α)2 2 f 0 (α)
1 f 00 (α)
based on Eq. (3.2), p = 2 is the convergence order and µ = is the asymptotic convergence factor.
2 f 0 (α)
Definition 3.2. Let f ∈ C m (Iα ), with m ∈ N such that m ≥ 1. The zero α ∈ Iα is a zero of multiplicity
m if f (i) (α) = 0 for every i = 0, . . . , m − 1 and f (m) (α) 6= 0. If the previous condition is satisfied for
m = 1, the zero α is called simple; otherwise, it is called multiple.
Proposition 3.2 (Convergence order of Newton method, zero of multiplicity m). If f ∈ C 2 (Iα ) ∩ C m (Iα )
and x(0) is “sufficiently” close to the zero α of multiplicity m > 1, then Newton’s method is convergent
with order 1 (linearly) to α, provided that f 0 (x(k) ) 6= 0 for all k ≥ 0. In particular, based on Eq. (3.2), we
have:
x(k+1) − α 1
lim =1− ,
k→+∞ x(k) − α m
1
with p = 1 as the convergence order and µ = 1 − ∈ (0, 1) as the asymptotic convergence factor.
m
If the zero α is simple, i.e., m = 1, Newton method converges at least quadratically based on Proposi-
tion 3.1. On the contrary, if the zero α is multiple (m > 1), Newton method converges only linearly based
on Proposition 3.2. We observe that, in general, the higher the order of convergence p, the fewer iterations
will be needed to achieve a desired error value, meaning the method will be more efficient.
f (x(k) )
x(k+1) = x(k) − m for every k ≥ 0, (3.3)
f 0 (x(k) )
given the initial iterate x(0) and provided that f 0 (x(k) ) 6= 0 for all k ≥ 0. Based on Algorithm 3.1, we
obtain the following for the modified Newton method.
The convergence properties of the modified Newton method are characterized as follows.
Proposition 3.3 (Convergence order of the modified Newton method). If f ∈ C 2 (Iα ) ∩ C m (Iα ), where
m ≥ 1 is the multiplicity of the zero α ∈ Iα , and x(0) is “sufficiently” close to α, then the modified Newton
method is convergent with order 2 (quadratically) to α, provided that f 0 (x(k) ) 6= 0 for all k ≥ 0.
The modified Newton method requires prior knowledge of the multiplicity m of the zero α. Alternatively,
m can be estimated based on appropriate numerical methods or adaptive techniques.
This criterion is satisfactory if the zero α is simple; this can be shown by interpreting Newton method as a
fixed point iteration method. In particular, we have that the error estimator:
where ee(k+1) = δ (k) = x(k+1) − x(k) and m ≥ 1 is the multiplicity of the zero α.
Another stopping criterion is based on the residual (absolute), defined as follows:
This criterion is satisfactory if |f 0 (x)| ' 1 for x ∈ Iα , where Iα is a neighborhood of the zero α. In this
case, ee(k) ' e(k) . On the other hand, the criterion is unsatisfactory if |f 0 (x)| 1 or if |f 0 (x)| ' 0 for
x ∈ Iα . Specifically, if |f 0 (x)| 1 for x ∈ Iα , then the error is overestimated by the error estimator
e(k) e(k) ), leading to more Newton iterations than necessary thus making the criterion inefficient.
(e
Conversely, if |f 0 (x)| ' 0 for x ∈ Iα , then the error is underestimated by the error estimator (e
e(k) e(k) ),
causing the Newton iterations to terminate prematurely, as the actual error is larger than indicated by the
estimator.
Example 3.3. The following examples graphically illustrate situations where the stopping criterion based on the resid-
ual is either satisfactory or unsatisfactory.
Satisfactory, ee(k) ' e(k) Unsatisfactory, ee(k) e(k) Unsatisfactory, ee(k) e(k)
(the error is overestimated) (the error is underestimated)
A popular variant of the criterion is that of the relative residual, such that the error estimator is:
r(k)
ee(k) = for k ≥ 0.
r(0)
f (x(k) )
x(k+1) = x(k) − for every k ≥ 0;
q (k)
the choice of q (k) determines the specific method. Consider the following cases:
• for q (k) ≡ f 0 (x(k) ), we obtain the Newton method;
f 0 (x(k) )
• for q (k) ≡ , we obtain the modified Newton method (where m is the multiplicity of α);
m
f (b) − f (a)
• for q (k) = qR = for every k ≥ 0, with α ∈ (a, b), we obtain the rope method;
b−a
1
The terminology used here for inexact and quasi–Newton methods is not entirely precise.
f (x(k) ) − f (x(k−1) )
• for q (k) = for every k ≥ 1, we obtain the secant method (for the secant
x(k) − x(k−1)
(0)
method, q can be chosen as for the rope method).
The rope method converges linearly (p = 1) if the zero α is simple and under certain conditions on qR ,
while it may either converge or not if the zero α is multiple. The secant method converges with order
p ' 1.6 if the zero α is simple, while it converges linearly p = 1 if the zero α is multiple (m > 1).
Example 3.4. The following examples graphically illustrate the rope and secant methods for the generic iterate x(k) .
solve JF (x(k) ) δ (k) = −F(x(k) ) and set x(k+1) = x(k) + δ (k) for every k ≥ 0, (3.4)
provided that det(JF (x(k) )) 6= 0 for every k ≥ 0. Note that the Newton iterate (3.4) can be obtained as a
first–order Taylor series expansion of F(x) around x(k) , that is, as F(x(k) )+JF (x(k) ) (x(k+1) −x(k) ) = 0.
The Newton method must employ an appropriate stopping criterion. As seen in the case of a nonlinear
function, the criterion based on the difference between successive iterates, namely ee(k) = kδ (k−1) k < tol
for k ≥ 1, or the criterion based on the residual, namely ee(k) = r(k) = kF(x(k) )k < tol for k ≥ 0, can be
used, where tol is a given tolerance. In the latter case, a popular stopping criterion is based on the relative
r(k) kF(x(k) )k
residual for which ee(k) = (0) = < tol for k ≥ 0.
r kF(x(0) )k
We present the following result concerning the convergence of Newton method for systems of nonlinear
equations, which generalizes the case of a nonlinear function from Proposition 3.1.
kx(k+1) − αk
lim = µ.
k→+∞ kx(k) − αk2
!
sin(x1 x2 ) − x2
Example 3.5. Consider the system of nonlinear equations F(x) = 1 and the zero α =
x1 + x2 − e−x1 x2
T " # 2
x2 cos(x1 x2 ) x1 cos(x1 x2 ) − 1
1
, 0 . Its Jacobian matrix is JF (x) = x2 −x1 x2 x1 −x1 x2 .
2 1+ e 1+ e
2 2
1
Since det(JF (α)) = 6= 0, we expect at least quadratic convergence to α if x(0) is “sufficiently" close to α.
2
At each iterate of Newton method (3.4), it is necessary to assemble and solve a linear system, unless
n = 1, in which case JF (x(k) ) ≡ f 0 (x(k) ). For n 1, these operations can prove to be computationally
expensive, which necessitates the use of numerical methods for solving linear systems discussed in Chap-
ter 2. In addition, the assembly of the Jacobian matrix JF (x(k) ) can be computationally expensive too,
especially if n is large. Roughly speaking, the computational cost of the Newton method consists of the
number of operations given for the assembly and solving of the linear system at each iteration times the
number of iterations performed. Inexact and quasi–Newton methods aim at building an approximation of
the Jacobian matrix JF (x(k) ), and possibly solving the corresponding linear system, at a reduced cost.
A possible strategy consists in assembling the Jacobian matrix only every l ∈ N iterations, thereby
reducing assembly costs. Additionally, since the Jacobian matrix remains fixed for l iterations, the LU
factorization method can be employed, allowing the construction of the L and U matrices (the most com-
putationally demanding operation) only every l iterations.
The Broyden method is an inexact Newton method that generalizes the secant method for n ≥ 1.
It consists in selecting an approximation matrix B0 ∈ Rn×n of the Jacobian matrix JF (x(0) ), and then
sequentially building approximations of the Jacobian matrices for every k ≥ 0. The algorithm reads as
follows.
Since the method is inexact, we cannot expect the same convergence order of the Newton method (p = 2).
However, the convergence order p is in general superlinear, since p ∈ (1, 2).
Definition 3.4. Given the iteration function φ : [a, b] ⊆ R → R, we say that α ∈ R is a fixed point of φ if
and only if φ(α) ≡ α.
Example 3.6. Let us graphically illustrate the fixed points of some iteration functions.
The goal is to find the zero α of the nonlinear function f (x). We transform this problem into a fixed
point iteration problem by appropriately selecting an iteration function φ(x) such that f (α) = 0 if and only
if φ(α) = α for α ∈ [a, b]. Note that there are several iteration functions φ(x) that can perform this task
and multiple ways to derive them.
The simplest way to obtain φ(x) from f (x) is based on the following steps. Since f (α) = 0, we have
f (α) + α = α, so we can set φ(x) = f (x) + x. However, this is often an inadequate choice for the iteration
function, as we will see later.
Example 3.7. Let us consider f (x) = 2x2 − x − 1, for which we are interested in the zero α = 1. One possibility is
x+1
to set φ1 (x) = f (x) + x = 2x2 − 1. A second choice can be derived by setting f (x) = 0, which gives x2 =
r r 2
x+1 x+1
and thus x = ± ; in this case, we can take φ2 (x) = .
2 2
Example 3.8. We graphically illustrate the fixed point iteration algorithm below. Let us consider the case φ(x) =
cos(x), with a = 0.1, b = 1.1, and x(0) = 0.2, where we observe that the algorithm converges to α = cos(α) '
0.7391.
Let us now consider φ(x) = 2x2 − 1, with a = 0.5, b = 2, and x(0) = 1.1, where the algorithm diverges from the
fixed point α = 1. We observe that φ(x) corresponds to the iteration function φ1 (x) from Example 3.7.
r
1+x
Finally, let us consider φ(x) = , with a = 0.5, b = 2, and x(0) = 1.9, where the algorithm converges to
2
α = 1; in this case, φ(x) corresponds to φ2 (x) from Example 3.7.
We now state the properties that we must require from the iteration function φ(x) to ensure the existence
and uniqueness of the fixed point α within a given interval; furthermore, we will discuss the convergence
properties of the fixed point iteration method.
Proposition 3.5 (Global convergence in the interval). If φ ∈ C 1 ([a, b]), φ(x) ∈ [a, b] for every x ∈ [a, b],
and |φ0 (x)| < 1 for every x ∈ [a, b], then there exists a unique fixed point α ∈ [a, b], and the fixed point
iteration method converges for every x(0) ∈ [a, b] with order at least equal to 1 (linearly), that is:
x(k+1) − α
lim = φ0 (α),
k→+∞ x(k) − α
Let us introduce the constant L = max |φ0 (x)|. Then, under the hypothesis of Proposition 3.5, we observe
x∈[a,b]
that the error
e(k+1) = |x(k+1) − α| = |φ(x(k) ) − φ(α)| ≤ L|x(k) − α| = L e(k) for every k ≥ 0.
By recursion, we thus have
e(k) ≤ Lk e(0) for every k ≥ 0.
Since L < 1, we obtain lim e(k) = 0, which means that the method is convergent for every x(0) ∈ [a, b].
k→+∞
Example 3.9. We illustrate the results of Proposition 3.5 through the following examples.
We illustrate some results on the local convergence to the fixed point α, that is, in the vicinity of α.
Proposition 3.6 (Ostrowski, local convergence in the vicinity of the fixed point). If φ ∈ C 1 (Iα ), with
Iα being a neighborhood of the fixed point α of φ(x), and |φ0 (α)| < 1, then, if the initial iterate x(0) is
“sufficiently" close to α, the fixed point iteration method converges with order at least equal to 1 (linearly),
that is:
x(k+1) − α
lim = φ0 (α),
k→+∞ x(k) − α
• if |φ0 (α)| < 1, the fixed point iteration method converges to α with an order of at least 1, provided
that x(0) is “sufficiently" close to α;
• if |φ0 (α)| ≡ 1, the convergence of the method to α depends on the properties of φ(x) in the neigh-
borhood of Iα and the choice of the initial iterate x(0) (in fact, the method may or may not converge);
• if |φ0 (α)| > 1, the convergence of the method to α is impossible, unless x(0) ≡ α.
Proposition 3.7 (Local convergence in the vicinity of the fixed point). If φ ∈ C 2 (Iα ), with Iα being a
neighborhood of the fixed point α of φ(x), φ0 (α) = 0, and φ00 (α) 6= 0, then, if the initial iterate x(0) is
“sufficiently" close to α, the fixed point iteration method converges with order 2 (quadratically), that is:
x(k+1) − α 1
lim = φ00 (α),
k→+∞ (x(k) − α)2 2
1
where φ00 (α) is the asymptotic convergence factor.
2
where tol > 0 is an appropriate tolerance. The fixed point iteration algorithm stops at the first iteration k
such that ee(k) < tol or when k = kmax , with kmax being the maximum allowed number of iterations.
1
We can determine that x(k) − α = − δ (k) , for some ξ (k) between x(k) and α. We use
1 − φ0 (ξ (k) )
it to determine whether the stopping criterion is satisfactory or not. If φ0 (x) ' 0 in a neighborhood of
α (φ0 (α) ' 0), the criterion is satisfactory since e(k) ' ee(k+1) . If φ0 (x) > −1, but φ0 (x) ' −1 in a
1
neighborhood of α, the criterion is still satisfactory since e(k) ' ee(k+1) (the error is overestimated by the
2
estimator by a factor of 2). Conversely, if φ0 (x) < 1, but φ0 (x) ' 1 in a neighborhood of α, the criterion is
unsatisfactory since ee(k+1) e(k) , i.e. the error is underestimated by the error estimator.
can be reformulated as a fixed point iteration method via the iteration function φN (x) such that φN (α) = α.
From Eqs. (3.1) and (3.5), we deduce that the iteration function associated with Newton method is:
f (x)
φN (x) = x − .
f 0 (x)
It follows that the properties of Newton method, including convergence to α, can be inferred from those of
the iteration function φN (x).
Similarly, to the modified Newton method (Sec. 3.1.2) based on the iterate of Eq. (3.3), we associate the
iteration function φmN (x) defined as:
f (x)
φmN (x) = x − m .
f 0 (x)
As with Newton method, the rope method can also be interpreted as a fixed point iteration method with
the iteration function2 :
f (x)
φR (x) = x − ,
qR
f (b) − f (a)
where qR = .
b−a
Remark 3.1. Instead, the secant method cannot be interpreted as a fixed-point iteration method.
given the initial iterate x(0) ∈ Rn . As a stopping criterion, we can use one based on the difference between
successive iterates similarly to Section 3.2.2, that is, ee(k) = δ (k−1) < tol for k ≥ 1, with tol a prescribed
tolerance and δ (k) = x(k+1) − x(k) .
Let φ : Rn → Rn be differentiable in Ix ⊆ Rn , a neighborhood of x ∈ Rn . Then its Jacobian matrix
∂φi
at x is Jφ : Rn → Rn×n such that (Jφ (x))ij = (x) for i, j = 1, . . . , n. We also denote by ρ(Jφ (x))
∂xj
the spectral radius of Jφ (x) at x ∈ Rn . We then have the following convergence result.
Proposition 3.8. If φ ∈ C 1 (Iα ), with Iα ⊆ Rn a neighborhood of the fixed point α ∈ Rn of φ(x), and
the initial iterate x(0) ∈ Rn is “sufficiently” close to α, and ρ(Jφ (α)) < 1, then the fixed point iteration
method converges with order at least equal to 1 (linearly), that is:
kx(k+1) − αk
lim = ρ(Jφ (α)),
k→+∞ kx(k) − αk
f 0 (x)
2
Using the result from Proposition 3.6, the convergence of the method is guaranteed if |φ0R (α)| < 1, since φ0R (x) = 1 − ,
qR
1 0 1
for x(0) “sufficiently" close to α. That is, the following conditions on qR hold: qR > f (α) if f 0 (α) > 0, while qR < f 0 (α)
2 2
if f 0 (α) < 0. If the zero α is multiple, then φ0R (α) = 1, so the convergence of the rope method is no longer guaranteed. In general,
f 0 (α)
if |φ0R (α)| < 1, the rope method converges linearly to α with an asymptotic convergence factor of 1 − , for x(0) “sufficiently"
qR
close to α, as deduced from Proposition 3.6. Finally, if qR = f 0 (α), we have φ0R (α) = 0, so the rope method, under the assumptions
f 00 (α)
of Proposition 3.7, converges quadratically to α (with order p = 2) with an asymptotic convergence factor of − .
2 qR
Let us consider the approximation of functions and data, particularly by means of interpolation methods
and approximation in the sense of least squares.
Example 4.2. Let us assume that the function f (x) is known only through its evaluations at a set of (n + 1) nodes
{xi }n n
i=0 , that is, the data pairs {(xi , f (xi ))}i=0 are given. We might therefore be interested in defining an approxi-
mating function fe(x) for the unknown function f (x).
53
54 Approximation of Functions and Data
However, the approximation of f (x) by fe(x) presents some issues. First, it is necessary to compute and
evaluate n derivatives of the function f (x), operations that can be computationally expensive. Furthermore,
the Taylor expansion is accurate only in a neighborhood Ix0 of x0 , while it is generally quite inaccurate
outside of this neighborhood Ix0 .
1
Example 4.4. We consider the Taylor expansion of order n of f (x) = , which is denoted as fen (x), around x0 = 1.
x
4.2 Interpolation
We introduce the concept of interpolation and classify the various interpolation methods.
n n
Definition 4.1. Let us consider a set of (n + 1) data pairs {(xi , yi )}i=0 , where {xi }i=0 are (n + 1)
distinct nodes, i.e., such that xi 6= xj for every i 6= j with i, j = 0, . . . , n; if the function f (x) is known,
n
we set yi = f (xi ) for each i = 0, . . . , n. Interpolating the data pairs {(xi , yi )}i=0 means determining an
approximating function fe(x) such that fe(xi ) = yi for every i = 0, . . . , n, or if f (x) is known, such that
fe(xi ) = f (xi ) for every i = 0, . . . , n. The function fe(x) is called the interpolant of the data at the nodes.
a0 + a1 x + · · · + ak xk
• Rational interpolation, such that fe(x) = for suitable coefficients
ak+1 + ak+2 x + · · · + ak+n+1 xn
a0 , a1 , . . . ∈ R with k, n ≥ 0.
M
X
• Trigonometric interpolation, such that fe(x) = aj eι j x , where ι is the imaginary unit (ι2 =
j=−M
−1) and eι j x = cos(j x) + ι sin(j x), for some M and complex coefficients aj .
In this course, we will focus on polynomial interpolation and piecewise polynomial interpolation.
Example 4.5. We illustrate two cases for which n = 1 (left) and n = 2 (right).
The goal is to determine the expression of the interpolating polynomial Πn (x) (or Πn f (x)), that is,
n
Πn (x) = a0 + a1 x + · · · + an xn ; to this end, it is necessary to calculate the coefficients {ai }i=0 of
this polynomial of degree n. For this purpose, let us consider a special family of polynomials associated
n
with the (n + 1) distinct nodes {xi }i=0 .
n
Definition 4.2. Given a set of (n + 1) distinct nodes {xi }i=0 , the Lagrange characteristic function
( as ϕk ∈ Pn , is a polynomial of degree n such that ϕk (xi ) = δki for
associated with the node xk , denoted
0 if i 6= k
every i = 0, . . . , n, where δki = . Furthermore, it is defined as:
1 if i = k n
Y x − xi
ϕk (x) = .
x
i=0 k
− xi
i6=k
n
The set of polynomials {ϕk (x)}k=0 represents the basis of the Lagrange characteristic polynomials.
Example 4.6. Let us illustrate the basis of the Lagrange characteristic polynomials for n = 1 and n = 2.
n=1
x − x1
ϕ0 (x) = ∈ P1
x0 − x1
x − x0
ϕ1 (x) = ∈ P1
x1 − x0
n=2
x − x1 x − x2
ϕ0 (x) = ∈ P2
x0 − x1 x0 − x2
x − x0 x − x2
ϕ1 (x) = ∈ P2
x1 − x0 x1 − x2
x − x0 x − x1
ϕ2 (x) = ∈ P2
x2 − x0 x2 − x1
n
Definition 4.3. Given the basis of Lagrange characteristic polynomials {ϕk (x)}k=0 associated with the
n n
(n + 1) distinct nodes {xi }i=0 , the Lagrange interpolating polynomial of the data pairs {(xi , yi )}i=0 can
be expressed as:
Xn
Πn (x) = yk ϕk (x).
k=0
If the function f (x) is given and continuous, the Lagrange interpolating polynomial of the function f (x) at
n
the nodes {xi }i=0 is expressed as:
n
X
Πn f (x) = f (xk ) ϕk (x).
k=0
The Lagrange interpolating polynomial Πn (x) interpolates the data at the nodes; in fact,
n
X n
X
Πn (xi ) = yk ϕk (xi ) = yk δki = yi for every i = 0, . . . , n.
k=0 k=0
Example 4.7. Let us construct the Lagrange polynomial interpolant for the data pairs {(1, 3)}, {(2, 2)}, and {(4, 6)},
1 1 8
where n = 2. With x0 = 1, x1 = 2, and x2 = 4, we have: ϕ0 (x) = (x − 2)(x − 4) = x2 − 2x + ,
3 3 3
1 1 5 1 1 1 1
ϕ1 (x) = − (x − 1)(x − 4) = − x2 + x + 2, ϕ2 (x) = (x − 1)(x − 2) = x2 − x + . The Lagrange
2 2 2 6 6 2 3
interpolating polynomial of degree n = 2 is given by Π2 (x) = y0 ϕ0 (x) + y1 ϕ1 (x) + y2 ϕ2 (x) = x2 − 4x + 6,
where y0 = 3, y1 = 2, and y2 = 6.
Definition 4.4. For a continuous function f (x) and the interval I = [a, b] partitioned by (n + 1) ordered
nodes as a = x0 < x1 < · · · < xn = b, we define the error function En f (x) := f (x)−Πn f (x) associated
with the interpolating polynomial Πn f (x). The error is given by
n
Proposition 4.2. Consider (n+1) distinct nodes {xi }i=0 in an interval I = [a, b] such that a = x0 < x1 <
· · · < xn = b and the interpolating polynomial Πn f (x) of a function f (x) at those nodes. If f ∈ C n+1 (I)
1
for every x ∈ I, then there exists ξ = ξ(x) ∈ I such that: En f (x) = f (n+1) (ξ(x)) ωn (x), where
(n + 1)!
Yn
ωn (x) := (x − xi ). Moreover, the error en (f ) is bounded by the error estimator een (f ) as follows:
i=0
1
en (f ) ≤ een (f ) := max f (n+1) (x) max |ωn (x)| . (4.1)
(n + 1)! x∈I x∈I
n
Proposition 4.3. Consider (n + 1) equally spaced nodes {xi }i=0 in the interval I = [a, b] such that
b−a
xi = x0 + i h for i = 0, . . . , n, with x0 = a, xn = b, and h = . Then the function ωn (x) from
n
Proposition 4.2 satisfies:
n+1
n! n+1 n! b − a
max |ωn (x)| ≤ h = .
x∈I 4 4 n
Therefore, we deduce from Eq. (4.1) the following error estimate en (f ):
n+1
hn+1
1 b−a
en (f ) ≤ een (f ) := max f (n+1) (x) = max f (n+1) (x) . (4.2)
4(n + 1) x∈I 4(n + 1) n x∈I
If the (n + 1) nodes are equally spaced in I, the error en (f ) may tend to zero or not as n → +∞,
hn+1
depending on the function f (x) being interpolated. From Eq. (4.2), we observe that lim = 0.
n→+∞ 4(n + 1)
On the other hand, max f (n+1) (x) may grow as n increases; indeed, there exist functions for which
x∈I
lim max f (n+1) (x) = +∞. In such cases, the growth of max f (n+1) (x) may not be compensated
n→+∞ x∈I x∈I
hn+1
by the decrease of with n, leading to lim een (f ) = +∞; thus, the error estimator een (f )
4(n + 1) n→+∞
“explodes," and typically, the error en (f ) too. The so-called Runge phenomenon is an example of this
behavior, where the error function En f (x) tends to “explode" for increasing values of n near the endpoints
of the interval I when equally spaced nodes are used for polynomial interpolation.
1
Example 4.9. Consider the polynomial interpolation of the Runge function f (x) = on (n + 1) equally spaced
1 + x2
nodes in the interval I = [−5, 5].
n=8 n = 10
In this case, the interpolating polynomial Πn f (x) of f (x) exhibits the so-called Runge phenomenon for increasing
values of n, as can be observed near the endpoints of the interval I. Moreover, lim en (f ) = +∞.
n→+∞
A remedy to mitigate the Runge phenomenon related to polynomial interpolation is to use nodes that
are not equidistant within the interval I. The following definition provides a special family of nodes that
can be used for polynomial interpolation.
a+b b−a
xi = + x
bi i = 0, . . . , n.
2 2
xi }n
Example 4.10. We graphically highlight the (n + 1) Chebyshev–Gauss–Lobatto nodes {b i=0 in the reference
interval I = [−1, 1] for n = 4 (left) and n = 9 (right).
b
Proposition 4.4. If f ∈ C n+1 (I) and the (n + 1) Chebyshev–Gauss–Lobatto nodes are used in I = [a, b],
then lim Πn f (x) = f (x) for every x ∈ I, that is, lim en (f ) = 0.
n→+∞ n→+∞
Example 4.11. Referring to Example 4.9, in addition to considering the same data, we will show that the use of
Chebyshev–Gauss–Lobatto (CGL) nodes prevents the onset of the Runge phenomenon.
The Lagrange interpolating polynomial Πn (x) ∈ Pn uses the basis of Lagrange characteristic polyno-
n
X
n n
mials {ϕk (x)}k=0 to determine the coefficients {ai }i=0 of this polynomial, namely Πn (x) = yk ϕk (x) =
k=0
a0 + a1 x + · · · + an xn . An alternative approach to polynomial interpolation constructed via the Lagrange
basis is to directly determine the n + 1 coefficients a = (a0 , a1 , . . . , an )T ∈ Rn+1 by imposing the (n + 1)
interpolation constraints
Πn (xi ) = Πn (xi ) = a0 + a1 xi + · · · + an xni = yi for each i = 0, . . . , n.
In this way, the problem reduces to solving the following linear system:
V a=y (4.3)
j−1
where V ∈ R(n+1)×(n+1) is the Vandermonde matrix, with Vij = (xi−1 ) for i, j = 1, . . . , n + 1,
T
and y = (y0 , y1 , . . . , yn ) ∈ Rn+1 . The linear system (4.3) admits a unique solution if and only if
n
det(V ) 6= 0, that is, if and only if the n + 1 nodes {xi }i=0 are distinct. However, despite being intuitive,
this approach based on solving the linear system (4.3) may suffer from stability issues even for relatively
“small" values of n. This is due to the fact that the condition number of the matrix V is generally very high,
that is, K2 (V ) 1. Consequently, the solution a determined at the calculator will generally be subject to
significant errors (see Chapter 2).
Remark 4.1. Polynomial interpolation is generally not suitable for extrapolating information outside the
interval I containing the nodes (see Example 4.8).
n
Definition 4.6. Let us consider (n + 1) distinct nodes {xi }i=0 in the interval I = [a, b] such that a =
x0 < x1 < · · · < xn = b, which delimit n subintervals Ii = [xi , xi+1 ] for i = 0, . . . , n − 1. We define
H := max |Ii | = max (xi+1 − xi ) as the characteristic size of these subintervals.
i=0,...,n−1 i=0,...,n−1
n
Given the set of data pairs {(xi , yi )}i=0 , the piecewise linear polynomial interpolant ΠH
1 (x) of the data is
a piecewise polynomial of degree 1 such that ΠH 1 (x) ∈ P1 for every x ∈ Ii and i = 0, . . . , n − 1 (that is,
ΠH1 (x)|Ii ∈ P1 for each i = 0, . . . , n − 1), with:
yi+1 − yi
ΠH
1 (x) = yi + (x − xi ) for each i = 0, . . . , n − 1.
xi+1 − xi
If the function f ∈ C 0 (I) is known, then the piecewise linear interpolating polynomial ΠH 1 f (x) of the
function f (x) at the nodes is such that ΠH
1 f (x)|Ii ∈ P1 for each i = 0, . . . , n − 1, with:
f (xi+1 ) − f (xi )
ΠH
1 f (x) = f (xi ) + (x − xi ) for each i = 0, . . . , n − 1.
xi+1 − xi
Example 4.12. We present the piecewise linear interpolants of the (n + 1) data pairs, denoted as ΠH 1 (x), and of a
function f (x), denoted as ΠH
1 f (x), at the n + 1 nodes in the interval I. Specifically, we consider n = 4 (that is, we
have n + 1 = 5 nodes). The characteristic size of the subintervals {Ii }3i=0 is H = max |Ii |.
i=0,1,2,3
ΠH
1 (x) ΠH
1 f (x)
Definition 4.7. If the function f ∈ C 0 (I) is known, we define the error associated with the piecewise linear
interpolating polynomial ΠH H H
1 f (x) as e1 (f ) := max f (x) − Π1 f (x) .
x∈I
H2
eH eH
1 (f ) ≤ e 1 (f ) := max |f 00 (x)| ;
8 x∈I
it follows that the error converges to zero with order 2 in H (quadratically).
Similarly to ΠH H
1 (x), it is possible to define the piecewise quadratic interpolating polynomial Π2 (x)
H 0
such that Π2 (x)|Ii ∈ P2 for each subinterval Ii of I for i = 0, . . . , n − 1. If f ∈ C (I) is known, we use
the notation ΠH2 f (x).
In a similar way, it is possible to define the piecewise polynomial interpolant of degree r ≥ 1, denoted
as ΠH H H 0
r (x), such that Πr (x)|Ii ∈ Pr for every i = 0, . . . , n − 1 (or Πr f (x) if f ∈ C (I) is known).
Example 4.13. Let us consider the piecewise quadratic interpolating polynomial of a continuous function f (x), de-
noted as ΠH
2 f (x), on (n + 1) nodes in the interval I.
Specifically, let us set n = 4. The piecewise quadratic interpolant ΠH2 f (x) interpolates f (x) at the (n + 1) = 5 nodes,
as well as at intermediate points within each subinterval of I, such as the midpoints of the subintervals.
eH eH
r (f ) ≤ e r (f ) := Cr H
r+1
max f (r+1) (x) ,
x∈I
where Cr is a positive constant. Thus, we deduce that the error converges to zero with order (r + 1) in H.
For the composite polynomial interpolant ΠH r (x), the r + 1 nodes within each subinterval Ii can be
chosen to be equally spaced or as the Chebyshev–Gauss–Lobatto nodes.
Remark 4.2. The piecewise polynomial interpolants ΠH 0
r f (x) of any degree r ≥ 1 are only C –continuous
between one subinterval and another (across the external nodes of each subinterval), as can be seen in
Example 4.13.
n
X n
2 X
2
yi − fem (xi ) ≤ (yi − pm (xi )) for every pm ∈ Pm .
i=0 i=0
If fem ∈ Pm exists, then it is called the polynomial of degree m approximating the data in the least squares
sense (or the function f (x) if known).
n
By convention, we assume that the nodes {xi }i=0 are distinct and that 0 ≤ m ≤ n. In a typical scenario
for using the least squares method, we have 0 ≤ m n.
Remark 4.3. The least squares approximating polynomial fem (x) does not generally interpolate the data
(or the function f (x)) at the nodes. In particular, it is only when m = n that we can ensure fem (xi ) = yi
(or fem (xi ) = f (xi )) for every i = 0, . . . , n. In fact, in this case, fem (x) coincides with the interpolating
polynomial Πn (x) of degree n (or Πn f (x)).
Example 4.14. Let us graphically illustrate least squares approximating polynomials fem (x) for relatively large datasets
{(xi , yi )}n
i=0 , with n = 100. Consider m = 1 and 2 (on the left) and m = 2 and 3 (on the right).
As with polynomial interpolation, determining least squares approximating polynomials fem (x) of degree
m
m consists of finding the (m + 1) coefficients {ai }i=0 ; indeed, fem (x) = a0 + a1 x + · · · + am xm . To this
end, we define the vector a = (a0 , a1 , . . . , am ) ∈ Rm+1 and the function Φ : Rm+1 → R as:
T
X n
2
Φ(b) = [yi − (b0 + b1 xi + · · · + bm xm
i )] ,
i=0
n
which is associated with the dataset {(xi , yi )}i=0 for a generic vector b = (b0 , b1 , . . . , bm )T ∈ Rm+1 .
The least squares method consists of determining the coefficients a of the polynomial fem (x) such that:
Φ(a) = min
m+1
Φ(b).
b∈R
Given that Φ is differentiable and convex, the previous minimization problem corresponds to solving the
following differential problem:
∂Φ
find a ∈ Rm+1 : (a) = 0 for every j = 0, . . . , m, (4.4)
∂bj
Definition 4.9. Based on Definition 4.8, the least squares approximating polynomial fe1 (x) of degree m =
1 is called a regression line or a least squares line.
We illustrate the derivation of the system (4.5) for fe1 (x) (i.e., for the regression line, m = 1) with
n ≥ 1. In this case, the function Φ(b) can be expressed as:
n
X n
X
2 2
yi + b20 + b21 x2i − 2b0 yi − 2b1 xi yi + 2b0 b1 xi .
Φ(b) = [yi − (b0 + b1 xi )] =
i=0 i=0
At this point, the problem (4.4) can be written as the linear system (4.5) with A ∈ R2×2 and q ∈ R2 ,
where: X n X n
(n + 1) xi yi
i=0 i=0
A= X and q = ,
n X n Xn
2
xi xi xi yi
i=0 i=0 i=0
respectively. We observe that, in this case, the Vandermonde matrix V ∈ R(n+1)×2 can be expressed as:
1 x0
1 x1
V = .. .. .
. .
1 xn
This chapter considers numerical methods for approximating the derivatives of a function f (x). This
approach is known as numerical differentiation or derivation.
Next, we discuss numerical methods for approximating the definite integral of the function, commonly
referred to as numerical integration, which is typically performed using the so–called quadrature formulas.
Definition 5.1. Given a function f (x) and a step size h > 0, the approximation of f 0 (x) at some x ∈
(a, b) ⊆ R using the forward finite difference scheme is defined as:
f (x + h) − f (x)
δ+ f (x) := ,
h
while the approximation using the backward finite difference scheme is:
f (x) − f (x − h)
δ− f (x) := .
h
Example 5.1. We illustrate graphically the forward finite difference scheme (on the left) and the backward finite
difference scheme (on the right) for approximating f 0 (x) for some h > 0. To this end, we plot the straight lines
passing through the points (x, f (x)) and (x ± h, f (x ± h)), with slopes δ+ f (x) and δ− f (x), respectively.
63
64 Numerical Differentiation and Integration
Proposition 5.1. If f ∈ C 2 ((a, b)) and x ∈ (a, b), the error E+ f (x) associated with the forward finite
difference scheme is:
1
E+ f (x) := f 0 (x) − δ+ f (x) = − h f 00 (ξ+ ) for some ξ+ ∈ [x, x + h],
2
while the error E− f (x) associated with the backward finite difference scheme is:
1
E− f (x) := f 0 (x) − δ− f (x) = h f 00 (ξ− ) for some ξ− ∈ [x − h, x].
2
Proof. Forward finite difference scheme. Consider the Taylor series expansion of f (x + h) around x, that
1
is, f (x + h) = f (x) + f 0 (x) h + f 00 (ξ+ )h2 for some ξ+ ∈ [x, x + h]. The result is obtained using the
2
definition of the error E+ f (x).
In a completely analogous manner, the proof for the backward finite difference scheme is obtained.
The forward and backward finite difference schemes are methods of order of accuracy 1. In fact, the
errors E+ f (x) and E− f (x) converge to zero with order 1 with respect to the step size h. If f ∈ P1 , we
have δ+ f (x) ≡ δ− f (x) ≡ f 0 (x) for every x ∈ R.
Another scheme for approximating the first derivative of a function is the one based on central finite
differences.
Definition 5.2. Given a function f (x) and a step size h > 0, the approximation of f 0 (x) at some x ∈
(a, b) ⊆ R using the central finite difference scheme is defined as:
f (x + h) − f (x − h)
δc f (x) := .
2h
Example 5.2. We illustrate graphically the central finite difference scheme for approximating f 0 (x) for some h > 0;
specifically, we plot the straight line with slope δc f (x).
Central finite difference, δc f (x)
Proposition 5.2. If f ∈ C 3 ((a, b)) and x ∈ (a, b), the error Ec f (x) associated with the central finite
difference scheme is:
1 2 000
Ec f (x) := f 0 (x) − δc f (x) = − h [f (ξ+ ) + f 000 (ξ− )]
12
for some ξ+ ∈ [x, x + h] and ξ− ∈ [x − h, x].
Proof. Consider the Taylor series expansion of f (x + h) around x, which gives us f (x + h) = f (x) +
1 1
f 0 (x) h + f 00 (x)h2 + f 000 (ξ+ )h3 for some ξ+ ∈ [x, x + h]; similarly, the Taylor expansion of f (x − h)
2 6
1 1
around x is f (x − h) = f (x) − f 0 (x) h + f 00 (x)h2 − f 000 (ξ− )h3 for some ξ− ∈ [x − h, x]. By applying
2 6
the definition of the error Ec f (x), we obtain the result.
The central finite difference scheme is a method with order of accuracy 2. In fact, the error Ec f (x)
converges to zero with order 2 with respect to the step size h. We observe that, for f ∈ P2 , it holds that
δc f (x) ≡ f 0 (x) for every x ∈ R.
0
It is possible to show that δc f (x) = Π2,{x−h,x,x+h} f (x), where Π2,{x−h,x,x+h} f (x) is the poly-
nomial of degree 2 that interpolates the function f (x) at the nodes x − h, x, and x + h.
Definition 5.3. Given a function f (x) and a step size h > 0, the approximation of f 00 (x) at x ∈ (a, b) ⊆ R
using the central finite differences is defined as:
f (x + h) − 2f (x) + f (x − h)
δc2 f (x) := .
h2
Proposition 5.3. If f ∈ C 4 ((a, b)) and x ∈ (a, b), the error Ec2 f (x) associated with the central finite
difference scheme is given by:
1 2 h (4) i
Ec2 f (x) := f 00 (x) − δc2 f (x) = − h f (ξ+ ) + f (4) (ξ− )
24
for some ξ+ ∈ [x, x + h] and ξ− ∈ [x − h, x].
Proof. Consider the Taylor expansion for f (x + h) around x, which gives us f (x + h) = f (x) + f 0 (x) h +
1 00 1 1 (4)
f (x)h2 + f 000 (x)h3 + f (ξ+ )h4 for some ξ+ ∈ [x, x + h]; similarly, the Taylor expansion of
2 6 24
1 1 1
f (x − h) around x is f (x − h) = f (x) − f 0 (x) h + f 00 (x)h2 − f 000 (x)h3 + f (4) (ξ− )h4 for some
2 6 24
ξ− ∈ [x − h, x]. At this point, by applying the definition of the error Ec2 f (x) and substituting the Taylor
expansions into δc2 f (x), we obtain the result.
The central finite difference scheme for the approximation of f 00 (x) is a method of order 2. We observe
that if f ∈ P3 , then δc2 f (x) ≡ f 00 (x) for every x ∈ R.
using suitable quadrature formulas, denoted by Iq (f ), such that Iq (f ) ' I(f ). We can classify quadrature
formulas into two main categories: simple formulas and composite formulas.
The simple quadrature formulas are based on a global approximation of the function f (x) over the
interval [a, b] using functions fe(x) that are "simple" to integrate over [a, b] (that is, they can be integrated
explicitly, or in closed form); this means that
Z b
Iq (f ) = I(fe) = fe(x) dx,
a
where fe(x) is an approximation of f (x) for x ∈ [a, b]. Typically, for simple quadrature formulas, fe(x) is
a polynomial of degree n that interpolates f (x) at n + 1 nodes in [a, b]; this is referred to as interpolatory
quadrature formulas.
The composite quadrature formulas are based on dividing the interval [a, b] into M subintervals, pos-
b−a
sibly of the same size H = , on which the function f (x) is approximated by a piecewise function
M
fe(x). Denoting the M + 1 nodes {xk }M k=0 as xk = a + k H for k = 0, . . . , M , with x0 = a and xM = b,
we recall that
XM Z xk
I(f ) = f (x) dx;
k=1 xk−1
In general, for composite quadrature formulas, fe(x) is a piecewise polynomial of degree n that interpolates
f (x) at n + 1 nodes in each of the subintervals {[xk−1 , xk ]}M
k=1 of [a, b].
Example 5.3. The difference between a simple quadrature formula and a composite one for approximating the integral
Z b
I(f ) = f (x) dx of a generic function f (x) over the interval [a, b] is shown in the following figure; we denote the
a
approximated integral as I(fe).
Simple Composite (M = 3)
Definition 5.4. The degree of exactness of a quadrature formula is the largest integer r ≥ 0 such that
all polynomials of degree less than or equal to r are exactly integrated by the formula, that is, such that
Iq (p) ≡ I(p) for every p ∈ Pr .
Definition 5.5. The order of convergence of a composite quadrature formula (also called the order of
accuracy) is the order of convergence of the associated error with respect to H, the size of the subintervals.
Definition 5.6. Let f (x) ∈ C 0 ([a, b]). The simple midpoint quadrature formula is defined as:
a+b
Imp (f ) := I (Π0 f ) = (b − a) f ,
2
a+b
where Π0 f (x) is the polynomial of degree 0 that interpolates f (x) at the midpoint x = of the
2
interval [a, b]. The composite midpoint quadrature formula is defined as:
M
X
c
(f ) := I ΠH
Imp 0 f = H f (xk ), (5.1)
k=1
M
where ΠH 0 f (x) is the piecewise polynomial of degree 0 that interpolates f (x) at the midpoints {xk }k=1
b−a xk−1 + xk
of the M subintervals of length H = into which [a, b] is divided, with xk = for each
M 2
k = 1, . . . , M .
Example 5.4. We graphically illustrate the simple midpoint formula (on the left) and the composite midpoint formula
(on the right) for approximating the integral I(f ) of a generic function f (x).
c
Simple, Imp (f ) Composite, Imp (f ) (M = 3)
Proposition 5.4. If f ∈ C 2 ([a, b]), the error emp (f ) associated with the simple midpoint quadrature
formula is given by
(b − a)3 00
emp (f ) := I(f ) − Imp (f ) = f (ξ) for some ξ ∈ [a, b],
24
while the error ecmp (f ) associated with the composite midpoint quadrature formula is given by
(b − a) 2 00
ecmp (f ) := I(f ) − Imp
c
(f ) = H f (ξ) for some ξ ∈ [a, b].
24
a+b
Proof. Simple formula. Letting the midpoint of [a, b] be x = , we consider the Taylor expansion of
2
1
f (x) around x, given by f (x) = f (x) + f 0 (x) (x − x) + f 00 (η(x)) (x − x)2 for some η(x) ∈ [a, b].
2Z
b Z b
0 1 00
Integrating this expression, we obtain I(f ) = Imp (f ) + f (x) (x − x) dx + f (ξ) (x − x)2 dx for
a 2 a
Z b Z b
some ξ ∈ [a, b], by virtue of the integral mean theorem1 . Since (x − x) dx = 0 and (x − x)2 dx =
a a
(b − a)3
, we obtain the desired result.
12
The midpoint quadrature formulas have a degree of exactness of 1; in fact, the errors emp (f ) and ecmp (f )
are identically zero for all polynomials of degree less than or equal to 1 (if f ∈ P1 , then f 00 (ξ) = 0 for
every ξ ∈ R).
The composite midpoint formula has an order of convergence (or order of accuracy) of 2; in fact, the
error ecmp (f ) is proportional to H 2 .
Definition 5.7. Let f (x) ∈ C 0 ([a, b]). The simple trapezoidal quadrature formula is defined as:
f (a) + f (b)
It (f ) := I (Π1 f ) = (b − a) , (5.2)
2
where Π1 f (x) is the polynomial of degree 1 that interpolates f (x) at the nodes a and b.
The composite trapezoidal quadrature formula is defined as:
M M −1
H X H X
Itc (f ) := I ΠH
1 f = [f (xk−1 ) + f (xk )] = [f (x0 ) + f (xM )] + H f (xk ),
2 2
k=1 k=1
M
where ΠH1 f (x) is the piecewise polynomial of degree 1 interpolating f (x) at the nodes {xk }k=0 of the M
b−a
subintervals of length H = in which the interval [a, b] is divided.
M
Integral Mean Theorem. Let f, g ∈ C 0 ([a, b]) and let g(x) have a constant sign on [a, b]. Then there exists a point c ∈ [a, b]
1
Z b Z b
such that f (x)g(x) dx = f (c) g(x) dx.
a a
Example 5.5. We illustrate graphically the simple trapezoidal formula (on the left) and the composite trapezoidal
formula (on the right) for the approximation of the integral I(f ) of a generic function f (x).
Simple, It (f ) Composite, Itc (f ) (M = 3)
Proposition 5.5. If f ∈ C 2 ([a, b]), the error et (f ) associated with the simple trapezoidal quadrature
formula is given by
(b − a)3 00
et (f ) := I(f ) − It (f ) = − f (ξ) for some ξ ∈ [a, b],
12
while the error ect (f ) associated with the composite trapezoidal quadrature formula is given by
(b − a) 2 00
ect (f ) := I(f ) − Itc (f ) = − H f (ξ) for some ξ ∈ [a, b].
12
Z b
Proof. Simple formula. Since et (f ) = I(f )−I(Π1 f ), we have that et (f ) = (f (x) − Π1 f (x)) dx. By
a
recalling the error function E1 f (x) in the case of polynomial interpolation of degree 1 (see Proposition 4.2),
1
we have that E1 f (x) = f 00 (η(x)) ω1 (x) for some η(x) ∈ [a, b], with ω1 (x) = (x − a)(x − b). Using the
2
mean value theorem (of Lagrange), we have
1 b 00
Z Z b
1
et (f ) = f (η(x)) ω1 (x) dx = f 00 (ξ) ω1 (x) dx, for some ξ ∈ [a, b].
2 a 2 a
Z b
(b − a)3
Thus, since ω1 (x) dx = − , we obtain the desired result.
a 6
The trapezoidal quadrature formulas have a degree of accuracy r = 1; in fact, the errors et (f ) and ect (f )
are identically zero for all polynomials of degree less than or equal to 1 (if f ∈ P1 , then f 00 (ξ) = 0 for
every ξ ∈ R).
The composite trapezoidal formula has an order of convergence (or order of accuracy) of 2; in fact, the
error ect (f ) is proportional to H 2 .
Example 5.6. It is possible to use the error estimates for composite formulas to determine the minimum number of
intervals needed to achieve a certain accuracy in theZapproximation of an integral. Consider, for example, the composite
2
trapezoidal formula for approximating the integral ex dx. If we want the quadrature error to be less than a tolerance
−2
ε > 0, we will impose that
b − a 2 00 b−a 2
|ect (f )| = − H f (ξ) ≤ H max |f 00 (x)| < ε.
12 12 x∈[a,b]
2
b−a b−a
Since f 00 (x) = ex , we have max |f 00 (x)| = e2 , leading to max |f 00 (x)| < ε and then M 2 >
x∈[−2,2] 12 M x∈[a,b]
(b − a)3
max |f 00 (x)|.
12ε x∈[a,b] r
43
Choosing, for example, ε = 10−4 , we obtain M > e2 ' 627.76, which means M ≥ 628.
12 · 10−4
Definition 5.8. Let f (x) ∈ C 0 ([a, b]). The simple Simpson’s quadrature formula is defined as:
b−a a+b
Is (f ) := I (Π2 f ) = f (a) + 4f + f (b) , (5.3)
6 2
where Π2 f (x) is the polynomial of degree 2 that interpolates f (x) at the nodes a, b, and at the midpoint
(a + b)/2.
The composite Simpson’s quadrature formula is defined as:
M
H X
Isc (f ) := I ΠH
2 f = [f (xk−1 ) + 4f (xk ) + f (xk )] ,
6
k=1
M
where ΠH 2 f (x) is the piecewise polynomial of degree 2 that interpolates f (x) at the nodes {xk }k=0 and at
M b−a
the midpoints {xk }k=1 of the M subintervals of length H = in which [a, b] is divided; the midpoints
M
xk−1 + xk
of the subintervals are defined as xk = for each k = 1, . . . , M .
2
Example 5.7. We illustrate graphically the simple Simpson’s formula (on the left) and the composite Simpson’s
formula (on the right) for the approximation of the integral I(f ) of a generic function f (x).
Proposition 5.6. If f ∈ C 4 ([a, b]), the error es (f ) associated with the simple Simpson’s quadrature for-
mula is given by
(b − a)5 (4)
es (f ) := I(f ) − Is (f ) = − f (ξ) for some ξ ∈ [a, b],
2880
while the error ecs (f ) associated with the composite Simpson’s quadrature formula is given by
(b − a) 4 (4)
ecs (f ) := I(f ) − Isc (f ) = − H f (ξ) for some ξ ∈ [a, b].
2880
The Simpson’s quadrature formulas have degree of exactness r = 3; in fact, the errors es (f ) and ecs (f ) are
identically zero for all polynomials of degree less than or equal to 3 (if f ∈ P3 , then f (4) (ξ) = 0 for every
ξ ∈ R).
The composite Simpson’s formula has an order of convergence (or order of accuracy) of 4; indeed, the
error ecs (f ) is proportional to H 4 .
Definition 5.9. Let f (x) ∈ C 0 ([a, b]). An interpolatory quadrature formula (simple) is defined as
n
X
Iq,n (f ) := I(fe) = αj f (yj ) , (5.4)
j=0
n n
where fe(x) is an interpolating function for f (x) at n + 1 quadrature nodes {yj }j=0 in [a, b] and {αj }j=0
are the corresponding quadrature weights, with n ≥ 0.
The interpolating function fe(x) should be “simple" to integrate, and it can be obtained in many different
ways; its choice determines the method (or family of formulas) of numerical quadrature.
[a, b] ⊂ R in question. To provide general quadrature formulas that can be applied to functions f (x)
defined on any interval [a, b], the nodes and quadrature weights are specified for the reference interval
[−1, 1] and denoted as {yj }nj=0 and {αj }nj=0 , respectively. The quadrature nodes and weights for the
general interval [a, b] can then be obtained as follows2 :
a+b b−a
yj = + yj for j = 0, . . . , n,
2 2
and
b−a
αj = αj for j = 0, . . . , n.
2
2k + 1 k
L0 (x) = 1, L1 (x) = x, and Lk+1 (x) = x Lk (x) − Lk−1 (x) for k = 1, . . . , n;
k+1 k+1
Z 1
these polynomials are orthogonal in the sense that Ln+1 (x) Lk (x) dx = 0 for every k = 0, . . . , n. In
−1
3 1 3 1
particular, we have L0 (x) = 1, L1 (x) = x, L2 (x) = x L1 (x) − L0 (x) = x2 − , etc.
2 2 2 2
Definition 5.10. Let f (x) ∈ C 0 ([a, b]). The Gauss–Legendre quadrature formula for n ≥ 0 on the
reference interval [−1, 1] is
Xn
αGL f y GL
IGL,n = j j ,
j=0
y GL
j := zeros of Ln+1 (x) for each j = 0, . . . , n,
2
αGL
j := h 2 i 2 for each j = 0, . . . , n.
1− y GL
j L0n+1 y GL
j
GL n GL n
n y j j=0 αj j=0
r
0 0 2 1 (midpoint formula)
1 1
1 −√ , +√ {1, 1} 3
( √ 3 √3 )
15 15 5 8 5
2 − , 0, + , , 5
5 5 9 9 9
We observe that the Gauss–Legendre quadrature formula for n = 0 coincides with the simple midpoint
formula.
Example 5.8. We can verify that the Gauss–Legendre formula for n = 1 has a degree of accuracy r = 3, meaning that
IGL,1 (f ) = I(f ) for every f ∈ P3 . Considering a generic polynomial of degree 3, say f (x) = c0 +c1 x+c2 x2 +c3 x3
for certain c0 , c1 , c2 , and c3 ∈ R, for example on the reference interval [−1, 1], we have
Z 1
2
I(f ) = f (x) dx = 2c0 + c2 .
−1 3
Imposing the following constraints, i.e., requiring that IGL,1 (f ) = I(f ) for every c0 , c1 , c2 , and c3 ∈ R, we have:
GL
α0 + αGL 1 = 2,
GL GL GL GL
α0 y0 +α1 y 1 = 0,
GL GL
2
GL
2 2
α 0 y 0 + α 1 y GL
1 = ,
3
GL GL 3 3
+ αGL y GL
α0 y0 1 1 = 0.
1 1
Thus, we obtain the quadrature nodes y GL 0 = − √ and y GL 1 = + √ , with the corresponding weights αGL 0 =
3 3
GL
α1 = 1; thus, we conclude that the Gauss-Legendre formula IGL,1 (f ) integrates polynomials of degree 3 exactly,
regardless of the values of the coefficients c0 , c1 , c2 , and c3 ∈ R. Therefore, we have verified that the formula has a
degree of accuracy equal to 3.
The Gauss–Legendre quadrature formulas maximize the degree of accuracy r for every given value
of n ≥ 0, but the corresponding quadrature nodes are all contained within the reference interval [−1, 1].
However, in some situations, one might want to include the endpoints of the interval {−1, 1} in the set of
quadrature nodes. The Gauss–Legendre–Lobatto quadrature formulas allow for maximizing the degree of
accuracy when the endpoints of the interval {−1, 1} are included in the set of quadrature nodes.
Definition 5.11. Let f (x) ∈ C 0 ([a, b]). Then, the Gauss–Legendre–Lobatto quadrature formula for n ≥ 1
on the reference interval [−1, 1] is
n
X
αGLL f y GLL
IGLL,n = j j ,
j=0
y GLL
0 := −1, y GLL
n := +1, and y GLL
j := zeros of L0n (x) for each j = 1, . . . , n − 1,
2 1
αGLL
j := for each j = 0, . . . , n.
n(n + 1) Ln y GLL 2
j
We observe that the Gauss–Legendre–Lobatto quadrature formulas for n = 1 and n = 2 coincide with the
simple trapezoidal and Simpson’s formulas, respectively.
Simple Composite
[1] A. Quarteroni. Numerical Models for Differential Problems. Springer, Cham, 2017.
[2] A. Quarteroni, F. Saleri, and P. Gervasio. Scientific Computing with MATLAB and Octave. Springer,
2014.
[3] A. Quarteroni, R. Sacco, and F. Saleri. Numerical Mathematics. Springer, Berlin and Heidelberg, 2007.
[4] https://www.mathworks.com/products/matlab.html
75