0% found this document useful (0 votes)

775 views

Numerical Methods and Optimization An Introduction

optimizacion

Uploaded by

Erik Eduard Romaní Chávez

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

775 views

Numerical Methods and Optimization An Introduction

optimizacion

Uploaded by

Erik Eduard Romaní Chávez

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 408

NUMERICAL METHODS

AND OPTIMIZATION
An Introduction
CHAPMAN & HALL/CRC
Numerical Analysis and Scientific Computing

Aims and scope:

Scientific computing and numerical analysis provide invaluable tools for the sciences and engineering.
This series aims to capture new developments and summarize state-of-the-art methods over the whole
spectrum of these fields. It will include a broad range of textbooks, monographs, and handbooks.
Volumes in theory, including discretisation techniques, numerical algorithms, multiscale techniques,
parallel and distributed algorithms, as well as applications of these methods in multi-disciplinary fields,
are welcome. The inclusion of concrete real-world examples is highly encouraged. This series is meant
to appeal to students and researchers in mathematics, engineering, and computational science.

Editors
Choi-Hong Lai Frédéric Magoulès
School of Computing and Applied Mathematics and
Mathematical Sciences Systems Laboratory
University of Greenwich Ecole Centrale Paris

Editorial Advisory Board

Mark Ainsworth Peter Jimack

Mathematics Department School of Computing
Strathclyde University University of Leeds

Todd Arbogast Takashi Kako

Institute for Computational Department of Computer Science
Engineering and Sciences The University of Electro-Communications
The University of Texas at Austin
Peter Monk
Craig C. Douglas Department of Mathematical Sciences
Computer Science Department University of Delaware
University of Kentucky
Francois-Xavier Roux
Ivan Graham ONERA
Department of Mathematical Sciences
University of Bath Arthur E.P. Veldman
Institute of Mathematics and Computing Science
University of Groningen

Proposals for the series should be submitted to one of the series editors above or directly to:
CRC Press, Taylor & Francis Group
3 Park Square, Milton Park
Abingdon, Oxfordshire OX14 4RN
UK
Published Titles
Classical and Modern Numerical Analysis: Theory, Methods and Practice
Azmy S. Ackleh, Edward James Allen, Ralph Baker Kearfott, and Padmanabhan Seshaiyer
Cloud Computing: Data-Intensive Computing and Scheduling
Frédéric Magoulès, Jie Pan, and Fei Teng
Computational Fluid Dynamics
Frédéric Magoulès
A Concise Introduction to Image Processing using C++
Meiqing Wang and Choi-Hong Lai
Coupled Systems: Theory, Models, and Applications in Engineering
Juergen Geiser
Decomposition Methods for Differential Equations: Theory and Applications
Juergen Geiser
Designing Scientific Applications on GPUs
Raphaël Couturier
Desktop Grid Computing
Christophe Cérin and Gilles Fedak
Discrete Dynamical Systems and Chaotic Machines: Theory and Applications
Jacques M. Bahi and Christophe Guyeux
Discrete Variational Derivative Method: A Structure-Preserving Numerical Method for
Partial Differential Equations
Daisuke Furihata and Takayasu Matsuo
Grid Resource Management: Toward Virtual and Services Compliant Grid Computing
Frédéric Magoulès, Thi-Mai-Huong Nguyen, and Lei Yu
Fundamentals of Grid Computing: Theory, Algorithms and Technologies
Frédéric Magoulès
Handbook of Sinc Numerical Methods
Frank Stenger
Introduction to Grid Computing
Frédéric Magoulès, Jie Pan, Kiat-An Tan, and Abhinit Kumar
Iterative Splitting Methods for Differential Equations
Juergen Geiser
Mathematical Objects in C++: Computational Tools in a Unified Object-Oriented Approach
Yair Shapira
Numerical Linear Approximation in C
Nabih N. Abdelmalek and William A. Malek
Numerical Methods and Optimization: An Introduction
Sergiy Butenko and Panos M. Pardalos
Numerical Techniques for Direct and Large-Eddy Simulations
Xi Jiang and Choi-Hong Lai
Parallel Algorithms
Henri Casanova, Arnaud Legrand, and Yves Robert
Parallel Iterative Algorithms: From Sequential to Grid Computing
Jacques M. Bahi, Sylvain Contassot-Vivier, and Raphaël Couturier
Particle Swarm Optimisation: Classical and Quantum Perspectives
Jun Sun, Choi-Hong Lai, and Xiao-Jun Wu
XML in Scientific Computing
C. Pozrikidis
This page intentionally left blank
NUMERICAL METHODS
AND OPTIMIZATION
An Introduction

Sergiy Butenko
Texas A&M University
college Station, USA

Panos M. Pardalos
University of Florida
Gainesville, USA
MATLAB® is a trademark of The MathWorks, Inc. and is used with permission. The MathWorks does
not warrant the accuracy of the text or exercises in this book. This book’s use or discussion of MAT-
LAB® software or related products does not constitute endorsement or sponsorship by The MathWorks
of a particular pedagogical approach or particular use of the MATLAB® software.

Cover image: © 2008 Tony Freeth, Images First Ltd.

CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
© 2014 by Taylor & Francis Group, LLC
CRC Press is an imprint of Taylor & Francis Group, an Informa business

No claim to original U.S. Government works

Version Date: 20140121

International Standard Book Number-13: 978-1-4665-7778-7 (eBook - PDF)

This book contains information obtained from authentic and highly regarded sources. Reasonable
efforts have been made to publish reliable data and information, but the author and publisher cannot
assume responsibility for the validity of all materials or the consequences of their use. The authors and
publishers have attempted to trace the copyright holders of all material reproduced in this publication
and apologize to copyright holders if permission to publish in this form has not been obtained. If any
copyright material has not been acknowledged please write and let us know so we may rectify in any
future reprint.

Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or
hereafter invented, including photocopying, microfilming, and recording, or in any information stor-
age or retrieval system, without written permission from the publishers.

For permission to photocopy or use material electronically from this work, please access www.copy-
right.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222
Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that pro-
vides licenses and registration for a variety of users. For organizations that have been granted a pho-
tocopy license by the CCC, a separate system of payment has been arranged.

Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are
used only for identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
and the CRC Press Web site at
http://www.crcpress.com
Dedicated to the memory of our grandmothers,
Uliana and Sophia,
who taught us how to count.
This page intentionally left blank
Preface

This text provides a basic introduction to numerical methods and optimiza-

tion for undergraduate and beginning graduate students in engineering and
operations research. It is based on the materials used by the authors during
many years of teaching undergraduate and introductory graduate courses in
industrial and systems engineering at Texas A&M University and the Uni-
versity of Florida. The book is intended for use as a text or supplement for
introductory one-semester courses on numerical methods, optimization, and
deterministic operations research.
Combining the topics from entry-level numerical methods and optimiza-
tion courses into a single text aims to serve a dual purpose. On the one hand,
this allows us to enrich a standard numerical methods syllabus with addi-
tional chapters on optimization, and on the other hand, students taking an
introductory optimization or operations research course may appreciate hav-
ing numerical methods basics (typically assumed as a background) handy. In
particular, the fact that students in engineering and operations research rep-
resent diverse educational backgrounds, some with no previous coursework
related to numerical methods, served as a motivation for this work.
In presenting the material, we assumed minimum to no previous experi-
ence of a reader with the subjects of discussion. Some mathematical proofs
are included as samples of rigorous analysis; however, in many cases the pre-
sentation of facts and concepts is restricted to examples illustrating them.
While the content of the text is not tied to any particular software, the book
is accompanied by MATLAB notes and codes available for download from
the publisher’s website, which also contains other supporting materials.
We would like to thank the numerous students who have taken our courses
throughout the years for their valuable feedback on preliminary versions of
parts of this text. This work would not have been possible without the love
and support of our families. Finally, we would like to thank our publisher,
Sunil Nair, for his patient assistance and encouragement.

Sergiy Butenko
Panos Pardalos

ix
This page intentionally left blank
Contents

I Basics 1
1 Preliminaries 3
1.1 Sets and Functions . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Fundamental Theorem of Algebra . . . . . . . . . . . . . . . 6
1.3 Vectors and Linear (Vector) Spaces . . . . . . . . . . . . . . 7
1.3.1 Vector norms . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 Matrices and Their Properties . . . . . . . . . . . . . . . . . 12
1.4.1 Matrix addition and scalar multiplication . . . . . . . 12
1.4.2 Matrix multiplication . . . . . . . . . . . . . . . . . . 13
1.4.3 The transpose of a matrix . . . . . . . . . . . . . . . . 14
1.4.4 Triangular and diagonal matrices . . . . . . . . . . . . 15
1.4.5 Determinants . . . . . . . . . . . . . . . . . . . . . . . 16
1.4.6 Trace of a matrix . . . . . . . . . . . . . . . . . . . . . 17
1.4.7 Rank of a matrix . . . . . . . . . . . . . . . . . . . . . 18
1.4.8 The inverse of a nonsingular matrix . . . . . . . . . . 18
1.4.9 Eigenvalues and eigenvectors . . . . . . . . . . . . . . 19
1.4.10 Quadratic forms . . . . . . . . . . . . . . . . . . . . . 22
1.4.11 Matrix norms . . . . . . . . . . . . . . . . . . . . . . . 24
1.5 Preliminaries from Real and Functional Analysis . . . . . . . 25
1.5.1 Closed and open sets . . . . . . . . . . . . . . . . . . . 26
1.5.2 Sequences . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.5.3 Continuity and diﬀerentiability . . . . . . . . . . . . . 27
1.5.4 Big O and little o notations . . . . . . . . . . . . . . . 30
1.5.5 Taylor’s theorem . . . . . . . . . . . . . . . . . . . . . 31

2 Numbers and Errors 37

2.1 Conversion between Different Number Systems . . . . . . . . 39
2.1.1 Conversion of integers . . . . . . . . . . . . . . . . . . 40
2.1.2 Conversion of fractions . . . . . . . . . . . . . . . . . . 42
2.2 Floating Point Representation of Numbers . . . . . . . . . . 44
2.3 Definitions of Errors . . . . . . . . . . . . . . . . . . . . . . . 45
2.4 Round-off Errors . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.4.1 Rounding and chopping . . . . . . . . . . . . . . . . . 47
2.4.2 Arithmetic operations . . . . . . . . . . . . . . . . . . 48
2.4.3 Subtractive cancellation and error propagation . . . . 49

xi
xii

II Numerical Methods for Standard Problems 53

3 Elements of Numerical Linear Algebra 55
3.1 Direct Methods for Solving Systems of Linear Equations . . 57
3.1.1 Solution of triangular systems of linear equations . . . 57
3.1.2 Gaussian elimination . . . . . . . . . . . . . . . . . . . 59
3.1.2.1 Pivoting strategies . . . . . . . . . . . . . . . 62
3.1.3 Gauss-Jordan method and matrix inversion . . . . . . 63
3.1.4 Triangular factorization . . . . . . . . . . . . . . . . . 66
3.2 Iterative Methods for Solving Systems of Linear Equations . 69
3.2.1 Jacobi method . . . . . . . . . . . . . . . . . . . . . . 70
3.2.2 Gauss-Seidel method . . . . . . . . . . . . . . . . . . . 72
3.2.3 Application: input-output models in economics . . . . 74
3.3 Overdetermined Systems and Least Squares Solution . . . . 75
3.3.1 Application: linear regression . . . . . . . . . . . . . . 76
3.4 Stability of a Problem . . . . . . . . . . . . . . . . . . . . . . 77
3.5 Computing Eigenvalues and Eigenvectors . . . . . . . . . . . 78
3.5.1 The power method . . . . . . . . . . . . . . . . . . . . 79
3.5.2 Application: ranking methods . . . . . . . . . . . . . . 80

4 Solving Equations 87
4.1 Fixed Point Method . . . . . . . . . . . . . . . . . . . . . . 88
4.2 Bracketing Methods . . . . . . . . . . . . . . . . . . . . . . . 92
4.2.1 Bisection method . . . . . . . . . . . . . . . . . . . . . 93
4.2.1.1 Convergence of the bisection method . . . . 93
4.2.1.2 Intervals with multiple roots . . . . . . . . . 95
4.2.2 Regula-falsi method . . . . . . . . . . . . . . . . . . . 96
4.2.3 Modiﬁed regula-falsi method . . . . . . . . . . . . . . 98
4.3 Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . 99
4.3.1 Convergence rate of Newton’s method . . . . . . . . . 103
4.4 Secant Method . . . . . . . . . . . . . . . . . . . . . . . . . . 105
4.5 Solution of Nonlinear Systems . . . . . . . . . . . . . . . . . 106
4.5.1 Fixed point method for systems . . . . . . . . . . . . . 106
4.5.2 Newton’s method for systems . . . . . . . . . . . . . . 107

5 Polynomial Interpolation 113

5.1 Forms of Polynomials . . . . . . . . . . . . . . . . . . . . . . 115
5.2 Polynomial Interpolation Methods . . . . . . . . . . . . . . . 116
5.2.1 Lagrange method . . . . . . . . . . . . . . . . . . . . . 117
5.2.2 The method of undetermined coeﬃcients . . . . . . . 118
5.2.3 Newton’s method . . . . . . . . . . . . . . . . . . . . . 118
5.3 Theoretical Error of Interpolation and Chebyshev Polynomials 120
5.3.1 Properties of Chebyshev polynomials . . . . . . . . . . 122
xiii

6 Numerical Integration 127

6.1 Trapezoidal Rule . . . . . . . . . . . . . . . . . . . . . . . . . 129
6.2 Simpson’s Rule . . . . . . . . . . . . . . . . . . . . . . . . . . 131
6.3 Precision and Error of Approximation . . . . . . . . . . . . . 132
6.4 Composite Rules . . . . . . . . . . . . . . . . . . . . . . . . . 134
6.4.1 The composite trapezoidal rule . . . . . . . . . . . . . 134
6.4.2 Composite Simpson’s rule . . . . . . . . . . . . . . . . 135
6.5 Using Integrals to Approximate Sums . . . . . . . . . . . . . 137

7 Numerical Solution of Diﬀerential Equations 141

7.1 Solution of a Differential Equation . . . . . . . . . . . . . . . 142
7.2 Taylor Series and Picard’s Methods . . . . . . . . . . . . . . 143
7.3 Euler’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . 145
7.3.1 Discretization errors . . . . . . . . . . . . . . . . . . . 147
7.4 Runge-Kutta Methods . . . . . . . . . . . . . . . . . . . . . . 147
7.4.1 Second-order Runge-Kutta methods . . . . . . . . . . 148
7.4.2 Fourth-order Runge-Kutta methods . . . . . . . . . . 151
7.5 Systems of Differential Equations . . . . . . . . . . . . . . . 152
7.6 Higher-Order Differential Equations . . . . . . . . . . . . . . 155

III Introduction to Optimization 159

8 Basic Concepts 161
8.1 Formulating an Optimization Problem . . . . . . . . . . . . . 161
8.2 Mathematical Description . . . . . . . . . . . . . . . . . . . . 164
8.3 Local and Global Optimality . . . . . . . . . . . . . . . . . . 166
8.4 Existence of an Optimal Solution . . . . . . . . . . . . . . . 168
8.5 Level Sets and Gradients . . . . . . . . . . . . . . . . . . . . 169
8.6 Convex Sets, Functions, and Problems . . . . . . . . . . . . . 173
8.6.1 First-order characterization of a convex function . . . 177
8.6.2 Second-order characterization of a convex function . . 179

9 Complexity Issues 185

9.1 Algorithms and Complexity . . . . . . . . . . . . . . . . . . . 185
9.2 Average Running Time . . . . . . . . . . . . . . . . . . . . . 189
9.3 Randomized Algorithms . . . . . . . . . . . . . . . . . . . . . 190
9.4 Basics of Computational Complexity Theory . . . . . . . . . 191
9.4.1 Class N P . . . . . . . . . . . . . . . . . . . . . . . . . 193
9.4.2 P vs. N P . . . . . . . . . . . . . . . . . . . . . . . . . 193
9.4.3 Polynomial time reducibility . . . . . . . . . . . . . . 194
9.4.4 N P-complete and N P-hard problems . . . . . . . . . 195
9.5 Complexity of Local Optimization . . . . . . . . . . . . . . . 198
9.6 Optimal Methods for Nonlinear Optimization . . . . . . . . . 203
9.6.1 Classes of methods . . . . . . . . . . . . . . . . . . . . 203
9.6.2 Establishing lower complexity bounds for a class of
methods . . . . . . . . . . . . . . . . . . . . . . . . . . 204
xiv

9.6.3 Deﬁning an optimal method . . . . . . . . . . . . . . . 206

10 Introduction to Linear Programming 211

10.1 Formulating a Linear Programming Model . . . . . . . . . . 211
10.1.1 Deﬁning the decision variables . . . . . . . . . . . . . 211
10.1.2 Formulating the objective function . . . . . . . . . . . 212
10.1.3 Specifying the constraints . . . . . . . . . . . . . . . . 212
10.1.4 The complete linear programming formulation . . . . 213
10.2 Examples of LP Models . . . . . . . . . . . . . . . . . . . . . 213
10.2.1 A diet problem . . . . . . . . . . . . . . . . . . . . . . 213
10.2.2 A resource allocation problem . . . . . . . . . . . . . . 214
10.2.3 A scheduling problem . . . . . . . . . . . . . . . . . . 215
10.2.4 A mixing problem . . . . . . . . . . . . . . . . . . . . 217
10.2.5 A transportation problem . . . . . . . . . . . . . . . . 219
10.2.6 A production planning problem . . . . . . . . . . . . . 220
10.3 Practical Implications of Using LP Models . . . . . . . . . . 221
10.4 Solving Two-Variable LPs Graphically . . . . . . . . . . . . . 222
10.5 Classiﬁcation of LPs . . . . . . . . . . . . . . . . . . . . . . . 229

11 The Simplex Method for Linear Programming 235

11.1 The Standard Form of LP . . . . . . . . . . . . . . . . . . . 235
11.2 The Simplex Method . . . . . . . . . . . . . . . . . . . . . . 237
11.2.1 Step 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
11.2.2 Step 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
11.2.3 Recognizing optimality . . . . . . . . . . . . . . . . . . 244
11.2.4 Recognizing unbounded LPs . . . . . . . . . . . . . . . 244
11.2.5 Degeneracy and cycling . . . . . . . . . . . . . . . . . 245
11.2.6 Properties of LP dictionaries and the simplex method 249
11.3 Geometry of the Simplex Method . . . . . . . . . . . . . . . 251
11.4 The Simplex Method for a General LP . . . . . . . . . . . . 254
11.4.1 The two-phase simplex method . . . . . . . . . . . . . 259
11.4.2 The big-M method . . . . . . . . . . . . . . . . . . . . 264
11.5 The Fundamental Theorem of LP . . . . . . . . . . . . . . . 266
11.6 The Revised Simplex Method . . . . . . . . . . . . . . . . . . 266
11.7 Complexity of the Simplex Method . . . . . . . . . . . . . . 276

12 Duality and Sensitivity Analysis in Linear Programming 281

12.1 Deﬁning the Dual LP . . . . . . . . . . . . . . . . . . . . . . 281
12.1.1 Forming the dual of a general LP . . . . . . . . . . . . 284
12.2 Weak Duality and the Duality Theorem . . . . . . . . . . . . 287
12.3 Extracting an Optimal Solution of the Dual LP from an
Optimal Tableau of the Primal LP . . . . . . . . . . . . . . . 289
12.4 Correspondence between the Primal and Dual LP Types . . 290
12.5 Complementary Slackness . . . . . . . . . . . . . . . . . . . . 291
12.6 Economic Interpretation of the Dual LP . . . . . . . . . . . . 294
xv

12.7 Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . . . 296

12.7.1 Changing the objective function coeﬃcient of a basic
variable . . . . . . . . . . . . . . . . . . . . . . . . . . 302
12.7.2 Changing the objective function coeﬃcient of a nonbasic
variable . . . . . . . . . . . . . . . . . . . . . . . . . . 303
12.7.3 Changing the column of a nonbasic variable . . . . . . 305
12.7.4 Changing the right-hand side . . . . . . . . . . . . . . 305
12.7.5 Introducing a new variable . . . . . . . . . . . . . . . 307
12.7.6 Introducing a new constraint . . . . . . . . . . . . . . 308
12.7.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . 310

13 Unconstrained Optimization 317

13.1 Optimality Conditions . . . . . . . . . . . . . . . . . . . . . . 317
13.1.1 First-order necessary conditions . . . . . . . . . . . . . 317
13.1.2 Second-order optimality conditions . . . . . . . . . . . 320
13.1.3 Using optimality conditions for solving optimization
problems . . . . . . . . . . . . . . . . . . . . . . . . . 322
13.2 Optimization Problems with a Single Variable . . . . . . . . 323
13.2.1 Golden section search . . . . . . . . . . . . . . . . . . 323
13.2.1.1 Fibonacci search . . . . . . . . . . . . . . . . 325
13.3 Algorithmic Strategies for Unconstrained Optimization . . . 327
13.4 Method of Steepest Descent . . . . . . . . . . . . . . . . . . 328
13.4.1 Convex quadratic case . . . . . . . . . . . . . . . . . . 330
13.4.2 Global convergence of the steepest descent method . . 331
13.5 Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . 333
13.5.1 Rate of convergence . . . . . . . . . . . . . . . . . . . 334
13.5.2 Guaranteeing the descent . . . . . . . . . . . . . . . . 335
13.5.3 Levenberg-Marquardt method . . . . . . . . . . . . . . 335
13.6 Conjugate Direction Method . . . . . . . . . . . . . . . . . . 336
13.6.1 Conjugate direction method for convex quadratic prob-
lems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
13.6.2 Conjugate gradient algorithm . . . . . . . . . . . . . . 340
13.6.2.1 Non-quadratic problems . . . . . . . . . . . . 341
13.7 Quasi-Newton Methods . . . . . . . . . . . . . . . . . . . . . 342
13.7.1 Rank-one correction formula . . . . . . . . . . . . . . 344
13.7.2 Other correction formulas . . . . . . . . . . . . . . . . 345
13.8 Inexact Line Search . . . . . . . . . . . . . . . . . . . . . . . 346

14 Constrained Optimization 351

14.1 Optimality Conditions . . . . . . . . . . . . . . . . . . . . . . 351
14.1.1 First-order necessary conditions . . . . . . . . . . . . . 351
14.1.1.1 Problems with equality constraints . . . . . . 351
14.1.1.2 Problems with inequality constraints . . . . . 358
14.1.2 Second-order conditions . . . . . . . . . . . . . . . . . 363
14.1.2.1 Problems with equality constraints . . . . . . 363
xvi

14.1.2.2 Problems with inequality constraints . . . . . 364

14.2 Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368
14.3 Projected Gradient Methods . . . . . . . . . . . . . . . . . . 371
14.3.1 Aﬃne scaling method for LP . . . . . . . . . . . . . . 374
14.4 Sequential Unconstrained Minimization . . . . . . . . . . . . 377
14.4.1 Penalty function methods . . . . . . . . . . . . . . . . 378
14.4.2 Barrier methods . . . . . . . . . . . . . . . . . . . . . 379
14.4.3 Interior point methods . . . . . . . . . . . . . . . . . . 381

Notes and References 387

Bibliography 389

Index 391
Part I

Basics

1
This page intentionally left blank
Chapter 1
Preliminaries

1.1 Sets and Functions

A set is a collection of objects of a certain kind, which are called elements
of the set. We typically denote a set by a capital Greek or Latin letter, i.e.,
A, B, Σ, . . ., and an element of a set by a small letter, such as a, b, c, . . .. A
set may be given by the list of its elements, for example, A = {a, b, c} or
B = {1, 2}, or by some clearly deﬁned rules, which can be used to determine
whether a given object is an element of the given set. For example, the set of
all real numbers between 0 and 1, inclusive, can be equivalently represented
in several diﬀerent ways as follows:

X = {x : x2 − x ≤ 0} = {x : 0 ≤ x ≤ 1} = {x : x ∈ [0, 1]} = [0, 1].

We write a ∈ A if a is an element of set A, and A ⊆ B if A is a subset of B

(i.e., every element of A is also an element of B). If A is a proper subset of
B, meaning that A is a subset of B and A = B, then we write A ⊂ B. For
example, for the set X = [0, 1] above, it is correct to write 0 ∈ X or {0} ⊂ X,
but writing 0 ⊂ X or {0} ∈ X is incorrect. Some other basic set-theoretic
notations, terms, and deﬁnitions are summarized in Table 1.1.

Deﬁnition 1.1 (Mapping) Let X, Y be two sets. A rule associating

a single element y = f (x) of Y with each element x ∈ X is called a
mapping of X into Y , denoted by f : X → Y , and f is said to map X
into Y (and, if f (x) = y, x into y).

Deﬁnition 1.2 (Real Function) If X ⊆ IR, then a mapping f : X →

IR is called a real function on X, where X is called the domain of f and
Y = {y = f (x) : x ∈ X is called the range of f .

The following terminology applies to mappings in general and to real func-

tions in particular. Consider a mapping f : X → Y . If f (x) = y, then we call
y the image of x under the mapping f . For a set X ⊆ X, we call the set

3
4

TABLE 1.1: Common set-theoretic notations, terminology, and the corresponding deﬁnitions.

Notation Terminology Deﬁnition

a∈A a belongs to A a is an element of A
Aa A contains a A has a as its element
∀a ∈ A for all a in A for every element a of A
∃a ∈ A there exists a in A there exists an element a of A
a∈/A a does not belong to A a is not an element of A
A=B A and B are equal A and B consist of the same elements
A = B A and B are not equal there is an element that belongs to exactly one of A and B
A⊆B A is a subset of B every element of A is an element of B
A⊂B A is a proper subset of B A ⊆ B and A = B
A⊇B A is a superset of B every element of B is an element of A
A⊃B A is a proper superset of B A ⊇ B and A = B
∅ empty set set with no elements
A∪B union of A and B {x : x ∈ A or x ∈ B}
A∩B intersection of A and B {x : x ∈ A and x ∈ B}
A∩B =∅ A and B are disjoint A and B have no elements in common
A\B diﬀerence of A and B {x : x ∈ A and x ∈/ B}
AΔB
symmetric diﬀerence of A and B (A \ B) ∪ (B \ A)
Aα union of family of sets {Aα : α ∈ F} {x : ∃α ∈ F such that x ∈ Aα }
α∈F

Aα intersection of family {Aα : α ∈ F} {x : x ∈ Aα ∀α ∈ F}
α∈F
Cartesian product of A and B
Numerical Methods and Optimization: An Introduction

A×B {(a, b) : a ∈ A, b ∈ B}
IR reals set of all real numbers
Z integers set of all integer numbers
Z+ positive integers set of all positive integer numbers
Preliminaries 5

{y = f (x) : x ∈ X } the image of X under the mapping f . For a given y ∈ Y ,

any x ∈ X such that f (x) = y is called a preimage of y. If y has a unique
preimage x, we denote it by f −1 (y). Given Y ⊆ Y , the set of all x ∈ X such
that f (x) ∈ Y is called the preimage of Y and is denoted by f −1 (B). We
say that f maps X into Y if f (X) ⊆ Y , and we say that f maps X onto Y
if f (X) = Y . If f (X) = Y and every element y ∈ Y has a unique preimage
f −1 (y), then f defines a one-to-one correspondence (bijection) between X and
Y . In this case, f −1 : Y → X is called the inverse mapping of f .
Two sets X and Y are called equivalent (equinumerable; equipotent), de-
noted by X ∼ Y , if there exists a one-to-one correspondence between them.
Obviously, two finite sets are equivalent if and only if they have the same num-
ber of elements. The equivalence between the sets is used to compare infinite
sets. In particular, we say that a set X is countable if it is equivalent to the
set Z+ of all positive integers. An infinite set that is not countable is called
uncountable.

Example 1.1 The set Z of all integers is countable. This can be shown using
the following one-to-one correspondence:

(n − 1)/2 if n is odd
f (n) = ∀n ∈ Z+ .
−n/2 if n is even

In the following example, we use the method known as Cantor’s diagonal

method to show that set [0, 1] is uncountable.

Example 1.2 To prove that the set [0, 1] of all reals between 0 and 1 is un-
countable, assume that there is a function f : Z+ → [0, 1] deﬁning a one-to-one
correspondence between Z+ and [0, 1]. Let f (n) = αn = 0.an1 an2 . . . ann . . . be
the decimal representation of the image of n ∈ Z+ , where ank = 0 if αn
requires less than k digits to be represented exactly. We construct number
β = 0.b1 b2 . . . bn . . . by assigning bn = 2 if ann = 1 and bn = 1, otherwise.
Then for any n we have f (n) = β, thus β cannot be counted using f (i.e.,
β ∈ [0, 1] does not have a preimage under f ). This contradicts the assumption
that a one-to-one correspondence exists.

Given a set of n integers N = {1, . . . , n}, a permutation p : N → N is a

bijection of the elements of N .

Example 1.3 If N = {1, 2, 3}, then a permutation given by p(1) = 3, p(2) = 2

and p(3) = 1 can be written as

1 2 3
p= .
3 2 1

A binary (dyadic) relation on a set A is a collection of elements of A × A

(ordered pairs of elements of A). A binary relation ∼ on a set A is called an
equivalence relation if it satisﬁes the following properties: (1) a ∼ a ∀a ∈ A
6 Numerical Methods and Optimization: An Introduction

(reﬂexivity); (2) ∀a, b ∈ A, if a ∼ b, then b ∼ a (symmetry); (3) ∀a, b, c ∈ A,

if a ∼ b and b ∼ c, then a ∼ c (transitivity). The equivalence class [a] of
an element a is deﬁned as the set [a] = {e ∈ A : e ∼ x}. Each element of
A belongs to exactly one equivalence class induced by any given equivalence
relation on A. The set of all equivalence classes in A induced by an equivalence
relation ∼ is called the quotient set of A by ∼, denoted by A/ ∼.

1.2 Fundamental Theorem of Algebra

Polynomial functions, or simply polynomials, are at the core of many nu-
merical methods and analytical derivations.

Deﬁnition 1.3 (Polynomial) A function p(x) of a single variable x is

a polynomial of degree n on its domain if it can be written in the form

p(x) = an xn + an−1 xn−1 + · · · + a1 x + a0 ,

where an , an−1 , . . . , a1 , a0 are constant coeﬃcients.

A polynomial of degree n = 1 is called a linear function, and a polynomial

of degree n = 2 is called a quadratic function. A root of a polynomial p(x) is
a number x̄ such that p(x̄) = 0. A root x̄ of p(x) is said to be of multiplicity
k if there is a polynomial q(x) such that q(x̄) = 0 and p(x) = (x − x̄)k q(x). A
simple root is a root of multiplicity k = 1.

Theorem 1.1 (Fundamental Theorem of Algebra) Given a poly-

nomial p(x) = xn + an−1 xn−1 + . . . + a1 x + a0 (where the coeﬃcients
a0 , ..., an−1 can be real or complex numbers), there exist n (not necessar-
ily distinct) complex numbers x1 , . . . , xn such that p(x) = (x − x1 )(x −
x2 ) · . . . · (x − xn ). In other words, every complex polynomial of degree n
has exactly n roots, counted with multiplicity.

Given a polynomial p(x) = xn + an−1 xn−1 + . . . + a2 x2 + a1 x + a0 with

integer coeﬃcients ai , i = 0, . . . , n − 1, it is easy to see that if p(x) has an
integer root r, then r divides a0 .

Example 1.4 We ﬁnd the roots of the polynomial p(x) = x3 − 6x2 + 10x − 4.
First, we look for integer roots of p(x). If p(x) has an integer root x1 , then
this root has to be equal to one of the integers that divide −4, i.e., ±1, ±2, ±4.
It is easy to check that p(2) = 0, so x1 = 2. Next, we can divide p(x) by x − 2,
Preliminaries 7

obtaining p(x) = (x − 2)(x2 − 4x + 2). To ﬁnd the two other roots,√we need to
solve the quadratic equation x2 − 4x + 2 = 0, yielding x2,3 = 2 ± 2.

1.3 Vectors and Linear (Vector) Spaces

A vector space or a linear space is an important mathematical abstraction

that can be used to study many objects of interest. We give a formal deﬁnition
of vector spaces and mention some fundamental properties describing their
structure. We will consider real vector spaces only, i.e., by a scalar we will
always mean a real number.

Let a set V be given, and let the operations of addition and scalar mul-
tiplication be deﬁned on V , such that for any x, y ∈ V and any scalar α we
have x + y ∈ V and αx ∈ V .

Deﬁnition 1.4 (Linear (Vector) Space) A set V together with oper-

ations of addition and scalar multiplication is a linear space if the fol-
lowing axioms are satisﬁed for any x, y, z ∈ V and any scalars α, β.
1. x + y = y + x.
2. (x + y) + z = x + (y + z).

3. There exists an element 0 ∈ V such that x + 0 = x.

4. There exists an element −x ∈ V such that x + (−x) = 0.

5. α(x + y) = αx + αy.

6. (α + β)x = αx + βx.

7. (αβ)x = α(βx).

8. 1 · x = x.

The elements of V are called vectors.

In this text we deal primarily with n-dimensional real vectors, deﬁned next.
8 Numerical Methods and Optimization: An Introduction

Deﬁnition 1.5 (Real Vector) A real n-dimensional real vector is an

ordered set of n real numbers {x1 , x2 , . . . , xn } and is usually written in
the form ⎡ ⎤
x1
⎢ x2 ⎥
⎢ ⎥
x=⎢ . ⎥
⎣ .. ⎦
xn
(column vector) or
x = [x1 , x2 , . . . , xn ]
(row vector). We will also write

x = [xi ]ni=1 .

The numbers x1 , x2 , . . . , xn are called the components of x.

For the sake of clarity, unless otherwise speciﬁed, by a vector we will mean a
column vector. For a vector x, the corresponding row vector will be denoted
by xT , which represents the transpose of x (see Deﬁnition 1.25 at page 14).

Deﬁnition 1.6 (Real Coordinate Space) The set of all n-dimensional

vectors is called an n-dimensional real coordinate space and is denoted
by IRn .

To consider the real coordinate space as a linear space, we introduce the

operations of addition and scalar multiplication in IRn as follows.

Deﬁnition 1.7 (Sum of Vectors; Scalar Multiplication) Given vec-

tors
x = [xi ]ni=1 , y = [yi ]ni=1 ∈ IRn ,
their sum is deﬁned as

x + y = [xi + yi ]ni=1 .

For a vector x ∈ IRn and a scalar α, the scalar multiplication αx is

deﬁned as
αx = [αxi ]ni=1 .

Example 1.5 For x = [1, 2, 3]T , y = [4, 5, 6]T and α = 2 we have

x + y = [1 + 4, 2 + 5, 3 + 6]T = [5, 7, 9]T ; αx = [2 · 1, 2 · 2, 2 · 3]T = [2, 4, 6]T .
Preliminaries 9

Deﬁnition 1.8 (Subspace) Let V be a linear space. If S ⊆ V such

that x + y ∈ S for any x, y ∈ S and αx ∈ S for any x ∈ S and any scalar
α, then S is called a subspace of V .

Deﬁnition 1.9 (Linear Combination) Given vectors v (1) , . . . , v (k) in

V , a sum in the form c1 v (1) + . . . + ck v (k) , where c1 , . . . , ck are scalars,
is called a linear combination of v (1) , . . . , v (k) .

Deﬁnition 1.10 (Span) The set of all linear combinations of v (1) ,

. . . , v (k) is called the span of v (1) , . . . , v (k) and is denoted by
Span(v (1) , . . . , v (k) ).

Deﬁnition 1.11 (Linear Independence) Vectors v (1) , . . . , v (k) ∈ V

are called linearly independent if

c1 v (1) + . . . + ck v (k) = 0

implies that c1 = . . . = ck = 0. Otherwise, the vectors are linearly depen-

dent.

Example 1.6 Consider the following vectors v (1) , v (2) ∈ IR2 :

(1) 1 (2) 1
v = and v = .
2 3

These vectors are linearly independent, since

c1 + c2 = 0
c1 v (1) + c2 v (2) = 0 ⇔ ⇔ c1 = c2 = 0.
2c1 + 3c2 = 0

The vectors
0 0
v (1) = and v (2) =
2 3
are linearly dependent, because for c1 = −3 and c2 = 2 we have c1 v (1) +
c2 v (2) = 0.
10 Numerical Methods and Optimization: An Introduction

Theorem 1.2 Given v (1) , . . . , v (k) ∈ V , a vector v ∈ Span(v (1) , . . . , v (k) )

has a unique representation as a linear combination of v (1) , . . . , v (k) if
and only if v (1) , . . . , v (k) are linearly independent.

Deﬁnition 1.12 (Basis) We say that vectors v (1) , . . . , v (k) ∈ V form a

basis of the space V if and only if v (1) , . . . , v (k) are linearly independent
and Span(v (1) , . . . , v (k) ) = V .

Theorem 1.3 The following statements are valid.

1. If V = Span(v (1) , . . . , v (k) ) then any set of m > k vectors in V is

linearly dependent.

2. Any two bases {v (1) , . . . , v (k) } and {u(1) , . . . , u(m) } of V contain

equal number of vectors, m = k. This number is called the dimen-
sion of V . The subspace {0} is said to have dimension 0.

3. If V is a linear space of dimension k > 0, then no set of less than

k vectors can span V and the following statements are equivalent:

(a) v (1) , . . . , v (k) are linearly independent;

(b) Span(v (1) , . . . , v (k) ) = V ;
(c) v (1) , . . . , v (k) form a basis of V .

1.3.1 Vector norms

Deﬁnition 1.13 (Norm) A function p : V → IR, denoted by p(x) =
x for x ∈ V , is called a norm in V if for any x, y ∈ V and any scalar
α the following properties hold:

1. x ≥ 0 with equality if and only if x = 0;

2. αx = |α|x;

3. x + y ≤ x + y.

Deﬁnition 1.14 (Normed Linear Spaces) A vector space equipped

with a norm p(x) = x is called a normed linear space.
Preliminaries 11

Deﬁnition 1.15 (p-norm) A p-norm of a vector x = [xi ]ni=1 is deﬁned

as n 1/p

xp = |xi | p

i=1

for any real p ≥ 1.

The most commonly used values for p are 1, 2, and ∞, and the correspond-
ing norms are

n
• 1-norm: x1 = |xi |;
i=1

n
• 2-norm: x2 = |xi |2 ;
i=1

• ∞-norm: x∞ = max |xi |.

1≤i≤n

Deﬁnition 1.16 (Inner Product) The inner product of two vectors

x, y ∈ IRn is deﬁned as

n
xT y = x i yi .
i=1

Deﬁnition 1.17 (Euclidean Space) The linear space IRn equipped

with the inner product is called the Euclidean n-space.

Deﬁnition 1.18 (Orthogonal) A set of vectors {v (1) , . . . , v (k) } is

called orthogonal if
T
v (i) v (j) = 0 for i = j, i, j = 1, . . . , k.

Deﬁnition 1.19 (Orthonormal) A set of orthogonal vectors

{v (1) , . . . , v (k) } is said to be orthonormal if

(i) T (j) 0 for i = j.
v v =
1 for i = j.
12 Numerical Methods and Optimization: An Introduction

For an arbitrary set of linearly independent vectors p(0) , . . . , p(k−1) ∈ IRn ,

where k ≤ n, the Gram-Schmidt orthogonalization procedure generates the
set of vectors d(0) , . . . , d(k−1) ∈ IRn as follows:
d(0) = p(0) ;

s−1 T
p(s) d(i) (i) (1.1)
d(s) = p(s) − d(i)T d(i)
d , s = 1, . . . , k − 1.
i=0

Then {d(0) , . . . , d(k−1) } is an orthogonal set of vectors (Exercise 1.5).

1.4 Matrices and Their Properties

Deﬁnition 1.20 (Matrix) A real matrix is a rectangular array of real
numbers composed of rows and columns. We write
⎡ ⎤
a11 a12 · · · a1n
⎢ a21 a22 · · · a2n ⎥
⎢ ⎥
A = [aij ]m×n = ⎢ . .. .. .. ⎥ (1.2)
⎣ .. . . . ⎦
am1 am2 ··· amn

for a matrix of m rows and n columns, and we say that the matrix A is
of order m × n.

Note that if m = 1 or n = 1 in the above deﬁnition, then A becomes a

vector (row vector and column vector, respectively).
We will deal with real matrices (i.e., A ∈ IRm×n ) and m × n will always
denote the rows×columns. We say that two matrices A = [aij ]m×n and B =
[bij ]m×n are equal, A = B when aij = bij , ∀i, j.

1.4.1 Matrix addition and scalar multiplication

Deﬁnition 1.21 (Matrix Addition) Given two matrices, A = [aij ]m×n
and B = [bij ]m×n ∈ IRm×n , their sum is deﬁned as
⎡ ⎤
a11 + b11 a12 + b12 · · · a1n + b1n
⎢ a21 + b21 a22 + b22 · · · a2n + b2n ⎥
⎢ ⎥
A+B =⎢ .. .. .. .. ⎥.
⎣ . . . . ⎦
am1 + bm1 am2 + bm2 ··· amn + bmn

The addition of matrices is not deﬁned for matrices of diﬀerent order.

Preliminaries 13

Deﬁnition 1.22 (Scalar Multiplication) Given a matrix A = [aij ]m×n

and some scalar α ∈ IR, scalar multiplication αA is deﬁned as
⎡ ⎤
αa11 αa12 · · · αa1n
⎢ αa21 αa22 · · · αa2n ⎥
⎢ ⎥
αA = ⎢ .. .. .. .. ⎥.
⎣ . . . . ⎦
αam1 αam2 ··· αamn

Next we list the algebraic rules of matrix addition and and scalar multi-
plication.

Theorem 1.4 Suppose that A, B, C are m × n matrices, O is the m × n

zero matrix, and α, β are scalars. The following rules of matrix addition
and scalar multiplication are valid:
(i) A+B =B+A commutativity
(ii) A + O = A additive identity
(iii) A + (−A) = O additive inverse
(iv) (A + B) + C = A + (B + C) associativity
(v) α(A + B) = αA + αB distributive property for matrices
(vi) (α + β)A = αA + βA distributive property for scalars
(vii) (αβ)A = α(βA) associativity for scalars

1.4.2 Matrix multiplication

Deﬁnition 1.23 (Matrix Product) Given A ∈ IRm×n , B ∈ IRp×q

where n = p (i.e., the number of columns in A is equal to the number of
rows in B), the matrix product AB is deﬁned as:

AB = C = [cij ]m×q ,

where

n
cij = aik bkj , i = 1, 2, . . . , m, j = 1, 2, . . . , q.
k=1

Matrix multiplication is not deﬁned when n = p.

Example 1.7 Let

2 1 5
A= , x= .
4 3 1
Since the number of columns in A and the number of rows in x are both equal
14 Numerical Methods and Optimization: An Introduction

to 2, the matrix multiplication Ax is deﬁned and the resulting matrix will be

of order (2 × 1):

2 1 5 2·5+1·1 11
Ax = · = = .
4 3 1 4·5+3·1 23

Deﬁnition 1.24 (Identity Matrix) The identity matrix In of order n

is deﬁned by In = [δij ]n×n , where

1, i = j,
δij =
0, i = j.

In other words, the identity matrix is a square matrix which has ones on
the main diagonal and zeros elsewhere.

Note that In A = AIn = A for A ∈ IRn×n .

Example 1.8
⎡ ⎤ ⎡ ⎤⎡ ⎤ ⎡ ⎤
1 0 0 1 2 3 1 0 0 1 2 3
I3 = ⎣ 0 1 0 ⎦; ⎣ 4 5 6 ⎦⎣ 0 1 0 ⎦=⎣ 4 5 6 ⎦.
0 0 1 7 8 9 0 0 1 7 8 9

Theorem 1.5 Let α be a scalar, and assume that A, B, C and the iden-
tity matrix I are matrices such that the following sums and products are
deﬁned. Then the following properties hold.
(i) (AB)C = A(BC) associativity
(ii) IA = AI = A identity matrix
(iii) A(B + C) = AB + AC left distributive property
(iv) (A + B)C = AC + BC right distributive property
(v) α(AB) = (αA)B = A(αB) scalar associative property

Note that matrix multiplication is not commutative, i.e., AB = BA in

general.

1.4.3 The transpose of a matrix

Deﬁnition 1.25 (Matrix Transpose) Given a matrix A ∈ IRm×n ,

the transpose of A is an n × m matrix whose rows are the columns of A.
The transpose matrix of A is denoted by AT .
Preliminaries 15

Deﬁnition 1.26 (Symmetric Matrix) A square matrix A ∈ IRn×n ,

is said to be symmetric if AT = A.

Example 1.9 For

2 4 8
A=
1 3 7
the transpose is ⎡ ⎤
2 1
AT = ⎣ 4 3 ⎦.
8 7
For ⎡ ⎤
1 5 9
B=⎣ 5 2 6 ⎦
9 6 3
we have B T = B, therefore B is a symmetric matrix.

The following are algebraic rules for transposes.

Theorem 1.6 Given matrices A and B and scalar α ∈ IR, we have

(i) (AT )T = A
(ii) (αA)T = αAT
(iii) (A + B)T = AT + B T
(iv) (AB)T = B T AT

1.4.4 Triangular and diagonal matrices

A matrix A = [aij ]n×n is called
- upper triangular if aij = 0 for all i > j.
- lower triangular if aij = 0 for all i < j.
In both cases the matrix A is called triangular. Matrix A is called diagonal if
aij = 0 when i = j.
Example 1.10 The matrices
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
1 2 3 1 0 0 1 0 0
U = ⎣ 0 4 5 ⎦, L = ⎣ 2 4 0 ⎦, and D=⎣ 0 4 0 ⎦
0 0 2 3 6 2 0 0 8

are upper triangular, lower triangular, and diagonal, respectively.

16 Numerical Methods and Optimization: An Introduction

1.4.5 Determinants
For a square matrix A ∈ IRn×n the determinant of A is a real number,
which is denoted by det(A). If
⎡ ⎤
a11 a12 · · · a1n
⎢ a21 a22 · · · a2n ⎥
⎢ ⎥
A = [aij ]n×n = ⎢ . .. .. .. ⎥,
⎣ .. . . . ⎦
an1 an2 · · · ann

then the determinant of A is usually written as

a11 a12 · · · a1n

a21 a22 · · · a2n

det(A) = . .. .. ..
.. . . .

an1 an2 · · · ann

and is deﬁned as follows.

• If n = 1 and A = a11 , then det(A) = a11 .

• If n ≥ 2, then

n
det(A) = (−1)i+j aij · det(Aij ), (1.3)
j=1

where Aij is the matrix obtained from A by removing its ith row and j th
column. Note, that i in the above formula is chosen arbitrarily, and (1.3)
is called the ith row expansion. We can similarly write the j th column
expansion as
n
det(A) = (−1)i+j aij · det(Aij ). (1.4)
i=1

So, for the 2 × 2 matrix

a11 a12
A=
a21 a22
we have
det(A) = a11 a22 − a12 a21 .
If n ≥ 3, then calculation of the determinant of A is reduced to calculation of
several 2 × 2 determinants, as in the following example.

Example 1.11 Let n = 3 and

⎡ ⎤
4 1 0
A=⎣ 1 3 −1 ⎦ .
0 1 2
Preliminaries 17

Using the ﬁrst row expansion, we have

3 −1 −1
det(A) = 4 · −1· 1 + 0 = 4(6 + 1) − 1(2 − 0) = 26.
1 2 0 2

Given a triangular matrix A = [aij ]n×n , its determinant is equal to the

product of its diagonal elements:

n
det(A) = aii .
i=1

A minor of a matrix A is the determinant of a square matrix obtained from

A by removing one or more of its rows or columns. For an m × n matrix A and
k-element subsets I ⊂ {1, . . . , m} and J ⊂ {1, . . . , n} of indices, we denote
by [A]I,J the minor of A corresponding to the k × k matrix obtained from A
by removing the rows with index not in I and the columns with index not in
J. Then [A]I,J is called a principal minor if I = J, and a leading principal
minor if I = J = {1, . . . , k}.

1.4.6 Trace of a matrix

Deﬁnition 1.27 (Matrix Trace) The trace tr(A) of a matrix A =
[aij ]n×n is the sum of its diagonal elements, i.e.,

n
tr(A) = a11 + . . . + ann = aii .
i=1

Some basic properties of the trace of a matrix are given next.

Theorem 1.7 Let A = [aij ]n×n and B = [bij ]n×n be n × n matrices and
let α be a scalar. Then

(i) tr(A + B) = tr(A) + tr(B)

(ii) tr(αA) = αtr(A)

(iii) tr(AT ) = tr(A)

(iv) tr(AB) = tr(BA)

n
(v) tr(AT B) = tr(AB T ) = tr(B T A) = tr(BAT ) = aij bij
i,j=1
18 Numerical Methods and Optimization: An Introduction

1.4.7 Rank of a matrix

Deﬁnition 1.28 (Row (Column) Space) Given an m × n matrix A,
its row (column) space is the subspace of IRn (IRm ) spanned by the row
(column) vectors of A.

The column space of A, which is given by {y ∈ IRm : y = Ax, x ∈ IRn },

is also called the range space of A. The set {x ∈ IRn : Ax = 0} is called the
kernel or null space of A and is denoted by Ker(A) or Null(A).

Theorem 1.8 (Rank of a matrix) The dimension of the row space of

any matrix A is equal to the dimension of its column space and is called
the rank of A (rank(A)).

Example 1.12 The rank of

⎡ ⎤
1 2 3
A=⎣ 4 5 6 ⎦
5 7 9

is 2, since the ﬁrst two rows are linearly independent, implying that rank(A) ≥
2, and the third row is the sum of the ﬁrst two rows, implying that rank(A) = 3.

An m × n matrix A is said to have full rank if rank(A) = min{m, n}.

Clearly, A has full rank if and only if either rows or columns of A are linearly
independent.

1.4.8 The inverse of a nonsingular matrix

Deﬁnition 1.29 (Nonsingular Matrix) Matrix A = [aij ]n×n is said
to be nonsingular if there exists some matrix A−1 such that AA−1 =
A−1 A = In . The matrix A−1 is called the multiplicative inverse of A. If
A−1 does not exist, then A is said to be singular.

Note that A−1 is unique for any nonsingular A ∈ IRn×n , since if we assume
that B and C are both inverses of A, then we have

B = BIn = B(AC) = (BA)C = In C = C,

that is, B is equal to C.

The following properties of nonsingular matrices are valid.
Preliminaries 19

Theorem 1.9 Let A and B be nonsingular n × n matrices. Then

(i) (A−1 )−1 = A

(ii) (AB)−1 = B −1 A−1

1.4.9 Eigenvalues and eigenvectors

Deﬁnition 1.30 (Eigenvalue; Eigenvector) Given a matrix A ∈

IRn×n , a scalar λ is called an eigenvalue of A if there exists a nonzero
vector v, called an eigenvector of A corresponding to λ, such that

Av = λv.

An eigenvalue λ of A together with a corresponding eigenvector v are

referred to as an eigenpair of A.

Hence, the eigenvalues of a matrix A ∈ IRn×n are the roots of the charac-
teristic polynomial p(λ) = det(A − λIn ).

Theorem 1.10 The following properties hold:

(a) Each eigenvalue has at least one corresponding eigenvector.

(b) If an eigenvalue λ is a root of p(λ) of multiplicity r, then there are
at most r linearly independent eigenvectors corresponding to λ.

(c) A set of eigenvectors, each of which corresponds to a diﬀerent

eigenvalue, is a set of linearly independent vectors.

Example 1.13 Find the eigenvalues and the corresponding eigenvectors of

the matrix ⎡ ⎤
2 2 1
A=⎣ 1 3 1 ⎦.
1 2 2
The characteristic polynomial of A is

2−λ 2 1

ϕ(λ) = 1 3−λ 1 = −λ3 + 7λ2 − 11λ + 5 = −(λ − 1)2 (λ − 5).
1 2 2−λ
Solving the characteristic equation ϕ(λ) = 0, we obtain the eigenvalues of A:
λ1 = λ2 = 1 and λ3 = 5.
20 Numerical Methods and Optimization: An Introduction

In order to ﬁnd an eigenvector v corresponding to eigenvalue λi , we solve the

system (A − λi I)v = 0, i = 1, 2, 3.
For λ1 = λ2 = 1 we have:
⎡ ⎤⎡ ⎤ ⎡ ⎤
1 2 1 v1 0
⎣ 1 2 1 ⎦ ⎣ v2 ⎦ = ⎣ 0 ⎦ .
1 2 1 v3 0

This system consists of three equivalent equations v1 + 2v2 + v3 = 0. Thus,

v1 = −2v2 − v3 . Setting [v2 , v3 ] = [1, 0] and [v2 , v3 ] = [0, 1] we obtain two
linearly independent eigenvectors corresponding to λ1 = λ2 = 1. These vectors
are v (1) = [−2, 1, 0]T and v (2) = [−1, 0, 1]T .
Similarly, for λ3 = 5 we consider the system (A − 5I)v = 0:
⎡ ⎤⎡ ⎤ ⎡ ⎤
−3 2 1 v1 0
⎣ 1 −2 1 ⎦ ⎣ v2 ⎦ = ⎣ 0 ⎦ .
1 2 −3 v3 0

Solving this system we obtain an inﬁnite set of solutions v1 = v2 = v3 . We can

set v1 = 1 to obtain an eigenvector v (3) = [1, 1, 1]T corresponding to λ3 = 5.

Theorem 1.11 The following statements are valid:

(a) The determinant of any matrix is equal to the product of its eigen-
values.

(b) The trace of any matrix is equal to the sum of its eigenvalues.
(c) The eigenvalues of a nonsingular matrix A are nonzero, and the
eigenvalues of A−1 are reciprocals of the eigenvalues of A.

Example 1.14 The eigenvalues of

⎡ ⎤
2 2 1
A=⎣ 1 3 1 ⎦
1 2 2

were computed in the previous example, and they are λ1 = λ2 = 1, λ3 = 5. We

can conclude, therefore, that

det(A) = 1 · 1 · 5 = 5,

tr(A) = 1 + 1 + 5 = 7,
and the eigenvalues of A−1 are μ1 = μ2 = 1, μ3 = 1/5.
Preliminaries 21

Theorem 1.12 The eigenvalues of a real symmetric matrix are all real
numbers.

Deﬁnition 1.31 (Orthogonal Matrix) An n × n matrix A is orthog-

onal if AT A = In , i.e., AT = A−1 .

Theorem 1.13 (Eigenvalue Decomposition) A real symmetric ma-

trix A can be represented as

A = RDRT ,

where D is a diagonal matrix consisting of the eigenvalues of A; R is the

matrix having an eigenvector corresponding to the ith diagonal element
of D as its ith column; and R is orthogonal.

Example 1.15 Consider the matrix

⎡ ⎤
3 1 0
A=⎣ 1 2 1 ⎦.
0 1 3
The characteristic polynomial of A is

3−λ 1 0

ϕ(λ) = 1 2−λ 1 = (3 − λ)(5 − 5λ + λ2 ) − (3 − λ).
0 1 3−λ

Hence, ϕ(λ) = (3 − λ)(4 − 5λ + λ2 ), and the roots of ϕ(λ) are

λ1 = 1, λ2 = 3 and λ3 = 4.
Next we ﬁnd normalized eigenvectors corresponding to each eigenvalue.
(1) (1) (1)
Denote by v (1) = [v1 , v2 , v3 ]T the normalized eigenvector correspond-
ing to λ1 = 1. Then
⎡ ⎤ ⎡ (1) ⎤ ⎡ ⎤
2 1 0 v1 0
⎢ ⎥
(A − I)v (1) = 0 ⇔ ⎣ 1 1 1 ⎦ ⎣ v2(1) ⎦ = ⎣ 0 ⎦ .
0 1 2 v3
(1) 0

This gives v (1) = c1 [1, −2, 1]T for some

√ constant c1 . Choosing
√ c1 such that
v (1) is normalized, i.e., v (1) 2 = c1 6 = 1, we obtain c1 = 1/ 6 and
√ √ √
v (1) = [1/ 6, −2/ 6, 1/ 6]T .
22 Numerical Methods and Optimization: An Introduction
(2) (2) (2)
Denote by v (2) = [v1 , v2 , v3 ]T the normalized eigenvector correspond-
ing to λ2 = 3. We have
⎡ ⎤ ⎡ (2) ⎤ ⎡ ⎤
0 1 0 v1 0
⎢ ⎥
(A − 3I)v (2) = 0 ⇔ ⎣ 1 −1 1 ⎦ ⎣ v2(2) ⎦ = ⎣ 0 ⎦ ,
0 1 0 v
(2) 0
3

−1]T for some c2 . Selecting c2 satisfying

solution v (2) = c2 [1, 0,√
yielding the √
v 2 = c2 2 = 1, we get c2 = 1/ 2 and the vector
(2)

√ √
v (2) = [1/ 2, 0, −1/ 2]T .

(3) (3) (3)

Similarly, the normalized eigenvector v (3) = [v1 , v2 , v3 ]T corresponding
to λ3 = 4 satisﬁes
⎡ ⎤⎡ (3)
⎤ ⎡ ⎤
−1 1 0 v1 0
⎢ ⎥
(A − 4I)v (3) = 0 ⇔ ⎣ 1 −2 1 ⎦ ⎣ v2 ⎦ = ⎣ 0 ⎦ .
(3)

0 1 −1 (3)
v3 0

The solution of this system

√ is v (3) = c3 [1,√1, 1]T , where c3 is found from the
equation v 2 = c3 3 = 1, i.e., c3 = 1/ 3 and
(3)

√ √ √
v (3) = [1/ 3, 1/ 3, 1/ 3]T .

Now we are ready to write matrices D and R from the eigenvalue decompo-
sition theorem. The matrix D is a diagonal matrix containing eigenvalues of
A, ⎡ ⎤
1 0 0
D = ⎣ 0 3 0 ⎦,
0 0 4
and R has corresponding normalized eigenvectors as its columns,
⎡ √ √ √ ⎤
1/ √6 1/ 2 1/√3
R = ⎣ −2/√ 6 0√ 1/√3 ⎦ .
1/ 6 −1/ 2 1/ 3

It is easy to check that A = RDRT .

1.4.10 Quadratic forms

Deﬁnition 1.32 (Quadratic Form) A quadratic form is

n
n
Q(x) = aij xi xj ,
i=1 j=1
Preliminaries 23
where aij , i, j = 1, . . . , n are the coeﬃcients of the quadratic form, x =
[x1 , . . . , xn ]T is the vector of variables.

Let A = [ai,j ]n×n be the matrix of coeﬃcients of Q(x). Then Q(x) = xT Ax =

xT Qx, where Q = (A+AT )/2 is a symmetric matrix. Therefore, any quadratic
form is associated with the symmetric matrix of its coeﬃcients.

Definition 1.33 (Definite (Semidefinite) Matrix) A symmetric ma-

trix Q and the corresponding quadratic form q(x) = xT Qx is called
• positive definite if q(x) = xT Qx > 0 for all nonzero x ∈ IRn ;
• positive semidefinite if q(x) = xT Qx ≥ 0 for all x ∈ IRn ;
• negative definite if q(x) = xT Qx < 0 for all nonzero x ∈ IRn ;

• negative semideﬁnite if q(x) = xT Qx ≤ 0 for all x ∈ IRn .

If a matrix (quadratic form) is neither positive nor negative semideﬁnite,
it is called indeﬁnite.

Theorem 1.14 (Sylvester’s Criterion) A real symmetric matrix A

is positive deﬁnite if and only if all its leading principal minors are pos-
itive.

Consider a quadratic form Q(x) = xT Qx. By the eigenvalue decomposition

theorem, Q = RT DR, therefore, denoting by y = Rx we obtain

n
Q(x) = xT Qx = xT RT DRx = y T Dy = λi yi2 .
i=1

This leads to the following statement.

Theorem 1.15 A symmetric matrix Q is

• positive (negative) deﬁnite if and only if all the eigenvalues of Q

are positive (negative);
• positive (negative) semideﬁnite if and only if all the eigenvalues of
Q are nonnegative (nonpositive).
24 Numerical Methods and Optimization: An Introduction

Example 1.16 The eigenvalues of matrix

⎡ ⎤
3 1 0
A=⎣ 1 2 1 ⎦
0 1 3

were computed in Example 1.15:

λ1 = 1, λ2 = 3 and λ3 = 4.

They are all positive, hence matrix A is positive deﬁnite.

Theorem 1.16 (Rayleigh’s Inequality) For a positive semideﬁnite

n × n matrix Q, for any x ∈ IRn :

λmin (Q)x2 ≤ xT Qx ≤ λmax (Q)x2 , (1.5)

where λmin (Q) and λmax (Q) are the smallest and the largest eigenvalues
of Q, respectively.

1.4.11 Matrix norms

Given a vector p-norm, the corresponding induced norm or operator norm
on the space IRm×n of m × n matrices is deﬁned by

Axp
Ap = max . (1.6)
x=0 xp

The most commonly used values of p are p = 1, 2, and ∞. For p = 1 and

p = ∞, (1.6) simpliﬁes to 1-norm,

m
A1 = max |aij |
1≤j≤n
i=1

and ∞-norm,

n
A∞ = max |aij |,
1≤i≤m
j=1

respectively. For p = 2 and m = n, the induced p-norm is called the spectral

norm and is given by
A2 = λmax (AT A),

i.e., the square root of the largest eigenvalue of AT A.

Preliminaries 25

The Frobenius norm of an m × n matrix A is

m
n
AF = a2ij .
i=1 j=1

Example 1.17 For

1 −6
A= ,
3 2

we have
A1 = max{1 + 3, 6 + 2} = 8;

10 0 √ √
A2 = λmax = 40 = 2 10;
0 40

A∞ = max{1 + 6, 3 + 2} = 7;
√ √
AF = 1 + 36 + 9 + 4 = 50.

1.5 Preliminaries from Real and Functional Analysis

For a scalar set S ⊂ IR, the number m is called a lower bound on S
if ∀s ∈ S : s ≥ m. Similarly, the number M is an upper bound on S if
∀s ∈ S : s ≤ M. A set that has a lower bound is called bounded from below; a
set that has an upper bound is bounded from above. A set that is bounded from
both below and above is called bounded. If a set S is bounded from below, its
greatest lower bound glb(S), i.e., the lower bound that has the property that
for any > 0 there exists an element s ∈ S such that glb(S) + > s , is called
the inﬁmum of S and is denoted by inf(S) = inf s. Similarly, for a bounded
s∈S
from above set S, the least upper bound lub(S), which is the upper bound
such that for any > 0 there exists an element s ∈ S satisfying the inequality
lub(S) − < s , is called the supremum of S and is denoted by sup(S) = sup s.
s∈S
If S has no lower bound, we set inf(S) = −∞; if there is no upper bound,
we set sup(S) = ∞. If s∗ = inf s and s∗ ∈ S, we use the term minimum and
s∈S
write s∗ = min s. Similarly, we use the term maximum and write s∗ = max s
s∈S s∈S
if s∗ = sup s and s∗ ∈ S.
s∈S
26 Numerical Methods and Optimization: An Introduction

1.5.1 Closed and open sets

We consider
the Euclidean n-space IRn with the norm given by x =
√
n
xT x = x2i . Let X ⊂ IRn .
i=1

Deﬁnition 1.34 (Interior Point; Open Set) We call x̄ ∈ X an inte-

rior point of X if there exists ε > 0 such that x̄ is included in X together
with a set of points around it, {x : x − x̄ < ε} ⊂ X for some ε > 0. A
set X is called open if all of its points are interior.

Example 1.18 For x̄ ∈ IRn and ε > 0, the open ε-neighborhood of x̄ or the
open ε-ball centered at x̄ deﬁned by

B(x̄, ε) = {x ∈ IRn : x − x̄ < ε},

is an open set.

Deﬁnition 1.35 (Limit Point; Closed Set) We call x̄ a limit point

of X if every ε-neighborhood of x̄ contains inﬁnitely many points of X.
A set X is called closed if it contains all of its limit points.

Example 1.19 For x̄ ∈ IRn and ε > 0, the closed ε-neighborhood of x̄ or the
closed ε-ball centered at x̄ given by

B̄(x̄, ε) = {x ∈ IRn : x − x̄ ≤ ε}

is a closed set.

Deﬁnition 1.36 (Bounded Set; Compact Set) A set X ⊂ IRn is

called bounded if there exists C > 0 such that x ≤ C for every x ∈ X.
A set is called compact if it is closed and bounded.

1.5.2 Sequences
A sequence in IRn is given by a set {xk : k ∈ I}, where I is a countable set
of indices. We will write {xk : k ≥ 1} or simply {xk } if I is the set of positive
integers. As in Deﬁnition 1.35, we call x̄ a limit (cluster, accumulation) point
of a sequence {xk } if every ε-neighborhood of x̄ contains inﬁnitely many points
Preliminaries 27

of the sequence. If a sequence {xk } has a unique limit point x̄, we call x̄ the
limit of the sequence. In other words, a point x̄ ∈ IRn is said to be the limit
for a sequence {xk } ⊂ IRn if for every > 0 there exists an integer K such
that for any k ≥ K we have xk − x̄ < . In this case we write
xk → x̄, k → ∞ or lim xk = x̄
k→∞

and say that the sequence converges to x̄.

A subsequence of {xk } is an inﬁnite subset of this sequence and is typically
denoted by {xki : i ≥ 1} or simply {xki }. Then a point x̄ ∈ IRn is a limit
point for a sequence {xk } ⊂ IRn if and only if there is a subsequence of {xk }
that converges to x̄.

Deﬁnition 1.37 (Limit Inferior; Limit Superior) The limit infe-

rior and limit superior of a sequence {xk : k ≥ 1} ⊂ IR are the inﬁmum
and supremum of the sequence’s limit points, respectively:

lim inf xk = lim ( inf xm ) = sup( inf xm )

k→∞ k→∞ m≥k k≥0 m≥k
lim sup xk = lim ( sup xm ) = inf ( sup xm )
k→∞ k→∞ m≥k k≥0 m≥k

Deﬁnition 1.38 (Rate of Convergence) Consider a sequence {xk }

such that xk → x∗ , k → ∞. If there exist C > 0 and positive integers
R, K, such that for any k ≥ K:

xk+1 − x∗
≤ C,
xk − x∗ R

then we say that {xk : k ≥ 0} has the rate of convergence R.

• If R = 1, we say that the convergence is linear.

xk+1 −x∗

• If lim
xk −x∗
= 0, the rate of convergence is superlinear.
k→∞

xk+1 −x∗

• If lim ∗ = μ does not hold for any μ < 1, then the rate of
k→∞
xk −x

convergence is called sublinear.

• If R = 2, the convergence is quadratic.

1.5.3 Continuity and diﬀerentiability

Let X ⊆ IR. Given a real function f : X → IR and a point x0 ∈ X, f
is called continuous at x0 if lim f (x) = f (x0 ), and f is called diﬀerentiable
x→x0
28 Numerical Methods and Optimization: An Introduction

at x0 if there exists the limit f (x0 ) = lim f (x)−f (x0 )

x−x0 , in which case f (x0 )
x→x0
is called the first-order derivative, or simply the derivative of f at x0 . If f is
differentiable at x0 , then it is continuous at x0 . The opposite is not true in
general; in fact, there exist continuous functions that are not differentiable at
any point of their domain. If f is continuous (differentiable) at every point of
X, then f is called continuous (differentiable) on X. The set of all continuous
functions on X is denoted by C(X). The k th order derivative f (k) (x0 ) of f
at x0 can be defined recursively as the first-order derivative of f (k−1) at x0
for k ≥ 1, where f (0) ≡ f . If f is a continuous function at x0 , then we
call f continuously differentiable or smooth at x0 . Similarly, we call f k-times
continuously differentiable at x0 if f (k) exists and is continuous at x0 . The
set of all k times continuously differentiable functions on X is denoted by
C (k) (C). Then C (k) (X) ⊂ C (k−1) (X), k ≥ 1 (where C (0) (X) ≡ C(X)).

Theorem 1.1 (Intermediate Value Theorem) Let f : [a, b] → IR

be a continuous function. Then f takes all values between f (a) and f (b)
on [a, b], i.e., for any z ∈ [f (a), f (b)] there exists c ∈ [a, b] such that
z = f (c).

Theorem 1.17 (Mean Value Theorem) If f is a continuous func-

tion on [a, b] and diﬀerentiable on (a, b), then there exists c ∈ (a, b) such
that
f (b) − f (a)
= f (c) (1.7)
b−a
In particular, if f (a) = f (b), then f (c) = 0 for some c ∈ (a, b).

From a geometric perspective, the mean value theorem states that under
the given conditions there always exists a point c between a and b such that
the tangent line to f at c is parallel to the line passing through the points
(a, f (a)) and (b, f (b)). This is illustrated in Figure 1.1. For the function f in
this ﬁgure, there are two points c1 , c2 ∈ (a, b) that satisfy (1.7).
Now let X ⊆ IRn . Then a function f = [f1 , . . . , fm ]T : X → IRm is called
continuous at point x̄ ∈ X if lim f (x) = f (x̄). The function is continuous on
x→x̄
X, denoted by f ∈ C(X), if it is continuous at every point of X. In this case
the ﬁrst-order derivative at x̄ ∈ IRn , called Jacobian, is given by the following
m × n matrix of partial derivatives:
⎡ ∂f1 (x̄) ∂f1 (x̄)
⎤
···
⎢ ∂x1 ∂xn
⎥
Jf (x̄) = ⎢
⎣
..
.
..
.
..
.
⎥.
⎦
∂fm (x̄) ∂fm (x̄)
∂x1 ··· ∂xn
Preliminaries 29

f (b)

f (c1 )

a c2
c1 b

f (c2 )

f (a)

FIGURE 1.1: An illustration of the mean value theorem.

If all partial derivatives exist and are continuous on an open set containing x̄,
then f is continuously differentiable. If m = 1, i.e., we have f : IRn → IR, the
Jacobian at x̄ ∈ IRn is given by the vector of partial derivatives
T
∂f (x̄) ∂f (x̄)
∇f (x̄) = ,...,
∂x1 ∂xn
at x̄ and is called the gradient of f at x̄.
For f : IRn → IR and x̄, d ∈ IRn , the directional derivative of f (x) at x̄ in
the direction d is defined as
∂f (x̄) f (x̄ + αd) − f (x̄)
= lim .
∂d α→0 α
(x̄)
If d = 1, then ∂f∂d is also called the rate of increase of f at x̄ in the
direction d.
Assume that f is differentiable. Denoting by φ(α) = f (x̄ + αd), we have
∂f (x̄) f (x̄ + αd) − f (x̄)
= lim
∂d α→0 α
φ(α) − φ(0)
= lim
α→0 α
= φ (0).
On the other hand, using the chain rule, according to which the derivative of
the composite function f (g(α)) is equal to the product of the derivatives of f
and g, we have
φ (α) = ∇f (x̄ + αd)T d ⇒ φ (0) = ∇f (x̄)T d.
30 Numerical Methods and Optimization: An Introduction

So, we obtain
∂f (x̄)
= ∇f (x̄)T d.
∂d
For f : IRn → IR, the second-order derivative of f at x̄ is given by the
matrix of second-order partial derivatives of f at x̄,
⎡ ∂ 2 f (x̄) ∂ 2 f (x̄)
⎤
∂x21
··· ∂x1 ∂xn
⎢ ⎥
∇2 f (x̄) = ⎢
⎣
..
.
..
.
..
.
⎥
⎦
∂ 2 f (x̄) ∂ 2 f (x̄)
∂xn ∂x1 ··· ∂x2n

and is called the Hessian of f at x̄.

1.5.4 Big O and little o notations

For two functions f, g : IR → IR we say that

• f (x) = O(g(x)) as x → ∞ if there exist numbers x0 > 0 and C > 0 such

that |f (x)| ≤ C|g(x)| for all x ≥ x0 .
|f (x)|
• f (x) = o(g(x)) as x → ∞ if lim = 0.
x→∞ |g(x)|

For two functions f, g : IRn → IR and x̄ ∈ IRn we say that

• f (x) = O(g(x)) as x → x̄ if |f (x)|/|g(x)| is asymptotically bounded near

x̄, i.e., there exist numbers > 0 and C > 0 such that |f (x)| ≤ C|g(x)|
for all x with x − x̄ ≤ .
|f (x)|
• f (x) = o(g(x)) as x → x̄ if lim = 0.
x→x̄ |g(x)|

|f (x)|
It is easy to show that if lim = c, where c < ∞ is a positive number,
x→x̄ |g(x)|
then f (x) = O(g(x)) and g(x) = O(f (x)) as x → x̄.

Example 1.20 For f (x) = 10x2 +100x3 , g(x) = x2 , we have f (x) = O(g(x))
as x → 0 since
f (x) 10x2 + 100x3
lim = lim = 10.
x→0 g(x) x→0 x2
For f (x) = 6x3 − 10x2 + 15x and g(x) = x3 , we have f (x) = O(g(x)) as
x → ∞ since
f (x) 6x3 − 10x2 + 15x
lim = lim = 6.
x→∞ g(x) x→∞ x3
For f (x) = x2 and g(x) = x, we have f (x) = o(g(x)) as x → 0.
Preliminaries 31

1.5.5 Taylor’s theorem

Theorem 1.2 (Taylor’s Theorem) Let f ∈ C (n+1) ([α, β]) and x0 ∈

[α, β]. Then, for each x ∈ (α, β), there exists ξ = ξ(x) that lies between
x0 and x such that
f (x) = Pn (x) + Rn (x), (1.8)
where

n
f (k) (x0 )
Pn (x) = (x − x0 )k (1.9)
k!
k=0

and
f (n+1) (ξ)
Rn (x) = (x − x0 )n+1 . (1.10)
(n + 1)!
The polynomial Pn (x) is called the Taylor polynomial of degree n.

For f : IRn → IR, we will use Taylor’s theorem for multivariate functions
in the following forms.

If f ∈ C (1) (IRn ), then

f (x) = f (x̄) + ∇f (x̃)T (x − x̄), (1.11)

where x̃ is a point on the line segment connecting x̄ and x.

If f ∈ C (2) (IRn ), then

1
f (x) = f (x̄) + ∇f (x̄)T (x − x̄) + (x − x̄)T ∇2 f (x̃)(x − x̄), (1.12)
2
where x̃ is a point on the line segment between x̄ and x. The error term
is often represented using the big O and little o notations:

f (x) = f (x̄) + ∇f (x̄)T (x − x̄) + O(x − x̄2 ); (1.13)

f (x) = f (x̄) + ∇f (x̄)T (x − x̄) + o(x − x̄). (1.14)

If f ∈ C (3) (IRn ), then

1
f (x) = f (x̄) + ∇f (x̄)T (x − x̄) + (x − x̄)T ∇2 f (x̄)(x − x̄) + R2 (x), (1.15)
2
32 Numerical Methods and Optimization: An Introduction
where the error term R2 (x) can be expressed as

R2 (x) = o(x − x̄2 ).

Example 1.21 Let f (x) = x31 + x32 + 3x1 x2 + x1 − x2 + 1, x̄ = [0, 0]T . Then

3x21 + 3x2 + 1 6x1 3
∇f (x) = , ∇ f (x) =
2
,
3x22 + 3x1 − 1 3 6x2

and

x1 6x̃1 3 x1
f (x) = 1 + [1, −1] + 12 [x1 , x2 ]
x2 3 6x̃2 x2
= 1 + x1 − x2 + 3x̃1 x1 + 3x̃2 x2 + 3x1 x2 .
2 2

for some x̃ between x and 0, i.e., x̃ = αx + (1 − α)0 = αx for some α ∈ (0, 1).
For x = [1, 2]T , we have f (x) = 15 and we can ﬁnd x̃ by solving the following
equation for α:
3α + 24α + 6 = 15 ⇔ α = 1/3.
Thus, x̃ = [1/3, 2/3]T and, according to (1.12),

f (x) = 1 + x1 − x2 + x21 + 2x22 + 3x1 x2

at x = [1, 2]T .
The linear approximation

f (x) ≈ f (x̄) + ∇f (x̄)T (x − x̄) = 1 + x1 − x2

in (1.13) and (1.14) works well if x√is very close to x̄. For example, for
x = [0.1, 0.1] with x − x̄ = 0.1 × 2 ≈ 0.1414, we have f (x) = 1.032,
while the linear approximation gives f (x) ≈ 1 with the error R1 (x) = 0.032.
Using the quadratic approximation (1.15) for x̄ = 0, we have
1
f (x) ≈ f (x̄) + ∇f (x̄)T (x − x̄) + (x − x̄)T ∇2 f (x̄)(x − x̄) = 1 + x1 − x2 + 3x1 x2 ,
2
and for x = [0.1, 0.1]T we obtain f (x) ≈ 1.03 with the error R2 (x) = 0.002.

Exercises
1.1. Let f : X → Y be an arbitrary mapping and X , X ⊆ X, Y , Y ⊆ Y .
Prove that
(a) f −1 (Y ∪ Y ) = f −1 (Y ) ∪ f −1 (Y );
Preliminaries 33

(b) f −1 (Y ∩ Y ) = f −1 (Y ) ∩ f −1 (Y );
(c) f (X ∪ X ) = f (X ) ∪ f (X );
(d) f (X ∩ X ) may not be equal to f (X ) ∩ f (X ).
1.2. Prove that the following sets are countable:
(a) the set of all odd integers;
(b) the set of all even integers;
(c) the set 2, 4, 8, 16, . . . , 2n , . . . of powers of 2.
1.3. Show that
(a) every infinite subset of a countable set is countable;
(b) the union of a countable family of countable sets A1 , A2 , . . . is
countable;
(c) every infinite set has a countable subset.
1.4. Show that each of the following sets satisfies axioms from Definition 1.4
and thus is a linear space.
(a) IRn with operations of vector addition and scalar multiplication.
(b) IRm×n with operations of matrix addition and scalar multiplication.
(c) C[a, b]–the set of all continuous real-valued functions defined on the
interval [a, b], with addition and scalar multiplication defined as

(f + g)(x) = f (x) + g(x), x ∈ [a, b] for any f, g ∈ C[a, b];

(αf )(x) = αf (x), x ∈ [a, b] for each f ∈ C[a, b] and any scalar α;
respectively.
(d) Pn –the set of all polynomials of degree at most n with addition
and scalar multiplication deﬁned as

(p + q)(x) = p(x) + q(x), for any p, q ∈ Pn ;

(αp)(x) = αp(x), for each p ∈ Pn ; and any scalar α;

respectively.
1.5. (Gram-Schmidt orthogonalization) Let p(0) , . . . , p(k−1) ∈ IRn ,
where k ≤ n, be an arbitrary set of linearly independent vectors. Show
that the set of vectors d(0) , . . . , d(k−1) ∈ IRn given by

d(0) = p(0) ;

s−1 T
p(s) d(i) (i)
d(s) = p(s) − d(i)T d(i)
d , s = 1, . . . , k − 1
i=0

is orthogonal.
34 Numerical Methods and Optimization: An Introduction

1.6. Show that for p = 1, 2, ∞ the norm · p is compatible, that is for any
A ∈ IRm×n , x ∈ IRn :
Axp ≤ Ap xp .

1.7. Let ⎡ ⎤ ⎡ ⎤
a1 b1
⎢ ⎥ ⎢ .. ⎥
a = ⎣ ... ⎦ and b = ⎣ . ⎦
an bn
be two vectors in IRn .

(a) Compute the matrix

⎡ ⎤
a1
⎢ ⎥
M = abT = ⎣ ... ⎦ [b1 , . . . , bn ].
an

(b) How many additions and multiplications are needed to compute

M?
(c) What is the rank of M ?

1.8. Show that a square matrix A is orthogonal if and only if both the
columns and rows of A form sets of orthonormal vectors.

1.9. A company manufactures three diﬀerent products using the same ma-
chine. The sell price, production cost, and machine time required to
produce a unit of each product are given in the following table.

Product 1 Product 2 Product 3

Sell price $30 $35 $50
Production cost $18 $22 $30
Machine time 20 min 25 min 35 min

The table below represents a two-week plan for the number of units of
each product to be produced.

Week 1 Week 2
Product 1 120 130
Product 2 100 80
Product 3 50 60

Use a matrix product to compute, for each week, the revenue received
from selling all items manufactured in a given week, the total production
cost for each week, and the total machine time spent each week. Present
your answers in a table.
Preliminaries 35

1.10. Given matrices

⎡ ⎤
1 2 6 ⎡ ⎤ ⎡ ⎤
⎢ −2 7 −6 ⎥ 2 −1 1 0 0
A=⎢ ⎣ 2 10
⎥, B = ⎣ 1 −3 ⎦ and C = ⎣ 0 1 0 ⎦,
5 ⎦
0 2 0 0 1
0 4 8

(a) ﬁnd the transpose of each matrix;

(b) consider all possible pairs {(A, A), (A, B), (B, A), . . . , (C, C)} and
compute a product of each pair for which multiplication is a feasible
operation.

1.11. For the matrices

⎡ ⎤ ⎡ ⎤
−1 8 4 −4 9 2
A = ⎣ 2 −3 −6 ⎦ ; B = ⎣ 3 −5 4 ⎦,
0 3 7 8 1 −6

ﬁnd (a) A − 2B, (b) AB, (c) BA.

1.12. Compute the p-norm for p = 1, 2, ∞ of the matrices A and B in Exer-
cise 1.11.

1.13. Find the quadratic Taylor’s approximation of the following functions:

(a) f (x) = x41 + x42 − 4x1 x2 + x21 − 2x22 at x̄ = [1, 1]T ;

(b) f (x) = exp (x21 + x22 ) at x̄ = [0, 0]T ;
1
(c) f (x) = 1+x21 +x22
at x̄ = [0, 0]T .
This page intentionally left blank
Chapter 2
Numbers and Errors

Numbers are usually represented by certain symbols, referred to as numerals,

according to rules describing the corresponding numeral system. For example,
the Roman numeral “X” describes the number ten, which is written as “10”
in the decimal system. Clearly, our capability to perform nontrivial numerical
operations efficiently depends on the numeral system chosen to represent the
numbers, and positional systems are commonly used for this purpose. In a
positional notation, each number is represented by a string of symbols, with
each symbol being a multiplier for some base number. In general, the base
number can be different for different positions, which makes counting more
complicated.

Example 2.1 Consider the common English length measures:

1 yard = 3 feet
1 foot = 12 inches

If we take a length of 5 yards 1 foot and 7 inches, we cannot just juxtapose

the three symbols and say that this length is equal to 517 inches, because the
base number for the feet position is 12 inches, while the base number for the
yard position is 3 feet = 36 inches. Note, that if we measure some length in
metric units, for example, 5 meters, 1 decimeter, and 7 centimeters, then we
can write this length simply by juxtaposing the symbols: 517 centimeters. This
is because the base of the decimeter position is 10 centimeters, and the base
of the meter position is 10 decimeters = 100 = 102 centimeters.

As the above example shows, a positional notation is simpler if diﬀerent

powers of the same number (called the base or radix) are used as the base
number for each position. Hence, a good positional notation is characterized
by a single base or radix β, a set of symbols representing all integers from
0 (null symbol) to β − 1, a radix point to separate integral and fractional
portions of the number, and ﬁnally, by a plus and minus sign. It is important
that every number has a unique representation in every system.
The decimal system, deﬁned next, is what we use in everyday life.

37
38 Numerical Methods and Optimization: An Introduction

Deﬁnition 2.1 (Decimal System) In the decimal system the base

β = 10, and a real number R is represented as follows:

R = ±an an−1 . . . a0 .a−1 . . . a−m

= ±(an an−1 . . . a0 .a−1 . . . a−m )10
= ±(an 10n + an−1 10n−1 + . . . + a0 100 + a−1 10−1 + . . . + a−m 10−m ),

where all ai ∈ {0, 1, . . . , 9}. The radix point is called the decimal point in
the decimal system.

Example 2.2 Consider the number R = 352.17:

352.17 = 3 × 100 + 5 × 10 + 2 × 1 + 1 × 0.1 + 7 × 0.01

= 3 × 102 + 5 × 101 + 2 × 100 + 1 × 10−1 + 7 × 10−2 .

Here n = 2, m = 2, and a2 = 3, a1 = 5, a0 = 2, a−1 = 1, a−2 = 7.

However, the base 10 is not the only choice, and other bases can be used.
The majority of modern computers use the base-2 or binary system.

Deﬁnition 2.2 (Binary System) In the binary system the base β = 2,

and a real number R is represented as follows:

R = ±(an an−1 . . . a0 .a−1 . . . a−m )2

= ±(an 2n + an−1 2n−1 + . . . + a0 20 + a−1 2−1 + . . . + a−m 2−m ),

where all ai ∈ {0, 1}. In this case the radix point is called the binary
point.

Example 2.3 Consider the number

(10110.1)2 = 1 × 24 + 0 × 23 + 1 × 22 + 1 × 21 + 0 × 20 + 1 × 2−1 .

Here n = 4, m = 1, and a4 = 1, a3 = 0, a2 = 1, a1 = 1, a0 = 0, a−1 = 1.

In general, in a system with the base β, a real R is represented as

R = (an . . . a0 .a−1 . . . a−m )β

= an β n + . . . + a0 β 0 + a−1 β −1 + . . . + a−m β −m .

It is easy to see that the number n + 1 of digits required to represent a

positive integer N as a base-β number (an . . . a0 )β is no more than logβ N + 1.
Indeed, if n + 1 > logβ N + 1, we would have N ≥ β n > β logβ N = N , a
Numbers and Errors 39

T HERE ARE 10 T YPES

OF PEOPLE–

T HOSE WHO U N DERSTAN D BINARY,

AN D T HOSE WHO DON T .

FIGURE 2.1: Mathematical humor playing on the equivalence between dec-

imal 2 and binary (10)2 .

contradiction. Therefore, n + 1 ≤ logβ N + 1. On the other hand, since all

ai ≤ β − 1, i = 0, . . . , n, we have

n
N = an β + . . . + a0 β ≤ (β − 1)
n 0
β i = β n+1 − 1,
i=0

yielding n + 1 ≥ logβ (N + 1). Thus, we have

logβ (N + 1) ≤ n + 1 ≤ logβ N + 1. (2.1)

2.1 Conversion between Diﬀerent Number Systems

Converting a binary number to a decimal can be carried out in a straight-
forward fashion–we can just compute the decimal number resulting from the
expression in Deﬁnition 2.2. For example, the binary (10)2 is equivalent to the
decimal 2 (see Figure 2.1):

(10)2 = 1 × 21 + 0 × 20 = 2.

Next, we discuss how the conversion can be done more eﬃciently.

40 Numerical Methods and Optimization: An Introduction

2.1.1 Conversion of integers

Note that we can use the following recursion to ﬁnd the decimal represen-
tation of a base-β number (an an−1 . . . a0 )β :

bn = an ;
bn−1 = an−1 + bn β = an−1 + an β;
bn−2 = an−2 + bn−1 β = an−2 + an−1 β + an β 2 ;
.. ..
. .
b0 = a 0 + b1 β = a0 + a1 β + a2 β 2 + . . . + an β n .

Obviously, N = b0 is the decimal representation sought. This recursion is

summarized in Algorithm 2.1, where the same variable N is used in place of
bi , i = 0, . . . , n.

Algorithm 2.1 Converting a base-β integer to its decimal form.

1: Input: base β and a number (an an−1 . . . a0 )β
2: Output: decimal representation N of (an an−1 . . . a0 )β
3: N = an
4: for i = n − 1, n − 2, . . . , 0 do
5: N = ai + N β
6: end for
7: return N

Algorithm 2.1 can be used to convert integers from the binary to the
decimal system.

Example 2.4 For (10111)2 , β = 2, n = 4, a4 = 1, so Algorithm 2.1 starts

with N = 1 and gives

i=3: N = 0 + 1 × 2 = 2;

i=2: N = 1 + 2 × 2 = 5;

i=1: N = 1 + 5 × 2 = 11;

i=0: N = 1 + 11 × 2 = 23.

To carry out the conversion, we used 4 additions and 4 multiplications. If we

perform the same conversion using the formula in Deﬁnition 2.2, we have

(10110)2 = 1 × 24 + 0 × 23 + 1 × 22 + 1 × 21 + 1 × 20 .

We need 4 additions and 5 multiplications, excluding the multiplications re-

quired to compute the diﬀerent powers of 2. Obviously, we are better oﬀ if we
use Algorithm 2.1.
Numbers and Errors 41

To convert a positive integer in the decimal form to a base-β number, we

use the following observation. Let N = an an−1 . . . a0 be the given decimal
number and let Nβ = (br br−1 . . . b0 )β be its base-β representation. Then the
remainder upon dividing N by β must be the same in both decimal and base-β
systems. Dividing (br br−1 . . . b0 )β by β gives the remainder of b0 . Thus, b0 is
just the remainder we obtain when we divide N by β. The remainder upon
dividing N by β can be expressed as

N − βN/β,

where N/β is the integer part of N/β. If we apply the same procedure to
N/β, we can compute b1 , and so on, as in Algorithm 2.2.

Algorithm 2.2 Converting a decimal integer N to its base-β representation.

1: Input: a decimal number N and base β
2: Output: base-β representation (br br−1 . . . b0 )β of N
3: N0 = N, i = 0
4: repeat
5: bi = N0 − βN0 /β
6: N0 = N0 /β
7: i=i+1
8: until N0 < β
9: r = i, br = N0
10: return (br br−1 . . . b0 )β

Example 2.5 We convert the decimal 195 to the binary form using Algo-
rithm 2.2. We have:

i = 0 : N0 = 195, b0 = 195 − 2195/2 = 195 − 2 × 97 = 1;

i = 1 : N0 = 97, b1 = 97 − 297/2 = 1;

i = 2 : N0 = 48, b2 = 48 − 248/2 = 0;

i = 3 : N0 = 24, b3 = 24 − 224/2 = 0;

i = 4 : N0 = 12, b4 = 12 − 212/2 = 0;

i = 5 : N0 = 6, b5 = 6 − 26/2 = 0;

i = 6 : N0 = 3, b6 = 3 − 23/2 = 1;

i = 7 : N0 = 1 < β = 2, so b7 = 1, and the algorithm terminates.

The output is (11000011)2 .

42 Numerical Methods and Optimization: An Introduction

2.1.2 Conversion of fractions

Consider a positive real number x. We can represent x as:

x = xI + xF ,

where xI = x is the integer part x, and xF = {x} is the fractional part of
x. For example, for x = 5.3 we have xI = 5 and xF = 0.3. We can convert xI
and xF to another base separately and then combine the results. Note that
the xF part can be written in the form
∞

xF = a−k 10−k ,
k=1

where a−k ∈ {0, 1, 2, . . . 9}, for k = 1, . . . , ∞. If a−k = 0 for k ≥ N , where N

is some ﬁnite integer, then the fraction is said to terminate. For example,
1
= 0.125 = 1 × 10−1 + 2 × 10−2 + 5 × 10−3
8
terminates, whereas
1
= 0.16666 . . . = 1 × 10−1 + 6 × 10−2 + 6 × 10−3 + · · ·
6
does not.
Next we discuss how xF < 1 is converted from base β to a decimal. Simi-
larly to Algorithm 2.1, we can develop a recursive procedure for this purpose,
assuming that the input base-β fraction terminates. The corresponding pro-
cedure is presented in Algorithm 2.3.

Algorithm 2.3 Converting a base-β number xF < 1 to its decimal form.

1: Input: base β and a number (0.a−1 . . . a−m )β
2: Output: decimal representation D of (0.a−1 . . . a−m )β
3: D = a−m /β
4: for i = m − 1, m − 2, . . . , m do
5: D = (a−i + D)/β
6: end for
7: return D

Example 2.6 We use Algorithm 2.3 to convert (0.110101)2 to a decimal. We

have m = 6, a−6 = 1, so we start with D = 1/2 = 0.5.
i = 5: D = (0 + 0.5)/2 = 0.25;
i = 4: D = (1 + 0.25)/2 = 0.625;
i = 3: D = (0 + 0.625)/2 = 0.3125;
Numbers and Errors 43

i = 2: D = (1 + 0.3125)/2 = 0.65625;
i = 1: D = (1 + 0.65625)/2 = 0.828125.
Thus, D = 0.828125.

To represent a decimal xF < 1 given by

∞

xF = a−k 10−k
k=1

as a base-β number, we can equivalently represent xF as the base-β fraction

∞

xF = (0.a−1 a−2 · · · )β = a−k β −k ,
k=1

where a−k ∈ {0, 1, . . . , β − 1}, k ≥ 1. Given 0 ≤ xF ≤ 1, we can ﬁnd

(0.a−1 a−2 · · · )β as follows. Let F1 = xF , then
∞

F1 = a−k β −k ⇒
k=1
∞ ∞

βF1 = a−k β −k+1 = a−1 + a−(k+1) β −k ,
k=1 k=1

so a−1 is the integer

∞ part of βF1 .
Denoting by F2 = k=1 a−(k+1) β −k the fractional part of βF1 , we have
∞
∞

βF2 = a−(k+1) β −k+1 = a−2 + a−(k+2) β −k ,
k=1 k=1

and a−2 is the integer part of F2 , etc. In summary, we obtain the procedure
described in Algorithm 2.4, where we assume that the given decimal fraction
xF terminates and use the same variable F for all Fi , i ≥ 1.

Example 2.7 We convert xF = 0.625 to the binary form. We have β = 2,

F = 0.625, and Algorithm 2.4 proceeds as follows.
i=1: 2F = 2(0.625) = 1.25, a−1 = 1.25 = 1, F = 1.25 − 1 = 0.25;
i=2: 2F = 2(0.25) = 0.5, a−2 = 0.5 = 0, F = 0.5 − 0 = 0.5;
i=3: 2F = 2(0.5) = 1, a−3 = 1 = 1, F = 1 − 1 = 0.
Since F = 0, the algorithm terminates, and we have 0.625 = (0.101)2 .

Unfortunately, a terminating decimal fraction may not terminate in its binary

representation.
44 Numerical Methods and Optimization: An Introduction

Algorithm 2.4 Converting a decimal xF < 1 to its base-β representation.

1: Input: a decimal number xF < 1 and base β, maximum allowed number
R of base-β digits
2: Output: base-β representation (0.b−1 b−2 . . . b−r )β of xF , r ≤ R
3: F = xF , i = 1
4: repeat
5: b−i = βF
6: F = βF − b−i
7: i=i+1
8: until F = 0 or i = R + 1
9: r = i − 1, br = N0
10: return (0.b−1 b−2 . . . b−r )β

Example 2.8 For xF = 0.1 and β = 2, we have F = 0.1 and

i=1: 2F = 2(.1) = 0.2, a−1 = 0, F = 0.2;

i=2: 2F = 2(.2) = 0.4, a−2 = 0, F = 0.4;
i=3: 2F = 2(.4) = 0.8, a−3 = 0, F = 0.8;
i=4: 2F = 2(.8) = 1.6, a−4 = 1, F = 0.6;
i=5: 2F = 2(.6) = 1.2, a−5 = 1, F = 0.2.

We see that F = 0.2 assumes the same value as for i = 1, therefore, a cycle
occurs. Hence we have
0.1 = (0.00011)2 ,
i.e., 0.1 = (0.00011 0011 0011 . . . 0011 . . .)2 .
Note that if we had to convert (0.00011)2 to a decimal, we could use a
trick called shifting, which works as follows. If we multiply (0.00011)2 by 2,
we obtain (0.0011)2 . Because of the cycle, multiplying (0.00011)2 by 25 = 32
gives a number with the same fractional part, (11.0011)2 . Hence, if we denote
the value of (0.00011)2 by A, we have

32A − 2A = (11)2 = 3 ⇔ 30A = 3 ⇔ A = 3/30 = 0.1.

2.2 Floating Point Representation of Numbers

In computers, the numbers are usually represented using the ﬂoating-point
form. A real number x ∈ IR is represented as a ﬂoating-point number in base
β as
x = m · βe,
Numbers and Errors 45

where
m = ±(0.d1 d2 . . . dn )β ,
and m is called the mantissa or signiﬁcand; β is the base; n is the length or
precision; and e is the exponent, where s < e < S (usually s = −S).

Example 2.9 The number 23.684 could be represented as 0.23684·102

in a ﬂoating-point decimal system.

The mantissa is usually normalized when its leading digit is zero. The next
example shows why.

Example 2.10 Suppose the number 1/39 = 0.025641026... is stored in a

ﬂoating-point system with β = 10 and n = 5. Thus, it would be represented as

0.02564 · 100 .

Obviously, the inclusion of the useless zero to the right of the decimal point
leads to dropping of the digit 1 in the sixth decimal place. But if we normalize
the number by representing it as

0.25641 · 10−1 ,

we retain an additional signiﬁcant ﬁgure while storing the number.

When mantissa m is normalized, its absolute value is limited by the in-

equality
1
≤ |m| < 1.
β
For example, in a base-10 system the mantissa would range between 0.1 and
1.

2.3 Deﬁnitions of Errors

Using numerical methods, we typically ﬁnd a solution that approximates
the true values with a certain error. For some computation, let xT be the true
value, and let xC be the computed approximation. We are typically interested
in either absolute or relative errors, as deﬁned next.

Deﬁnition 2.3 (Absolute and Relative Error) The absolute error

is deﬁned as
|xT − xC |,
46 Numerical Methods and Optimization: An Introduction
while the relative error is deﬁned as

|xT − xC |
.
|xT |

Many of the problems solved by numerical methods involve inﬁnite sets of

values, each of which can potentially require an inﬁnite number of digits to
be represented exactly.

Example 2.11 Consider a problem, in which one wants to find the smallest
value taken by some function y = f (x) defined over the interval [0, 1] (Figure
2.2). The set of all values taken by f (x) is given by X = {f (x), x ∈ [0, 1]}. Ob-
viously, the set X is infinite, and most of its elements are irrational numbers,
requiring infinitely many digits to be represented exactly.

0 x∗ x̂ 1
x0 x1 x2 x3 x4 x5 x6 x

FIGURE 2.2: A continuous function deﬁned on the interval [0, 1].

In reality, any digital computer is finite by its nature. Thus, the infinite set
of values should be approximated by a finite set, in which, in its turn, each ele-
ment should be approximated by a numerical value representable with a finite
number of digits. This leads to two types of errors in numerical computation:
the round-off and truncation error.

Deﬁnition 2.4 (Truncation Error) By the truncation error (some-

times called discretization error) we mean the error appearing as a result
of the approximation of an inﬁnite set of values by a ﬁnite set.

Example 2.12 In the problem from Example 2.11, we can consider the set of
n + 1 mesh points {xi = ni , i = 0, 1, . . . , n}, and a ﬁnite set Xn = {f (xi ), i =
Numbers and Errors 47

0, 1, . . . , n} of n + 1 values instead of X. This way we are approximating

the original problem by the problem of ﬁnding the smallest among the n + 1
numbers in Xn . Suppose that this smallest number in Xn equals fˆ = 4/9,
and the exact minimum of X is f ∗ = 1/12 (see Figure 2.2). Then |f ∗ − fˆ| =
|1/12−4/9| = 13/36 is the absolute truncation error of the considered method.

Deﬁnition 2.5 (Round-oﬀ Error) The error caused by the approxi-

mation of a numerical value with a value represented by fewer digits is
called the round-oﬀ error.

Example 2.13 Assume that we want to write our answer f˘ to the problem in
the previous example as a five-digit decimal number instead of fˆ = 4/9. Then
f˘ = 0.4444 and the absolute round-off error is |fˆ − f˘| = 4/9 − 4444/10000 =
1/22500 ≈ 0.000044444, where the last number is an approximation of the
absolute round-off error.

2.4 Round-oﬀ Errors

2.4.1 Rounding and chopping
Since any computer can only use a ﬁnite number of digits to represent a
number, most of the real numbers cannot be represented exactly. There are
two common ways to approximate a real number x by a ﬂoating-point number
f l(x)n,β –rounding and chopping.

Deﬁnition 2.6 (Rounding, Chopping) If rounding is employed then

f l(x)n,β is set equal to the nearest to x ﬂoating-point number in the
system with base β and precision n. In case of a tie, round to the even
digit (symmetric rounding).
Alternatively, if chopping is employed, then f l(x)n,β is set equal to the
nearest to x ﬂoating-point number between 0 and x.

Example 2.14 Using rounding, we have

2
fl = f l(0.6666 . . .) = +(0.67) × 100 ,
3 2,10

f l(−456789 . . .)2,10 = −(0.46) × 106 ,

48 Numerical Methods and Optimization: An Introduction

whereas using chopping we obtain the following approximations:

2
fl = f l(0.666 . . .) = +(0.66) × 100 ,
3 2,10

f l(−456789)2,10 = −(0.45) × 106 .

Observe that if |x| ≥ β S , then x cannot be represented. An attempt to

employ such x will result in what is called an overflow error. If 0 ≤ |x| ≤ β s−n ,
then x is too small to be represented and we have underflow. That is, in both
cases the limits of the computer are exceeded.
Relative round-off error of both rounding and chopping is defined as δ(x)
such that
f l(x) − x
f l(x) = x(1 + δ(x)) ⇒ δ(x) = .
x
Thus we can bound the error for a given x as

|δ(x)| ≤ 12 β 1−n , for rounding,

|δ(x)| ≤ β 1−n , for chopping.

The maximum possible |δ| is called unit round-oﬀ. The machine epsilon is
deﬁned as the smallest positive number ε such that 1 + ε = 1.

Example 2.15 A pseudocode to determine the machine epsilon for a binary

computer can be written as follows:

Algorithm 2.5 Determining machine epsilon for a binary computer.

1: eps = 1
2: repeat
3: eps = eps/2
4: until (1 + eps/2 = 1)
5: return eps

The machine epsilon can be used in stopping or convergence criteria in

implementations of numerical algorithms. This is a convenient way to ensure
that your program is portable; that is, it can be executed on any computer.

2.4.2 Arithmetic operations

Consider the common arithmetic operations with floating-point numbers,
such as addition, multiplication, and division by a nonzero number. Then
the length of the floating-point number resulting from an operation on two
floating-point numbers is not necessarily the same as that of the initial num-
bers.
Numbers and Errors 49

Example 2.16 Let

x = 0.20 · 101 = 2, y = 0.77 · 10−6 , z = 0.30 · 101 = 3.

Then

x+y = (0.20000077) · 101 ,

x×y = (0.154) · 10−5 ,
x
= (0.666 . . .) · 1010 .
z

Let x ◦ y denote the arithmetic operation on numbers x, y (i.e. +,–,×, . . .),

and x ⊗ y denote the ﬂoating-point operation between x and y. Then

x ◦ y = x ⊗ y

in general. However ⊗ is constructed such that

x ⊗ y = f l(x ◦ y).

Considering the error, we have

x ⊗ y = (x ◦ y)(1 + δ(x ◦ y)),

where |δ| is no greater than the unit round-oﬀ.

2.4.3 Subtractive cancellation and error propagation

The following two examples illustrate the so-called subtractive cancellation,
which is the round-oﬀ induced when subtracting two nearly equal ﬂoating-
point numbers.

Example 2.17 Suppose that we have x = 0.7654499, y = 0.7653001, then

z = x − y = .1498 × 10−3 . We have

xC = (x)4,10 = (0.7654) × 100 ,

yC = (y)4,10 = (0.7653) × 100 ,
zC = x C − y C = (0.1000) × 10−3 ,

|z−zC |
and the relative round-oﬀ error is |z| = 0.3324 = 33.24%.

One of the main problems arising in numerical computations is that

ﬂoating-point arithmetics gives errors that may explode in the computations
that follow, and produce erroneous results. Once an error occurs, it “contam-
inates” the rest of the calculation. This is called error propagation.
50 Numerical Methods and Optimization: An Introduction

Exercises
2.1. Find the binary representations of the numbers
(a) a = 32170;
(b) b = 7/16;
(c) c = 75/128.
2.2. Find infinite repeating binary representations of the following numbers:
(a) a = 1/3;
(b) b = 1/10;
(c) c = 1/7.
2.3. Find the binary representation of 106 and 43 .
2.4. Convert (473)10 into a number with the base
(a) 2;
(b) 6;
(c) 8.
2.5. Find the decimal representations of the following numbers:
(a) a = (101011010101)2 ;
(b) b = (16341)8 ;
(c) c = (4523)6 .
2.6. Prove that any number 2−n , where n is a positive integer, can be repre-
sented as a n-digit decimal number 0.a1 a2 . . . an .
2.7. Prove that 2 = 1.999999 . . ..
2.8. For a decimal integer with m digits, how many bits are needed for its
binary representation?
2.9. Create a hypothetical binary floating-point number set consisting of 7-
bit words, in which the first three bits are used for the sign and the
magnitude of the exponent, and the last four are used for the sign and
magnitude of the mantissa. On the sign place, 0 indicates that the quan-
tity is positive and 1, negative. For example, the number represented in
1
the following figure is −(.100)2 × 2(1×2 ) = −(1 × 2−1 ) × 4 = −2.
± 21 20 ± 2−1 2−2 2−3

0 1 0 1 1 0 0

exponent mantissa
Numbers and Errors 51

(a) How many numbers are in this set?

(b) Draw a line and show the decimal equivalents of all numbers in
the set on this line. What is the distance between two consecutive
numbers?
(c) Use your illustration to determine the unit round-oﬀ for this sys-
tem.

2.10. Consider the quadratic equation ax2 + bx + c = 0. Its roots can be found
using the formula √
−b ± b2 − 4ac
x1,2 = .
2a
Let a = 1, b = 100 + 10−14 , and c = 10−12 . Then the exact roots of the
considered equation are x1 = −10−14 and x2 = −100.
(a) Use the above formula to compute the first root on a computer.
What is the relative round-off error?
(b) Use the following equivalent formula for the first root:
−2c
x1 = √ .
b + b2 − 4ac
What do you observe?
2.11. Use a computer to perform iterations in the form xk+1 = (0.1xk −
0.2)30, k ≥ 0 starting with x0 = 3. You could use, e.g., a spreadsheet
application. What do you observe after (a) 50 iterations; (b) 100 itera-
tions? Explain.
This page intentionally left blank
Part II

Numerical Methods for

Standard Problems

53
This page intentionally left blank
Chapter 3
Elements of Numerical Linear Algebra

In this chapter, we will consider the problems of solving a system of linear

equations and computing eigenvalues of a matrix. These problems arise in
many important applications, such as polynomial interpolation, solution of
ordinary diﬀerential equations, numerical optimization and others.

Example 3.1 Suppose that a company is planning to invest the total of $1

million into three projects, A, B, and C. The estimated one-year returns for
each project under two diﬀerent market conditions (scenarios), I and II, are
summarized in Table 3.1.

TABLE 3.1: Returns under two diﬀerent scenarios.

Return under scenario

Project I II
A 1.4 0.9
B 0.8 2.0
C 1.2 1.0

The manager’s goal is to distribute funds between the projects in a way

that would yield the total return of 1.2 under both scenarios.
Let us denote by x1 , x2 , and x3 the amount of money (in millions of dol-
lars) to be invested in project A, B, and C, respectively. Then the problem can
be formulated as the following system of linear equations:

1.4x1 + 0.8x2 + 1.2x3 = 1.2

0.9x1 + 2x2 + 1.0x3 = 1.2
x1 + x2 + x3 = 1.

In this system, the ﬁrst two equations express the goal of obtaining the return
of 1.2 under scenarios I and II, respectively. The third equation indicates that
the total amount of the investment to be made is $1 million.

55
56 Numerical Methods and Optimization: An Introduction

An m × n system of linear algebraic equations has the following general

form:
a11 x1 + a12 x2 + · · · + a1n xn = b1
a21 x1 + a22 x2 + · · · + a2n xn = b2
.. .. (3.1)
. .
am1 x1 + am2 x2 + · · · + amn xn = bm .
In the above system, we have
• m linear equations ai1 x1 + ai2 x2 + · · · + ain xn = bi , i = 1, . . . , m;
• n unknowns xj , j = 1, . . . , n;
• m × n coeﬃcients aij ∈ IR, i = 1, 2, . . . , m, j = 1, . . . , n; and
• m constant terms bi ∈ IR, i = 1, . . . , m.
If all of the constant terms are zero, bi = 0, i = 1, . . . , m, then the system is
called homogeneous.
Note that system (3.1) can be equivalently represented in the matrix form
as follows:
Ax = b,
where
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
a11 a12 ··· a1n x1 b1
⎢ a21 a22 ··· a2n ⎥ ⎢ x2 ⎥ ⎢ b2 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
A=⎢ .. .. .. .. ⎥, x = ⎢ .. ⎥ , and b = ⎢ .. ⎥.
⎣ . . . . ⎦ ⎣ . ⎦ ⎣ . ⎦
am1 am2 ··· amn xn bm

Example 3.2 The system from Example 3.1 has the following matrix form:
⎡ ⎤⎡ ⎤ ⎡ ⎤
1.4 0.8 1.2 x1 1.2
⎣ 0.9 2 1.0 ⎦ ⎣ x2 ⎦ = ⎣ 1.2 ⎦ .
1 1 1 x3 1

A system of linear equations is called consistent if there exists a set of values

x∗i , i = 1, . . . n of its unknowns that satisﬁes each equation in the system, in
which case the vector x∗ = [x∗1 , . . . , x∗n ]T is called a solution of the system. If
a system of linear equations has no solution, it is called inconsistent.
Two systems of linear equations are equivalent if their solution sets coin-
cide.

Example 3.3 The system from Example 3.1,

1.4x1 + 0.8x2 + 1.2x3 = 1.2

0.9x1 + 2x2 + 1.0x3 = 1.2
x1 + x2 + x3 = 1

has the solution x∗1 = 1/2, x∗2 = 1/4, x∗3 = 1/4, thus it is consistent.
Elements of Numerical Linear Algebra 57

Assume that a consistent system Ax = b has two distinct solutions, x∗ and

x̂. Then for any real α, xα = αx∗ + (1 − α)x̂ is also a solution of the system.
Thus, a consistent system of linear equations has either a unique solution or
inﬁnitely many solutions.
For systems with the same number of equations as the number of un-
knowns, the following theorem holds.

Theorem 3.1 For a matrix A ∈ IRn×n and a vector b ∈ IRn×1 , the

following statements are equivalent:

(i) The system Ax = b has a unique solution.

(ii) The matrix A is nonsingular.

(iii) det(A) = 0.

Note that if A is nonsingular and b = 0 (i.e., the system is homogeneous),

then the unique solution of Ax = 0 is x = 0.
The numerical methods for solving systems of linear equations are classified
into direct and iterative methods. Direct methods provide the exact answer
(assuming no round-off errors) in a finite number of steps. Iterative methods
are not guaranteed to compute an exact solution in a finite number of steps,
however they can yield high-quality approximate solutions and are easier to
implement than the direct methods.

3.1 Direct Methods for Solving Systems of Linear

Equations

Some direct methods proceed by ﬁrst reducing the given system to a spe-
cial form, which is known to be easily solvable, and then solving the resulting
“easy” system. Triangular systems of linear equations, discussed next, repre-
sent one such easily solvable case.

3.1.1 Solution of triangular systems of linear equations

An n × n system Ax = b is called triangular if the matrix A is triangular.

If A = [aij ]n×n is upper triangular and nonsingular, we can solve the system
58 Numerical Methods and Optimization: An Introduction

Ax = b easily using back substitution. That is, if the system at hand is

⎡ ⎤⎡ ⎤ ⎡ ⎤
a11 a12 · · · a1(n−1) a1n x1 b1
⎢ 0 a22 · · · a a ⎥ ⎢ x ⎥ ⎢ ⎥
⎢ 2(n−1) 2n ⎥⎢ 2 ⎥ ⎢ b2 ⎥
⎢ .. .. . . . ⎥ ⎢ . ⎥ ⎢ . ⎥
⎢ . . .. .. .. ⎥ ⎢ .. ⎥ = ⎢ .. ⎥,
⎢ ⎥⎢ ⎥ ⎢ ⎥
⎣ 0 0 · · · a(n−1)(n−1) a(n−1)n ⎦ ⎣ xn−1 ⎦ ⎣ bn−1 ⎦
0 0 ··· 0 ann xn bn
then we can solve for x starting from xn :
bn
ann xn = bn ⇒ xn = ,
ann
bn−1 − a(n−1)n xn
a(n−1)(n−1) xn−1 + a(n−1)n xn = bn ⇒ xn−1 = ,
a(n−1)(n−1)
and n
bk − j=k+1 akj xj
xk = , k = n, n − 1, . . . , 1.
akk
Example 3.4 Consider the system
3x1 +2x2 +x3 = 1
x2 −x3 = 2
2x3 = 4.
Solving the above system using back substitution we get
4
x3 = = 2,
2
b2 − a23 x3 2+2
x2 = = = 4,
a22 1
b1 − (a12 x2 + a13 x3 ) 1 − (2 · 4 + 1 · 2)
x1 = = = −3.
a11 3
Therefore the solution to the system is x = [−3, 4, 2]T .
Similarly, if we have a lower triangular system
⎡ ⎤⎡ ⎤ ⎡ ⎤
a11 0 ··· 0 0 x1 b1
⎢ a21 a ··· 0 0 ⎥ ⎢ x2 ⎥ ⎢ b2 ⎥
⎢ 22 ⎥⎢ ⎥ ⎢ ⎥
⎢ .. .. .. .. .. ⎥ ⎢ .. ⎥ ⎢ .. ⎥
⎢ . . . . . ⎥ ⎢ ⎥=⎢ ⎥,
⎢ ⎥⎢ . ⎥ ⎢ . ⎥
⎣ a(n−1)1 a(n−1)2 · · · a(n−1)(n−1) 0 ⎦ ⎣ xn−1 ⎦ ⎣ bn−1 ⎦
an1 an2 ··· an(n−1) ann xn bn
we can use the forward substitution to solve it:
k−1
bk − j=1 akj xj
xk = , k = 1, 2, . . . , n.
akk
Elements of Numerical Linear Algebra 59

3.1.2 Gaussian elimination

Given a nonsingular matrix A = [aij ]n×n and a vector b ∈ IRn , Gaussian
elimination transforms the system Ax = b into an equivalent system Āx = b̄,
where Ā is a triangular matrix. Then Āx = b̄ can be solved by back substitu-
tion to get the solution of Ax = b.
Note that nonsingularity of A and equivalence of the systems Ax = b and
Āx = b̄ yield that Ā is also nonsingular, thus its diagonal elements are nonzero.
In the Gaussian elimination method, the equivalent transformation is
achieved by applying the following three elementary row operations:

(1) changing the order of two rows;

(2) multiplying a row by a nonzero constant;

(3) replacing a row by its sum with a multiple of another row.

When any of these operations is applied to the augmented matrix

⎡ ⎤
a11 a12 ··· a1n b1
⎢ a21 a22 ··· a2n b2 ⎥
⎢ ⎥
[A|b] = ⎢ .. .. .. .. .. ⎥,
⎣ . . . . . ⎦
an1 an2 ··· ann bn

which is obtained by appending the vector of constant terms to the matrix of

coeﬃcients, we obtain an equivalent system.
We introduce the following notation for the matrix [A|b]:
⎡ ⎤
(1) (1) (1)
a11 a12 ··· a1n(1) a1(n+1)
⎢ (1) ⎥
⎢ a21 (1)
a22 ···
(1)
a2n
(1)
a2(n+1) ⎥
(1) ⎢ ⎥
M = [A|b] = ⎢ . .. .. .. .. ⎥.
⎢ .. . ⎥
⎣ . . . ⎦
(1) (1) (1) (1)
an1 an2 ··· ann an(n+1)

(1)
Without loss of generality we can assume that a11 = 0 (if this is not the case,
(1)
we can switch rows so that a11 = 0). In the Gaussian elimination method, we
(1)
first “eliminate” elements ai1 , i = 2, . . . , m in the first column, by turning
them into 0. This can be done by replacing the ith row, i = 2, . . . , m with the
(1)
ai1
row which is obtained by multiplying the first row by − (1) and adding the
a11
result to the ith row, thus obtaining

(1)
(2) (1) ai1 (1)
aij = aij − a ,
(1) 1j
i = 2, . . . , n, j = 1, 2, . . . , n, n + 1.
a11
60 Numerical Methods and Optimization: An Introduction

Denote the resulting matrix by M (2) :

⎡ ⎤
(1) (1) (1) (1)
a a12 ··· a1n a1(n+1)
⎢ 11 ⎥
⎢ 0 a22
(2)
···
(2)
a2n
(2)
a2(n+1) ⎥
⎢ ⎥
M (2) = ⎢ . .. .. .. .. ⎥.
⎢ .. . ⎥
⎣ . . . ⎦
(2) (2) (2)
0 an2 ··· ann an(n+1)

(2)
Next, assuming that a22 = 0, we use the second row to obtain zeros in rows
3, . . . , n of the second column of matrix M (2) :
(2)
(3) (2) ai2 (2)
aij = aij − a ,
(2) 2j
i = 3, . . . , n, j = 2, . . . , n, n + 1.
a22

We denote the matrix obtained after this transformation by M (3) .

Continuing this way, at the k th step, k ≤ n − 1, we have matrix M (k) , and
(k)
assuming that akk = 0, eliminate the elements in rows k + 1, . . . , n of the k th
column:
(k)
(k+1) (k) aik (k)
aij = aij − a ,
(k) kj
i = k + 1, . . . , n, j = k, . . . , n, n + 1.
akk

The resulting matrix M (k+1) has the following form:

⎡ (1) (1) (1) (1) (1)
⎤
a11 a12 · · · a1k a1(k+1) ··· a1n(1) a1(n+1)
⎢ ⎥
⎢ 0 (2)
a22 · · ·
(2)
a2k
(2)
a2(k+1) ···
(2)
a2n
(2)
a2(n+1) ⎥
⎢ ⎥
⎢ . .. .. .. .. .. .. ⎥
⎢ .. . . . . ··· . . ⎥
⎢ ⎥
⎢ (k) (k) (k) (k) ⎥
M (k+1)
=⎢ 0 0 ··· akk ak(k+1) ··· akn ak(n+1) ⎥.
⎢ ⎥
⎢ 0 0 ··· 0
(k+1)
a(k+1)(k+1) ···
(k+1)
a(k+1)n
(k+1)
a(k+1)(n+1) ⎥
⎢ ⎥
⎢ . .. .. .. .. .. ⎥
⎢ . .. ⎥
⎣ . . ··· . . . . . ⎦
(k+1) (k+1) (k+1)
0 0 ··· 0 an(k+1) ··· ann an(n+1)

When k = (n − 1), we obtain a matrix M (n) corresponding to a triangular

system Āx = b̄:
⎡ (1) ⎤
(1) (1) (1)
a11 a12 · · · a1n a1(n+1)
⎢ ⎥
a2(n+1) ⎥
(2) (2) (2)
⎢ 0 a · · · a ⎥
M (n) = ⎢
22 2n
⎢ .. .. .. .. .. ⎥.
⎣ . . . . ⎥
. ⎦
(n)
0 0 · · · ann (n)
an(n+1)

Ā b̄

Given two augmented matrices A1 and A2 , the fact that they correspond
Elements of Numerical Linear Algebra 61

to two equivalent systems can be expressed as A1 ∼ A2 . During the execu-

tion of the Gaussian elimination procedure, we construct a chain of matrices
M (1) , M (2) , . . . , M (n) , which correspond to equivalent systems. Therefore, we
can write
M (1) ∼ M (2) ∼ . . . ∼ M (n) .

Example 3.5 Consider the system of linear equations

x1 + 2x2 − 4x3 = −4,
5x1 + 11x2 − 21x3 = −22,
3x1 − 2x2 + 3x3 = 11.
Combining the coeﬃcient matrix and the vector of constant terms into a single
matrix we have
⎡ ⎤
1 2 −4 −4
M = [A|b] = ⎣ 5 11 −21 −22 ⎦ .
3 −2 3 11
We reduce A to an upper triangular matrix, by applying the elementary row
operations to the augmented matrix M . Let M (1) = [A(1) |b(1) ] = [A|b] = M be
(k)
the initial augmented matrix of the considered system. Let us denote by Ri
th
the i row of M (k)
= [A |b ]. To eliminate the ﬁrst element in the second
(k) (k)

row of M (1) , we multiply the ﬁrst row by −5 and add the result to the second
row to obtain the second row of M (2) . This can be written as
⎡ ⎤ ⎡ ⎤
1 2 −4 −4 (2) (1) (1)
1 2 −4 −4
⎣ 5 11 −21 −22 ⎦ ∼R2 =−5R1 +R2 ⎣ 0 1 −1 −2 ⎦ .
3 −2 3 11 3 −2 3 11

M (1)

We proceed by eliminating the ﬁrst element of the third row, thus obtaining
matrix M (2) :
⎡ ⎤ ⎡ ⎤
1 2 −4 −4 (2) (1) (1)
1 2 −4 −4
⎣ 0 1 −1 −2 ⎦ ∼R3 =−3R1 +R3 ⎣ 0 1 −1 −2 ⎦ .
3 −2 3 11 0 −8 15 23

M (2)

Finally,
Ā b̄
⎡ ⎤ ⎡ ⎤
1 2 −4 −4 (3) (2) (2)
1 2 −4 −4
⎣ 0 1 −1 −2 ⎦ ∼R3 =8R2 +R3 ⎣ 0 1 −1 −2 ⎦ .
0 −8 15 23 0 0 7 7

M (2) M (3)

The resulting matrix Ā is upper-triangular. Therefore, using back substitution

we can derive the solution to the system, x = [2, −1, 1]T .
62 Numerical Methods and Optimization: An Introduction

3.1.2.1 Pivoting strategies

(k) (k)
The coefficient akk in A that is used to eliminate aik , i = k+1, k+2, . . . , n
in the Gaussian elimination procedure is called the pivotal element, and the k th
row is called the pivotal row. The process of eliminating all the below-diagonal
elements described above is called a pivot.
In the Gaussian elimination procedure above, at each step we assumed that
the diagonal element, which was used as the pivotal element, was nonzero.
However, it may be the case that at some step k the diagonal element in the
(k)
k th row is equal to zero: akk = 0, therefore it cannot be used to eliminate the
th
elements below in the k column. In this situation we can use the following
trivial pivoting strategy to change the pivoting element. We examine the ele-
(k) (k)
ments below akk in the k th column, until we find some alk = 0, l > k, and
then simply switch rows k and l.
However, in some cases more sophisticated pivoting strategies are needed to
minimize the round-off error of computations. The next example demonstrates
the effect of the choice of a pivotal element in Gaussian elimination on the
precision of the computed solution.

Example 3.6 Consider the following system:

0.102x1 + 2.45x2 = 2.96

20.2x1 − 11.4x2 = 89.6

which has the solution [x1 , x2 ] = [5, 1]. We use three-digit arithmetic to solve
this system using Gaussian elimination. We have

0.102 2.45 2.96 (2) (1)
R2 =R2 −198·R1
(1) 0.102 2.45 2.96
∼ ,
20.2 −11.4 89.6 0 −497 −496

hence the solution is x1 = 5.05 and x2 = 0.998. However, if we switch the

rows, we obtain

20.2 −11.4 89.6 (2) (1) (1) 20.2 −11.4 89.6
∼R2 =R2 −0.00505·R1 ,
0.102 2.45 2.96 0 2.51 2.51

yielding the correct solution [x1 , x2 ] = [5, 1].

(k)
In general,
if the pivotal element, say, akk , is very small in absolute value,
lk
then aakk may be very large for some l > k. When used in computations as a
multiplier, the last fraction may cause undesirable round-oﬀ errors. Therefore,
one should avoid choosing small numbers as pivotal elements. This motivates
the following partial pivoting strategy.
At the k th step of Gaussian elimination, there are n − k + 1 candidates for
the pivotal element:
(k) (k) (k)
akk , a(k+1)k , . . . , ank .
Elements of Numerical Linear Algebra 63
(k)
We choose alk such that
(k) (k)
|alk | = max |apk |
k≤p≤n

as the pivotal element by switching the k th and lth rows. Then we have
(k)
a
pk
(k) ≤ 1, ∀p ≥ k.
a
lk

3.1.3 Gauss-Jordan method and matrix inversion

The Gauss-Jordan method is a modiﬁcation of Gaussian elimination. It
uses elementary row operations in the augmented matrix in order to transform
the matrix of coeﬃcients into the identity matrix.

Example 3.7 Consider the system

x1 + 2x2 − 4x3 = −4,
5x1 + 11x2 − 21x3 = −22, (3.2)
3x1 − 2x2 + 3x3 = 11.
In Example 3.5 we used Gaussian elimination to reduce the augmented matrix
of this system to the equivalent form
⎡ ⎤
1 2 −4 −4
⎣ 0 1 −1 −2 ⎦ .
0 0 7 7
We can further reduce the upper triangular matrix to the identity matrix by
eliminating the non-diagonal elements as follows.
First, we can divide the third row by 7 in order to have unity in the diagonal
in the third row: ⎡ ⎤
1 2 −4 −4
⎣ 0 1 −1 −2 ⎦ .
0 0 1 1
Now the first row can be replaced by its sum with the third row multiplied by
4, and the second row can be replaced by its sum with the third row:
⎡ ⎤
1 2 0 0
⎣ 0 1 0 −1 ⎦ .
0 0 1 1
Finally, adding the first row to the second row multiplied by −2 and writing
the result as the new first row, we obtain
⎡ ⎤
1 0 0 2
⎣ 0 1 0 −1 ⎦ .
0 0 1 1
64 Numerical Methods and Optimization: An Introduction

Thus the solution can be extracted from the right-hand-side vector of the last
matrix: [x1 , x2 , x3 ] = [2, −1, 1].

The Gauss-Jordan method provides a natural way of computing the inverse

of a nonsingular matrix. Assume that we have a nonsingular matrix
⎡ ⎤
a11 a12 · · · a1n
⎢ a21 a22 · · · a2n ⎥
⎢ ⎥
A=⎢ . .. .. .. ⎥ .
⎣ .. . . . ⎦
an1 an2 · · · ann

Let us look for its inverse X = A−1 in the following form:

⎡ ⎤
x11 x12 · · · x1n
⎢ x21 x22 · · · x2n ⎥
⎢ ⎥
X=⎢ . .. .. .. ⎥ ,
⎣ .. . . . ⎦
xn1 xn2 ··· xnn

such that
AX = In ,
where In denotes the n × n identity matrix. Then in order to ﬁnd the ﬁrst
column of X, we can solve the system
⎡ ⎤⎡ ⎤ ⎡ ⎤
a11 a12 · · · a1n x11 1
⎢ a21 a22 · · · a2n ⎥ ⎢ x21 ⎥ ⎢ 0 ⎥
⎢ ⎥⎢ ⎥ ⎢ ⎥
⎢ .. .. .. .. ⎥ ⎢ .. ⎥ = ⎢ .. ⎥ .
⎣ . . . . ⎦ ⎣ . ⎦ ⎣ . ⎦
an1 an2 ··· ann xn1 0

In general, to ﬁnd the j th column Xj of X, j = 1, 2, . . . , n, we can solve the

system
AXj = Inj ,
where Inj is the j th column of the identity matrix In . We can solve all of these
systems simultaneously for all j = 1, 2, . . . , n by applying the Gauss-Jordan
method to the matrix A augmented with the identity matrix In = [In1 · · · Inn ]:
⎡ ⎤
a11 a12 · · · a1n 1 0 · · · 0
⎢ a21 a22 · · · a2n 0 1 · · · 0 ⎥
⎢ ⎥
⎢ .. .. .. .. .. .. . . . ⎥.
⎣ . . . . . . . .. ⎦
an1 an2 · · · ann 0 0 · · · 1

When the coeﬃcient matrix is reduced to the identity matrix using elementary
row operations, the right-hand side of the resulting augmented matrix will
contain the inverse of A. So, starting with [A|In ], we use the Gauss-Jordan
method to obtain [In |A−1 ].
Elements of Numerical Linear Algebra 65

Example 3.8 Consider the coeﬃcient matrix of the system (3.2):

⎡ ⎤
1 2 −4
A=⎣ 5 11 −21 ⎦ .
3 −2 3

Augment this matrix with the 3 × 3 identity matrix:

⎡ ⎤
1 2 −4 1 0 0
⎣ 5 11 −21 0 1 0 ⎦.

3 −2 3 0 0 1

Let us eliminate the elements in rows 2 and 3 of the ﬁrst column using a11 = 1
as the pivotal element:
⎡ ⎤
1 2 −4 1 0 0

⎣ 0 1 −1 −5 1 0 ⎦.

0 −8 15 −3 0 1

We can use the second row to eliminate non-diagonal elements in the second
column:
⎡ ⎤
1 0 −2 11 −2 0

⎣ 0 1 −1 −5 1 0 ⎦.

0 0 7 −43 8 1

Finally, we can divide row 3 by 7 and use this row to eliminate coeﬃcients
for x3 in other rows:
⎡ ⎤
1 0 0 −9/7 2/7 2/7

⎣ 0 1 0 −78/7 15/7 1/7 ⎦ .

0 0 1 −43/7 8/7 1/7

A−1

Therefore, the inverse of A is

⎡ ⎤
−9/7 2/7 2/7
A−1 = ⎣ −78/7 15/7 1/7 ⎦ .
−43/7 8/7 1/7

Now to ﬁnd the solution of (3.2) all we need to do is to multiply A−1 by the
right-hand-side vector b:
⎡ ⎤⎡ ⎤ ⎡ ⎤
−9/7 2/7 2/7 −4 2
x = A−1 b = ⎣ −78/7 15/7 1/7 ⎦ ⎣ −22 ⎦ = ⎣ −1 ⎦ .
−43/7 8/7 1/7 11 1
66 Numerical Methods and Optimization: An Introduction

3.1.4 Triangular factorization

Consider an elementary row operation of replacing the k th row of an m ×
n matrix A by its sum with the lth row multiplied by mkl , where k > l.
Observe that this operation is equivalent to performing a matrix multiplication
(m ) (m )
Ekl kl A, where Ekl kl is an m×m matrix, which is obtained from the identity
matrix Im by replacing zero on the intersection of k th row and lth column with
mkl . As an illustration, consider
⎡ ⎤
1 2 −4 −4
M (1) = ⎣ 5 11 −21 −22 ⎦
3 −2 3 11

A(1) b(1)

from Example 3.5 (page 61). In this example, we replaced the second row of
M (1) by its sum with the ﬁrst row multiplied by −5 to obtain
⎡ ⎤
1 2 −4 −4
C = ⎣ 0 1 −1 −2 ⎦.
3 −2 3 11

It is easy to check that the same outcome results from performing the following
matrix multiplication:
⎡ ⎤⎡ ⎤ ⎡ ⎤
1 0 0 1 2 −4 −4 1 2 −4 −4
⎣ −5 1 0 ⎦ ⎣ 5 11 −21 −22 ⎦ = ⎣ 0 1 −1 −2 ⎦ .
0 0 1 3 −2 3 11 3 −2 3 11

(−5) M (1) C
E21

Next, in Example 3.5, we replaced the third row of the current matrix C by
its sum with the ﬁrst row of C multiplied by −3 to obtain M (2) , that is,
(−3) (−3) (−5)
M (2) = E31 C = E31 E21 M (1) .

Finally, we replaced the third row of M (2) by its sum with the second row of
M (2) multiplied by 8 to obtain M (3) :
(8) (8) (−3) (−5)
M (3) = E32 M (2) = E32 E31 E21 M (1) .

In the process of obtaining M (3) = [A(3) |b(3) ], the coeﬃcient matrix A has
undergone the following transformations to become an upper triangular matrix
U ≡ A(3) :
(8) (−3) (−5)
U = E32 E31 E21 A.
Hence,
−1 −1 −1
(−5) (−3) (8)
A = E21 E31 E32 U = LU,
Elements of Numerical Linear Algebra 67
−1 −1 −1
(−5) (−3) (8)
where L = E21 E31 E32 . Next we show that L is a lower
−1
(m ) (−m )
triangular matrix. Note that Ekl kl = Ekl kl and
⎡ ⎤
1 0 0
= ⎣ −m21 0 ⎦.
(−m21 ) (−m31 ) (−m32 )
E21 E31 E32 1
−m31 −m32 1

Hence,
⎡ ⎤
−1 −1 −1 1 0 0
=⎣ 5 0 ⎦.
(−5) (−3) (8) (5) (3) (−8)
L= E21 E31 E32 = E21 E31 E32 1
3 −8 1

Thus, we obtained a representation of A as the product of a lower and

upper triangular matrices, A = LU . This process is called the triangular (LU)
factorization of A.
In the above description of triangular factorization, we have not exchanged
rows while performing Gaussian elimination. Next we discuss the case involv-
ing row exchanges. As an example, we solve the same system as above using
Gaussian elimination with the partial pivoting strategy. We have
⎡ ⎤
1 2 −4 −4
M (1) = ⎣ 5 11 −21 −22 ⎦,
3 −2 3 11

A b

(1)
and we first exchange the first two rows
of M . The permutation correspond-
1 2 3
ing to this row exchange is p1 = . For each permutation p we
2 1 3
can define the corresponding permutation matrix P = [pij ]n×n such that

1, if p(i) = j,
pij =
0, otherwise.

Note that P −1 = P T for a permutation matrix P . The matrix P1 correspond-

ing to our permutation p1 above is given by
⎡ ⎤
0 1 0
P1 = ⎣ 1 0 0 ⎦ .
0 0 1

Note that pre-multiplying M (1) by P1 yields

⎡ ⎤⎡ ⎤ ⎡ ⎤
0 1 0 1 2 −4 −4 5 11 −21 −22
P1 M (1) = ⎣ 1 0 0 ⎦ ⎣ 5 11 −21 −22 ⎦ = ⎣ 1 2 −4 −4 ⎦ ,
0 0 1 3 −2 3 11 3 −2 3 11
68 Numerical Methods and Optimization: An Introduction

which is exactly M (1) after exchange of its ﬁrst two rows. Next, we eliminate
the below-diagonal elements in the ﬁrst column:
⎡ ⎤
5 11 −21 −22
P1 M (1) = ⎣ 0 −1/5 2/5 ⎦ = M (2) .
(−3/5) (−1/5)
E31 E21 1/5
0 −43/5 78/5 121/5

Exchanging the second and third rows of M (2) , we obtain:

⎡ ⎤
5 11 −21 −22
P1 M (1) = ⎣ 0 −43/5 78/5 121/5 ⎦ ,
(−3/5) (−1/5)
P2 E31 E21
0 −1/5 1/5 2/5
where ⎡ ⎤
1 0 0
P2 = ⎣ 0 0 1 ⎦.
0 1 0
Finally,
⎡ ⎤
5 11 −21 −22
=⎣ 0 121/5 ⎦ .
(−1/43) (−3/5) (−1/5)
E32 P2 E31 E21 P1 M (1) −43/5 78/5
0 0 −7/43 −7/43

U

Since M (1) = [A|b], we have

(−1/43) (−3/5) (−1/5)
E32 P2 E31 E21 P1 A = U.
−1
(−1/43) (1/43)
Multiplying both sides by E32 = E32 , we have:

(−3/5) (−1/5) (1/43)
P2 E31 E21 P1 A = E32 U.

Since P2−1 = P2T = P2 , we have:

⎡ ⎤
1 0 0
U = ⎣ 0 1/43 1 ⎦ U.
(−3/5) (−1/5) (1/43)
E31 E21 P1 A = P2 E32
0 1 0
Next,
⎡ ⎤
−1 −1 1 0 0
P1 A = P2 E32
(1/43)
U = E21
(−1/5) (−3/5)
E31 ⎣ 0 1/43 1 ⎦ U,
0 1 0
so
⎡ ⎤⎡ ⎤ ⎡ ⎤
1 0 0 1 0 0 1 0 0
P1 A = ⎣ 1/5 1 0 ⎦ ⎣ 0 1/43 1 ⎦ U = ⎣ 1/5 1/43 1 ⎦ U.
3/5 0 1 0 1 0 3/5 1 0
Elements of Numerical Linear Algebra 69

Multiplying both sides by P2 from the left, we obtain:

⎡ ⎤
1 0 0
P2 P1 A = ⎣ 3/5 1 0 ⎦ U.
1/5 1/43 1

L

Thus, for a permutation matrix

⎡ ⎤
0 1 0
P = P2 P1 = ⎣ 0 0 1 ⎦,
1 0 0

we have
P A = LU ⇐⇒ A = P T LU.
Consider a system Ax = b and suppose that a triangular factorization
A = LU of A is known. Then

Ax = b ⇐⇒ LU x = b ⇐⇒ Ly = b, where U x = y.

Hence, we can solve Ly = b for y using forward substitution. Having computed

y, we can ﬁnd x by solving the system U x = y via backward substitution.
Thus, using a triangular factorization A = LU , we solve the system Ax = b
in two steps:
1. compute y that solves Ly = b;
2. compute x that solves U x = y.
Similarly, if we have P A = LU , then

Ax = b ⇐⇒ (P A)x = P b ⇐⇒ LU x = b ,

where b = P b. Then the system Ax = b can be solved by ﬁrst ﬁnding the

solution y of Ly = b and then computing the solution x of the system U x = y.

3.2 Iterative Methods for Solving Systems of Linear

Equations
Given a system Ax = b of n linear equations with n unknowns, an iterative
method starts with an initial guess x(0) and produces a sequence of vectors
x(1) , x(2) , . . ., such that

x(k) = x(k−1) + C(b − Ax(k−1) ), k = 1, 2 . . . , (3.3)

70 Numerical Methods and Optimization: An Introduction

where C is an appropriately selected matrix. We are interested in a choice of

C, such that the above iteration produces a sequence {x(k) : k ≥ 0} such that
x(k) → x∗ , k → ∞, where x∗ is the solution of the system, that is, Ax∗ = b.
We have

x(k) − x∗ = x(k−1) + C(b − Ax(k−1) ) − x∗ = x(k−1) + Cb − CAx(k−1) − x∗ .

Replacing b with Ax∗ in the last expression, we obtain

x(k) − x∗ = x(k−1) + CAx∗ − CAx(k−1) − x∗ = (In − CA)(x(k−1) − x∗ ),

where In is the n × n identity matrix. Consider a matrix norm · M and a

vector norm · V which are compatible, that is,

AxV ≤ AM · xV , ∀A ∈ IRn×n , x ∈ IRn .

Then the absolute error after k iterations of the scheme expressed in terms of
the norm · V satisﬁes the following inequality:

x(k) − x∗ V = (In − CA)(x(k−1) − x∗ )V ≤ (In − CA)M · (x(k−1) − x∗ )V .

Since
x(k−1) − x∗ V ≤ (In − CA)M · (x(k−2) − x∗ )V ,
we have
x(k) − x∗ V ≤ (In − CA)2M · (x(k−2) − x∗ )V .
Continuing likewise and noting that the error bound is multiplied by the factor
of In − CAM each time, we obtain the following upper bound on the error:

x(k) − x∗ V ≤ (In − CA)kM · (x(0) − x∗ )V .

Therefore, if In −CAM < 1, then x(k) −x∗ V → 0, k → ∞, and the method
converges. A matrix C such that In − CAM < 1 for some matrix norm is
called an approximate inverse of A.

Example 3.9 If we choose C = A−1 , then for any initial x(0) we have:

x(1) = x(0) + A−1 (b − Ax(0) )

= A−1 b,

thus, we obtain the exact solution in one iteration.

3.2.1 Jacobi method

Note that A = [aij ]n×n can be easily represented as the sum of three
matrices, A = L + U + D, where L, U , and D are a lower triangular, an upper
triangular, and a diagonal matrix, respectively, such that L and U both have
zero diagonals.
Elements of Numerical Linear Algebra 71

In the Jacobi method, we choose C in (3.3) to be

⎡ ⎤
1
a11 0 ··· 0
⎢ 0 1
··· 0 ⎥
⎢ a22 ⎥
C = D−1 = ⎢ . .. .. .. ⎥.
⎣ .. . . . ⎦
0 0 ··· 1
ann

Thus, an iteration of the Jacobi method is given by the following expression:

x(k+1) = x(k) + D−1 (b − Ax(k) ), k = 0, 1, . . . (3.4)

Recall that when In − D−1 AM < 1 for some matrix norm, the method is
guaranteed to converge. We have
⎡ ⎤
0 a12
a11 ··· a1n
a11
⎢ a21
0 ··· a2n ⎥
⎢ a22 a22 ⎥
In − D−1 A = ⎢ .. .. .. .. ⎥.
⎣ . . . . ⎦
an1
ann
an2
ann ··· 0

A matrix A ∈ IRn×n is called strictly diagonally dominant by rows if

n
|aii | > |aij |, i = 1, 2, . . . , n.
j=1
j=i

Similarly, A ∈ IRn×n is called strictly diagonally dominant by columns if

n
|ajj | > |aij |, j = 1, 2, . . . , n.
i=1
i=j

If A is strictly diagonally dominant by rows, then

n
|aij |
In − D−1 A∞ = max < 1,
1≤i≤n
j=1
|aii |
j=i

and the Jacobi method converges.

Likewise, if A is strictly diagonally dominant by columns, then

n
|aij |
In − D−1 A1 = max < 1,
1≤i≤n
i=1
|aii |
i=j

and the method converges in this case as well.

72 Numerical Methods and Optimization: An Introduction

Example 3.10 Consider a system Ax = b, where

2 1 3
A= , b= .
3 5 8

The true solution is x∗ = [1, 1]T . We apply the Jacobi method three times with
x(0) = [0, 0]T to ﬁnd an approximate solution of this system. We have:
1
−1 2 0
D = .
0 15
Hence, using the Jacobi iteration (3.4) we obtain:

(1) 1/2 0 3 3/2
x = = ;
0 1/5 8 8/5

3/2 1/2 0 3 23/5 7/10
(2)
x = + − = ;
8/5 0 1/5 8 25/2 7/10

7/10 1/2 0 9/10 23/20
x(3) = + = .
7/10 0 1/5 12/5 59/50
Notice how the error decreases:
x(1) − x∗ ∞ = 0.60;
∗
x (2)
− x ∞ = 0.30;
∗
x (3)
− x ∞ = 0.18.

3.2.2 Gauss-Seidel method

Again, we use the same representation of A as in the Jacobi method,
A = L + U + D. To derive an iteration of the Gauss-Seidel method, we can
use C = (L + D)−1 in the general scheme (3.3):
x(k+1) = x(k) + (L + D)−1 (b − Ax(k) ) ⇒
(L + D)x(k+1) = (L + D − A)x(k) + b ⇒
Dx(k+1) = −Lx(k+1) − U x(k) + b.
For example, for n = 3 we have:
⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤
a11 0 0 0 0 0 0 a12 a13 b1
⎣ 0 a22 0 ⎦ x(k+1) = − ⎣ a21 0 0 ⎦ x(k+1) −⎣ 0 0 a23 ⎦ x(k) +⎣ b2 ⎦ .
0 0 a33 a31 a32 0 0 0 0 b3
The above is equivalent to
(k+1) (k) (k)
a11 x1 = −0 − (a12 x2 + a13 x3 ) + b1
(k+1) (k+1) (k)
a22 x2 = −(a21 x1 ) − (a23 x3 ) + b2
(k+1) (k+1) (k+1)
a33 x3 = −(a31 x1 + a32 x2 ) − 0 + b3 .
Elements of Numerical Linear Algebra 73

In general,
(k+1) (k)
(k+1) − j<i aij xj − j>i aij xj + bi
xi = , i = 1, 2, . . . , n. (3.5)
aii
(k+1) (k+1) (k+1) (k+1)
Hence, computing xi requires the values of x1 , x2 , . . . , xi−1 .

Example 3.11 Given the system Ax = b from Example 3.10, where

2 1 3
A= ,b = ,
3 5 8

we apply three iterations of the Gauss-Seidel method starting with x(0) =

[0, 0]T .
We have
Dx(k+1) = −Lx(k+1) − U x(k) + b,
or, equivalently,
(k+1) (k)
2x1 = −x2 + 3,
(k+1) (k+1)
5x2 = −3x1 + 8.

For k = 0 we have:
(1) (1)
2x1 = 3 ⇒ x1 = 1.5,
(1) (1)
5x2 = −9/2 + 8 ⇒ x2 = 0.7.

For k = 1 we obtain:
(2) (2)
2x1 = −7/10 + 3 ⇒ x1 = 23/20 = 1.15,
(2) (2)
5x2 = −69/20 + 8 ⇒ x2 = 91/100 = 0.91.

Finally, for k = 2:
(3) (3)
2x1 = −91/100 + 3 ⇒ x1 = 209/200 = 1.045,
(3) (3)
5x2 = −627/200 + 8 ⇒ x2 = 973/1000 = 0.973.

Recall that the true solution of the considered system is x∗ = [1, 1]T , therefore
the errors for each step are

x(1) − x∗ ∞ = 0.500;
x(2) − x∗ ∞ = 0.150;
∗
x (3)
− x ∞ = 0.045.

We can observe that for this system, the Gauss-Seidel method converges faster
than the Jacobi method.
74 Numerical Methods and Optimization: An Introduction

3.2.3 Application: input-output models in economics

The input-output analysis is a quantitative methodology used to study
the interdependencies between the sectors of an economy. It was developed by
Leontief [21], who was awarded the Nobel Prize in Economics in 1973 for this
work.
Consider a system consisting of n industries, each producing a diﬀerent
product. By a unit of input or output of the ith product, we mean one dollar’s
worth of the product (that is, 1/wi , where wi is the worth of the ith product).
Assume that aij ≥ 0 units of product i are required to produce one unit of
product j, i = 1, . . . , n. Then the cost of production of one unit of j will

n
include aij , therefore production of the j th product will be proﬁtable only
i=1
if

n
aij < 1. (3.6)
i=1

In the open model there is also the open sector demand for di ≥ 0 units of
product i, whereas in the closed model the open sector is ignored (d = 0).
Denote by xi the output of the ith product needed to meet the total demand.
Then the output of xj units of product j will require the input of aij xj units
of product i. Since the volume of production of the ith product should be
equal to the total demand for the product i, we have:

xi = ai1 x1 + ai2 x2 + . . . + ain xn + di for i = 1, 2, . . . , n.

Thus, we obtain the following system:

(1 − a11 )x1 + (−a12 )x2 +...+ (−a1n )xn = d1

(−a21 )x1 + (1 − a22 )x2 +...+ (−a2n )xn = d2
.. .. ..
. . .
(−an1 )x1 + (−an2 )x2 +...+ (1 − ann )xn = dn ,

or, in the matrix form,

(In − A)x = d. (3.7)
We are interested in ﬁnding a nonnegative solution of the above system. We
can use the Jacobi method to show that such a solution always exists.
From (3.6) we have
n
aij < 1 − ajj ,
i=1
i=j

and, taking into account that aij > 0 for i, j = 1, 2, . . . , n, we conclude that
In − A is strictly diagonally dominant by columns. Hence, for any initial guess
x(0) , the Jacobi method applied to (3.7) will converge to the unique solution
Elements of Numerical Linear Algebra 75

of this system. The k th iteration of the Jacobi method is

n
(k)
di + aij xj
j=1
(k+1) j=i
xi = for i = 1, 2, . . . , n,
(1 − aii )

where 0 ≤ aij < 1, di ≥ 0 for all i, j = 1, 2, . . . , n. Therefore, if x(k) ≥ 0

(k+1)
then xi ≥ 0 for i = 1, 2, . . . , n, so x(k+1) ≥ 0. Thus, choosing x(0) ≥ 0,
the method will generate a sequence of nonnegative vectors converging to the
unique solution, which is also nonnegative.

3.3 Overdetermined Systems and Least Squares Solution

Consider a system Ax = b, where A is an m × n matrix with entries
aij , i = 1, . . . m, j = 1, . . . , n, x = [x1 , . . . , xn ]T is the vector of unknowns, and
b = [b1 , . . . , bm ]T is the vector of right-hand sides. If m > n, that is, the number
of equations is greater than the number of variables, the system Ax = b is
called overdetermined. An overdetermined system is typically not expected to
have a solution x such that Ax = b, and one looks for an answer which is the
“closest,” in some sense, to satisfying the system. A common way of measuring
how close a given vector x ∈ IRn is to√satisfying the overdetermined system
Ax = b is to look at the norm r2 = rT r of the residual vector r = b − Ax.

Deﬁnition 3.1 (Least Squares Solution) A vector x∗ is called the

least squares solution of the system Ax = b if it minimizes b − Ax2 ,
that is, b − Ax∗ 2 ≤ b − Ax2 for any x ∈ IRn .

The following theorem yields a method for ﬁnding a least squares solution of
an overdetermined system of linear equations.

Theorem 3.2 If x∗ solves the system

AT Ax = AT b,

then x∗ is a least squares solution of the system Ax = b.

76 Numerical Methods and Optimization: An Introduction

Proof. We need to show that b − Ax∗ 2 ≤ b − Ax2 for any x ∈ IRn . For
an arbitrary x ∈ IRn , we have:
b − Ax22 = (b − Ax∗ ) + (Ax∗ − Ax)22
= b − Ax∗ 22 + 2(x∗ − x)T (AT b − AT Ax∗ ) + A(x∗ − x)2

=0 ≥0
≥ b − Ax∗ 22 ,
which proves that x∗ is a least squares solution of the system Ax = b.
Example 3.12 Find the least squares solution of the system
x1 + x2 = 1
x1 − x2 = 3
x1 + 2x2 = 2.
First, we show that the system has no exact solution. Solving for the first two
equations, we find that x1 = 2, x2 = −1. Putting these values in the third
equation we obtain 0 = 2, implying that the system has no solution.
To find the least squares solution, we solve the system AT Ax = AT b, where
⎡ ⎤ ⎡ ⎤
1 1 1
A = ⎣ 1 −1 ⎦ , b = ⎣ 3 ⎦ .
1 2 2
We have
3 2 x1 6
A Ax = A b ⇐⇒
T T
= .
2 6 x2 2
This system has a unique solution
∗
x1 16/7
= .
x∗2 −3/7

3.3.1 Application: linear regression

The following problem frequently arises in statistics. Given n points
[x1 , y1 ]T , . . . , [xn , yn ]T ∈ IR2 on the plane, find the line l(x) = ax + b (de-
fined by the coefficients a and b) that minimizes the sum of squared residuals

n
(axi + b − yi )2 .
i=1

Such a line is called the linear regression line for the points [x1 , y1 ]T , . . . , [xn , yn ]T .
Clearly, this problem is equivalent to the problem of ﬁnding a least squares
solution of the following overdetermined linear system with n equations and
two unknowns (a and b):
ax1 + b = y1
···
axn + b = yn .
Elements of Numerical Linear Algebra 77
T T
The solution
can be obtained by solving the system A Az = A y, where
a x1 . . . x n
z= , y = [y1 , . . . , yn ]T , and AT = . We have:
b 1 ... 1
n n
2 n
xi a + xi b = x i yi
AT Az = AT y ⇐⇒ i=1 n
i=1 i=1
n
xi a + nb = yi .
i=1 i=1

Expressing b via a from the second equation, we obtain:

n
n
yi − xi a
i=1 i=1
b= .
n
Hence, from the ﬁrst equation,

n
n
n
n x i yi − xi yi
i=1 i=1 i=1
a= 2 .

n
n
n x2i − xi
i=1 i=1

3.4 Stability of a Problem

Consider the following pair of linear systems:
x1 −x2 = 1
x1 −1.001x2 = 0
and
x1 −x2 = 1
x1 −0.999x2 = 0.
The solution to the first system is [x1 , x2 ] = [1001, 1000], whereas the solution
to the second system is [x1 , x2 ] = [−999, −1000]. A key observation here is that
despite a high similarity between the coefficient matrices of the two systems,
their solutions are very different. We call an instance of a problem unstable if
a small variation in the values of input parameters may result in a significant
change in the solution to the problem.
Given a nonsingular matrix A ∈ IRn×n and a vector b ∈ IRn , let x̄ = A−1 b
be the solution of the system Ax = b. Let Ã be a nonsingular matrix obtained
from A by perturbing its entries, and let x̃ = Ã−1 b be the solution of the
system Ãx = b. Then A(x̃ − x̄) = (A − Ã)x̃, hence, x̃ − x̄ = A−1 (A − Ã)x̃, and
for any given compatible vector and matrix norms we have

x̃ − x̄ ≤ A−1 · Ã − A · x̃.

78 Numerical Methods and Optimization: An Introduction

The last inequality can be rewritten as

x̃ − x̄ Ã − A
≤ k(A) ,
x̃ A
where
k(A) = A · A−1
is referred to as the condition number of A with respect to the given norm.
When k(A) is large then we say that A is ill-conditioned.
Note that
x = A−1 Ax ≤ A−1 · Ax,
implying
x
A−1 ≥ , ∀x = 0.
Ax
Hence,
x
k(A) = A−1 · A ≥ A ∀x = 0. (3.8)
Ax
This lower bound on k(A) is easier to compute and can be used as a rough
estimate of k(A).
Example 3.13 Consider the coeﬃcient matrix of the unstable system dis-
cussed in the beginning of this section:

1 −1
A= .
1 −1.001
Its inverse is given by

1001 −1000
A−1 = .
1000 −1000
The condition number of this matrix with respect to the inﬁnity norm is
k(A) = A∞ A−1 ∞ = 2.001 · 2001 = 4004.001.
For x = [1, 1]T , the lower bound in (3.8) is given by
x∞
A∞ = 2001.
Ax∞

3.5 Computing Eigenvalues and Eigenvectors

In this section we brieﬂy discuss the problem of computing the eigenvalues
and eigenvectors of a square matrix. Recall that an eigenvalue λi and the
corresponding eigenvector vi of an n × n matrix A satisfy the system
(A − λIn )vi = 0, vi = 0. (3.9)
Elements of Numerical Linear Algebra 79

The eigenvalues are the n roots of the characteristic polynomial p(λ) =

det(A − λIn ), which is of degree n. Given λi , the corresponding eigenvector
vi can be computed by solving the linear system (3.9). Also, if an eigenvector
v of A is known, pre-multiplying Av = λv by v T and dividing both sides by
v T v = 0, we obtain
v T Av
λ= T . (3.10)
v v
In theory, one could compute the coefficients of the characteristic polynomial
and then find its roots using numerical methods (such as those discussed in
Chapter 4). However, due to a high sensitivity of roots of a polynomial to
perturbations in its coefficients, this would not be practical in the presence of
round-off errors.
We will restrict our discussion of numerical methods for computing eigen-
values and eigenvectors to the power method described next.

3.5.1 The power method

Given a matrix A ∈ IRn×n with eigenvalues λ1 , λ2 , . . . , λn , by its dominant
eigenvalue we mean an eigenvalue λi such that |λi | > |λj | for all j = i (if such
an eigenvalue exists).
The power method discussed here is suitable for computation of the dom-
inant eigenvalue and the corresponding eigenvector. Assume that A ∈ IRn×n
has n linearly independent eigenvectors v1 , . . . , vn and a unique dominant
eigenvalue, that is,
|λ1 | > |λ2 | ≥ |λ3 | ≥ . . . ≥ |λn |.
Starting with an arbitrary initial guess x(0) of the dominant eigenvector, we
construct a sequence of vectors {x(k) : k ≥ 0} determined by

x(k) = Ax(k−1) = Ak x(0) , k ≥ 1.

n
Let x(0) = ci vi be the representation of the initial guess as a linear combi-
i=1
nation of the eigenvectors, then
n k

n
n λi
x (k)
=A k
ci vi = λki ci vi = λk1 c1 v1 + ci vi , k ≥ 1.
i=1 i=1 i=2
λ1

Since |λi /λ1 | < 1, the direction of x(k) tends to that of v1 as k → ∞ (assuming
c1 = 0). Also, for
(x(k) )T Ax(k)
μ(k) = ,
(x(k) )T x(k)
we have μ(k) → λ1 , k → ∞ according to (3.10).
To ensure the convergence of x(k) to a nonzero vector of bounded length,
80 Numerical Methods and Optimization: An Introduction

we scale x(k) at each iteration:

x(k)
v (k) = , x(k+1) = Av (k) , k ≥ 0. (3.11)
x(k) 2

Also,
(v (k) )T Av (k)
μ(k) = = (v (k) )T x(k+1) . (3.12)
(v (k) )T v (k)
In summary, starting with an initial guess x(0) , we proceed by computing
{v (k) : k ≥ 1} and {μ(k) : k ≥ 1} using (3.11)–(3.12) to ﬁnd approximations
of v1 and λ1 , respectively.

3.5.2 Application: ranking methods

A notable application of the power method for computing the dominant
eigenpair of a matrix can be found in ranking methods. We discuss two exam-
ples, the celebrated PageRank algorithm and the Analytic Hierarchy Process.
PageRank algorithm. The PageRank method was originally developed by
Brin and Page [7] for the purpose of ranking webpages in Google’s search
engine. Its eﬀectiveness and simplicity made the method a popular choice in
many other applications. PageRank scores each webpage by summing up the
scores of webpages that point to the page being scored via a hyperlink. This
ranking system can be interpreted as a democracy where hyperlinks can be
thought of as votes in favor of the linked webpages [8]. By linking to multiple
webpages, a website’s “voting score” is split evenly among the set of webpages
to which it links.
Assume that there are a total of n webpages, and denote by wi the impor-
tance score (same as the “voting score”) of the ith webpage, i = 1, . . . , n. Let
N + (i) denote the set of webpages that have a hyperlink to the ith webpage,
and let ni be the number of pages to which the ith page points. Then we have
1
wi = wj , i = 1, . . . , n. (3.13)
nj
j∈N + (i)

This system can be written as

w = Aw, (3.14)

where A = [aij ]n×n is given by

1/nj , if j ∈ N + (i);
aij = (3.15)
0, otherwise.

Note that A is a column-stochastic matrix (that is, all its elements are
nonnegative and each column sums up to 1), and as such is guaranteed to have
Elements of Numerical Linear Algebra 81

an eigenvalue λ1 = 1. Thus, the problem of computing the importance scores

reduces to finding the eigenvector of A corresponding to the eigenvalue λ1 = 1.
Then the webpages are ranked according to a nonincreasing order of their
importance scores given by the corresponding components of the eigenvector.
Analytic Hierarchy Process. Another example of using eigenvalue com-
putations for ranking alternatives is the Analytic Hierarchy Process (AHP).
AHP was first introduced by Saaty in 1980 [30], and has since been developed
into a powerful decision-making tool.
Assume that there are n elements (alternatives, options) to be ranked. At
some stage of the AHP, a matrix P is created, which is called the preference
matrix, and whose elements are
wi
pij = , i, j = 1, . . . , n.
wj
Here numbers wi and wj are used to compare the alternatives i and j. To
compare two options, a 10-point scale is often used, in which wi , i = 1, . . . , n
is assigned values from {0, 1, 2, . . . , 9} as follows. If i = j, or i and j are equal
alternatives, then wi = wj = 1. Otherwise,
⎧
⎪
⎪ 3 moderately
⎨
5 strongly
wi = if i is preferable over j.
⎪
⎪ 7 very strongly
⎩
9 extremely
The numbers 2, 4, 6, 8 are used for levels of preference between two of the
specified above. In all these cases, wj is set equal to 1. For example, if element
i is strongly preferable over element j, we have pij = 5 and pji = 1/5. Zeros
are used when there is not enough information to compare two elements, in
which case the diagonal element in each row is increased by the number of
zeros in that row.
As soon as the preference matrix is constructed, one of the techniques
used to rank the elements is the following eigenvalue method. Denote by w =
[wi ]ni=1 the eigenvector of P corresponding to its largest eigenvalue. Then
element i is assigned the value wi , and the elements are ranked according to
the nonincreasing order of the components of vector w. This method is based
on the observation that if the numbers wi , i = 1, . . . , n used to construct
the preference matrix P were the true preference scores of the corresponding
alternatives, then for the vector w = [wi ]ni=1 we would have P w = nw, that
is, w is the eigenvector of P corresponding to its largest eigenvalue λ = n.

Exercises
3.1. Solve the following systems of linear algebraic equations using
82 Numerical Methods and Optimization: An Introduction

(a) the Gaussian elimination method;

(b) the Gauss-Jordan method.

(i) x1 + x2 + 2x3 = 1
2x1 + x2 − 3x3 = 0
−3x1 − x2 + x3 = 1
1 1
(ii) x1 + 2 x2 + 3 x3 = 1
1 1 1
2 x1 + 3 x2 + 4 x3 = 0
1 1 1
3 x1 + 4 x2 + 5 x3 = 0

(iii) 2x1 + 4x2 − 4x3 = 12

x1 + 5x2 − 5x3 − 3x4 = 18
2x1 + 3x2 + x3 + 3x4 = 8
x1 + 4x2 − 2x3 + 2x4 = 8.

3.2. Consider the system Ax = b, where

⎡ ⎤ ⎡ ⎤
(a) 2 0 1 0
A = ⎣ 1 3 −1 ⎦ , b = ⎣ −1 ⎦ ;
−1 2 4 0
⎡ ⎤ ⎡ ⎤
(b) 1 0 21 1
A = ⎣ 1 2 0 ⎦, b = ⎣ 2 ⎦.
0 1 3 3
Find A−1 . Then solve the system by computing x = A−1 b.

3.3. Use Gaussian elimination to factorize the matrix

⎡ ⎤ ⎡ ⎤
(a) 1 2 1 (b) 1 4 2
A=⎣ 3 4 2 ⎦; A=⎣ 4 4 1 ⎦.
2 5 1 2 6 1

That is, compute a lower triangular matrix L and an upper triangular

matrix U such that P A = LU , where P is the permutation matrix
corresponding to the ﬁnal permutation p used for making the pivots.

3.4. How many multiplication and division operations are required to solve
a linear system using

(a) Gaussian elimination;

(b) Gauss-Jordan;
(c) triangular factorization

methods? What procedure would you follow to solve k systems of linear

equations Ax = bi , i = 1, . . . , k?
Elements of Numerical Linear Algebra 83

3.5. Let
3 2 1
A= , b= .
1 2 0

(a) Verify that A is strictly diagonally dominant by rows.

(b) Verify that I2 − D−1 A∞ < 1, where the matrices involved are
deﬁned as in the description of the Jacobi method.
(c) Apply the Jacobi method twice to get an approximate solution x(2)
to the system Ax = b. Use x(0) = [1, −1]T as the initial guess.
(d) Will the method converge? Explain.

3.6. For the systems in Exercise 3.2, use x(0) = [−1, 1, 2]T as the initial guess
and do the following:

(a) Compute x(0) − x∗ ∞ , where x∗ is the solution found in Exer-

cise 3.2.
(b) Apply the Jacobi method twice and compute x(k) −x∗ ∞ , k = 1, 2.
(c) Apply the Gauss-Seidel method twice and compute x(k) −
x∗ ∞ , k = 1, 2.
(d) How many steps k of the Jacobi and Gauss-Seidel methods will
guarantee that x(k) − x∗ ∞ ≤ 10−8 ?

3.7. Let ⎡ ⎤ ⎡ ⎤
3 −1 1 −1
A=⎣ 0 2 1 ⎦; b = ⎣ 0 ⎦.
−1 1 4 1

(a) Use Gaussian elimination to ﬁnd the solution x∗ of the system

Ax = b.
(b) Compute det(A) and A−1 above. Then find x = A−1 b.
(c) Use the Jacobi method twice, starting with x(0) = [1, 1, 0]T , to
find an approximation to the solution. Report the error x(k) −
x∗ ∞ , k = 1, 2.
(d) Use the Gauss-Seidel method twice, starting with x(0) = [1, 1, 0]T ,
to find an approximation to the solution. Compare the correspond-
ing errors to those obtained in (c) for the Jacobi method.
(e) After how many steps will the Jacobi method find an approximation
to the solution with the error not exceeding 10−8 ?
(f) After how many steps will the Gauss-Seidel method find an ap-
proximation to the solution with the error not exceeding 10−8 ?

3.8. Given
84 Numerical Methods and Optimization: An Introduction

(i)
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
−3 1 1 1 1
A=⎣ 1 2 0 ⎦, b = ⎣ −1 ⎦ , and x(0) = ⎣ 1 ⎦;
1 −1 3 0 1

(ii)
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
2 0 1 −1 1
A = ⎣ −1 3 1 ⎦, b = ⎣ 0 ⎦, and x(0) = ⎣ 2 ⎦;
0 1 −2 1 −1

(iii)
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
3 1 1 −1 1
A = ⎣ −1 2 0 ⎦, b = ⎣ 0 ⎦, and x(0) = ⎣ 2 ⎦;
2 −1 −4 2 3

(iv)
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
4 3 −2 1 1
A = ⎣ −2 5 0 ⎦, b = ⎣ 2 ⎦, and x(0) = ⎣ 0 ⎦;
1 −1 −4 0 −1

(v)
⎡ ⎤ ⎤⎡ ⎤⎡
5 2 −1 0 4 0
⎢ 1 7 2 −3 ⎥ ⎢ 0 ⎥ ⎢ 0 ⎥
A=⎢
⎣ 0
⎥, b=⎢ ⎥
⎣ −5 ⎦ , and x(0) =⎢ ⎥
⎣ 0 ⎦;
3 −6 1 ⎦
−2 4 −2 9 5 0

ﬁnd an approximate solution of Ax = b using two steps of

(a) the Jacobi method;

(b) the Gauss-Seidel method.

Use x(0) as an initial guess. What can you say about the convergence of
the Jacobi and Gauss-Seidel methods in each case?

3.9. For each system below, do the following:

(a) Prove that the system has no solution.

(b) Find the least squares solution, x∗ .
(c) Compute Ax∗ − b2 , where A is the matrix of coeﬃcients and b is
the right-hand-side vector.
Elements of Numerical Linear Algebra 85

(i) 2x1 + x2 = 1 (ii) x1 + x2 = 1

x1 − x2 = 2 2x1 + 3x2 = 0
x1 + 2x2 = 1 x1 + 2x2 = 2
(iii) −x1 + x2 = 1 (iv) −3x1 + x2 = −2
x1 + 2x2 = 0 4x1 + 7x2 = 11
2x1 + x2 = 1 x1 − x2 = 3
(v) 2x1 − x2 + x3 = 1
−x1 + x2 − x3 = 2
x1 + x2 + 2x3 = −1
2x1 + x2 + 2x3 = 3.
3.10. Consider the following two matrices:

101 −90 100.999 −90.001
A= , B= .
110 −98 110 −98
Find the eigenvalues of A and B. What do you observe?
⎡ ⎤
4 0 1
3.11. Given the matrix A = ⎣ −2 1 0 ⎦,
−2 0 1
(a) Find the eigenvalues and corresponding eigenvectors of A.
(b) Use your answer from (a) to compute det(A).
(c) What are the eigenvalues of A−1 ?
3.12. Let a polynomial p(x) = c0 + c1 x + c2 x2 + . . . + cn−1 xn−1 + xn be given.
Prove that p(x) is the characteristic polynomial of matrix A = [aij ]n×n
defined as follows:
⎧
⎨ 1, if j = i + 1;
ai,j = −cj−1 if i = n, j = 1, 2, . . . , n;
⎩
0 otherwise.
i.e., ⎡ ⎤
0 1 0 ··· 0
⎢ 0 0 1 ··· 0 ⎥
⎢ ⎥
⎢ .. .. .. .. .. ⎥
A=⎢ . . . . . ⎥.
⎢ ⎥
⎣ 0 0 0 ··· 1 ⎦
−c0 −c1 −c2 ··· −cn−1
This matrix is called the companion matrix of p(x).
3.13. Given the polynomial p(x) = (x − 1)(x + 2)(x − 3), find a non-diagonal
3 × 3 matrix M whose eigenvalues are the roots of p(x).
3.14. Find two different matrices A and B whose eigenvalues are the roots of
the polynomial
P (λ) = λ3 − 3λ2 − λ + 3.
This page intentionally left blank
Chapter 4
Solving Equations

In this chapter, we discuss several numerical methods for solving a nonlinear

equation in the form
F (x) = 0, (4.1)
where F (x) is a given real function.

Deﬁnition 4.1 A number x∗ such that F (x∗ ) = 0 is called a root of the

function F (x) or of the equation F (x) = 0.

The problem of ﬁnding a root of a function (equation) arises frequently in

science, engineering and everyday life.

Example 4.1 John saves money by making regular monthly deposits in the
amount p to his account. Assume that the annual interest rate is r. Then the
total amount, A, that John will have after n payments is
r r 2 r n−1
A=p+p 1+ +p 1+ + ... + p 1 + . (4.2)
12 12 12
$ %
r i−1
In this expression, the term p 1 + 12 is the contribution of the (n − i)-
th deposit toward the total sum, that is, the contribution of the last, month
n,$ payment
% is p, the contribution of the payment made the month before is
r
p 1 + 12 (the deposit plus the interest for one month), etc.; and the contri-
$ %
r n−1
bution of the ﬁrst payment is p 1 + 12 (the deposit plus the interest for
n − 1 months). The right-hand side of Equation (4.2) is the sum of n terms
of a geometric series, hence
n r i−1 r n
(1 + 12 ) −1
A= p 1+ =p ,
i=1
12 (1 + 12 ) − 1
r

which simpliﬁes to the equation

p r n
A= 1+ −1 , (4.3)
r/12 12
called the annuity-due equation. Suppose that John has n months left until his
retirement, and the target amount of money he wants to have in his account

87
88 Numerical Methods and Optimization: An Introduction

upon retirement is A. Then in order to ﬁnd an interest rate that would be high
enough to achieve his target, John needs to solve the following equation with
respect to r:
p r n
F (r) = 1+ − 1 − A = 0. (4.4)
r/12 12

There are only a few types of equations that can be solved by simple
analytical methods. These include linear and quadratic equations
ax + b = 0,
ax2 + bx + c = 0.
In general, the problem of finding a solution of a nonlinear equation is not
easy. For example, for a polynomial of degree n
p(x) = an xn + an−1 xn−1 + an−2 xn−2 + . . . + a1 x + a0 ,
where n > 4 there are no general formulas to solve the equation p(x) = 0.
Moreover, in 1823 a Norwegian mathematician Niels Henrik Abel (1802–1829)
had shown that no such formulas can be developed for general polynomials of
degree higher than four.
When direct solution methods are not available, numerical (iterative) tech-
niques are used, which start with an initial solution estimate x0 and proceed
by recursively computing improved estimates x1 , x2 , . . . , xn until a certain
stopping criterion is satisfied.
Numerical methods typically give only an approximation to the exact so-
lution. However, this approximation can be of a very good (predefined) accu-
racy, depending on the amount of computational effort one is willing to invest.
One of the advantages of numerical methods is their simplicity: they can be
concisely expressed in algorithmic form and can be easily implemented on a
computer. The main drawback is that in some cases numerical methods fail.
We will illustrate this point when discussing specific methods.

4.1 Fixed Point Method

In this section, we discuss the ﬁxed point method, which is also known as
simple one-point iteration or method of successive substitutions.
We consider an equation in the form
F (x) = 0, (4.5)
where F : [a, b] → IR, and suppose that the interval (a, b) is known to contain
a root of this equation. We can always reduce (4.5) to an equation in the form
x = f (x), (4.6)
Solving Equations 89

such that (4.6) is equivalent to (4.5) at the interval (a, b), i.e., such that
x∗ ∈ (a, b) solves (4.5) if and only if x∗ solves (4.6).
Example 4.2 Put f (x) = x − α(x)F (x), where α(x) = 0 if x ∈ (a, b). Then,
obviously, f (x) = x ⇔ α(x)F (x) = 0 ⇔ F (x) = 0.
To ﬁnd a root x∗ ∈ (a, b) of Equation (4.6), we take some starting point
x0 ∈ (a, b) and sequentially calculate x1 , x2 , . . . using the formula
xk+1 = f (xk ), k = 0, 1, . . . . (4.7)
In the above, we assume that f (xk ) ∈ (a, b) for all k, so xk is always in the
domain of F .

Deﬁnition 4.2 A ﬁxed point of a function f is a real number x∗ such

that f (x∗ ) = x∗ .

Clearly, if x∗ is a ﬁxed point of f , then F (x∗ ) = 0, and thus x∗ is a root of F .

Deﬁnition 4.3 The iteration xk+1 = f (xk ) is called a ﬁxed point iter-
ation.

Next we show that if the ﬁxed point method converges, it converges to a

ﬁxed point of f , which is also a root of F .

Theorem 4.1 Let f be a continuous function, and let the sequence

{xk : k ≥ 0} be generated by the ﬁxed point iteration, xk+1 = f (xk ). If
lim xk = x∗ , then x∗ is a ﬁxed point of f and hence x∗ solves F (x) = 0.
k→∞

Proof. Since f is continuous, we have

lim f (xk ) = f ( lim xk ) = f (x∗ ),
k→∞ k→∞

and since xk+1 = f (xk ),

x∗ = lim xk = lim f (xk−1 ) = f (x∗ ).
k→∞ k→∞

Example 4.3 For f (x) = exp(−x) and x0 = −1, four ﬁxed point iterations
produce the following points:
x1 = f (x0 ) ≈ 2.718282,
x2 = f (x1 ) ≈ 0.065988,
x3 = f (x2 ) ≈ 0.936142,
x4 = f (x3 ) ≈ 0.392138.
90 Numerical Methods and Optimization: An Introduction
y y=x
f (x0 )

f (x2 )

f (x3 )

0 f (x1 )
x0 x2 x4 x∗ x3 x1 x

FIGURE 4.1: Four ﬁxed point iterations applied to f (x) = exp (−x) starting
with x0 = −1.

The iterations are illustrated in Figure 4.1. Observe that applying the ﬁxed
point iteration recursively will produce a sequence converging to the ﬁxed point
x∗ ≈ 0.567143.

The following example shows that a repeated application of the ﬁxed point
iteration may not produce a convergent sequence.

Example 4.4 For f (x) = 1/x and x0 = 4, we have

x1 = f (x0 ) = 1/4,
x2 = f (x1 ) = 4 = x0 ,

hence, if we continue, we will cycle between the same two points, x0 and x1 .
See Figure 4.2 for an illustration.

Theorem 4.2 Assume that f : (a, b) → IR is diﬀerentiable on (a, b) and

there exists a constant q, such that 0 ≤ q < 1 and |f (x)| ≤ q for any
x ∈ (a, b). If there exists a solution x∗ ∈ (a, b) of the fixed point equation
f (x) = x, then this solution is unique on (a, b). Moreover, there exists
> 0, such that for any x0 satisfying |x̂−x0 | < , the sequence defined by
the fixed point iteration, xk+1 = f (xk ), satisfies the following inequality:

|xk − x∗ | ≤ q k |x0 − x∗ |, k ≥ 0. (4.8)

Solving Equations 91
y y=x

f (x1 )

f (x0 )
0
x1 x∗ x0 x

FIGURE 4.2: Fixed point iterations applied to f (x) = 1/x starting with
x0 = 4 lead to a cycle x0 , x1 , x0 , x1 , . . ., where x1 = 1/4.

Proof. Let x∗ ∈ (a, b) be an arbitrary ﬁxed point of f , and let xk ∈ (a, b).
According to the mean value theorem, there exists ck between xk and x∗ in
(a, b) such that f (xk ) − f (x∗ ) = f (ck )(xk − x∗ ), therefore

|xk − x∗ | = |f (xk−1 ) − f (x∗ )| = |f (ck−1 )(xk−1 − x∗ )| ≤ q|xk−1 − x∗ |.

Hence,

|xk − x∗ | ≤ q|xk−1 − x∗ | ≤ q 2 |xk−2 − x∗ | ≤ . . . ≤ q k |x0 − x∗ | → 0, k → ∞,

and lim xk = x∗ . Since x∗ is an arbitrary ﬁxed point of f on (a, b) and
k→∞
xk → x∗ , k → ∞, this implies that for any other ﬁxed point x̂ ∈ (a, b) of f we
must have xk → x̂, k → ∞, meaning that x̂ = x∗ .

Example 4.5 Assume that John from Example 4.1 can make the monthly
payment of $300, and we want to ﬁnd an interest rate r which would yield
$1,000,000 in 30 years. We will use the annuity-due equation given in (4.4):
p r n
1+ − 1 − A = 0. (4.9)
r/12 12
Rearrange the annuity-due equation in the following way:
r n Ar
1+ −1− = 0, (4.10)
12 12p
so,
1/n
Ar
r = 12 +1 − 12.
12p
92 Numerical Methods and Optimization: An Introduction

TABLE 4.1: Fixed point iterations for ﬁnding the interest rate with two
diﬀerent choices for r0 .

Iteration Approximate
number, k interest rate, rk
0 r0 = 0.1 r0 = 0.2
1 0.112511 0.135264
2 0.116347 0.122372
3 0.117442 0.119091
4 0.117747 0.118203
5 0.117832 0.117958
6 0.117856 0.117891
7 0.117862 0.117872
8 0.117864 0.117867
9 0.117865 0.117865
10 0.117865 0.117865

We have p = 300, A = 1, 000, 000, and n = 12 · 30 = 360, that is,

6 1/360
10 r
r = 12 +1 − 12 = f (r).
3600
As a plausible initial approximation, we may take the value of r0 = 0.1. Then
the ﬁrst approximation becomes
r1 = f (r0 ) = f (0.1) = 0.112511.
Likewise, the second approximation is
r2 = f (r1 ) = f (0.116347) = 0.116347.
The results of 10 ﬁxed point iterations, as well as 10 iterations with another
initial guess r0 = 0.2 are shown in Table 4.1, which illustrates the convergence
of the method. We conclude that the target interest rate is ≈ 11.79%

4.2 Bracketing Methods

The bracketing methods are based on the observation that if f : [a, b] → IR
is continuous and if f (a)f (b) ≤ 0, then there exists x∗ ∈ [a, b] such that
f (x∗ ) = 0, which follows from the intermediate value theorem. To ﬁnd a root
x∗ , we start with the interval [a0 , b0 ] = [a, b] and identify a smaller subinterval
[a1 , b1 ] of [a0 , b0 ] containing a root (i.e., such that f (a1 )f (b1 ) ≤ 0). Then the
search is continued on [a1 , b1 ]. This procedure is carried out recursively, until
the search interval is reduced to a small enough size.
Solving Equations 93

f (a2 )
f (a1 )
f (a0 )
b3
f (a3 ) b2 b1
a c3 c1 c0 b0
a0 c 2 x∗ b
a1 a3 f (b )
2
a2
f (b3 )
f (b1 )
f (b0 )

FIGURE 4.3: An illustration of the bisection method.

4.2.1 Bisection method

The idea of the bisection method, which is also called the half-interval
method, is to split the search space at each iteration in half by choosing the
middle point of the interval as the point of separation. One can determine
which of the two halves contains the root using the intermediate value theo-
rem as described above. Namely, we choose the half where the function has
opposite-sign values at the endpoints. The algorithm terminates when the
search interval is smaller than a given precision .
The method is illustrated in Figure 4.3 and proceeds as follows. We start
with an interval [a, b] such that f (a)f (b) < 0, and set a0 = a, b0 = b. Let c0 =
(a0 + b0 )/2 be the mid-point of the interval [a0 , b0 ]. Since f (a0 ) > 0, f (b0 ) < 0,
and f (c0 ) < 0, out of the two subintervals, [a0 , c0 ] and [c0 , b0 ], the ﬁrst one
is guaranteed to contain a root of f . Hence, we set a1 = a0 , b1 = c0 , and
continue the search with the interval [a1 , b1 ]. For the mid-point c1 of this
interval we have f (c1 ) < 0, therefore there must be a root of f between a1
and c1 , and we continue with [a2 , b2 ] = [a1 , c1 ]. After one more step, we obtain
[a3 , b3 ] = [c2 , b2 ] as the new search interval. If b3 −a3 < , then c3 = (a3 +b3 )/2
is output as an approximation of the root x∗ .

4.2.1.1 Convergence of the bisection method

The bisection algorithm produces a set of centers c1 , . . . , ck , . . . such that
lim ck = x∗ , where x∗ ∈ [a0 , b0 ] is a root of f . Since at iteration n of the
k→∞
bisection method the root x∗ is in the interval [an , bn ] and cn is the mid-point
of this interval, we have

1
|x∗ − cn | ≤ (bn − an ).
2
94 Numerical Methods and Optimization: An Introduction

Also,
1 1 1
bn − a n = (bn−1 − an−1 ) = 2 (bn−2 − an−2 ) = . . . = n (b0 − a0 ).
2 2 2
Hence,
1
|x∗ − cn | ≤ (b0 − a0 ), (4.11)
2n+1
and the bisection method converges. Moreover, (4.11) can be used to determine
the number of iterations required to achieve a given precision by solving the
inequality
1
n+1
(b0 − a0 ) < ,
2
which is equivalent to
$ %
b0 − a 0 ln b0 −a 0

n > log2 −1=

− 1. (4.12)
ln 2

For example, if [a0 , b0 ] = [0, 1] and = 10−5 , we have

5 ln(10)
n> − 1 ≈ 15.6,
ln 2
and the number of iterations guaranteeing the precision is 16.
A pseudo-code for the bisection method is presented in Algorithm 4.1.
We stop if f (ck−1 ) = 0 at some iteration (line 6 of the algorithm), meaning
that the algorithm found the exact root of f , or when the search interval is
small enough to guarantee that |cn − x∗ | < (line 14). We could also add
the requirement that |f (x̄)| < δ to the stopping criterion to make sure that
not only our approximate solution is near the real root, but also that the
corresponding function value is very close to 0.

Example 4.6 We use 4 iterations of Algorithm 4.1 to approximate a root

x∗ ∈ [0, 1] of f (x) = x3 + x − 1. First we need to make sure that there is a
root in the interval [0, 1]. We have

f (0) = −1, f (1) = 1, f (0)f (1) < 0,

therefore we can say that there exists at least one root within [0, 1]. In fact, we
can show that this interval contains exactly one root of f . Indeed, suppose that
there are at least two roots x̂, x∗ ∈ [0, 1], such that f (x̂) = f (x∗ ) = 0. Then by
the mean value theorem there exists γ between x̂ and x∗ such that f (γ) = 0.
But f (x) = 3x2 +1 ≥ 1 for all x in [0, 1], leading to a contradiction. Therefore,
the assumption that f has more than one root in [0, 1] is not correct.
Using Algorithm 4.1 we obtain:

a0 = 0, b0 = 1, c0 = (0 + 1)/2 = 0.5, f (c0 ) = (0.5)3 + 0.5 − 1 = −0.375 < 0,

Solving Equations 95

Algorithm 4.1 Bisection method for solving f (x) = 0.

Input: f, , a, b such that f (a)f (b) < 0
Output: x̄ such that |x̄ − x∗ | < , where f (x∗ ) = 0
1: a0 =&a, b$0 = b, %c0 = (a
' 0 + b0 )/2
b −a
2: n = ln 0 0 /ln 2
3: for k = 1, . . . , n do
4: if f (ak−1 )f (ck−1 ) ≤ 0 then
5: if f (ck−1 ) = 0 then
6: return x̄ = ck−1
7: end if
8: ak = ak−1 , bk = ck−1
9: else
10: ak = ck−1 , bk = bk−1
11: end if
12: ck = (ak + bk )/2
13: end for
14: return x̄ = cn

and the ﬁrst 4 iterations give the following results.

k=1: f (b0 )f (c0 ) < 0 ⇒ [a1 , b1 ] = [0.5, 1],
c1 = (0.5 + 1)/2 = 0.75,
f (c1 ) = (0.75)3 + 0.75 − 1 = 0.172 > 0;
k=2: f (a1 )f (c1 ) < 0 ⇒ [a2 , b2 ] = [0.5, 0.75],
c2 = (0.5 + 0.75)/2 = 0.625,
f (c2 ) = (0.625)3 + 0.625 − 1 = −0.131 < 0;
k=3: f (b2 )f (c2 ) < 0 ⇒ [a3 , b3 ] = [0.625, 0.75],
c3 = (0.625 + 0.75)/2 = 0.6875,
f (c3 ) = (0.6875)3 + 0.6875 − 1 = 0.01245 > 0;
k=4: f (a3 )f (c3 ) < 0 ⇒ [a4 , b4 ] = [0.625, 0.6875],
c4 = (0.625 + 0.6875)/2 = 0.65625,
f (c4 ) = (0.65625)3 + 0.65625 − 1 ≈ −0.06113 < 0.

4.2.1.2 Intervals with multiple roots

If [a, b] is known to contain multiple roots of a continuous function f , then:

1. if f (a)f (b) > 0, then there is an even number of roots of f in [a, b];

2. if f (a)f (b) < 0, then there is an odd number of roots of f in [a, b].

Example 4.7 Consider f (x) = x6 + 4x4 + x2 − 6, where x ∈ [−2, 2]. Since

f (−2) = f (2) = 126 and f (−2)f (2) > 0, we can say that if the interval
[−2, 2] contains a root of f , then it contains an even number of roots. Using
96 Numerical Methods and Optimization: An Introduction

the bisection method we have:

a0 = −2, b0 = 2, c0 = (−2 + 2)/2 = 0, f (c0 ) = f (0) = −6 < 0,

and since f (a0 )f (c0 ) < 0 and f (b0 )f (c0 ) < 0, both [−2, 0] and [0, 2] contain
at least one root of f . Hence, we can apply the bisection method for each of
these intervals to ﬁnd the corresponding roots. For [a1 , b1 ] = [0, 2] we have

c1 = (0 + 2)/2 = 1, f (c1 ) = 0,

and we found a root x̂ = 1. Similarly, for [a1 , b1 ] = [−2, 0] we obtain

c1 = (−2 + 0)/2 = −1, f (c1 ) = 0,

and we found another root x∗ = −1. Since f (x) = 6x5 + 16x3 + 2x, we have
f (x) < 0 for all x < 0 and f (x) > 0 for all x > 0, implying that the function
is decreasing on (−∞, 0) and increasing on (0, +∞). Hence, f can have at
most one negative and one positive root, and it has no real roots other than
x∗ = −1 and x̂ = 1.

4.2.2 Regula-falsi method

The regula-falsi method, also known as the false-position method, is similar
to the bisection method, but instead of the mid-point at the k th iteration we
take the point ck deﬁned by an intersection of the line segment joining the
points (ak , f (ak )) and (bk , f (bk )) with the x-axis (see Figure 4.4). The line
passing through (a, f (a)) and (b, f (b)) is given by

f (b) − f (a)
y − f (b) = (x − b),
b−a
so for y = 0,
f (b) − f (a) f (b) f (b)
=− = ,
b−a x−b b−x
implying
b−a af (b) − bf (a)
x = b − f (b) = .
f (b) − f (a) f (b) − f (a)
Thus, we have the following expression for ck :

ak f (bk ) − bk f (ak )
ck = .
f (bk ) − f (ak )

A pseudo-code for the regula-falsi method is given in Algorithm 4.2.

Note that in this case we stop when |f (ck )| < , since, unlike the bisection
method, the regula-falsi method does not guarantee that the length of the
interval containing the root of f (x) tends to zero. Indeed, as can be seen from
the illustration in Figure 4.4, it is possible that one of the endpoints of the
Solving Equations 97

f (a2 )
f (a1 )
f (a0 )

b2 b1
a c1 c0 b0
a0 x∗ b
a1 f (b2 )
a2
f (b1 )

f (b0 )

FIGURE 4.4: An illustration of the regula-falsi method.

Algorithm 4.2 Regula-falsi method for solving f (x) = 0.

Input: f, , a, b such that f (a)f (b) < 0
Output: x̄ such that |f (x̄)| <
1:
(b0 )−b0 f (a0 )
k = 0, a0 = a, b0 = b, c0 = a0ff (b 0 )−f (a0 )
2: repeat
3: k =k+1
4: if f (ak−1 )f (ck−1 ) ≤ 0 then
5: if f (ck−1 ) = 0 then
6: return x̄ = ck−1
7: end if
8: ak = ak−1 , bk = ck−1
9: else
10: ak = ck−1 , bk = bk−1
11: end if
12:
(bk )−bk f (ak )
ck = akff (b k )−f (ak )
13: until |f (ck )| <
14: return x̄ = ck

search interval converges to the root x∗ , whereas the other endpoint of the
search interval always remains the same. For example, if we apply the regula-
falsi method to f (x) = 2x3 − 4x2 + 3x on [−1, 1], we observe that the left
endpoint is always −1. At the same time, the right endpoint approaches 0,
which is the root (Exercise 4.4). Thus, the length of the interval is always at
least 1 in this case.
98 Numerical Methods and Optimization: An Introduction

4.2.3 Modiﬁed regula-falsi method

To guarantee that the length of the interval containing a root tends to
zero, the regula-falsi method can be modified by down-weighting one of the
endpoint values. This is done in an attempt to force the next ck to appear on
that endpoint’s side of the root. More specifically, if two consecutive iterations
of the regula-falsi method yield points located to the left from the root, we
down-weight f (bk ) using the coefficient of 12 :
1
2 f (bk )ak − f (ak )bk
ck = .
1
2 f (bk ) − f (ak )
Alternatively, if two consecutive points appear to the right from the root, we
put
f (bk )ak − 12 f (ak )bk
ck = .
f (bk ) − 12 f (ak )
Figure 4.5 provides an illustration using the same example as the one used in
the previous figure for the regula-falsi method.

f (a2 )
f (a1 )
f (a0 ) f (a3 )
b3
1
f (a 2 ) b2 b1
2 c1 c0 b0
a0 c2
a1 a3
a2
f (b2 ) f (b1 )
f (b3 )

f (b0 )

FIGURE 4.5: An illustration of the modiﬁed regula-falsi method.

This simple modification guarantees that the width of the bracket tends to
zero. Moreover, the modified regula-falsi method often converges faster than
the original regula-falsi method. A pseudo-code for the modified regula-falsi
method is given in Algorithm 4.3.
Example 4.8 Let f (x) = x3 + x − 1. We use the modified regula-falsi method
to compute an approximate root of f on [0, 1].
We have a0 = 0, b0 = 1, f (a0 ) = −1, f (b0 ) = 1, and f (a0 )f (b0 ) < 0. We
compute
a0 f (b0 ) − b0 f (a0 ) 1
c0 = = = 0.5, f (c0 ) = −0.375.
f (b0 ) − f (a0 ) 2
Solving Equations 99

Algorithm 4.3 Modiﬁed regula-falsi method for solving f (x) = 0.

Input: f, , a, b such that f (a)f (b) < 0
Output: x̄ such that |f (x̄)| <
1:
(b0 )−b0 f (a0 )
k = 0, a0 = a, b0 = b, c0 = a0ff (b 0 )−f (a0 )
2: repeat
3: k =k+1
4: α = 1, β = 1
5: if f (ak−1 )f (ck−1 ) ≤ 0 then
6: if f (ck−1 ) = 0 then
7: return x̄ = ck−1
8: end if
9: ak = ak−1 , bk = ck−1
10: if k > 1 and f (ck−1 )f (ck ) > 0 then
11: α = 1/2
12: end if
13: else
14: ak = ck−1 , bk = bk−1
15: if k > 1 and f (ck−1 )f (ck ) > 0 then
16: β = 1/2
17: end if
18: end if
19: ck = akβfβf (bk )−bk αf (ak )
(bk )−αf (ak )
20: until |f (ck )| <
21: return x̄ = ck

k=1: α = β = 1; f (b0 )f (c0 ) < 0 ⇒ [a1 , b1 ] = [0.5, 1],

(b1 )−b1 f (a1 )
c1 = a1ff (b 1 )−f (a1 ) 1+0.375 ≈ 0.63636, f (c1 ) ≈ −0.10594.
= 0.5+0.375

k=2: α = β = 1; f (b1 )f (c1 ) < 0 ⇒ [a2 , b2 ] = [0.63636, 1],

f (c0 )f (c1 ) > 0 ⇒ β = 0.5,
0.5f (b2 )−b2 f (a2 )
c2 = a20.5f (b2 )−f (a2 ) ≈ 0.699938, f (c2 ) ≈ 0.042847.

4.3 Newton’s Method

Similarly to the regula-falsi method, Newton’s method (also called Newton-
Raphson method) uses a linear approximation of the function to obtain an
estimate of a root. However, in the case of Newton’s method, the linear ap-
proximation at each iteration is given by the tangent line to f (x) at xk , the
current estimate of the root.
To derive Newton’s method iteration, we assume that f (x) is continuously
diﬀerentiable and consider the ﬁrst-order Taylor’s series approximation of f (x)
100 Numerical Methods and Optimization: An Introduction

f (x0 )

f (x1 )

f (x2 )

x∗ x2 x1 x0

FIGURE 4.6: An illustration of Newton’s method.

about xk :
f (x) ≈ f (xk ) + f (xk )(x − xk ).
Instead of solving f (x) = 0, we solve the linear equation

f (xk ) + f (xk )(x − xk ) = 0

for x to obtain the next approximation xk+1 of a root. If f (xk ) = 0, the

solution of this equation is

f (xk )
xk+1 = xk − .
f (xk )

So, the k th iteration of Newton’s method can be written as

f (xk−1 )
xk = xk−1 − , k ≥ 1. (4.13)
f (xk−1 )

Figure 4.6 illustrates the steps of Newton’s method geometrically. For the
function f in this illustration, the method quickly converges to the root x∗ of
Solving Equations 101

f (x0 )

f (x1 )
f (x2 ) f (x3 )
x∗ x0 x1 x2 x3

FIGURE 4.7: Newton’s method produces a sequence of points moving away

from the unique root x∗ of f .

f (x0 )

x1 x∗
x0

f (x1 )

FIGURE 4.8: Newton’s method cycles.

f . However, as can be seen from examples in Figures 4.7 and 4.8, sometimes
Newton’s method fails to converge to a root.
A pseudo-code of Newton’s method is presented in Algorithm 4.4. Besides
terminating when we reach a point xk with |f (xk )| < or when the step
size becomes very small (|xk − xk−1 | < δ), we included additional stopping
criteria, such as allowing no more than N iterations and stopping whenever
we encounter a point xk with f (xk ) = 0. This makes sure that the algorithm
always terminates, however, as illustrated in Figures 4.7 and 4.8, it may still
produce an erroneous output. On the other hand, if the method does converge
to a root, it typically has a very good speed of convergence.

Example 4.9 Consider the annuity-due equation (4.4) derived in Exam-

ple 4.1:
p r n
1+ − 1 − A = 0. (4.14)
r/12 12
Multiplying both sides of this equation by r/12, we obtain
r n Ar
p 1+ −1 − = 0.
12 12
102 Numerical Methods and Optimization: An Introduction

Algorithm 4.4 Newton’s method for solving f (x) = 0.

Input: f, f , x0 (f (x0 ) = 0), , δ, N
Output: x̄ such that |f (x̄)| < (unless another stopping criterion is satisﬁed)
1: k = 0
2: repeat
3: k =k+1
4: xk = xk−1 − ff(x k−1 )
(xk−1 )
5: if (k ≥ N ) or (f (xk ) = 0) then
6: STOP
7: end if
8: until (|f (xk )| < ) or (|xk − xk−1 | < δ)
9: return x̄ = xk

Denote by
r n Ar
f (r) = p 1+ −1 − ,
12 12
then
pn r n−1 A
f (r) = 1+ − ,
12 12 12
and an iteration of Newton’s method is
$$ %n %
12p 1 + rk−1 − 1 − Ark−1
rk = rk−1 − $
12
%n−1 , k ≥ 1. (4.15)
pn 1 + rk−1
12 −A
Like in Example 4.5, assume that the monthly payment is $300, and we want
to find an interest rate r which would yield $1,000,000 in 30 years. We have
p = 300, A = 1, 000, 000, and n = 12 · 30 = 360. We also need to define
the starting point r0 . Table 4.2 contains the results of applying 7 iterations of
Newton’s method with two different initial guesses: r0 = 0.1 and r0 = 0.2. In
both cases the method converges to the same solution r ≈ 0.117865, or 11.79%.
Example 4.10 We can use Newton’s method for the equation
f (x) = x2 − α = 0 (4.16)
√
to approximate the square root x = α of a positive number α. Newton’s step
for the above equation is given by
f (xk−1 )
xk = xk−1 −
f (xk−1 )
x2 − α
= xk−1 − k−1
2xk−1
2xk−1 − x2k−1 + α
2
=
2xk−1

1 α
= xk−1 + .
2 xk−1
Solving Equations 103

TABLE 4.2: Newton’s iterations for ﬁnding the interest rate with two dif-
ferent choices of r0 .

Iteration Approximate
number, k interest rate, rk
0 r0 = 0.1 r0 = 0.2
1 0.128616 0.170376
2 0.119798 0.145308
3 0.117938 0.127727
4 0.117865 0.119516
5 0.117865 0.117919
6 0.117865 0.117865
7 0.117865 0.117865

√
For example, to ﬁnd 7, we use f (x) = x2 − 7 = 0 and the iteration

1 7
xk = xk−1 + .
2 xk−1

Starting with x0 = 3, we obtain

1 7
x1 = 3+ = 8/3 ≈ 2.66667,
2 3

1 7
x2 = 8/3 + = 127/48 ≈ 2.64583,
2 8/3

1 7
x3 = 127/48 + = 32, 257/12, 192 ≈ 2.64575.
2 127/48
√
More generally, to approximate p a for a positive number α, we can use
the equation f (x) = xp − α = 0 and Newton’s iteration in the form:

xpk−1 − α
xk = xk−1 −
pxp−1
k−1

1 α
= (p − 1)xk−1 + , k ≥ 1.
p xp−1
k−1

4.3.1 Convergence rate of Newton’s method

Assume that Newton’s iteration produces a sequence {xk : k ≥ 1} con-
verging to a root x∗ of function f . We have

f (xk )
xk+1 = xk − ,
f (xk )
104 Numerical Methods and Optimization: An Introduction

and x∗ is a ﬁxed point of

f (x)
φ(x) = x − .
f (x)

Hence, x∗ = φ(x∗ ), and by the mean value theorem

xk+1 − x∗ = φ(xk ) − φ(x∗ ) = φ (ξk )(xk − x∗ )

for some ξk between xk and x∗ . We have

(f (x))2 − f (x)f (x) f (x)f (x)

φ (x) = 1 −
= ,
(f (x)) 2 (f (x))2

therefore
f (ξk )f (ξk )
xk+1 − x∗ = (xk − x∗ ). (4.17)
(f (ξk ))2

Since f (x∗ ) = 0, by the mean value theorem we get

|f (ξk )| = |f (ξk ) − f (x∗ )| = |f (νk )| · |ξk − x∗ | (4.18)

for some νk lying between ξk and x∗ . Recall that ξk is located between xk and
x∗ , thus
|ξk − x∗ | ≤ |xk − x∗ |

and from (4.18),

|f (ξk )| ≤ |f (νk )| · |xk − x∗ |. (4.19)

Combining (4.17) and (4.19), we obtain

|f (ξk )f (νk )|
|xk+1 − x∗ | ≤ |xk − x∗ |2 . (4.20)
(f (ξk ))2

If f (x) is such that for some constant C

|f (ξ)f (ν)|
≤C
(f (ξ))2

for all ξ and ν in some neighborhood of x∗ , then

|xk+1 − x∗ | ≤ C|xk − x∗ |2 .

Hence, if certain conditions are satisﬁed, Newton’s method is quadratically

convergent.
Solving Equations 105

f (x0 )

f (x1 )

f (x2 )
f (x3 )
x∗ x3 x2 x1 x0

FIGURE 4.9: An illustration of the secant method.

4.4 Secant Method

The secant method can be viewed as a modiﬁcation of Newton’s method
that avoids calculating the derivative by replacing it with a diﬀerence-based
approximation:
f (xk ) − f (xk−1 )
f (xk ) ≈ .
xk − xk−1
Hence, an iteration of the secant method is given by

xk − xk−1
xk+1 = xk − f (xk ) , k ≥ 1. (4.21)
f (xk ) − f (xk−1 )

Geometrically, this change in Newton’s iteration corresponds to replacing the

tangent line at xk with the secant line passing through (xk−1 , f (xk−1 )) and
(xk , f (xk )), as illustrated in Figure 4.9. Algorithm 4.5 provides a pseudo-code
of the secant method.
106 Numerical Methods and Optimization: An Introduction

Algorithm 4.5 Secant method for solving f (x) = 0.

Input: f, x0 , x1 , , δ, N
Output: x̄ such that |f (x̄)| < (unless another stopping criterion is satisﬁed)
1: k = 1
2: repeat
3: k =k+1
4: xk = xk−2 ff (x k−1 )−xk−1 f (xk−2 )
(xk−1 )−f (xk−2 )
5: if (k ≥ N ) then
6: STOP
7: end if
8: until (|f (xk )| < ) or (|xk − xk−1 | < δ)
9: return x̄ = xk

4.5 Solution of Nonlinear Systems

Some of the methods we discussed for solving single-variable equations
extend to nonlinear systems of n equations with n variables in the form

F (x) = 0, where F : IRn → IRn and 0 is an n-dimensional zero-vector.

Then x∗ ∈ IRn is a solution (root) of the above system if F (x∗ ) = 0. While

the methods we discuss in this section apply to n × n systems for an arbitrary
dimension n, for simplicity of presentation we will focus on the case of n = 2,
corresponding to the systems in the form

F1 (x, y) = 0
(4.22)
F2 (x, y) = 0.

In this case, a solution is given by a pair of numbers, [x∗ , y ∗ ]T such that

F1 (x∗ , y ∗ ) = 0 and F2 (x∗ , y ∗ ) = 0.

4.5.1 Fixed point method for systems

Suppose that we know that the root [x∗ , y ∗ ]T we are looking for is such
that x∗ ∈ (a, b) and y ∗ ∈ (c, d). Assume that we can reduce (4.22) to an
equivalent in the region (a, b) × (c, d) = {[x, y]T ∈ IR2 : x ∈ (a, b), y ∈ (c, d)}
system of the form
x = f1 (x, y)
y = f2 (x, y).
Example 4.11 Put

f1 (x, y) = x − αF1 (x, y),

f2 (x, y) = y − αF2 (x, y),
Solving Equations 107

where α = 0. Then, obviously

f1 (x, y) = x αF1 (x, y) = 0
⇔
f2 (x, y) = y αF2 (x, y) = 0.

To ﬁnd a root (x∗ , y ∗ ) ∈ (a, b) × (c, d), we take some [x0 , y0 ]T ∈ (a, b) × (c, d)
and then recursively calculate (xk+1 , yk+1 ) by the formula

xk+1 = f1 (xk , yk )
(4.23)
yk+1 = f2 (xk , yk )

for k ≥ 0. In the above, we assume that [f1 (xk , yk ), f2 (xk , yk )]T ∈ (a, b)×(c, d)
for all [xk , yk ]T , so the next vector [xk+1 , yk+1 ]T in the sequence is in the
domain of fi (x, y), i = 1, 2.
∗ ∗ T
Similar to the one-dimensional case, we call ∗[x , y ] a fixed point of a
f1 (x, y) x
function f (x, y) = if f (x∗ , y ∗ ) = , that is, f1 (x∗ , y ∗ ) = x∗
f2 (x, y) y∗
and f2 (x∗ , y ∗ ) = y ∗ . Obviously, if [x∗ , y ∗ ]T is a fixed point of f (x, y) then
it is a root of the system (4.22). The iteration (4.23) is called a fixed point
iteration.

4.5.2 Newton’s method for systems

F1 (x, y)
Recall that for F (x, y) = , its ﬁrst-order derivative at point
F2 (x, y)
[x, y]T is given by the Jacobian matrix
( ∂F (x,y) ∂F (x,y) )
1 1
∂x ∂y
JF (x, y) = ∂F2 (x,y) ∂F2 (x,y) .
∂x ∂y

Assume that we are given [xk , yk ]T ∈ IR2 . Similar to the one-dimensional case,
we use the Taylor’s approximation of F (x, y) about [xk , yk ]T ,
( ∂F1 (xk ,yk ) ∂F1 (xk ,yk ) )
F1 (x, y) F1 (xk , yk ) x − xk
≈ + ∂F2 (x ∂x ∂y
,
F2 (x, y) F2 (xk , yk ) k ,yk ) ∂F2 (xk ,yk ) y − yk
∂x ∂y

instead of F (x, y) in order to ﬁnd the next approximation [xk+1 , yk+1 ]T of a

root [x∗ , y ∗ ]T of F (x, y). Assume that the matrix JF (xk , yk ) is nonsingular.
Solving the system
( ∂F1 (xk ,yk ) ∂F1 (xk ,yk ) )
F1 (xk , yk ) ∂x ∂y x − xk 0
+ ∂F2 (xk ,yk ) ∂F2 (xk ,yk ) =
F2 (xk , yk ) y − yk 0
∂x ∂y

with respect to x and y, and denoting the solution by [xk+1 , yk+1 ]T , we obtain
( ∂F1 (xk ,yk ) ∂F1 (xk ,yk ) )−1
xk+1 xk F1 (xk , yk )
= − ∂F2 (xk ,yk ) ∂F2 (xk ,yk )
∂x ∂y
.
yk+1 yk F2 (xk , yk )
∂x ∂y
108 Numerical Methods and Optimization: An Introduction

Assuming that the Jacobian JF (x, y) is always nonsingular, the (k + 1)-st

iteration of Newton’s method, where k ≥ 0, is

( ∂F1 (xk ,yk ) ∂F1 (xk ,yk )

)−1
xk+1 xk F1 (xk , yk )
= − ∂x
∂F2 (xk ,yk )
∂y
∂F2 (xk ,yk ) . (4.24)
yk+1 yk F2 (xk , yk )
∂x ∂y

Example 4.12 Consider the system of nonlinear equations

(x − 1)2 + (y − 1)2 − 1 = 0
x+y−2 = 0.

We apply 2 iterations of Newton’s method starting with [x0 , y0 ]T = [0, 2]T . We

have
F1 (x, y) (x − 1)2 + (y − 1)2 − 1
F (x, y) = = ,
F2 (x, y) x+y−2
and the Jacobian of F is

2(x − 1) 2(y − 1)
JF (x, y) = .
1 1

We obtain

x1 F1 (0, 2)
0 1/4
= − JF−1 (0, 2)
= ;
y1 F2 (0, 2)
2 7/4

x2 1/4 F1 (1/4, 7/4) 7/24 0.2917
= −JF−1 (1/4, 7/4) = ≈ .
y2 7/4 F2 (1/4, 7/4) 41/24 1.7083

Note that geometrically the ﬁrst equation is described by a circle, and the
second equation is given by a line (see Figure 4.12). The roots are at points
√ √
1 − 1/√2 0.2929 1 + 1/√2 1.7071
x̂ = ≈ and x̃ = ≈ ,
1 + 1/ 2 1.7071 1 − 1/ 2 0.2929

which are located at the intersections of the line and the circle. The steps
of Newton’s method in this example correspond to movement from the point
[0, 2]T along the line given by the second equation toward the root x̂.

Exercises
4.1. Let f (x) = 5 − x6 .
Solving Equations 109
y
[x0 , y0 ]T [x1 , y1 ]T

2
x̂

x̃
1 x

FIGURE 4.10: An illustration of Example 4.12.

(a) Find all ﬁxed points of f by solving a quadratic equation.

(b) Apply the ﬁxed point method twice with x0 = 4.

4.2. Let f (x) = 14 (x2 − 4x + 7).

(a) Verify that x∗ = 1 is a ﬁxed point of f .

(b) Apply the ﬁxed point method twice with x0 = 3.
(c) Perform 3 iterations of the ﬁxed point method with x0 = 5. What
do you observe?
(d) Graph f (x) and illustrate your computations in (b) and (c) geo-
metrically.

4.3. Consider f (x) = x3 + 2x2 + 10x − 20.

(a) Prove that f has a unique root x∗ in the interval [1, 2].
(b) Apply the bisection method twice.
(c) How many iterations of the bisection method are required to guar-
antee an approximation of the root within 10−10 ? (Note: In 1225
Leonardo of Pisa computed the root x∗ ≈ 1.368808107. This was a
remarkable result for his time.)

4.4. Plot f (x) = 2x3 − 4x2 + 3x and illustrate three steps of the regula-falsi
method applied to f on the interval [−1, 1] graphically. What do you
observe?

4.5. Consider f (x) = x5 + 2x − 1.

110 Numerical Methods and Optimization: An Introduction

(a) Prove that f has a unique root in [0, 1].

(b) Apply the bisection method twice.
(c) How many iterations of the bisection method are required to obtain
an approximation to the root within 10−8 ?
(d) Apply the regula-falsi method twice.
4.6. Let f (x) = x5 + 2x3 + x − 1.
(a) Prove that f has a unique root in the interval [0, 1].
(b) Apply the bisection method twice.
(c) Apply the regula-falsi method twice.
4.7. Let f (x) = 2x7 + x − 5.
(a) Prove that f has a unique root in the interval [1, 2].
(b) Apply three iterations of the bisection method to ﬁnd an approxi-
mation of the root.
4.8. Consider the function f (x) = e−2x − x3 .
(a) Prove that f has a unique root x∗ in the interval [0, 1].
(b) Apply the bisection method three times to obtain an approximation
c3 of the root.
(c) How many iterations k of the bisection method are needed to ensure
that the absolute error |ck − x∗ | ≤ 10−5 ?
4.9. Consider the function f (x) = ex + x − 4.
(a) Prove that f has a unique root in the interval [1, 2].
(b) Perform three iterations of the regula-falsi method.
(c) Perform three iterations of the modiﬁed regula-falsi method.
(d) Compare the results you obtained in (b) and (c).
4.10. Let f (x) = x5 + 3x − 1.
(a) Prove that f (x) = 0 has a unique solution in [0, 1].
(b) Apply two iterations of the bisection method to compute an ap-
proximation c2 of the root.
(c) Apply Newton’s method twice using c2 computed in (b) as the
initial guess.
4.11. Let f (x) = x3 + 2x − 2.
(a) Prove that f (x) has a unique root in [0, 1].
(b) Apply the regula-falsi method twice.
Solving Equations 111

(a) Use the bisection method twice for f (x) = x3 − 2 in the interval
[1, 2].
(b) Use Newton’s method twice starting with the point obtained using
bisection in (a).
√
4.13. Find an approximation of 3 9 by ﬁnding the root of the function f (x) =
x3 − 9 using two steps of the modiﬁed regula-falsi method. Choose an
appropriate initial interval with integer endpoints.

4.14. Use three iterations of Newton’s method with the initial guess x0 = 2
to approximate
√
(a) 3;
√
(b) 5;
(c) 32/5 .

4.15. Let f (x) = x6 − x − 1.

(a) Use 7 iterations of Newton’s method with x0 = 2 to get an approx-

imate root for this equation.
(b) Use 7 iterations of the secant method with x0 = 2, x1 = 1 to get
an approximate root for this equation.

Compare your results in (a) and (b).

4.16. Consider the equation e100x (x − 2) = 0. Apply Newton’s method several

times with x0 = 1. What do you observe?

4.17. Propose a method to solve the equation e100x (x − 2)(x5 + 2x − 1) = 0.

4.18. Consider the system of equations

(x − 1)2 + (y − 1)2 − 1 = 0
x+y−1 = 0.

(a) Draw the sets of points on the plane that satisfy each equation,
and indicate the solutions of the system.
(b) Solve this system exactly.
(c) Apply Newton’s method twice with [x0 , y0 ]T = [1/2, 1/2]T . Illus-
trate the corresponding steps geometrically.
112 Numerical Methods and Optimization: An Introduction

4.19. Consider the system of equations

x2 + y 2 − 9 = 0
x+y−1 = 0.

Apply Newton’s method twice with starting point [1, 0]T . Show the exact
solutions and the steps of Newton’s method graphically.
4.20. Use Newton’s method twice with initial guess [x0 , y0 ]T = [1/2, 0]T to
solve the nonlinear system

x2 − x − y = 0
x+y−2 = 0.

Illustrate your steps graphically.

4.21. Consider the system of equations

x4 + y 4 − 3 = 0
x − 3xy + 1
3 2
= 0.

Apply Newton’s method twice with the starting point [1, 1]T .
Chapter 5
Polynomial Interpolation

Many problems arising in engineering require evaluation of real-valued func-

tions. Evaluation of some of the commonly used functions, such as cos(x), on
a computer usually utilizes the so-called lookup tables, which store a finite
set of numbers describing the function of interest in a way that allows one to
approximate the value of f (x) for any given x with a high level of accuracy.
For example, a finite set of coefficients in the power series decomposition of
f (x) could be used for this purpose.
Example 5.1 Consider f (x) = cos(x), x ∈ [−1, 1]. The Taylor series approx-
imation of this function about the origin is
x2 x4 x6 x2n
cos(x) = 1 − + − + . . . + (−1)n + ...,
2! 4! 6! (2n)!
and for a given n, the set of 2n + 1 coefficients
1 1 1
1, 0, − , 0, , 0, . . . , (−1)n
2! 4! (2n)!
could be used as the table representing f (x) for the values of x close to 0.
Given such a table, the approximation to f (x) is computed as
x2 x4 x6 x2n
cos(x) ≈ 1 − + − + . . . + (−1)n .
2! 4! 6! (2n)!
In this formula, a better approximation is obtained for a larger n.
However, in many practical situations, the function f (x) of interest is not
known explicitly, and the only available information about the function is a
table of measured or computed values y0 = f (x0 ), y1 = f (x1 ), . . . , yn = f (xn ),
for a set of points x0 < x1 < . . . < xn in some interval [a, b]:

x0 x1 ··· xn
y0 y1 ··· yn

Given such a table, our goal is to approximate f (x) with a function, say
a polynomial p(x), such that p(xi ) = f (xi ), i = 0, 1, . . . , n. Then for any
x ∈ [a, b] we can approximate the value f (x) with p(x): f (x) ≈ p(x). If
x0 < x < xn , the approximation p(x) is called an interpolated value, otherwise
it is an extrapolated value.

113
114 Numerical Methods and Optimization: An Introduction
y
1

− π2 π
2
0 x

FIGURE 5.1: Two polynomials approximating y = f (x) = cos(x) over

2
[−π/2, π/2] (solid line): the Taylor polynomial p(x) = 1 − x2 (dashed) and the
quadratic polynomial q(x) = 1 − π42 x2 passing through points (−π/2, 0), (0, 1)
and (π/2, 0) (dotted).

Example 5.2 Figure 5.1 shows two approximations to f (x) = cos(x) over
2
[−π/2, π/2]: the Taylor polynomial p(x) = 1 − x2 of degree 2 (computed as in
Example 5.1), and the quadratic polynomial q(x) = 1 − π42 x2 passing through
points given by the table

−π/2 0 π/2
0 1 0

Note that |f (x) − p(x)| increases signiﬁcantly when x moves away from the
origin. At the same time, the polynomial q(x) passing through the three equidis-
tant points in [−π/2, π/2] provides a better approximation of f (x).

Polynomials represent a natural choice for approximation of more com-

plicated functions, since their simple structure facilitates easy manipulation
(diﬀerentiation, integration, etc.) In this chapter we will discuss several basic
methods for polynomial interpolation. We start by introducing various forms
of representation for polynomials. Then we show how these forms can be used
to compute the unique interpolating polynomial in diﬀerent ways. Finally, in
Section 5.3 we discuss upper bounds for the error of interpolation and intro-
duce Chebyshev polynomials, which ensure the minimum possible worst-case
error bound.
Polynomial Interpolation 115

5.1 Forms of Polynomials

Depending on the objective, the same polynomial can be represented in
several diﬀerent forms. In particular, a polynomial p(x) in the power form is
written as follows:

p(x) = a0 + a1 x + a2 x2 + · · · + an xn .

If an = 0, then p(x) has degree n.

A polynomial p(x) is in the shifted power form with the center c if

p(x) = b0 + b1 (x − c) + b2 (x − c)2 + · · · + bn (x − c)n .

A polynomial p(x) is in the Newton form if

p(x) = a0 + a1 (x − c1 ) + a2 (x − c1 )(x − c2 ) + · · · + an (x − c1 ) · · · (x − cn )

n
k
= a0 + ak (x − ci ).
k=1 i=1

Note that for c1 = c2 = · · · = cn = c we obtain the shifted power form, while

for c1 = c2 = · · · = cn = 0 we have the power form.
Evaluation of p(x) in the Newton form can trivially be done using n +
n(n + 1)/2 additions and n(n + 1)/2 multiplications. However, this can be
done much more eﬃciently by expressing p(x) in the nested form as follows:

n
k
p(x) = a0 + (x − c1 ) a1 + ak (x − ci )
k=2 i=2

n
k
= a0 + (x − c1 ) a1 + (x − c2 ) a2 + ak (x − ci )
k=3 i=3
..
.
= a0 + (x − c1 )(a1 + (x − c2 )(a2 + (x − c3 )(a3 + · · ·
+(x − cn−1 )(an−1 + (x − cn )an ) · · · ))).

We can start evaluating p(x) by ﬁrst computing an−1 = an−1 + (x − cn )an ,

then going backward and ﬁnding an−2 = an−2 + (x − cn−1 )an−1 , an−3 =
an−3 + (x − cn−2 )an−2 , . . ., to eventually compute a0 + (x − c1 )a1 = p(x). This
evaluation procedure, which is summarized in Algorithm 5.1, requires only 2n
additions and n multiplications.

Example 5.3 Find p(5) for p(x) = 1 + 2(x − 1) + 3(x − 1)(x − 2) + 4(x −
1)(x − 2)(x − 3) + 5(x − 1)(x − 2)(x − 3)(x − 4).
116 Numerical Methods and Optimization: An Introduction

Algorithm 5.1 Evaluating a polynomial p(x) in the Newton form.

1: Input: a0 , a1 , . . . , an ; c1 , . . . , cn ; x
n *k
2: Output: p(x) = a0 + k=1 ak i=1 (x − ci )
3: an := an
4: for i = n − 1, n − 2, . . . , 0 do
5: ai := ai + (x − ci+1 )ai+1
6: end for
7: return a0

We have a0 = 1, a1 = 2, a2 = 3, a3 = 4, a4 = 5; c1 = 1, c2 = 2, c3 = 3, c4 =
4, and x = 5. Applying Algorithm 5.1 we obtain:

i = 4 : a4 = 5
i = 3 : a3 = 4 + (5 − 4)5 = 9

i = 2 : a2 = 3 + (5 − 3)9 = 21
i = 1 : a1 = 2 + (5 − 2)21 = 65

i = 0 : a0 = 1 + (5 − 1)65 = 261.

Thus, p(5) = 261.

5.2 Polynomial Interpolation Methods

Given n + 1 pairs (x0 , y0 ), . . . , (xn , yn ) of numbers, where xi = xj if i = j,
our goal is to ﬁnd a polynomial p(x) such that p(xi ) = yi , i = 0, . . . , n.
For x0 , . . . , xn , we deﬁne the following nth degree polynomials for j =
0, . . . , n:

(x − x0 )(x − x1 ) · · · (x − xj−1 )(x − xj+1 ) · · · (x − xn )

lj (x) =
(xj − x0 )(xj − x1 ) · · · (xj − xj−1 )(xj − xj+1 ) · · · (xj − xn )
n
(x − xi )
= ,
i=0
(x j − xi )
i=j

which are referred to as Lagrange polynomials. It is easy to verify that lj (xj ) =

1, whereas lj (xi ) = 0 if i = j, ∀i, j = 0, . . . , n. Hence, the following polynomial
has the desired property, p(xi ) = yi , i = 0, . . . , n:

n n
(x − xi )
p(x) = y0 l0 (x) + y1 l1 (x) + · · · + yn ln (x) = yj . (5.1)
j=0 i=0
(x j − xi )
i=j
Polynomial Interpolation 117

Now, assume that there are two polynomials of degree ≤ n, p(x) and q(x),
such that p(xi ) = q(xi ) = yi , i = 0, . . . , n. Then r(x) = p(x) − q(x) is a
polynomial of degree ≤ n, which has n + 1 distinct roots x0 , x1 , . . . , xn . We
obtain a contradiction with the fundamental theorem of algebra. Thus, we
proved the following theorem.

Theorem 5.1 Given (x0 , y0 ), . . . , (xn , yn ), where xi = xj if i = j, there

exists a unique interpolating polynomial p(x) of degree no greater than
n, such that p(xi ) = yi , i = 0, . . . , n.

5.2.1 Lagrange method

The proof of the theorem above was constructive and provided a method
for determining the interpolating polynomial, called the Lagrange method
in (5.1):

n n
(x − xi )
p(x) = y0 l0 (x) + y1 l1 (x) + · · · + yn ln (x) = yj .
j=0 i=0
(x j − xi )
i=j

Example 5.4 Use the Lagrange method to ﬁnd the second-degree polynomial
p(x) interpolating the data

x −1 0 1
.
y 3 5 2

We have:

(x − 0)(x − 1) x(x − 1)
l0 (x) = = ,
(−1 − 0)(−1 − 1) 2
(x + 1)(x − 1)
l1 (x) = = 1 − x2 ,
(0 + 1)(0 − 1)
(x + 1)(x − 0) x(x + 1)
l2 (x) = = .
(1 + 1)(1 − 0) 2

Hence, the interpolating polynomial is

2
x(x − 1) x(x + 1)
p(x) = yj lj (x) = 3· + 5 · (1 − x2 ) + 2 ·
j=0
2 2
5 1
= − x2 − x + 5.
2 2
118 Numerical Methods and Optimization: An Introduction

5.2.2 The method of undetermined coeﬃcients

Alternatively, we could use the power form for the interpolating polyno-
mial,
p(x) = a0 + a1 x + a2 x2 + · · · + an xn ,
and use the fact that p(xi ) = yi , i = 0, . . . , n, to ﬁnd its coeﬃcients by solving
the following linear system for a0 , . . . , an :

a0 + a1 x0 + a2 x20 + · · · + an xn0 = y0
a0 + a1 x1 + a2 x21 + · · · + an xn1 = y1
..
.
a0 + a1 xn + a2 x2n + · · · + an xnn = yn .

The same system in the matrix form is given by V a = y, where

⎡ ⎤ ⎡ ⎤ ⎡ ⎤
1 x0 x20 · · · xn0 a0 y0
⎢ 1 x1 x21 · · · xn1 ⎥ ⎢ a1 ⎥ ⎢ y1 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
V =⎢ . .. .. .. .. ⎥ , a = ⎢ .. ⎥ , y = ⎢ .. ⎥.
⎣ .. . . . . ⎦ ⎣ . ⎦ ⎣ . ⎦
1 xn x2n · · · xnn an yn

The matrix V is a Vandermonde matrix and is known to have the determinant

det(V ) = (xi − xj ) = 0.
0≤i<j≤n

This implies that there exists a unique solution a = V −1 y to the system.

Example 5.5 For the same data as in the previous example,

x −1 0 1
y 3 5 2

we have the following system:

⎡ ⎤⎡ ⎤ ⎡ ⎤
1 −1 1 a0 3
⎣ 1 0 0 ⎦ ⎣ a1 ⎦ = ⎣ 5 ⎦ .
1 1 1 a2 2

The solution to this system is a0 = 5, a1 = − 12 , a2 = − 52 , and the correspond-

ing interpolating polynomial is p(x) = − 52 x2 − 12 x + 5.

5.2.3 Newton’s method

Assume that we are given the interpolating polynomial pn (x) for data
points (xi , yi ), i = 0, 1, . . . , n. Suppose that a new data point (xn+1 , yn+1 ) is
added, then a natural question is, how can we use pn in order to construct the
Polynomial Interpolation 119

interpolating polynomial pn+1 for points (xi , yi ), i = 0, 1, . . . , n + 1? Unfortu-

nately, the Lagrange method and method of undetermined coeﬃcients do not
utilize the fact of having pn (x) when computing pn+1 (x). Consider, however,
the Newton form of pn+1 (x):

n
j−1
n
pn+1 (x) = aj (x − xi ) + an+1 (x − xi ) = q(x) + r(x),
j=0 i=0 i=0

q(x) r(x)

where

n
j−1
q(x) = aj (x − xi ) = a0 + a1 (x − x0 ) + . . . + an (x − x0 ) · . . . · (x − xn−1 )
j=0 i=0

and

n
r(x) = an+1 (x − xi ) = an+1 (x − x0 ) · . . . · (x − xn ).
i=0

We can show that q(x) = pk (x). Indeed, by deﬁnition of pn (x) and pn+1 (x),

pn+1 (xj ) = pn (xj ) = yj for j = 0, . . . , n.

*n
Since r(xj ) = an+1 i=0 (xj − xi ) = 0 for j = 0, . . . , n, we have

q(xj ) = pn+1 (xj ) − r(xj ) = pn+1 (xj ) = yj for j = 0, . . . , n,

and due to the uniqueness of the interpolating polynomial, q(x) = pn (x). As

a result, we obtain the following representation for pn+1 (x):

n
pn+1 (x) = pn (x) + an+1 (x − xi ).
i=0

In view of the fact that pn+1 (xn+1 ) = yn+1 , this implies that

n
yn+1 = pn+1 (xn+1 ) = pn (xn+1 ) + an+1 (xn+1 − xi ) ⇒
i=0
y − pn (xn+1 )
an+1 = *n+1
n .
i=0 (xn+1 − xi )

Therefore, pn+1 (x) can be computed eﬃciently using pn (x).

Example 5.6 Assume that a new data point (x3 , y3 ) = (2, 4) is added to the
data in Example 5.4:
x −1 0 1 2
y 3 5 2 4
120 Numerical Methods and Optimization: An Introduction

From Example 5.4 we know that the quadratic interpolating polynomial for the
ﬁrst three data points is given by p2 (x) = − 52 x2 − 12 x + 5. We use Newton’s
method to ﬁnd the 3rd degree interpolating polynomial for the given data. We
have:
p3 (x) = p2 (x) + a3 (x + 1)(x)(x − 1), p2 (2) = −6, and p3 (2) = 4.
4+6
Hence, a3 = 3·2·1 = 53 , and
5 1 5 5 5 13
p3 (x) = − x2 − x + 5 + (x + 1)(x)(x − 1) = x3 − x2 − x + 5.
2 2 3 3 2 6

5.3 Theoretical Error of Interpolation and Chebyshev

Polynomials
Assume that we interpolate a given function f using the data points
(xi , f (xi )), i = 0, . . . , n, where x0 < x1 < · · · < xn . The following theorem
describes the error of interpolation over the interval [x0 , xn ].

Theorem 5.2 Let f (x) ∈ C (n+1) ([x0 , xn ]). Then there exists ξ ∈
[x0 , xn ] such that for all x ∈ [x0 , xn ], the error

f (n+1) (ξ)
n
en (x) = f (x) − pn (x) = (x − xi ). (5.2)
(n + 1)! i=0

Note that ξ depends on x and its value is not available explicitly. However,
even if f (n+1) (ξ) is not known, it may be possible to obtain an upper bound
cn such that
|f (n+1) (x)| ≤ cn ∀x ∈ [x0 , xn ].
Then
cn n
|en (x)| ≤ |x − xi | ∀x ∈ [x0 , xn ]. (5.3)
(n + 1)! i=0

Example 5.7 Consider the quadratic interpolation of f (x) = cos x using 3

points, x0 = −π/2, x1 = 0, and x2 = π/2 shown in Figure 5.1 at page 114.
We will estimate the error of interpolation at x = π/4.
In this case,
|f (3) (x)| ≤ 1 ∀x,
hence,

1 π π 1 3 π 2
|e2 (x)| = |f (x) − p2 (x)| ≤ x+ x x− = x − x ,
3! 2 2 6 4
Polynomial Interpolation 121

and for x = π/4 we get the bound

π3
|e2 (x)| ≤ ≈ 0.2422365.
128
√
The actual error of interpolation is e2 (π/4) = 1/ 2 − 3/4 ≈ −0.0428932.

Note that the upper bound on the error in (5.3) depends on the choice of
the interpolation nodes x1 , . . . , xn−1 . Our next goal is to try to select these
nodes in such a way that the largest possible value of the upper bound in (5.3)
is as small as possible.
Without loss of generality, we can assume that [x0 , xn ] = [−1, 1]. Indeed,
consider an arbitrary interval [x0 , xn ] within the domain of f (x). Denoting by

x(xn − x0 ) + x0 + xn
F (x) = f ,
2

we transform the function f (x) deﬁned over [x0 , xn ] into the function F (x)
with the domain [−1, 1].
Denote by

n
Rn+1 (x) = (x − xk ) = (x − x0 )(x − x1 ) · . . . · (x − xn ).
k=0

Then, from (5.3),

cn
|en (x)| ≤ max |Rn+1 (x)|. (5.4)
(n + 1)! −1≤x≤1

We can control the error bound in (5.4) by choosing a set of nodes

xi , i = 0, 1, . . . , n which would minimize the largest deviation of Rn+1 (x)
from zero over [−1, 1]. That is, our objective is to minimize max |Rn+1 (x)|
−1≤x≤1
by selecting an appropriate set of interpolating nodes. This leads to a discus-
sion on Chebyshev polynomials. Let us introduce the following function:

Tn (x) = cos(n arccos x), n = 0, 1, . . . . (5.5)

Theorem 5.3 For any n = 0, 1, . . . , function Tn (x) is a polynomial of

degree n (called the nth Chebyshev polynomial), moreover,

T0 (x) = 1, T1 (x) = x,

and for n ≥ 2, Tn (x) can be found recursively from the following relation:

Tn+1 (x) = 2xTn (x) − Tn−1 (x). (5.6)

122 Numerical Methods and Optimization: An Introduction

Proof. We have

T0 (x) = cos 0 = 1; T1 (x) = cos(arccos x) = x.

We will use the following trigonometric identity to prove (5.6):

cos((n + 1)α) = 2 cos α cos(nα) − cos((n − 1)α), n = 1, 2, . . . .

With α = arccos x, this identity is transformed into (5.6).

To complete the proof, we need to show that Tn (x) is a polynomial of
degree n. We will use relation (5.6) and induction for the proof. We have
already shown that for n = 0 and n = 1, Tn (x) is a polynomial of degree
0 and 1, respectively. Assume that Tk is a polynomial of degree k for all
k = 0, 1, . . . , n. Then

Tn+1 (x) = 2xTn (x) − Tn−1 (x)

is a polynomial of degree that is equal to the degree of 2xTn (x), which is n+1.
Therefore, by induction the statement is correct for any integer n ≥ 0.
Table 5.1 shows the ﬁrst six Chebyshev polynomials.

TABLE 5.1: Chebyshev polynomials T0 (x) through T5 (x).

T0 (x) = 1
T1 (x) = x
T2 (x) = 2x2 − 1
T3 (x) = 4x3 − 3x
T4 (x) = 8x4 − 8x2 + 1
T5 (x) = 16x5 − 20x3 + 5x

5.3.1 Properties of Chebyshev polynomials

Chebyshev polynomials have the following properties.
1. If n ≥ 1, then the coeﬃcient of xn in Tn (x) is 2n−1 .
2. Tn (x) has n distinct roots x0 , x1 , . . . , xn−1 ∈ [0, 1] called Chebyshev
nodes, which are given by

π + 2πk
xk = cos , k = 0, 1, . . . , n − 1. (5.7)
2n

3. The deviation of Tn from zero on [−1, 1] is bounded by

max |Tn (x)| = 1. (5.8)

−1≤x≤1
Polynomial Interpolation 123
y
1

x
-1 0 1

1
FIGURE 5.2: Polynomial approximations to y = f (x) = 1+10x 2 over [−1, 1]

(shown with a solid line) based on 11 equally spaced nodes (dotted) and based
on 11 Chebyshev nodes (dashed).

4. Tn (x) is an even function for n = 2k, and an odd function for n = 2k +1,
that is,

T2k (−x) = T2k (x) and T2k+1 (−x) = −T2k+1 (x), k = 0, 1, . . . .

It appears, that the normalized Chebyshev polynomial of degree n + 1,

Tn+1 (x)/2n has a property useful for polynomial approximation: it has the
smallest deviation from zero over [−1, 1] among all normalized polynomials
of degree n + 1. This is expressed by the following theorem, which we state
without a proof.

Theorem 5.4 (Minimax) There is a unique polynomial of degree n+1,

T (x) = Tn+1 (x)/2n such that for any polynomial Rn+1 (x) of degree n+1
(with the coeﬃcient for xn+1 equal to 1), the following property holds:

max {|T (x)|} ≤ max {|Rn+1 (x)|},

−1≤x≤1 −1≤x≤1

Since any polynomial of degree n + 1 with leading coeﬃcient 1 can be repre-

sented as
Rn+1 (x) = (x − x0 )(x − x1 ) · . . . · (x − xn ),
124 Numerical Methods and Optimization: An Introduction

Theorem 5.4 implies that among all choices of nodes xk ∈ [−1, 1], k =
0, 1, . . . , n, the maximum deviation of Rn+1 (x) from zero, max {|Rn+1 (x)|},
−1≤x≤1
is minimized if xk , k = 0, 1, . . . , n are chosen as the roots of Tn+1 (x), given
by (5.7).
1
Example 5.8 Figure 5.2 shows interpolating polynomials for f (x) = 1+10x 2

over [−1, 1] based on two distinct sets of nodes. One is based on 11 equally
spaced nodes, whereas the other one is based on the same number of Chebyshev
nodes. As can be seen from the ﬁgure, the polynomial based on Chebyshev nodes
provides a much better overall approximation of f (x) over [−1, 1].

Exercises
5.1. Use Algorithm 5.1 to evaluate p(3), where

(a) p(x) = 4 + 5(x + 1) − 7(x + 1)(x − 2) + 8(x + 1)(x − 2)(x − 4);

(b) p(x) = 7 + 3(x − 2) + 4(x − 2)(x − 4) − 10(x − 2)(x − 4)(x − 5) +
6(x − 2)(x − 4)(x − 5)x.

5.2. Consider the following polynomial given in the shifted power form:
p(x) = 1 + (x − 5555.5)2 . The power form representation of p(x) is
p(x) = 30863581.25 − 11.111x + x2 . For each of the two representations,
evaluate p(5555) using 6-digit ﬂoating-point arithmetic.

5.3. Find the quadratic polynomial p2 (x) that interpolates the data

(i) x 1 2 4 (ii) x −1 2 5
y −1 −1 2 y 1 0 −2

(iii) x −2 0 2 (iv) x 0 3 5
y 9 7 9 y −1 8 −4

(a) using the Lagrange method;

(b) using the method of undetermined coeﬃcients.

5.4. Add the data point (x3 , y3 ) = (6, 1) to each table in Exercise 5.3.

(a) Use Newton’s method to ﬁnd the cubic interpolating polynomial

p3 (x) for the resulting data.
(b) Find p3 (x) using the Lagrange method.
(c) Find p3 (x) using the method of undetermined coeﬃcients.
Polynomial Interpolation 125

5.5. Find the trigonometric function T (x) = a0 + a1 sin x + a2 sin 2x that

interpolates the following data.
x 0 π/2 π/3
y -1 2 1

5.6. Find the cubic interpolating polynomial p3 (x) for f (x) = cos x using
x0 = 0.3, x1 = 0.4, x2 = 0.5, and x3 = 0.6. For x = 0.44,
(a) compute e3 (x) = f (x) − p3 (x);
(b) estimate the error e3 (x) using the bound in Equation (5.3) at
page 120.
5.7. Find the quartic interpolating polynomial p4 (x) for f (x) = e3x using
x0 = −1, x1 = −0.5, x2 = 0, x3 = 0.5, and x4 = 1. For x = 0.8,
(a) compute e4 (x) = f (x) − p4 (x);
(b) estimate the error e4 (x) using the bound in Equation (5.3) at
page 120.
This page intentionally left blank
Chapter 6
Numerical Integration

In this chapter we will deal with the problem of integration of a function over
an interval,
+ b
f (x)dx,
a

which arises in many situations.

Example 6.1 A manufacturer produces 2-inch wafers. Due to the production

process variability, the actual diameter of wafers made is normally distributed
with a mean of 2 inches and a standard deviation of 0.01 inch. Speciﬁcations
require that the diameter is between 1.985 and 2.02 inches. We need to esti-
mate the fraction of the produced wafers that will be acceptable.
Denote by ξ the random variable corresponding to the diameter of a wafer.
We know that ξ has the normal probability density function (pdf )

1 −(x − 2)2
p(x) = √ exp .
0.01 2π 2 · 0.012

Then to answer our question, we need to estimate the probability P (1.985 ≤

ξ ≤ 2.02) of ξ being in the range [1.985, 2.02]. From probability theory, it is
known that this probability is given by the following integral:

+
2.02 +
2.02
1 −(x − 2)2
P (1.985 ≤ ξ ≤ 2.02) = p(x)dx = √ exp dx,
0.01 2π 2 · 0.012
1.985 1.985

which is equal to the area enclosed between the x-axis and the plot of p(x),
where x ∈ [1.985, 2.02], as shown in Figure 6.1. This integral cannot be com-
puted analytically, and numerical methods are required to approximate its
value.

The idea behind the numerical methods to be discussed in this chapter

is simple: we replace the integrated function (integrand) with a polynomial
approximating this function. Recall the Lagrange form of a polynomial:

n
pn (x) = a0 l0 (x) + a1 l1 (x) + · · · + an ln (x) = ai li (x),
i=0

127
128 Numerical Methods and Optimization: An Introduction

← p(x)

,
2.02
P (1.985 ≤ ξ ≤ 2.02) = p(x)dx
1.985

1.985 2 2.02 x

FIGURE 6.1: Computing the probability using integral.

where
n
(x − xi )
lk (x) = , k = 0, 1, . . . , n.
i=0
(x k − xi )
i=k

For a given function f (x) and n distinct points x0 , . . . , xn , the interpolating

polynomial is

n
pn (x) = f (xk )lk (x). (6.1)
k=0
,b
Consider now the problem of approximating a f (x)dx for some “complicated”
f : [a, b] → IR, e.g., such that the integral cannot be computed analytically.
Then we could replace the complicated f (x) with an interpolating polynomial
pn (x) based on evaluating f in n equally spaced points x0 , . . . , xn on [a, b],
,b
where xi = a + b−an i, i = 0, . . . , n, and compute a pn (x)dx instead. Note that
the operation of integration is linear, that is
+ b + b + b
(f (x) + g(x))dx = f (x)dx + g(x)dx,
a a a
and + +
b b
αf (x)dx = α f (x)dx,
a a
where g(x) : [a, b] → IR is a function and α is a scalar. Therefore, if f (x) =
pn (x) + en (x), where en (x) is the error of approximation of f by pn , then
+ b + b + b
f (x)dx = pn (x)dx + en (x)dx,
a a a
Numerical Integration 129
,b ,b ,b
and the error of approximation of a f (x)dx by a pn (x)dx equals a en (x)dx.

Consider the interpolating polynomial of f (x) in the form (6.1). Since the
operation of integration is linear, we have
+ b + b n

n + b
pn (x)dx = f (xi )li (x) dx = f (xi ) li (x)dx, (6.2)
a a i=0 i=0 a

,b
where l (x)dx
a i
does not depend on f for i = 0, . . . , n. Thus, for a function
,b
f : [a, b] → IR the integral a f (x)dx is approximated by

n
Qn (f ) = Ai f (xi ), (6.3)
i=0

,b
where Ai = a li (x)dx. Qn (f ) is called the quadrature formula or numerical
integration formula.
Taking n = 1 and n = 2 in (6.2) we obtain the trapezoidal and Simpson’s
rules, respectively, which are discussed next.

6.1 Trapezoidal Rule

Given a function f (x) and points a and b, in the trapezoidal rule we want
,b
to compute the integral a f (x)dx using the linear interpolating polynomial
passing through the points (a, f (a)) and (b, f (b)). We take x0 = a, x1 = b,
and approximate f by p1 :

p1 (x) = f (a)l0 (x) + f (b)l1 (x)

x−b x−a
= f (a) + f (b) .
a−b b−a

Next, we use p1 to approximate the integral of f over [a, b]:

+ + +
b b
x−b b
x−a
f (x)dx ≈ f (a) dx + f (b) dx
a a a−b a b−a
b−a
= (f (a) + f (b)).
2
Thus, we obtain the formula for approximately computing the integral known
as the trapezoidal rule:
130 Numerical Methods and Optimization: An Introduction
y
f (b)

x)
p 1(
f (x) →

f (a)

a b x

FIGURE 6.2: An illustration of the trapezoidal rule.

Trapezoidal Rule:
+ b
b−a
f (x)dx ≈ (f (a) + f (b)). (6.4)
a 2

Note that (6.4) is just the area of the trapezoid deﬁned by the points
(a, 0), (a, f (a)), (b, f (b)), and (b, 0), as shown in Figure 6.2.

Example
,1 6.2 Using the trapezoidal rule to approximate the integral
2
0
exp(x ), we obtain:
+ 1
1 1
exp(x2 )dx ≈ (exp(0) + exp(1)) = (1 + e) ≈ 1.85914.
0 2 2
sin x
For f (x) = x and [a, b] = [1, 5] the trapezoidal rule gives
+
5
sin x 5−1 sin 1 sin 5 sin 5
dx ≈ + = 2 sin 1 + ≈ 1.29937.
1 x 2 1 5 5
Numerical Integration 131
y
f (b)
p2 (x) →
f ( a+b
2 )

f (x) →

f (a)

a a+b b x
2

FIGURE 6.3: An illustration for Simpson’s rule.

6.2 Simpson’s Rule

To derive Simpson’s rule, we approximate f with a quadratic interpolating
polynomial p2 based on the points x0 = a, x1 = a+b2 and x2 = b. Then we
have

2 )(x − b)
(x − a+b a+b (x − a)(x − b)
p2 (x) = f (a) +f
(x0 − 2 )(x0 − b) 1 − a)(x1 − b)
a+b 2 (x
(x − a)(x − a+b
2 )
+f (b) .
(x2 − a)(x2 − x1 )

Integrating p2 over [a, b], we obtain

+
b
b−a a+b
p2 (x)dx = f (a) + 4f + f (b) . (6.5)
a 6 2

This yields Simpson’s rule:

132 Numerical Methods and Optimization: An Introduction

Simpson’s Rule:
+
b
b−a a+b
f (x)dx ≈ f (a) + 4f + f (b) . (6.6)
a 6 2

Example 6.3 For the same integrals as in Example 6.2, Simpson’s rule gives
+ 1
1
exp(x2 )dx ≈ (exp(0) + 4 exp(1/4) + exp(1)) ≈ 1.47573,
0 6

+
5
sin x 5−1 sin 3 sin 5
dx ≈ sin 1 + 4 + ≈ 0.55856.
1 x 6 3 5

6.3 Precision and Error of Approximation

n
Consider the quadrature formula (6.3), Qn (f ) = Ai f (xi ), where
,b i=0
Ai = a li (x)dx.

Deﬁnition 6.1 (Precision of a Quadrature Formula) If Qn (f ) =

,b
f (x)dx for all polynomials f of degree ≤ m, and if there exists a
a ,b
polynomial fˆ of degree m + 1 for which Qn (fˆ) = a fˆ(x)dx, then we say
that the formula has precision m.

Clearly m ≥ n when n + 1 points are used, since in this case pn (x) = f (x)
for any polynomial f (x) of degree at most n. For example, for the trapezoidal
rule we have + b
f (x) = 1 : 1dx = (b − a) = Q1 (f );
a
+ b
b2 − a 2 b−a
f (x) = x : xdx = = (b + a) = Q1 (f ).
a 2 2
To ﬁnd the precision of a quadrature, we use it to compute the integrals
,b k
a
x dx for k ≥ n+1 until we encounter k for which the error of approximation
is nonzero. Using the trapezoidal rule for f (x) = x2 , we obtain:
+ b
b3 − a 3
x2 dx = ,
a 3
Numerical Integration 133
b−a 2 1
(a + b2 ) = (b3 + a2 b − ab2 − a3 ),
Q1 (f ) =
2 2
and the error of approximation is
+ b
1 3 1 1
x2 dx − Q1 (f ) = (b − a3 ) − (b3 + a2 b − ab2 − a3 ) = − (b − a)3 ,
a 3 2 6
,b
thus a f (x)dx = Q1 (f ) for f (x) = x2 whenever a = b. Hence the trapezoidal
rule has the precision 1.
Since n = 2 for Simpson’s rule, it is precise for all polynomials of degree
up to 2. For f (x) = x3 we have
+ b
b4 − a 4
x3 dx = ,
a 4

b−a$ 3 % b4 − a 4
Q2 (f ) = a + 4((a + b)/2)3 + b3 = ,
6 4
and Simpson’s rule is exact. For f (x) = x4 , we can use [a, b] = [0, 1] to show
that Simpson’s rule does not give the exact answer. Indeed,
+ 1
1 5 1
x4 dx = = = (0 + 4(1/2)4 + 1) = Q2 (f ).
0 5 24 6

Therefore, the precision of Simpson’s rule is 3.

Next we discuss the error of approximation of the trapezoidal and Simp-
son’s rules. Recall from Theorem 5.2 that for f (x) ∈ C (n+1) ([a, b]) the error
of interpolating polynomial pn (x) based on x0 , . . . , xn ∈ [a, b] at any x ∈ [a, b]
is
f (n+1) (ξ)
n
en (x) = f (x) − pn (x) = (x − xi )
(n + 1)! i=0

for some ξ ∈ [a, b]. In particular, for n = 1 the error is

f (ξ)
e1 (x) = f (x) − p1 (x) = (x − a)(x − b).
2
Therefore, the error of approximation of the trapezoidal rule is
+ +
b
f (ξ) b
f (ξ)(b − a)3
eT = e1 (x)dx = (x − a)(x − b)dx = − .
a 2 a 12

For Simpson’s rule, it can be shown that there exists ξ ∈ [a, b] such that its
error of approximation is
5
f (4) (ξ) b−a
eS = − .
90 2
134 Numerical Methods and Optimization: An Introduction

In summary, we have

f (ξ)
eT = − (b − a)3 for some ξ ∈ [a, b] (6.7)
12
for the trapezoidal rule, and
5
f (4) (ξ) b−a
eS = − for some ξ ∈ [a, b] (6.8)
90 2

for Simpson’s rule.

6.4 Composite Rules

As can be seen from the error terms (6.7) and (6.8) above, the trapezoidal
and Simpson’s rules above may not work well if the interval [a, b] is large. To
reduce the error in this case, we can divide the interval [a, b] into N smaller
subintervals using points a = x0 , x1 , . . . , xN = b, and since
+ b
N −1 + xi+1
f (x)dx = f (x)dx,
a i=0 xi

we can apply the rules to the integrals deﬁned on the N smaller intervals.

6.4.1 The composite trapezoidal rule

Divide [a, b] into N equal intervals, so we have

b−a
h= , xi = a + ih, i = 0, . . . , N
N
(see Figure 6.4). Therefore,

a = x0 < x1 < · · · < xN = b,

and for each interval we have

+ xi+1
h f (ξi ) 3
f (x)dx = (f (xi ) + f (xi+1 )) − h
xi 2 12

for some ξi ∈ [xi , xi+1 ]. Hence, we obtain the following formulas for the com-
posite trapezoidal rule:
+ b
f (x)dx ≈ TN (f ),
a
Numerical Integration 135
y
f (x4 )
f (x3 )
f (x2 )

f (x1 )

f (x0 )

a = x0 x1 x2 x3 x4 = b x

FIGURE 6.4: An illustration of the composite trapezoidal rule with N = 4.

where

N −1
N −1
h h
TN (f ) = (f (xi ) + f (xi+1 )) = h f (xi ) + (f (x0 ) + f (xN )) ,
i=0
2 i=1
2

b−a
h= , xi = a + ih, i = 0, . . . , N,
N
with the error

N −1
f (ξi )h3
T =−
eN for some ξi ∈ [xi , xi+1 ], i = 0, . . . , N − 1.
i=0
12

6.4.2 Composite Simpson’s rule

Similarly, for the composite version of Simpson’s rule we have the following
formulas:
+ b
f (x)dx ≈ SN (f ),
a
136 Numerical Methods and Optimization: An Introduction

where

N −1
h xi + xi+1
SN (f ) = f (xi ) + 4f + f (xi+1 ) ,
i=0
6 2

b−a
h= , xi = a + ih, i = 0, . . . , N,
N
with the error

N −1 5
f (4) (ξi ) h
eN
S =− for some ξi ∈ [xi , xi+1 ], i = 0, . . . , N − 1.
i=0
90 2

Note that the precision (as described in Deﬁnition 6.1) of the composite
rules does not change with increasing N , i.e., the precision of the composite
trapezoidal rule is 1, and the precision of the composite Simpson’s rule is 3.
However, since h = b−aN , we have

lim eN N
T = lim eS = 0,
N →∞ N →∞

and + b
lim TN (f ) = lim SN (f ) = f (x)dx.
N →∞ N →∞ a

Example 6.4 Estimate the integral

+ 4
1
dx
1 x

using the composite trapezoidal rule and the composite Simpson’s rule with
N = 3.
Applying the composite trapezoidal rule, we obtain
+ 4 3 +
i+1
1 1
dx = dx
1 x i=1 i x

3
1 1 1
≈ +
i=1
2 i i+1

1 1 1 1 1 1
= 1+ + + + +
2 2 2 3 3 4

1 1 1 1 35
= + + 1+ = ≈ 1.3863.
2 3 2 4 24
,4 1
Note, that the exact value of the integral is 1 x
dx = ln 4, therefore the
absolute error is 1.3863 − ln 4 ≈ 0.0720.
Numerical Integration 137

Using the composite Simpson’s rule, we have:

+ 4 3 + i+1
1 1
dx = dx
1 x i=1 i
x
3
1 1 2 1
≈ +4 +
i=1
6 i 2i + 1 i + 1

1 8 1 1 8 1 1 8 1
= 1+ + + + + + + +
6 3 2 2 5 3 3 7 4
3497
= ≈ 1.3877.
2520
The absolute error is 1.3877 − ln 4 ≈ 0.0014 in this case.

6.5 Using Integrals to Approximate Sums

If the value of an integral can be computed analytically, the numerical
methods for computing integrals can be used to approximate the sums ap-
pearing in the corresponding computations. For example, consider the integral
+ N
1
dx.
1 x
We can use the composite trapezoidal rule with step size h = 1 to estimate
this integral.
+ N
N −1 + i+1
1 1
dx = dx
1 x i=1 i
x
N −1
1 1 1
≈ +
2 i=1 i i+1
N −1
1 1 1
N
= +
2 i=1 i i=2 i
N
1 1 1
= − 1+
i=1
i 2 N

1 1 1 1
= 1 + + ··· + − + .
2 N 2 2N
Thus,

N + N
1 1 1 1 1 1
≈ dx + − = ln N + − .
i=1
i 1 x 2 2N 2 2N
138 Numerical Methods and Optimization: An Introduction
,
5000
1
Example 6.5 Consider the integral x+10 dx. Use the composite trape-
1

5000
1
zoidal rule to estimate n+10 .
n=1

5000
1
Applying the composite trapezoidal rule, and denoting by x = n+10 ,
n=1
we have
+
5000 4999
1 1 1 1
dx ≈ + =
x + 10 2 i=1 i + 10 (i + 1) + 10
1
4999
1 1 1
5000
+ =
2 i=1
i + 10 i=2 i + 10

1 1 1 1 1 1
= x− +x− =x− + .
2 5000 + 10 1 + 10 2 5010 11
,
5000
On the other hand, 1 5000
x+10 dx = ln(x + 10)|1 = ln(5010) − ln(11), which
$ 1 %

$ 1 x − 12 % 5010 + 11 ≈ ln(5010) − ln(11), or x ≈ ln(5010) − ln(11) +

1 1 1
gives
2 5010 + 11 ≈ 6.1669.
1

Exercises
,1
6.1. Estimate the integral x5 cos xdx using
−1

(a) the trapezoidal rule;

(b) Simpson’s rule.
,π
6.2. Estimate the integral 0
exp(x) cos xdx using

(a) the trapezoidal rule;

(b) Simpson’s rule.
,1
6.3. Evaluate the integral 0 x exp(x)dx using the trapezoidal and Simpson’s
rule. Find lower and upper bounds on the approximation error in both
cases.
Numerical Integration 139

6.4. Consider the integral

+4
1
I= dx
1 + x2
−4

with the exact value given by 2 arctan(4).

(a) Compute the integral by the trapezoidal rule and the composite
trapezoidal rule with N = 2 and N = 4, respectively. Compute the
absolute error of approximation.
(b) Compute the integral by Simpson’s rule and the composite Simp-
son’s rule with N = 2 and N = 4, respectively. Compute the abso-
lute error of approximation.
(c) Summarize the results you obtained in (a) and (b) in a table. Which
method is more accurate?
,
2001
1
6.5. Consider the integral x+11 dx. Use the composite trapezoidal rule to
1

2000
1
estimate n+11 .
n=1

N
6.6. Find N such that 1
n ≈ 2015 using the composite trapezoidal rule.
n=1
This page intentionally left blank
Chapter 7
Numerical Solution of Diﬀerential
Equations

Diﬀerential equations relate a function f to its derivatives. They play a very

important role in science and engineering and arise in a wide range of appli-
cations.

Example 7.1 In ﬁnance, a bond is a ﬁxed-income security that can be de-

scribed using the following attributes [22]: current price P ; coupon payment
rate c; the number of coupon payments per year m; the number of years to
maturity n; yield to maturity λ–the annual interest rate that is implied by
the current price; and the modified duration DM , which is computed by the
following formula:
1 1 + nc − (mn − 1)λ/m
DM = DM (c, m, n, λ) = − .
λ (1 + λ/m)(c(1 + λ/m)mn − c + λ)
The price and the yield to maturity are related by the following differential
equation, referred to as the price sensitivity formula:
dP
= −P DM , with P (λ0 ) = P0 . (7.1)
dλ
Differential equations are classified into several broad categories, which
are further divided into many subcategories. The broad categories include or-
dinary differential equations and partial differential equations. An ordinary
differential equation (ODE) relates a function of a single variable to its ordi-
nary derivatives, as in the following examples:
dy
= −5y + exp (−t),
dt
( 3 ) 4 2
dy d y dy d3 y
1+ −5 = 0.
dt dt4 dt dt3
Here y is the function, and t is the independent variable.
Alternatively, a partial differential equation (PDE) deals with a function
of several independent variables and its partial derivatives. The equation
∂u ∂2u ∂2u ∂2u
= + 2 + 2
∂t ∂x2 ∂y ∂z

141
142 Numerical Methods and Optimization: An Introduction

is an example of a partial diﬀerential equation.

Whichever the type may be, a differential equation is said to be of the nth
order if it involves a derivative of the nth order, but no derivative of an order
higher than this. For example, the above equations are of the first, fourth, and
second order, respectively.
This chapter focuses on numerical methods for solving first-order ODEs.

7.1 Solution of a Diﬀerential Equation

Consider a first-order ODE in the form
y = f (x, y). (7.2)
A solution of Equation (7.2) is a function y = y(x), satisfying this equation.
For example, the function y(x) = sin x is a solution of differential equation y =
cos x. To solve a differential equation means to find all functions satisfying this
equation. For example, applying rules of integration to the equation y = cos x,
we find that the functions fc (x) = sin x+c, where c is a real constant, represent
all solutions of this equation. In this example, the solution consisting of a set
of continuous functions was found analytically. In contrast, most numerical
methods enable one to find only approximate values of y corresponding to
some finite set of values of t, and the solution found this way is given by the
corresponding table of values.
Consider the initial value problem (IVP) in the form
y = f (t, y) with y(a) = y0 , (7.3)
where a ≡ t0 and y0 are given initial values of the independent variable t and
the function y, respectively. A solution to problem (7.3) on an interval [a, b]
is a differentiable function y = y(t) such that
y (t) = f (t, y) for all t ∈ [a, b] (7.4)
and
y(a) = y0 . (7.5)
Example 7.2 The function y(t) = exp (2t) + t2 + t is a solution of the IVP
y = 2y − 2t2 + 1 with y(0) = 1
for t ≥ 0. Indeed,
1. y (t) = 2 exp (2t) + 2t + 1 = 2y(t) − 2t2 + 1 for all t;
2. y(0) = exp (0) + 02 + 0 = 1,
so, both conditions (7.4) and (7.5) are satisfied.
Numerical Solution of Differential Equations 143

7.2 Taylor Series and Picard’s Methods

Given an IVP
y = f (t, y), y(t0 ) = y0 , (7.6)
a Taylor series solution to this IVP can be generated by recursively computing
the coeﬃcients for the Taylor series approximation

k
y (i) (t0 )
yk (t) = (t − t0 )i , k ≥ 0
i=0
i!

of y(t) about t = t0 using (7.6) as illustrated in the following example.

Example 7.3 Consider the IVP
y = 1 + 2ty + 3y 2 , y(0) = 0.
We have y0 (t) = y(t0 ) = 0. Using t = 0 and y = 0 in the equation we get
y (0) = 1, hence y1 (t) = t. To compute y (0), we diﬀerentiate the equation:
y = (1 + 2ty + 3y 2 ) = 2y + 2ty + 6yy =⇒ y (0) = 0.
Thus, y2 (t) = t. Similarly,
y = (2y + 2ty + 6yy ) = 4y + 2ty + 6(y )2 + 6yy =⇒ y (0) = 10,
10 3
and y3 (t) = t + 6 t , etc.
Note that the IVP (7.6) is equivalent to the integral equation
+t
y(t) = y(t0 ) + f (u, y(u))du, (7.7)
t0

i.e. y(t) is a solution to (7.6) ⇐⇒ y(t) is a solution to (7.7). A solution to the

integral equation (7.7) can be constructed using Picard’s method as follows:
y0 (t) = y0 ;
,t
y1 (t) = y0 + f (u, y0 (u))du;
t0
,t
y2 (t) = y0 + f (u, y1 (u))du;
t0
..
.
,t
yk+1 (t) = y0 + f (u, yk (u))du.
t0

Picard’s method can be summarized as follows:

144 Numerical Methods and Optimization: An Introduction

y0 (t) = y(t0 );
,t (7.8)
yk+1 (t) = y(t0 ) + f (u, yk (u))du, k ≥ 0.
t0

Example 7.4 Consider the IVP

y = 3t2 y 2 , y(0) = −1.

Applying Picard’s method twice, we obtain:

+t +t
y1 (t) = y(0) + 3u y(0) du = −1 +
2 2
3u2 du = t3 − 1;
0 0

+t +t
1 9
y2 (t) = −1 + 3u y1 (u) du = −1 +
2 2
3u2 (u3 − 1)2 du = t − t6 + t3 − 1.
3
0 0

Example 7.5 We solve the following IVP using the Taylor series and
Pickard’s method:
y = y, y(0) = 1.

We have: y = y , y = y , . . .. Hence, 1 = y(0) = y (0) = y (0) =

y (0) = . . ., and the Taylor series solution gives

t2 t3 tk
y(t) = 1 + t + + + ··· + + · · · = exp(t).
2! 3! k!

Applying Picard’s method, we obtain

y0 (t) = 1;
,t
y1 (t) = 1 + 0 y0 (u)du = 1 + t;
,t ,t 2
y2 (t) = 1 + 0 y1 (u)du = 1 + 0 (1 + u)du = 1 + t + t2 ;
,t ,t 2
t2 t3
y3 (t) = 1 + 0 y2 (u)du = 1 + 0 1 + u + u2 du = 1 + t + 2 + 3! ;
..
.
,t ,t u2 u3 uk−1

yk (t) = 1+ 0
yk−1 (u)du = 1 + 0
1+u+ 2! + 3! + ··· + (k−1)! du
2 3
uk−1 k
= 1+u+ u
2! + u
3! + ··· + (k−1)! + u
k! .

As one can see, in this example Pickard’s method also generates the Taylor
series for y(t) = exp(t), which is the exact solution of the given IVP.
Numerical Solution of Diﬀerential Equations 145

7.3 Euler’s Method

Consider the IVP
y = f (t, y), y(t0 ) = y0 . (7.9)
Assume that y(t) ∈ C 2 [t0 , b]. Applying Taylor’s theorem to expand y(t) about
t = t0 , we obtain

y (ct )(t − t0 )2
y(t) = y(t0 ) + y (t0 )(t − t0 ) + (7.10)
2
for some ct ∈ [t0 , t], where t ∈ [t0 , b]. Note that y (t0 ) can be replaced by
f (t0 , y(t0 )). Then for t = t1 , where t1 = t0 + h, h > 0, we have

y (ct1 )h2
y(t1 ) = y(t0 ) + hf (t0 , y(t0 )) + . (7.11)
2
If h is chosen small enough, then the term involving h2 is close to 0 and thus
can be neglected, yielding the following formula:

y(t1 ) = y(t0 ) + hf (t0 , y(t0 )). (7.12)

In the last equation we obtained the formula of a single step of the so-called
Euler’s method.
Figure 7.1 illustrates Euler’s method geometrically: starting from the point
(t0 , y0 ), we obtain an approximation y1 to the solution y(t) in point t1 by
moving along the tangent line to y(t) at t0 until we reach the point (t1 , y1 ).
Suppose that, given the IVP (7.9), we want to ﬁnd a numerical solution
over interval [a, b], where a ≡ t0 , using Euler’s method. First we divide [a, b]
into n equal subintervals, each of length h = b−an , thus obtaining a set of n + 1
mesh points tk = a + kh, k = 0, 1, . . . , n. The value h is called the step size.
Using Euler’s method we ﬁnd an approximate solution in the mesh points by
following the scheme described in Algorithm 7.1.

Algorithm 7.1 Euler’s method for solving the IVP y = f (t, y) with y(a) = y0
on the interval [a, b].
1: Input: function f , interval [a, b], y0 = y(a), and the number of steps n
2: Output: (tk , yk ) such that yk ≈ y(tk ), k = 1, . . . , n
3: h = b−a
n , t0 = a
4: for k = 1, . . . , n do
5: tk = tk−1 + h
6: yk = yk−1 + hf (tk−1 , yk−1 )
7: end for
8: return (tk , yk ), k = 1, . . . , n
146 Numerical Methods and Optimization: An Introduction
y
y1

y = y(t)
y(t1 )

t
t0 t1

FIGURE 7.1: An illustration of the one-step Euler method.

Example 7.6 Suppose that we have a 30-year 10% coupon bond with yield to
maturity λ0 = 10% and price P0 = 100. We are interested to know the price of
this bond when its yield changes to λ1 = 11%. We can use the price sensitivity
formula mentioned in Example 7.1:
dP
= −P DM , with P (λ0 ) = P0 .
dλ
Let us use Euler’s method to solve the above equation. We have

f (λ, P ) = −P DM ; λ0 = 10; P0 = 100; h = 0.11 − 0.10 = 0.01;

f (λ0 , P0 ) = −P0 DM (0.10, 2, 30, 0.10) = −100 · 9.47 = −947.

Therefore,

P (λ1 ) ≈ 100 + 0.01(−947) = 100 − 9.47 = 90.53.

Example 7.7 Consider the equation y = t + y with y(0) = 1. Apply Euler’s

method three times with h = 0.05. We have:

f (t, y) = t + y,

ti = 0.05 · i, i = 0, 1, 2, . . . .
Numerical Solution of Diﬀerential Equations 147

y1 = y0 + hf (t0 , y0 ) = y0 + h(t0 + y0 ) = 1 + 0.05(0 + 1) = 1.05,

y2 = y1 + hf (t1 , y1 ) = 1.05 + 0.05(0.05 + 1.05) = 1.105,
y3 = y2 + h(t2 , y2 ) = 1.105 + 0.05(0.10 + 1.105) = 1.165.

7.3.1 Discretization errors

In Euler’s method we deal with truncation errors arising as a result of
replacing an inﬁnite continuous region by a ﬁnite discrete set of mesh points.

Deﬁnition 7.1 (Global and Local Discretization Errors) Let y =

y(t) be the unique solution to an IVP y = f (t, y) with y(t0 ) = y0 , and
let {(tk , yk ) : k = 0, . . . n} be the set of approximate values obtained using
a numerical method for solving the IVP. The global discretization error
of the method at step k is given by

ek = yk − y(tk ) (7.13)

for k = 0, 1, . . . n. When k = n, the corresponding error en is called the

ﬁnal global error.
The local discretization error at step k is deﬁned as the error of a
single step from tk−1 to tk :

k = ŷk − y(tk ), (7.14)

where ŷk = y(tk−1 ) + hf (tk−1 , y(tk−1 )), k = 1, . . . n.

Figure 7.2 illustrates the above deﬁnitions for n = 3.

7.4 Runge-Kutta Methods

The Runge-Kutta (RK) methods discussed in this section allow one to
achieve a good accuracy without computing the higher-order derivatives. As
before, we consider the IVP

y = f (t, y), y(t0 ) = y0 . (7.15)

Assume that the value yk of the approximate solution at point tk has been
found, and we want to ﬁnd a proper value yk+1 of the numerical solution at
the next point tk+1 = tk + h. Then Euler’s method can be generalized in the
following way:
yk+1 = yk + hG(tk , yk , h), (7.16)
148 Numerical Methods and Optimization: An Introduction
y
y3 ⎫
⎬
ﬁnal global error e3
y2 ⎭ ⎫
y(t3 ) ⎬
global error e2
ŷ2 ⎭
} local error 2
y(t2 ) ⎫ y = y(t)
y1 ⎬
local/global
⎭ error e1 = 1
y(t1 )

y0
t
t0 t1 t2 t3

FIGURE 7.2: An illustration of discretization errors in Euler’s method.

where G(tk , yk , h) is called an increment function and can be written in a

general form as

G(tk , yk , h) = p1 f1 + p2 f2 + . . . + pl fl . (7.17)

In this expression, l is some positive integer called the order of the method,
pi , i = 1, . . . , l are constants, and fi , i = 1, . . . , l are given by
f1 = f (tk , yk ),
f2 = f (tk + α1 h, yk + α1 hf1 ),
f3 = f (tk + α2 h, yk + α2 hf2 ), (7.18)
..
.
fl = f (tk + αl−1 h, yk + αl−1 hfl−1 ).

The coeﬃcients α1 , . . . , αl−1 , p1 , . . . , pl are chosen such that the precision of

the approximation is suﬃciently good. Given yk , one can recursively compute
f1 , . . . , fl , and then yk+1 .
The ﬁrst-order Runge-Kutta method (l = 1) is, in fact, the previously
discussed Euler method.

7.4.1 Second-order Runge-Kutta methods

For l = 2 we obtain

yk+1 = yk + h(p1 f1 + p2 f2 ), (7.19)

Numerical Solution of Diﬀerential Equations 149

where
f1 = f (tk , yk ),
(7.20)
f2 = f (tk + α1 h, yk + α1 hf1 ).
Let y(t) be the solution of Equation (7.15), then for the second-order derivative
of y(t) we obtain

d2 y d ∂f ∂f dy ∂f ∂f
= f (t, y(t)) = + = + f. (7.21)
dt2 dt ∂t ∂y dt ∂t ∂y

Therefore, substituting (7.21) into a Taylor series expansion to the second-

order term,

h2
y(tk + h) = y(tk ) + hy (tk ) + y (tk ) + O(h3 ),
2
and using the fact that y(x) is the solution of (7.15), we obtain:

y(tk + h) − y(tk ) h ∂f ∂f
= f+ + f + O(h2 ). (7.22)
h 2 ∂t ∂y t = tk
y = y(tk )

Using a ﬁrst-order Taylor series expansion for f2 in (7.20) as a function of two

variables, we get

f2 = f (tk + α1 h, yk + α1 hf1 )
0 1 (7.23)
= f + α1 h ∂f
∂t + α 1 h ∂f
∂y f t = tk + O(h2 ),
y = y(tk )

and from (7.19),

yk+1 −yk
h = p1 f 1 + p2 f 2
0 1 (7.24)
= (p1 + p2 )f + p2 α1 h ∂f
∂t + p 2 α 1 h ∂f
∂y f t = tk + O(h2 ).
y = y(tk )

Equating the coeﬃcients of the corresponding terms in the right-hand sides

of (7.22) and (7.24), we derive the following equations:

p1 + p2 = 1,

and
1
p2 α 1 = .
2
1 −1
Therefore, for a ﬁxed α1 , substituting p2 = 2α1 1 and p1 = 2α2α 1
in (7.19) we
obtain a general scheme for the second-order RK methods:

2α1 − 1 1
yk+1 = yk + h f1 + f2 , (7.25)
2α1 2α1
150 Numerical Methods and Optimization: An Introduction

where
f1 = f (tk , yk ),
(7.26)
f2 = f (tk + α1 h, yk + α1 hf1 ).
In general, there are inﬁnitely many values for α1 to choose from, each
yielding a diﬀerent second-order RK method (although all of them give exactly
the same result if the solution to the IVP is a polynomial of degree ≤ 2). We
mention three of the most popular versions:
• Heun Method with a Single Corrector: α1 = 1;
• The Improved Polygon Method: α1 = 12 ;
• Ralston’s Method: α1 = 34 .
A general scheme of the second-order Runge-Kutta method is summarized in
Algorithm 7.2.

Algorithm 7.2 Second-order Runge-Kutta method for solving the IVP y =

f (t, y) with y(a) = y0 on the interval [a, b].
1: Input: f , [a, b], y0 = y(a), parameter α1 , and the number of steps n
2: Output: (tk , yk ) such that yk ≈ y(tk ), k = 1, . . . , n
3: h = b−a n , t0 = a
4: for k = 1, . . . , n do
5: tk = tk−1 + h
6: f1 = f (tk−1 , yk−1 )
7: f2 = f (tk−1 + α1 h, yk−1 + α1 hf1)
1 −1
8: yk = yk−1 + h 2α2α 1
f1 + 1
2α1 f2
9: end for
10: return (tk , yk ), k = 1, . . . , n

Example 7.8 Consider the logistic model (also called the Verhulst-Pearl
model) for population dynamics. According to this model, the population P (t)
of a certain species as a function of time t satisfies the following equation:

dP P
= cP 1 − ,
dt M
where c is a growth rate coefficient, M is a limiting size for the population,
and both c and M are constants.
Assuming that P (0) = P0 is given, the exact solution to this equation is
M P0
P (t) = .
P0 + (M − P0 ) exp (−ct)
Putting c = 0.1, M = 500, and P (0) = 300, we obtain the following IVP:

dP P
= 0.1P 1 − , P (0) = 300.
dt 500
Numerical Solution of Differential Equations 151

Next we approximate P (1) using one step of the Heun method, the improved
polygon method, and Ralston’s method, and $ compare% absolute errors for each
method. We have h = 1, f (t, P ) = 0.1P 1 − 500 P
, t0 = 0, P0 = 300, and the
exp (−0.1t) , so P (1) ≈ 311.8713948706.
150000
exact solution P (t) = 300+200
Using the Heun method, we obtain
$ %
f1 = f (t0 , P0 ) = 30 1 − 300 500 = 12,
f2 = f (t0 + h, $ P0 + hf1 )% = f (1, 312) = 11.7312,
P1 = P0 + h 12 f1 + 12 f2 = 300 + 0.5(12 + 11.7312) = 311.8656,

and the absolute error is P1 − P (1) ≈ 311.8656 − 311.8714 = −0.0058.

The improved polygon method gives

f1 = f (t0 , P0 ) = 12,
f2 = f (t0 + 0.5h, P0 + 0.5hf1 ) = f (0.5, 306) = 11.8728,
P1 = P0 + hf2 = 300 + 11.8728 = 311.8728,

and the absolute error is P1 − P (1) ≈ 0.0014.

Finally, by Ralston’s method,

f1 = f (t0 , P0 ) = 12,
f2 = f (t0 + 0.75h,
$ P0 +%0.75hf1 ) = f (0.75, 309) = 11.8038,
P1 = P0 + h 13 f1 + 23 f2 = 300 + 4 + 7.8692 = 311.8692,

and the absolute error is P1 − P (1) ≈ −0.0022.

7.4.2 Fourth-order Runge-Kutta methods

Fourth-order RK methods are probably the most popular among the
Runge-Kutta methods. They can be derived analogously to the second-order
RK methods and are represented by an inﬁnite number of versions as well. As
an example, Algorithm 7.3 presents a scheme, which is frequently referred to
as the classical fourth-order RK method.

Example 7.9 We apply one step of the classical fourth-order Runge-Kutta

method to the problem
$ considered
% in Example 7.8. As before, we have h =
1, f (t, P ) = 0.1P 1 − 500
P
, t0 = 0, P0 = 300, and the exact solution P (1) ≈
311.8713948706. Using the classical fourth-order Runge-Kutta method, we ob-
tain
f1 = f (t0 , P0 ) = 12,
f2 = f (t0 + 0.5h, P0 + 0.5hf1 ) = f (0.5, 306) = 11.8728,
f3 = f (t0 + 0.5h, P0 + 0.5hf2 ) = f (0.5, 305.9364) = 11.8742,
f4 = f (t0 + h, P0 + hf3 ) = f (1, 311.8742) = 11.7343,
P1 = P0 + h(f1 + 2f2 + 2f3 + f4 )/6 = 311.871394,

and the error is |P1 − P (1)| < 10−6 in this case.

152 Numerical Methods and Optimization: An Introduction

Algorithm 7.3 The classical fourth-order Runge-Kutta method for solving

the IVP y = f (t, y) with y(a) = y0 on the interval [a, b].
1: Input: f , [a, b], y0 = y(a), n
2: Output: (tk , yk ) such that yk ≈ y(tk ), k = 1, . . . , n
3: h = b−a
n , t0 = a
4: for k = 1, . . . , n do
5: tk = tk−1 + h
6: f1 = f (tk−1 , yk−1 )
7: f2 = f (tk−1 + 0.5h, yk−1 + 0.5hf1 )
8: f3 = f (tk−1 + 0.5h, yk−1 + 0.5hf2 )
9: f4 = f (tk−1 + h, yk−1 + hf3 )
10: yk = yk−1 + h(f1 + 2f2 + 2f3 + f4 )/6
11: end for
12: return (tk , yk ), k = 1, . . . , n

7.5 Systems of Diﬀerential Equations

Consider the initial value problem (IVP):
dx

dt = f1 (t, x, y) x(t0 ) = x0 ,
dy with (7.27)
dt = f2 (t, x, y) y(t0 ) = y0 .

The solution to problem (7.27) on an interval [t0 , b] is a pair of diﬀerentiable

functions x(t) and y(t) such that for any t ∈ [t0 , b]:

x (t) = f1 (t, x(t), y(t)) x(t0 ) = x0 ,
with (7.28)
y (t) = f2 (t, x(t), y(t)) y(t0 ) = y0 .

Example 7.10 Verify that the pair of functions x(t) = 1

5 exp(t) − 1
10 t exp(t)
and y(t) = 12 exp(t) − 15 t exp(t) solves the IVP

x = 3x − y x(0) = 0.2,
with
y = 4x − y y(0) = 0.5.

1. x (t) = 1
10 exp(t) − 10
1
t exp(t) = 3x(t) − y(t) for all t;
y (t) = 3
10 exp(t) − 5 t exp(t) = 4x(t) − y(t).
1

1 1
2. x(0) = 5 = 0.2; y(0) = 2 = 0.5.

Formulas for numerical solutions of systems of diﬀerential equations are

similar to those for a single equation. Algorithms 7.4 and 7.5 outline the Euler
and Runge-Kutta methods for systems, respectively.
Numerical Solution of Diﬀerential Equations 153

Algorithm 7.4 Euler’s method for the system (7.27) on the interval [a, b].
1: Input: f1 , f2 , [a, b], x0 = x(a), y0 = y(a), n
2: Output: (tk , xk , yk ) such that xk ≈ x(tk ), yk ≈ y(tk ), k = 1, . . . , n
3: h = b−a
n , t0 = a
4: for k = 1, . . . , n do
5: tk = tk−1 + h
6: xk = xk−1 + hf1 (tk−1 , xk−1 , yk−1 )
7: yk = yk−1 + hf2 (tk−1 , xk−1 , yk−1 )
8: end for
9: return (tk , xk , yk ), k = 1, . . . , n

Algorithm 7.5 Runge-Kutta method for the system (7.27) on [a, b].
1: Input: f1 , f2 , [a, b], x0 = x(a), y0 = y(a), n
2: Output: (tk , xk , yk ) such that xk ≈ x(tk ), yk ≈ y(tk ), k = 1, . . . , n
3: h = b−a
n , t0 = a
4: for k = 1, . . . , n do
5: tk = tk−1 + h
6: f11 = f1 (tk−1 , xk−1 , yk−1 )
7: f21 = f2 (tk−1 , xk−1 , yk−1 )
8: f12 = f1 (tk−1 + 0.5h, xk−1 + 0.5hf11 , yk−1 + 0.5hf21 )
9: f22 = f2 (tk−1 + 0.5h, xk−1 + 0.5hf11 , yk−1 + 0.5hf21 )
10: f13 = f1 (tk−1 + 0.5h, xk−1 + 0.5hf12 , yk−1 + 0.5hf22 )
11: f23 = f2 (tk−1 + 0.5h, xk−1 + 0.5hf12 , yk−1 + 0.5hf22 )
12: f14 = f1 (tk−1 + h, xk−1 + hf13 , yk−1 + hf23 )
13: f24 = f2 (tk−1 + h, xk−1 + hf13 , yk−1 + hf23 )
14: xk = xk−1 + h(f11 + 2f12 + 2f13 + f14 )/6
15: yk = yk−1 + h(f21 + 2f22 + 2f23 + f24 )/6
16: end for
17: return (tk , xk , yk ), k = 1, . . . , n

Example 7.11 Consider the system of diﬀerential equations:

x = y x(0) = 1
with
y = −ty − t2 x y(0) = 1.
Use Euler’s method with h = 0.5 to approximate (x(1), y(1)).
The k th iteration of Euler’s method (k = 1, 2 . . .) is given by:

xk xk−1 yk−1
= +h .
yk yk−1 −tk−1 yk−1 − t2k−1 xk−1
We have x0 = x(0) = 1, y0 = y(0) = 1, and h = 0.5, and the ﬁrst step (k = 1)
gives

x1 x0 y0 1 1 1.5
= +h = + 0.5 = .
y1 y0 −t0 y0 − t20 x0 1 0 1
154 Numerical Methods and Optimization: An Introduction

For the second step (k = 2) we obtain

x2 1.5 1 2
= + 0.5 = .
y2 1 −0.5 × 1 − 0.52 × 1.5 0.5625
Example 7.12 The Lotka-Volterra model describes interactions between a
predator and a prey species, say, rabbits and foxes. Denote by R(t) the number
of rabbits and by F (t) the number of foxes that are alive at time t. Then the
Lotka-Volterra model consists of two diﬀerential equations describing changes
in the prey and predator population, respectively:
dR
dt = aR − bRF ;
dF
dt = cRF − dF,
where a is the natural growth rate of rabbits in the absence of predation, b is
the predation rate coeﬃcient, c is the reproduction rate of foxes per 1 rabbit
eaten, and d is the mortality rate of foxes.
Let time t be measured in years. Assuming a = 0.03, b = 0.004, c =
0.0002, d = 0.1, and the initial populations of R(0) = 10000 rabbits and
F (0) = 500 foxes, we obtain the following model:

dt = 0.03R − 0.004RF ;
dR
R(0) = 10000
with
dt = 0.0002RF − 0.1F,
dF
F (0) = 500.
Let us use two steps of the Runge-Kutta method to estimate the population of
rabbits and foxes at time t = 1. We have t0 = 0, h = 0.5, R0 = 10000, F0 =
500, f1 (t, R, F ) = 0.03R − 0.004RF, f2 (t, R, F ) = 0.0002RF − 0.1F .
For Step 1 we have:

f11 = f1 (0, 10000, 500) = −19700;

f21 = f2 (0, 10000, 500) = 950
f12 = f1 (0.25, 5075, 737.5) = −14819;
f22 = f2 (0.25, 5075, 737.5) = 674.8125
f13 = f1 (0.25, 6295.25, 668.7031) = −16649.7559;
f23 = f2 (0.25, 6295.25, 668.7031) = 775.0604
f14 = f1 (0.5, 1675.1221, 887.5302) = −5896.6318;
f24 = f2 (0.5, 1675.1221, 887.5302) = 208.5913
R1 = R0 + h(f11 + 2f12 + 2f13 + f14 )/6 ≈ 2622.1547
F1 = F0 + h(f21 + 2f22 + 2f23 + f24 )/6 ≈ 838.1947.
For the second step of the Runge-Kutta method, we have t1 = 0.5, h =
0.5, R1 = 2622.1547, F1 = 838.1947, and

f11 = −8712.8405; f21 = 355.7558;

f12 = −1633.0655; f22 = −10.3942;
f13 = −7333.2500; f23 = 286.4237;
f14 = 4068.8662; f24 = −303.1507
R2 = 740.7709;
F2 = 888.5834.
Numerical Solution of Diﬀerential Equations 155

TABLE 7.1: Runge-Kutta method for the Lotka-Volterra model in Exam-

ple 7.12.

Step t Rabbits Foxes

0 0 10000 500
1 0.5 2622 838
2 1 741 889
3 1.5 210 871
4 2 58 836
5 2.5 16 797
6 3 4 759
7 3.5 1 722
8 4 0 687

Thus, there will be approximately 741 rabbits and 889 foxes after 1 year.
Table 7.1 shows the result of application of 8 steps of the Runge-Kutta
method to the considered problem.

7.6 Higher-Order Diﬀerential Equations

Consider a higher-order diﬀerential equation of the general form:

y (n) = f (t, y, y , y , . . . , y (n−1) ). (7.29)

A common way to to handle such equation (of order n) is to reduce it to an

equivalent system of n ﬁrst-order equations. If n = 2, then (7.29) becomes

y = f (t, y, y ), (7.30)

and by substituting z = y , we obtain an equivalent system

y = z
(7.31)
z = f (t, y, z).

After that, the techniques for systems, such as those discussed in Sec-
tion 7.5, can be applied.

Example 7.13 Consider the diﬀerential equation

y = (1 − y 2 )y − y
y(0) = 0.2
y (0) = 0.2.
156 Numerical Methods and Optimization: An Introduction

Use Euler’s method with h = 0.1 to approximate y(0.2).

Denoting by z = y , the given problem is reduced to the following system:

y = z y(0) = 0.2
z = (1 − y 2 )z − y z(0) = 0.2.

Applying Euler’s method to this system gives

yk+1 yk zk
= +h , k ≥ 0.
zk+1 zk (1 − yk2 )zk − yk

y1 0.22
Using y0 = z0 = 0.2; h = 0.1, we obtain = . So, y2 =
z1 0.1992
0.22 + 0.1 · 0.1992 = 0.23992.

Exercises
7.1. Consider the IVP y = −xy, y(1) = 2.
$ %
(a) Verify that y(x) = 2 exp 12 (1 − x2 ) is the solution.
(b) Apply Picard’s method three times.

7.2. Solve the equations

2
(i) y = − x(1+exp(y))
x +exp(y)
, y(1) = 0;
y(1−x2 y 2 )
(ii) y = x(1+x2 y 2 ) , y(1) = 1

on the interval [1, 2] using two steps of the following methods:

(a) Euler’s method;

(b) Heun method;
(c) the improved polygon method;
(d) Ralston’s method;
(e) the classical fourth-order Runge-Kutta method.

7.3. Solve the system

x = −12x + 9y x(0) = 14
with
y = 11x − 10y y(0) = 6

on the interval [0, 0.2] using two steps of

Numerical Solution of Diﬀerential Equations 157

(a) Euler’s method for systems;

(b) Runge-Kutta method for systems.

Compute the absolute errors if the exact solution is

x(t) = 9 exp (−t) + 5 exp (−21t)

y(t) = 11 exp (−t) − 5 exp (−21t).

7.4. Consider the diﬀerential equation y = y, y(0) = 1, y (0) = 2.

(a) Verify that y(x) = 3

2 exp (x) − 1
2 exp (−x) is the solution.
(b) Reduce the diﬀerential equation to a ﬁrst-order system. Then use
Euler’s method with h = 0.5 to approximate y(1).
(c) What is the local and global discretization error for each step?

7.5. Consider the Van der Pol’s equation

y − 0.1(1 − y 2 )y + y = 0
y(0) = 0
y (0) = 1.

(a) Reduce it to a ﬁrst-order system.

(b) Use h = 0.5 and Euler’s method to estimate y(1).
(c) Use two steps of the Runge-Kutta method to estimate y(1).

7.6. Consider the diﬀerential equation

y = x2 y + y 2 y ,

y(1) = 0, y (1) = −1, y (1) = 2.

(a) Reduce it to a ﬁrst-order system.

(b) Use h = 0.5 and Euler’s method to estimate y(2).
(c) Use two steps of the Runge-Kutta method to estimate y(2).
This page intentionally left blank
Part III

Introduction to
Optimization

159
This page intentionally left blank
Chapter 8
Basic Concepts

Optimization is a methodology aiming to ﬁnd the best among available al-

ternatives. The available alternatives are referred to as feasible solutions, and
their quality is measured using some numerical function called the objective
function. A feasible solution that yields the best (minimum or maximum)
objective function value is called an optimal solution.
Optimization problems are of great practical interest. For example, in man-
ufacturing, how should one cut plates of a material so that the waste is min-
imized? In business, how should a company allocate the available resources
so that its profit is maximized? Some of the first optimization problems have
been solved in ancient Greece and are regarded among the most significant
discoveries of that time. In the first century A.D., the Alexandrian mathe-
matician Heron solved the problem of finding the shortest path between two
points by way of the mirror. This result, also known as Heron’s theorem of
the light ray, can be viewed as the origin of the theory of geometrical op-
tics. The problem of finding extreme values gained special importance in the
seventeenth century, when it served as one of the motivations in the inven-
tion of differential calculus, which is the foundation of the modern theory of
mathematical optimization.
The invention of the digital computer has led to the rise of the field of
numerical optimization. During World War II, optimization algorithms were
used to solve military logistics and operations problems. Optimization has be-
come one of the most important and well-studied fields in applied mathematics
and engineering ever since.
This chapter provides a description of basic concepts that will be used
throughout the chapters dealing with optimization.

8.1 Formulating an Optimization Problem

Consider the following example.

Example 8.1 A company needs to design an aluminum can in the shape of a

cylinder. The design is determined by the can’s height h and diameter d (see
Figure 8.1) and must satisfy the following requirements:

161
162 Numerical Methods and Optimization: An Introduction

d Volume:
πd2 h
4
h
Surface area:
πd2
2 + πdh

FIGURE 8.1: A cylindrical can of height h and diameter d.

• the height of the can must be at least 50% greater than its diameter,
• the can’s height cannot be more than twice its diameter, and
• the volume of the can must be at least 330 ml.
The goal is to design a can that would require the minimum possible amount
of aluminum.
We start formulating an optimization model by introducing the decision
variables, which are the parameters whose values need to be determined in
order to solve the problem. Clearly, in this example the decision variables are
given by the parameters that define a can’s design:
h = the can’s height, in cm
d = the can’s diameter, in cm.
Next we need to state our objective function, i.e., the quantity that we need
to optimize, in mathematical terms, as a function of the decision variables. It
is reasonable to assume that the aluminum sheets used have a fixed thickness.
Then the amount of aluminum used is determined by the surface area of the
cylinder can, which consists of two disks of diameter d (top and bottom) and
a rectangular side of height h and width given by the circumference of the
circle of diameter d, which is πd. Hence, the total surface area is given by:
πd2 πd2
f (h, d) = 2 + πdh = + πdh. (8.1)
4 2
The function f (h, d) is the objective function that we want to minimize.
Finally, we need to specify the constraints, i.e., the conditions that the
design parameters are required to satisfy. The first requirement states that
the height of the can must be at least 50% greater than its diameter, which
can be expressed as

h ≥ 1.5d ⇔ 1.5d − h ≤ 0. (8.2)

Basic Concepts 163

According to the second requirement, the can’s height cannot be more than
twice its diameter, i.e.,
h ≤ 2d ⇔ −2d + h ≤ 0. (8.3)
The third requirement is that the volume of the can must be at least 330 ml:
πd2 h
≥ 330. (8.4)
4
(note that 1 ml = 1 cm3 ).
It is also important to note that the can’s height and diameter must always
be nonnegative, i.e., h, d ≥ 0, however, it is easy to see that this is guaranteed
by the constraints (8.2) and (8.3): if we add these constraints together, we
obtain d ≥ 0, and from (8.2) we have h ≥ 1.5d ≥ 0. Thus, nonnegativity
constraints are redundant and do not need to be included in the model.
In summary, we have the following optimization problem:
πd2
minimize 2 + πdh
subject to 1.5d − h ≤ 0
(8.5)
−2d + h ≤ 0
πd2 h
4 ≥ 330.
Example 8.2 In computational physics, one is interested in locating a set of
(i) (i) (i)
n points {p(i) = [x1 , x2 , x3 ]T : i = 1, . . . , n} on a unit sphere in IR3 with
the goal of minimizing the Lennard-Jones potential

1 2
− 6 ,
d12
ij dij
1≤i≤j≤n

where dij = p(i) − p(j) is the Euclidean distance between pi and pj , i.e.,

(i) (j) (i) (j) (i) (j)
dij = (x1 − x1 )2 + (x2 − x2 )2 + (x3 − x3 )2 .
In this case, we have 3n decision variables describing the coordinates of n
points in IR3 , and putting the sphere’s center in the origin of the coordinate
system in IR3 , the only constraints are
p(i) = 1, i = 1, . . . , n.
Thus, the problem can be formulated as follows:
$ (i) %
minimize p − p(j) −12 − 2p(i) − p(j) −6
1≤i≤j≤n

subject to p(i) = 1, i = 1, . . . , n (8.6)

p(i) ∈ IR3 , i = 1, . . . , n.
(i) (i) (i)
Here p(i) = [x1 , x2 , x3 ]T for i = 1, . . . , n.
164 Numerical Methods and Optimization: An Introduction

8.2 Mathematical Description

Let X ⊆ IRn be a set of n-dimensional real vectors in the form x =
[x1 , . . . , xn ]T , and let f : X → IR be a given function. In mathematical terms,
an optimization problem has the general form
maximize f (x) minimize f (x)
or
subject to x ∈ X, subject to x ∈ X
or, equivalently,
max f (x) or min f (x).
x∈X x∈X

In the above, each xj , j = 1, . . . n, is called a decision variable, X is called the

feasible (admissible) region, and f is the objective function.
Note that a maximization problem max f (x) can be easily converted into
x∈X
an equivalent minimization problem min(−f (x)) and vice versa.
x∈X
We will typically use a functional form of an optimization problem, in
which the feasible region is described by a set of constraints given by equalities
and inequalities in the form
hi (x) = 0, i ∈ E,
gj (x) ≤ 0, j ∈ I,
where hi : IR → IR, i ∈ E, gj : IRn → IR, j ∈ I are given functions; E and I
n

are index sets for the equality and inequality constraints, respectively. Note
that the description above covers a wide variety of possible scenarios. For
example, in some problems the decision variables are required to be binary:
xj ∈ {0, 1}. We can easily represent this constraint in the equivalent form
xj (1 − xj ) = 0.
Also, if we are given an inequality constraint in the form g1 (x) ≥ g2 (x), we
can convert it into an equivalent inequality in the form g3 (x) ≤ 0 by letting
g3 (x) = g2 (x) − g1 (x). Obviously,
g1 (x) ≥ g2 (x) ⇔ g2 (x) − g1 (x) ≤ 0.
If the feasible region X is described by a set of equality and inequality con-
straints,
X = {x ∈ IRn : hi (x) = 0, gj (x) ≤ 0, i ∈ E, j ∈ I},
then, obviously, a point x̄ is feasible if and only if it satisfies each of the
equality and inequality constraints defining the feasible region X. For a feasible
x̄ ∈ X, and an inequality constraint gj (x) ≤ 0, we may have a strict inequality,
gj (x̄) < 0 or an equality gj (x̄) = 0 satisfied.
Basic Concepts 165
x2
x̄

X x̃

1
x1
x̂

FIGURE 8.2: Illustration for Example 8.3.

Deﬁnition 8.1 (Binding (Active) Constraint) An inequality con-

straint gj (x) ≤ 0 is called binding or active at a feasible point x̄ if
gj (x̄) = 0. Otherwise, if gj (x̄) < 0, the constraint is called nonbind-
ing or inactive at x̄.

Example 8.3 Consider the feasible set X defined by two inequality con-
straints,
X = {x ∈ IR2 : x21 + x22 ≤ 1, (x1 − 1)2 + x22 ≤ 1}.
Geometrically, this set is given by the points in the intersection of two disks
corresponding to the constraints, as illustrated in Figure 8.2. In this figure,
none of the constraints is active at x̂; only the first constraint is active at x̃,
and both constraints are active at x̄.

Next we deﬁne the concept of a feasible direction, which characterizes a

local property of a point in the feasible region, i.e., a property that is required
to be satisﬁed only in an arbitrarily small neighborhood of the point.

Deﬁnition 8.2 (Feasible Direction) A vector d ∈ IRn is called a fea-

sible direction for set X ⊆ IRn at x̄ ∈ X if there exists δ > 0 such that
x̄ + αd ∈ X for any α ≤ δ.

Recall that x̂ ∈ X is an interior point of X if there exists ε > 0 such that x̂

is included in X together with the ε-ball centered at x̂, i.e., B(x̂, ε) ⊂ X. We
call the set of all interior points of X the interior of X, denoted by int(X).
If x̄ ∈ X is not an interior point, then we call it a boundary point of X. Note
that any direction is feasible at an interior point. Figure 8.3 shows examples
166 Numerical Methods and Optimization: An Introduction

y h

x x
ε ε
X
y
d
h

FIGURE 8.3: Examples of interior points (x and x ), boundary points (y
and y ), feasible directions (d and d ), and infeasible directions (h and h ).

of interior and boundary points, as well as feasible and infeasible directions

at the boundary points.

8.3 Local and Global Optimality

We consider a minimization problem in the form

minimize f (x)
(8.7)
subject to x ∈ X,

where f : IRn → IR is an arbitrary function of n variables xj , j = 1, . . . , n.

To solve this problem, one needs to ﬁnd a feasible solution x∗ that mini-
mizes f over X. Such a solution is called a global optimal solution or a global
minimizer and is formally deﬁned as follows.

Deﬁnition 8.3 (Global Minimum) The point x∗ ∈ X is a point of

global minimum (global minimizer) for the problem min f (x) (denoted
x∈X
x∗ = arg min f (x)) if
x∈X

f (x∗ ) ≤ f (x) for all x ∈ X.

A global minimizer x∗ is a strict global minimizer if

f (x∗ ) < f (x) for all x ∈ X \ {x∗ }.

Basic Concepts 167

0
f (x) = 1/x f (x) = exp(x) x
f (x) = ln(x)

0 x 0 x

FIGURE 8.4: Examples of functions with no minimizer or maximizer.

A global minimizer does not always exist, even if the function is bounded
from below. Consider, for example, the function f (x) = exp(x), x ∈ IR. It is
bounded from below, f (x) ≥ 0 ∀x ∈ IR, however, there does not exist a point
x ∈ IR such that exp(x) = 0. Since lim exp(x) = 0, c = 0 is the greatest
x→−∞
lower bound for f (x) = exp(x), which is referred to as inﬁmum of f .

Deﬁnition 8.4 (Inﬁmum; Supremum) For a function f : X → IR,

its greatest lower bound is called the inﬁmum and is denoted by inf f (x).
x∈X
Similarly, its least upper bound is called the supremum and is denoted by
sup f (x).
x∈X

If f is unbounded from below or above, we have inf f (x) = −∞ or sup f (x) =

x∈X x∈X
+∞, respectively. Figure 8.4 shows examples of three functions that do not
have a minimizer. Note that the first two of these functions are bounded from
below. We have inf (1/x) = 0, inf (exp(x)) = 0, and inf (ln(x)) =
x∈(0,+∞) x∈IR x∈(0,+∞)
−∞. The supremum of each of these functions over the corresponding domain
is +∞.
Even when a minimizer does exist, it may be extremely difficult to com-
pute. Therefore, many of the known optimization methods focus on a less
ambitious, yet still very challenging goal of finding a local minimizer, which
is defined next.

Deﬁnition 8.5 (Local Minimum) x∗ ∈ X is a point of local mini-

mum (local minimizer) for the problem min f (x) if there exists ε > 0
x∈X
such that

f (x) ≥ f (x∗ ) for any x ∈ X with x − x∗ ≤ ε.

168 Numerical Methods and Optimization: An Introduction

f (x)

x1 x2 x3 x4 x5 x

FIGURE 8.5: A function with two strict local minimizers (x1 and x2 ), one
strict global minimizer (x2 ), and inﬁnitely many local minimizers that are not
strict (all points in the interval [x3 , x5 ]).

A local minimizer x∗ is a strict local minimizer for the considered problem

if there exists ε > 0 such that

f (x) > f (x∗ ) for any x ∈ X \ {x∗ } with x − x∗ ≤ ε.

Examples of local and global minimizers are given in Figure 8.5.

8.4 Existence of an Optimal Solution

Checking whether a global minimizer exists for a given problem is an ex-
tremely diﬃcult task in general. However, in some situations we can guarantee
the existence of a global minimizer. One such case, which applies to a broad
class of problems, is described by the classical Weierstrass theorem.

Theorem 8.1 (Weierstrass) The problems min f (x) and max f (x),
x∈X x∈X
where X ⊂ IRn is a compact set and f : X → IR is a continuous function,
have global optimal solutions.

In other words, a continuous function attains its minimum and maximum over
a compact set.
Another case where a global minimum is guaranteed to exist is given by
continuous coercive functions.
Basic Concepts 169

0 x

FIGURE 8.6: A coercive function, f (x) = 1

10 (exp(x) + exp(−x) − x3 ).

Deﬁnition 8.6 (Coercive Function) A function f : IRn → IR is

called coercive, if
lim f (x) = +∞. (8.8)

x
→∞

Figure 8.6 provides an example of a coercive function.

Theorem 8.2 Any continuous coercive function f : IRn → IR has a

global minimizer in IRn .

Proof. Consider an arbitrary x̂ ∈ IRn . Since f is coercive, there exists C > 0

such that for any x with x > C : f (x) > f (x̂). Next, we represent IRn as a
union of two disjoint sets, IRn = X1 ∪ X2 , where
X1 = {x : x ≤ C}, X2 = {x : x > C}.
Note that x̂ ∈ X1 and for any x ∈ X2 we have f (x) > f (x̂). Also, since X1 is a
compact set, by Weierstrass theorem there exists x∗ ∈ X1 such that f (x∗ ) =
min f (x). Obviously, f (x̂) ≥ f (x∗ ). Thus, for any x ∈ IRn : f (x) ≥ f (x∗ ), so
x∈X1
∗
x is a global minimizer.

8.5 Level Sets and Gradients

Deﬁnition 8.7 (Level Set) For a function f : IRn → IR and a con-
stant c, the set {x ∈ IRn : f (x) = c} is called the level set of f at the
level c; the set {x ∈ IRn : f (x) ≤ c} is called the lower level set of f at
170 Numerical Methods and Optimization: An Introduction

⇓ ⇓ ⇓
x2 x2 x2
1 1 1

x1 x1 x1
-1 1 -1 1 -1 1

-1 -1 -1

FIGURE 8.7: The level set, lower level set, and upper level set of f (x) =
x21 + x22 at the level c = 1.

the level c; and the set {x ∈ IRn : f (x) ≥ c} is called the upper level set
of f at the level c.

Example 8.4 For f (x) = x21 + x22 , the level set, lower level set, and upper
level set at the level c = 1 are given by {x ∈ IR2 : x21 + x22 = 1}, {x ∈ IR2 :
x21 +x22 ≤ 1}, and {x ∈ IR2 : x21 +x22 ≥ 1}, respectively. Geometrically, the level
set is the unit radius circle centered at the origin, the lower level set consists
of the circle and its interior, and the upper level set consists of the circle and
its exterior. See Figure 8.7 for an illustration.

Consider a curve γ in the level set S of f at the level c given by γ = {x(t) :

t ∈ (a, b)} ⊂ S, where x(t) : (a, b) → S is a continuous function. Then we have

f (x(t)) = c, for t ∈ (a, b). (8.9)

Hence, the derivative of g(t) = f (x(t)) at any point t̄ ∈ (a, b) is

dg(t̄)
= 0. (8.10)
dt
On the other hand, using the chain rule,

dg(t̄) df (x(t̄))
= = ∇f (x(t̄))T x (t̄). (8.11)
dt dt
Basic Concepts 171

y y = f (x1 , x2 )

y=c

x̄ x (t̄)

x = x(t)
x1 ∇f (x̄)

FIGURE 8.8: A level set and a gradient.

So, if we denote by x̄ = x(t̄), then from the last two equations we obtain

∇f (x̄)T x (t̄) = 0. (8.12)

Geometrically, this means that vectors ∇f (x̄) and x (t̄) are orthogonal. Note
that x (t̄) represents the tangent line to x(t) at x̄. Thus, we have the following
property. Let f (x) be a continuously diﬀerentiable function, and let x(t) be a
continuously diﬀerentiable curve passing through x̄ in the level set of f (x) at
the level c = f (x̄), where x(t̄) = x̄, x (t̄) = 0. Then the gradient of f at x̄ is
orthogonal to the tangent line of x(t) at x̄. This is illustrated in Figure 8.8.
√ √ T
√ √ T 8.5 For f (x) = x1 + x2 and x̃ = [1/ 2, 1/ 2] , ∇f (x̃) =
2 2
Example
[ 2, 2] . The gradient ∇f (x̃) is illustrated in Figure 8.9, which also shows
the level set of f at the level f (x̃) = 1. Clearly, the gradient is orthogonal to
the level set.

Example 8.6 Let f (x) = −x21 −4x22 . Then ∇f (x) = [−2x1 , −8x2 ]T . Consider
two diﬀerent points,
√
x̃ = [2, 0]T and x̄ = [ 3, 1/2]T .
172 Numerical Methods and Optimization: An Introduction
x2
√
√1 + 2 ∇f (x̃)
2

1
x̃
√1
2

x1
√
-1 √1 1 √1 + 2
2 2

-1

√ √
FIGURE 8.9: Gradient ∇f (x̃) of f (x) = x21 + x22 at x̃ = [1/ 2, 1/ 2]T .

Then
√
∇f (x̃) = [−4, 0]T and ∇f (x̄) = [−2 3, −4]T .

Note that f (x̃) = f (x̄) = −4. Consider the level set

S = {x ∈ IR2 : −x21 − 4x22 = −4}

of f at the level c = −4 and a curve

2
γ = {x(t) = [ 4 − 4t2 , t]T : t ∈ (−1, 1)} ⊂ S

passing through x̃ and x̄ in S, where x(t̃) = x̃ with t̃ = 0 and x(t̄) = x̄ with

t̄ = 1/2. Then
T
4t
x (t) = − √ ,1 ,
4 − 4t2
so
√
x (t̃) = [0, 1]T , x (t̄) = [−2/ 3, 1]T ,

and
∇f (x̃)T x (t̃) = ∇f (x̄)T x (t̄) = 0.
Basic Concepts 173

x x̂

X1 x̃ X2 x̄
x̄
X4
X3

x̃ x̌

FIGURE 8.10: Examples of convex (X1 , X2 ) and nonconvex (X3 , X4 ) sets.

8.6 Convex Sets, Functions, and Problems

Due to their special structure, convex problems play a very important role
in optimization. In this section, we introduce the notions of convex sets and
convex functions, and provide equivalent characterizations of a convex function
under conditions of continuous diﬀerentiability of the function and its gradient.

Deﬁnition 8.8 (Convex Combination) Given the points x1 , . . . , xm

m
in Euclidean n-space IRn and real numbers αi ≥ 0 with αi = 1, the
i=1

m
point αi xi is called a convex combination of these points.
i=1

For example, geometrically, the convex combination of two points is the line
segment between these two points, and the convex combination of three non-
colinear points is a triangle.

Deﬁnition 8.9 (Convex Set) A set X ⊆ IRn is said to be convex if

for any x, y ∈ X and any α ∈ (0, 1) : αx + (1 − α)y ∈ X.

In other words, X is convex if all points located on the line segment connecting
x and y are in X. See Figure 8.10 for examples.

Example 8.7 Consider a set in IRn deﬁned by a system of linear equations

and inequalities:

X = {x ∈ IRn : A1 x = b1 , A2 x ≤ b2 },
174 Numerical Methods and Optimization: An Introduction

where A1 , A2 and b1 , b2 are matrices and vectors, respectively, of appropriate

dimensions. For any x, y ∈ X and any α ∈ (0, 1), consider z = αx + (1 − α)y.
We have
A 1 x = b1 , A 1 y = b1 , A 2 x ≤ b2 , A 2 y ≤ b2 ,
so

A1 z = A1 (αx + (1 − α)y) = αA1 x + (1 − α)A1 y = αb1 + (1 − α)b1 = b1 ,

A2 z = A2 (αx + (1 − α)y) = αA2 x + (1 − α)A2 y ≤ αb2 + (1 − α)b2 = b2 ,

and, therefore, z ∈ X. Thus, X is a convex set.

Deﬁnition 8.10 (Hyperplane) A hyperplane H ⊂ IRn is a set of the

form
H = {x ∈ IRn : cT x = b},
where c ∈ IRn \ {0} and b ∈ IR.

A hyperplane is the intersection of the following closed halfspaces:

H+ = {x ∈ IRn : cT x ≥ b} and H− = {x ∈ IRn : cT x ≤ b}.

It is easy to see that H, H+ , H− are all convex sets (Exercise 8.9).

Deﬁnition 8.11 (Polyhedral Set) A set deﬁned by linear equations

and/or inequalities is called a polyhedral set or a polyhedron.

For example, the quadrangle X2 in Figure 8.10 is a polyhedral set that can
be described by four linear inequalities, each deﬁning a side of the quadrangle.

Deﬁnition 8.12 (Extreme Point) Given a convex set X, a point x ∈

X is called an extreme point of X if it cannot be represented as a convex
combination of two distinct points in X, i.e., there do not exist distinct
x , x ∈ X, α ∈ (0, 1) such that x = αx + (1 − α)x .

Example 8.8 For the convex set X1 in Figure 8.10, any point x that lies
on the curve between x̄ and x̃, inclusive, is an extreme point, thus X1 has
inﬁnitely many extreme points. The other convex set in this ﬁgure, X2 , has
only 4 extreme points, x̃, x̂, x̄, and x̌.

Extreme points of polyhedral sets are sometimes called vertices or corners.

For example, x̃, x̂, x̄, and x̌ are the vertices of X2 .
Basic Concepts 175

f (y)
αf (x) + (1 − α)f (y)

f (x)

f (αx + (1 − α)y)

x αx + (1 − α)y y

FIGURE 8.11: Geometric illustration for the deﬁnition of a convex function.

Deﬁnition 8.13 (Convex Function) Given a convex set X ⊆ IRn , a

function f : X → IR is convex if

f (αx + (1 − α)y) ≤ αf (x) + (1 − α)f (y) ∀x, y ∈ X, α ∈ (0, 1). (8.13)

If the last inequality is strict whenever x = y, i.e.,

f (αx + (1 − α)y) < αf (x) + (1 − α)f (y) ∀x, y ∈ X, y = x, α ∈ (0, 1), (8.14)

then f is called strictly convex. A function f is called concave (strictly

concave) if −f is convex (strictly convex).

Geometrically, the deﬁnition means that the graph of a convex function

plotted over the interval [x, y] lies on or below the line segment connecting
(x, f (x)) and (y, f (y)) (see Figure 8.11).
Figure 8.4 at page 167 shows plots of two convex functions (f (x) =
1/x, f (x) = exp(x)) and one concave function (f (x) = ln(x)). Recall that
none of these functions has a minimizer.

Example 8.9 We will demonstrate that f (x) = 1/x is strictly convex on

X = (0, +∞) using the deﬁnition above. We need to show that for any x, y ∈
176 Numerical Methods and Optimization: An Introduction

(0, +∞) such that x = y and any α ∈ (0, 1), the following inequality holds:
1 α 1−α
< + ∀α ∈ (0, 1), x, y ∈ (0, +∞), x = y. (8.15)
αx + (1 − α)y x y

Multiplying both sides by (αx + (1 − α)y)xy, which is positive, we obtain an

equivalent inequality

xy < (αx + (1 − α)y)(αy + (1 − α)x).

Since (αx+(1−α)y)(αy+(1−α)x)−xy = α(1−α)(x−y)2 > 0, inequality (8.15)

is correct, hence f is strictly convex by deﬁnition.

Given X ⊂ IRn , discontinuities for a convex function f : X → IR can only

occur at the boundary of X, as implied by the following theorem.

Theorem 8.3 A convex function is continuous in the interior int (X)

of its domain X. In particular, if X = IRn , then f is continuous on IRn .

Note that a convex function is not necessarily diﬀerentiable in the interior of

its domain. For example, consider X = IR, f (x) = |x|.

Deﬁnition 8.14 (Convex Problem) A problem min f (x) is called a

x∈X
convex minimization problem (or simply a convex problem) if f is a con-
vex function and X is a convex set.

Example 8.10 The problem

minimize 3(x1 − 2)2 + 2x42

subject to x21 + x22 ≤ 4

is convex since f (x) = 3(x1 − 2)2 + 2x42 is a convex function and

X = {x ∈ IR2 : x21 + x22 ≤ 4}

is a convex set. However, the problem

minimize 3(x1 − 2)2 + 2x42

subject to x21 + x22 = 4

is not convex since X = {x ∈ IR2 : x21 + x22 = 4} is not a convex set.

The following theorem describes a fundamental property of convex prob-

lems concerning their optima.
Basic Concepts 177

Theorem 8.4 Any local minimizer of a convex problem is its global min-
imizer.

Proof. Assume that x∗ is a point of local minimum of a convex problem

min f (x). We need to show that x∗ is a global minimizer for this problem.
x∈X
We will use contradiction. Indeed, if we assume that f (x̂) < f (x∗ ) for some
x̂ ∈ X, then, using the deﬁnition of convexity for f and X, for any α ∈ (0, 1)
we have:

f (x∗ + α(x̂ − x∗ )) = f (αx̂ + (1 − α)x∗ )

≤ αf (x̂) + (1 − α)f (x∗ )
< αf (x∗ ) + (1 − α)f (x∗ )
= f (x∗ ).

Thus, f (x∗ +α(x̂−x∗ )) < f (x∗ ) for any α ∈ (0, 1), which contradicts the local
optimality of x∗ , since any ε-ball centered at x∗ contains a point x∗ +α(x̂−x∗ )
for some α ∈ (0, 1). Therefore, x̂ ∈ X with f (x̂) < f (x∗ ) cannot exist, and
f (x∗ ) ≤ f (x) for any x ∈ X, meaning that x∗ is a global minimizer of the
considered problem.

8.6.1 First-order characterization of a convex function

Theorem 8.5 Let X ⊆ IRn be a convex set. If f : X → IRn is diﬀer-
entiable on an open set containing X then it is convex on X if and only
if
f (y) ≥ f (x) + ∇f (x)T (y − x) ∀x, y ∈ X. (8.16)

Proof. We ﬁrst show that if f is convex, then (8.16) holds. Consider arbitrary
x, y ∈ X and the direction d = y − x. The directional derivative of f at x in
the direction d is
f (x + α(y − x)) − f (x)
∇f (x)T (y − x) = lim
α→0+ α
f (αy + (1 − α)x) − f (x)
= lim
α→0+ α
αf (y) + (1 − α)f (x) − f (x)
≤ lim
α→0+ α
= f (y) − f (x).

To obtain the inequality above, we used the deﬁnition of convexity for f . Thus,
we have proved that if f is convex then (8.16) holds.
To prove the other direction, we assume that (8.16) holds. We need to
show that this yields the convexity of f on X. For an arbitrary α ∈ (0, 1) we
178 Numerical Methods and Optimization: An Introduction

f (y)

f (x) + ∇f (x)T (y − x)

f (x)
f (x)

x y

FIGURE 8.12: Geometric interpretation of the ﬁrst-order characterization

of a convex function.

deﬁne z = αx + (1 − α)y. Then z ∈ X due to the convexity of X. We use

inequality (8.16) two times as follows:

for z, x we have f (x) ≥ f (z) + ∇f (z)T (x − z); (8.17)

for z, y we have f (y) ≥ f (z) + ∇f (z)T (y − z). (8.18)

Multiplying (8.17) by α and (8.18) by (1 − α), and adding the resulting in-
equalities, we obtain

αf (x) + (1 − α)f (y) ≥ f (z) + α∇f (z)T (x − z) + (1 − α)∇f (z)T (y − z)

= f (z) + ∇f (z)T (αx + (1 − α)y − z)
= f (z).

So, f (αx + (1 − α)y) ≤ αf (x) + (1 − α)f (y), and f (x) is convex on X by the
deﬁnition of a convex function.

Geometrically, the ﬁrst-order characterization of a convex function means

that the graph of f restricted to the line deﬁned by x and y is always on or
above its tangent line at x. See Figure 8.12 for an illustration.
Basic Concepts 179

8.6.2 Second-order characterization of a convex function

If a convex function is twice continuously diﬀerentiable, then it can be
equivalently characterized using its Hessian. We discuss the corresponding
second-order characterization next.

Theorem 8.6 Let X ⊆ IRn be a convex set. If f : X → IRn is twice

continuously diﬀerentiable on an open set containing X then it is convex
on X if and only if the Hessian ∇2 f (x) is positive semideﬁnite for any
x ∈ X.

Proof. To prove the necessity, we use contradiction. Assume that ∇2 f (x)

is not positive semideﬁnite for some x ∈ X, then there exists d ∈ IRn with
d = 1 such that dT ∇2 f (x)d < 0. Let y = x + αd for an arbitrary α > 0. We
can express f (y) using Taylor’s theorem in the following form:
1
f (y) = f (x) + ∇f (x)T (y − x) + (y − x)T ∇2 f (x)(y − x) + o(y − x2 ). (8.19)
2
Taking into account that y − x = αd, α > 0, and d = 1, we can equivalently
rewrite the last equality as

f (y) − f (x) − ∇f (x)T (y − x) 1 T 2 o(α2 )

= d ∇ f (x)d + . (8.20)
α2 2 α2

2 d ∇ f (x)d = 0, then by deﬁnition of o(·) there exists a

1 T 2
Note that if
2
sufficiently small α = 0 such that o(α α2
)
is smaller in absolute value than
1 T 2
2 d ∇ f (x)d, so the sign of the right-hand side of (8.20) is the same as the
sign of dT ∇2 f (x)d. In particular, if dT ∇2 f (x)d < 0, then there exists α > 0
such that for y = x + αd ∈ X: f (y) − f (x) − ∇f (x)T (y − x) < 0, which
contradicts the first-order characterization of a convex function that must be
satisfied by f .
To prove the sufficiency, assume that ∇2 f (x) is positive semidefinite for
all x ∈ X. Then we show that f satisfies the first-order characterization
of a convex function and is therefore convex. Consider arbitrary x, y ∈ X.
According to Taylor’s theorem, there exists α ∈ (0, 1) such that
1
f (y) = f (x) + ∇f (x)T (y − x) + (y − x)T ∇2 f (x + α(y − x))(y − x). (8.21)
2
This can be equivalently written as
1
f (y) − f (x) − ∇f (x)T (y − x) = (y − x)T ∇2 f ((1 − α)x + αy)(y − x). (8.22)
2
Using Rayleigh’s inequality, we have

(y − x)T ∇2 f ((1 − α)x + αy)(y − x) ≥ λmin y − x2 , (8.23)

180 Numerical Methods and Optimization: An Introduction

where λmin is the smallest eigenvalue of ∇2 f ((1 − α)x + αy). From (8.22)
and (8.23) we obtain

1
f (y) − f (x) − ∇f (x)T (y − x) ≥ λmin y − x2 ≥ 0.
2
Since x, y ∈ X were chosen arbitrarily, this inequality holds for any x, y ∈ X.
Thus, by the ﬁrst-order characterization of a convex function, f is convex on
X.

Exercises
8.1. Consider the problem minn f (x), where f : IRn → IR is an arbitrary
x∈IR
continuous function. Prove or disprove each of the following independent
statements concerning this problem.

(a) If the considered problem has a local minimizer that is not strict,
then it has inﬁnitely many local minimizers.
(b) If x∗ is a strict local minimizer and a global minimizer for the
considered problem, then x∗ is a strict global minimizer for the
same problem.
(c) If x∗ is a strict global minimizer of f , then x∗ is a strict local
minimizer of f.
(d) If there exists a constant C such that f (x) ≥ C for all x ∈ IRn ,
then f has at least one local minimizer.
(e) If f is convex, it cannot have local minimizers x̄ and x̂ such that
f (x̂) < f (x̄).
(f) If f has a unique global minimizer and no other local minimizers,
then f is convex.
(g) If f is strictly convex, then any local minimizer is its strict global
minimizer.

8.2. Consider the problem min f (x) of minimizing a function f : IRn → IR

x∈{0,1}n
over the discrete set {0, 1} . Show that if f (x) is a linear function with
n

respect to each variable (that is, ﬁxing the values of n−1 variables turns
f (x) into a linear function of a single variable), then the considered
problem is equivalent to the problem min n f (x) of minimizing f (x)
x∈[0,1]
over the n-dimensional unit hypercube [0, 1]n .

8.3. In graph theory, a simple, undirected graph is a pair G = (V, E), where
Basic Concepts 181

V is a ﬁnite set of vertices and E is a set of edges given by unordered

pairs of vertices, E ⊆ {{i, j} : i, j ∈ V }. Given a simple undirected
graph G = (V, E) with n = |V | vertices, the maximum independent set
problem is to ﬁnd a maximum-size subset I of V such that {i, j} ∈ / E
for any i, j ∈ I. Let α(G) denote the independence number of G, which
is the cardinality of a maximum independent set of G. Show that

1 T
α(G) = max n e x − x AG x ,
T
x∈[0,1] 2

where e = [1, . . . , 1]T ∈ IRn , and AG = [aij ]n×n with aij = 1 if {i, j} ∈ E
and aij = 0 otherwise. Hint: observe that the objective function of the
above problem is linear with respect to each variable.

8.4. Solve the problem min f (x), where

x∈X

x2 −x+2
(a) f (x) = x2 +2x+2 , X = IR;
(b) f (x) = (x1 + 2x2 − 4x3 + 1)2 + (2x1 − x2 + 1)4 + (x1 − 3x2 + 2)6 ,
X = IR3 .
2 2 2
8.5. Does the function f (x) = ex1 +x2 +x3 −x31 −x42 −x63 have a global minimizer
in IR3 ? Why?

8.6. Find all local and global minimizers, if any, for the following problems
graphically.
(a) minimize 2x − x2
subject to 0 ≤ x ≤ 3.
(b) minimize −(x21 + x22 )
subject to x1 ≤ 1.
(c) minimize x1 − (x2 − 2)3 + 1
subject to x1 ≥ 1.
(d) minimize x21 + x22
subject to x21 + 9x22 = 9.
(e) minimize x21 + x22
subject to x21 − x2 ≤ 4
x2 − x1 ≤ 2.
(f) minimize x2 − x1
subject to x21 − x32 = 0.

8.7. Let f : IRn → IR be an arbitrary convex function. Show that for any
constant c the set X = {x ∈ IRn : f (x) ≤ c} is convex.

8.8. Prove the following properties of convex sets:

182 Numerical Methods and Optimization: An Introduction

(a) Let F be a family of convex sets. Then the intersection C is
C∈F
also a convex set.
(b) Let C ⊂ IRn be a convex set and let α be a real number. Then the
set αC = {y : y = αx, x ∈ C} is also convex.
(c) Let C1 , C2 ⊂ IRn be convex sets. Then the set

C1 + C2 = {x : x = x(1) + x(2) , x(1) ∈ C1 , x(2) ∈ C2 }

is also convex.
8.9. Let c ∈ IRn \ {0} and b ∈ IR. Show that the hyperplane

H = {x ∈ IRn : cT x = b}

and the closed halfspaces

H+ = {x ∈ IRn : cT x ≥ b},

H− = {x ∈ IRn : cT x ≤ b}.
are convex sets.
8.10. (Jensen’s inequality) Let f : X → IR, where X ⊆ IRn is a convex set.
Prove that f is a convex function if and only if for any x(1) , . . . , x(k) ∈ X
and coeﬃcients α1 , . . . , αk such that

k
αi = 1, αi ≥ 0, i = 1, . . . , m, (8.24)
i=1

the following inequality holds:

k

k
f αi x (i)
≤ αi f (x(i) ). (8.25)
i=1 i=1

8.11. Use Jensen’s inequality to show the following:

(a) If x is a convex combination of points x(1) , . . . , x(k) , then

f (x) ≤ max f (x(i) ).

1≤i≤k

(b) If we denote by
3 4

k
k
Δ = Conv{x (1)
,...,x (k)
}= x= αi x (i)
: αi = 1, αi ≥ 0 ,
i=1 i=1

then
max f (x) = max f (x(i) ).
x∈Δ 1≤i≤k
Basic Concepts 183

8.12. Let X ⊆ IRn be a convex set, and let f, g : X → IR be convex functions.

Show that the following function is convex:

h(x) = max{f (x), g(x)}, x ∈ X.

8.13. Let X ⊆ IRn be a convex set. Show that f : X → IR is convex if and

only if for any x, y ∈ X and β ≥ 0 such that y + β(y − x) ∈ X, we have

f (y + β(y − x)) ≥ f (y) + β(f (y) − f (x)).

Interpret this equivalent characterization of a convex function geomet-

rically.
8.14. Let X ⊆ IRn be a convex set. For a function f : X → IR its epigraph
epi(f ) is the following set in IRn+1 :

epi(f ) = {(x, c) ∈ IRn+1 : x ∈ X, c ∈ IR, f (x) ≤ c}.

(a) Sketch epi(f ) for f (x) = x2 , X = IR and for f (x) = x21 + x22 ,
X = IR2 .
(b) Show that f is a convex function if and only if epi(f ) is a convex
set.
8.15. Let f : IRm → IR be a convex function. For an m × n matrix A and a
vector b ∈ IRm , show that g : IRn → IR deﬁned by g(x) = f (Ax + b) is
a convex function.
8.16. Let x = [xj ]nj=1 . Consider the problem

0
m *
n (r)
minimize α0j (xr )σ0j
j=1 r=1

mi *n (r)
subject to αij (xr )σij ≤ 1, i = 1, . . . , m
j=1 r=1
xj > 0, j = 1, . . . , n,

where for any i = 0, . . . , m, j = 1, . . . , mi αij are some positive coef-

(r)
ﬁcients and σij , r = 1, . . . n are arbitrary real powers. Observe that
this problem is not convex in general. Formulate an equivalent convex
problem.
8.17. Use the second-order characterization of a convex function to show that
the following function f : IR → IR is convex:

(a) f (x) = |x|p , where p > 1 is a given constant;

(b) f (x) = |x| − ln(1 + |x|);
x2
(c) f (x) = 1−|x| .
This page intentionally left blank
Chapter 9
Complexity Issues

9.1 Algorithms and Complexity

By an algorithm, we usually mean a sequence of instructions carried out in
order to solve some computational problem. The steps of an algorithm need
to be clearly deﬁned, so that they can be implemented and executed on a
computer.
Example 9.1 Given an integer number a and a positive integer n, an can be
computed using the following simple algorithm.

Algorithm 9.1 A naive algorithm for computing the nth power of an integer.
1: Input: a, n
2: Output: an
3: answer = 1
4: for i = 1, . . . , n do
5: answer=answer×a
6: end for
7: return answer

Algorithms are usually compared by their performance. The most popular

performance measure of an algorithm is the time it takes to produce the
final answer. However, if an algorithm is executed on a computer, the time
it will take to terminate may vary significantly depending on many factors,
such as the technical characteristics of a computer used to run the algorithm.
Therefore, in analysis of algorithms, their time requirements are expressed in
terms of the number of elementary operations needed in order to execute the
algorithm. These may be arithmetic operations, assignments, comparisons,
and so on. It is assumed that each such operation takes unit time.
Example 9.2 Algorithm 9.1 consists of n + 1 assignments and n multiplica-
tions; therefore, its total running time is 2n + 1.
Given all possible inputs, the time complexity of an algorithm is defined
as the number of steps that the algorithm requires in the worst case, and is

185
186 Numerical Methods and Optimization: An Introduction

usually expressed as a function of the input parameters. Thus talking about

the complexity of an algorithm, one is usually interested in the performance of
the algorithm when the inputs are very large. This diminishes the role of the
constant multipliers in our analysis. For example, if n is very large, the relative
difference between 10n5 and 11n5 can be neglected. On the other hand, if the
time required by an algorithm is expressed as a sum of terms with different
rates of growth, the slower-growing terms can be neglected, since they are
overpowered by faster-growing terms when n is large enough. To express the
worst-case complexity of an algorithm, we will use the asymptotic notations
defined next.

Deﬁnition 9.1 Let {f (n), n ≥ 1}, {g(n), n ≥ 1} be sequences of positive

real numbers. We say that

• f (n) = O(g(n)) if there exists a constant c > 0 and an integer n0

such that f (n) ≤ cg(n) ∀n ≥ n0 .
• f (n) = Ω(g(n)) if there exists a constant c > 0 and an integer n0
such that f (n) ≥ cg(n) ∀n ≥ n0 .

• f (n) = Θ(g(n)) if there exist constants c, c > 0 and an integer n0

such that cg(n) ≤ f (n) ≤ c g(n) ∀n ≥ n0 .

Note that Θ induces an equivalence relation on the set of functions. The

equivalence class of f (n) is the set of all functions g(n) such that f (n) =
Θ(g(n)). We call this class the rate of growth of f (n). For any polynomials
f (n) and g(n) of degree k, we have f (n) = Θ(g(n)) = Θ(nk ). Hence, nk is a
natural representative element for the class of polynomials of degree k.
The number of steps that an algorithm takes usually depends on the input,
and the run time is usually expressed in terms of the problem size. If the
problem’s input is represented by a sequence (string) of symbols, then the
problem size can be deﬁned as the length of this string, i.e., the number of
bits required to store the problem’s input.

Example 9.3 Assume that the input of a problem is given by a single in-
teger. The number of symbols required to represent an integer n in a base-β
arithmetic system is "logβ n#, where β ≥ 2. Hence, the size of the problem is
Θ(log n).

Example 9.4 Consider a problem whose input is an m × n matrix A. Then

the number of bits required to represent the input is

Θ(mn + "log |P |#),

where P is the product of all nonzero entries of A. Since the number of bits
allocated to numbers of the same type in modern computers can be treated as
Complexity Issues 187

a constant, Θ(mn) is a reasonable approximation of the problem size in this

case. It is also natural to describe the running time of an algorithm for solving
such a problem using a function of n and m.

Obviously, the size of the problem input depends on the data structures
used to represent the problem input.

Example 9.5 In graph theory, a simple, undirected graph is a pair G =

(V, E), where V is a ﬁnite set of vertices and E is a set of edges, with each
edge corresponding to a pair of vertices. If G = (V, E) is a graph and e =
{u, v} ∈ E, we say that u is adjacent to v and vice versa. We also say that
u and v are neighbors. The neighborhood N (v) of a vertex v is the set of all
its neighbors in G: N (v) = {u : {u, v} ∈ E}. If G = (V, E) is a graph and
e = {u, v} ∈ E, we say that e is incident to u and v. The degree of a vertex
v is the number of its incident edges.
A graph can be represented in several ways. For example, G = (V, E) can
be described by its |V | × |V | adjacency matrix AG = [aij ]n×n , such that

1, if {i, j} ∈ E,
aij =
0, otherwise.

Alternatively, the same graph can be represented by its adjacency lists, where
for each vertex v ∈ V we record the set of vertices adjacent to it. In adjacency
lists, there are 2|E| elements, each requiring O(log |V |) bits. Hence, the total
space required is O(|E| log |V |) and O(|E|) space is a reasonable approximation
of the size of a graph.

If an algorithm runs in time O(nk ), where n is a parameter describing the

problem size, then we say that this is a polynomial-time or eﬃcient algorithm
with respect to n.
Most often, we will describe the running time of an algorithm using the
O(·) notation. An algorithm that runs in O(n) time is referred to as a linear-
time algorithm; O(n2 ) time as a quadratic-time algorithm, etc.

Example 9.6 Denote by f (n) = 2n + 1, g(n) = n. Then f (n) = O(g(n)),

therefore Algorithm 9.1 runs in linear time.

The next two examples analyze the running time of two classical algorithms
for the sorting problem, which, given n integers asks to sort them in a non-
decreasing order. In other words, given an input list of numbers a1 , a2 , . . . , an ,
we need to output these numbers in a sorted list as1 , as2 , . . . , asn , where as1 ≤
as2 ≤ . . . ≤ asn and {s1 , s2 , . . . , sn } = {1, 2, . . . , n}.

Example 9.7 (Quicksort) Quicksort uses the following divide-and-conquer

strategy to divide a list into two sub-lists.

1. Pick an element, called a pivot, from the list. We can pick, e.g., the ﬁrst
element in the list as the pivot.
188 Numerical Methods and Optimization: An Introduction

2. Reorder the list so that all elements which are less than the pivot come
before the pivot and all elements greater than the pivot come after it
(equal values can go either way). Note that after this partitioning, the
pivot assumes its ﬁnal position in the sorted list.

3. Recursively sort the sub-list of lesser elements and the sub-list of greater
elements.

At step k of Quicksort we make at most n − k comparisons (the worst case

is when the list is already sorted). The worst-case running time is given by

n(n − 1)
T (n) = (n − 1) + (n − 2) + (n − 3) + . . . + 1 = = Θ(n2 ).
2
Thus, Quicksort is an Ω(n2 ) algorithm.

Example 9.8 (Mergesort) Mergesort proceeds as follows:

1. If the list is of length 0 or 1, then do nothing (list is already sorted).

Otherwise, follow Steps 2–4:

2. Split the unsorted list into two equal parts.

3. Sort each sub-list recursively by re-applying Mergesort.

4. Merge the two sub-lists back into one sorted list.

Note that the merge operation involves n − 1 comparisons. Let T (n) denote
the run time of Mergesort applied to a list of n numbers. Then

T (n) ≤ 2T (n/2) + n − 1
≤ 2(2T (n/4) + n/2 − 1) + n − 1
= 4T (n/4) + (n − 2) + (n − 1)
···

k−1
≤ 2k T (n/2k ) + (n − 2i )
i=0
= 2k T (n/2k ) + kn − (2k − 1)

T (1) = 0

Using k = log2 n, we obtain

T (n) ≤ n log n − n + 1 = Θ(n log n).

In the above two examples we discussed two sorting algorithms, Quicksort

running in Ω(n2 ) time, and Mergesort running in O(n log n) time. Hence,
Mergesort guarantees a better worst-case running time performance. However,
the worst-case analysis may not be indicative of “average-case” performance.
Complexity Issues 189

9.2 Average Running Time

Assume that all possible inputs of a given problem of size n are equally
probable. Then the average running time A(n) of an algorithm for solving this
problem is deﬁned as the expected running time over all possible outcomes.

Example 9.9 (Average running time of Quicksort) Let q1 , . . . , qn be

the sorted version of a1 , . . . , an . Note that the number of comparisons used
by Quicksort to sort a1 , . . . , an is completely deﬁned by the relative order of
q1 , . . . , qn in the input list a1 , . . . , an . Hence, a random input can be thought
of as a random permutation of the elements of the sorted list. Assuming that
all such permutations are equally probable, we need to determine the expected
number of comparisons made by Quicksort.
Denote by xij the random variable representing the number of comparisons
between qi and qj during a Quicksort run, where q1 ≤ q2 ≤ . . . ≤ qn is the
sorted version of the input a1 , a2 , . . . , an . We have

1, if qi and qj are compared (with probability pij )
xij =
0, if qi and qj are not compared (with probability 1 − pij )
Then

n
n
A(n) = E xij
i=1 j=i+1

n n
= E(xij )
i=1 j=i+1
n n
= (1 · pij + 0 · (1 − pij ))
i=1 j=i+1
n n
= pij .
i=1 j=i+1

Note that qi and qj are compared if and only if none of the numbers
qi+1 , . . . , qj−1 appear before both qi and qj in the list a1 , . . . , an . The prob-
ability of this happening is
2
pij = .
j−i+1
Hence,

n
n
n
n
2
A(n) = pij = j−i+1
i=1 j=i+1 i=1 j=i+1
n
2 2 2 2
= 2 + 3 + 4 + . . . + n−i+1
i=1
n $ %
1 1 1 1
< 2 2 + 3 + 4 + ... + n
i=1
n
i ≤ 2n ln n = O(n log n).
1
= 2n
i=2
190 Numerical Methods and Optimization: An Introduction

Therefore, the average running time of Quicksort is O(n log n).

9.3 Randomized Algorithms

A randomized algorithm is an algorithm in which some decisions depend
on the outcome of a coin ﬂip (which can be either 0 or 1). Two major types of
randomized algorithms are Las Vegas algorithms, which always yield a correct
answer, but take random running time; and Monte Carlo algorithms, which
run in predeﬁned time and give a correct answer with “high probability.” For
a problem of size n, we say that
1
p=1− = 1 − n−α ,
nα
where α > 1, is a high probability.
We say that a randomized algorithm takes Õ(f (n)) of a resource (such as
space or time) if there exist numbers C, n0 > 0 such that for any n ≥ n0 the
amount of resource used is

≤ Cαf (n) with probability ≥ 1 − n−α , α > 1.

Example 9.10 (Repeated element identiﬁcation) Consider the follow-

ing problem. Given n numbers, n/2 of which have the same value and the
remaining n/2 are all different, find the repeated number.
We can develop a simple deterministic algorithm for this problem, in which
we first sort the numbers and then scan the sorted list until two consecutive
numbers are equal. The time complexity of this algorithm is O(n log n).
Next, we consider a Las Vegas algorithm for the same problem.

Algorithm 9.2 A Las Vegas algorithm for ﬁnding the repeated element.
1: Input: a1 , a2 , . . . , an , where n/2 of the values are identical
2: Output: the repeated element
3: stop = 0
4: while stop=1 do
5: randomly pick two elements ai and aj
6: if (i = j) and (ai = aj ) then
7: stop = 1
8: return ai
9: end if
10: end while

Next we determine the number of iterations of the algorithm required to

Complexity Issues 191

ensure the correct answer with high probability. The probability of success in
a single try is
n2
n n
2(2 − 1) − n
1 1 1
Ps = = 4 2
= − > if n > 4.
n2 n2 4 2n 8
Hence, the probability of failure in a single try is

Pf = 1 − Ps < 7/8,

and the probability of success in k tries is 1 − (Pf )k > 1 − (7/8)k . We need to

ﬁnd k such that the probability of success is high (i.e., at least 1 − n−α ).
α log n
1 − (7/8)k = 1 − n−α ⇔ (7/8)k = n−α ⇔ k = .
log(8/7)
Let c = 1/ log(8/7). If k = cα log n, then we have a high probability of success
in k tries. Thus, the running time of the algorithm is Õ(log n).

9.4 Basics of Computational Complexity Theory

Given an optimization problem Π, a natural question is:

Is the problem Π “easy” or “hard”? (9.1)

Methods for solving “easy” and “hard” problems have fundamental diﬀer-
ences, therefore distinguishing between them is important in order to approach
each problem properly. Questions of this type are addressed by computational
complexity theory.

Deﬁnition 9.2 By easy or tractable problems, we mean the problems

that can be solved in time polynomial with respect to their size. We also
call such problems polynomially solvable and denote the class of polyno-
mially solvable problems by P.

Example 9.11 1. The problem of sorting n integer numbers can be solved

in O(n log n) time, thus it belongs to P.
2. Two n × n rational matrices can be trivially multiplied in O(n3 ) time,
thus matrix multiplication is in P.
3. A linear system Ax = b, where A is an n × n rational matrix and b is a
rational vector of length n can be solved in O(n3 ) time, thus the problem
of solving a linear system is in P.
192 Numerical Methods and Optimization: An Introduction

Now that we decided which problems will be considered easy, how do we

deﬁne “hard” problems? We could say that a problem Π is hard if it is not
easy, i.e., Π ∈
/ P, however some of the problems in such a class could be too
hard, to the extent that they cannot be solved for some inputs. Such problems
are referred to as undecidable problems.

Example 9.12 Consider the following problem, which is referred to as the

halting problem:

Given a computer program with its input, will it ever halt?

Alan M. Turing has shown that there cannot exist an algorithm that would
always solve the halting problem. The mathematical formalism he introduced in
1930s to describe a computer implementation of algorithms is called a Turing
machine and is a foundational concept in computational complexity theory.

We want to deﬁne a class of hard problems that we have a chance to solve,

perhaps by “gambling.” For example, we could try to guess the solution, and
if we are lucky and are able to easily verify that our solution is indeed correct,
then we solved the problem.

Example 9.13 Consider a lottery, where some black box contains a set W of
n/2 distinct winning numbers randomly chosen from the set N = {1, 2, . . . , n}
of n integers. The problem is to guess all the elements of W by picking n/2
numbers from N . As soon as the pick is made, we can easily verify whether the
set of numbers picked coincides with W . Winning such a lottery (i.e., guessing
all the winning numbers correctly) is not easy, but not impossible.

However, if we apply this logic to optimization problems, we may still en-

counter some major difficulties. For example, if we pick a random feasible
solution x̄ ∈ X of the problem min f (x), even if we are lucky and x̄ is an opti-
x∈X
mal solution of the problem, the fact of optimality may not be easy to verify,
since answering this question could require solving the optimization problem.
To overcome this difficulty, we can consider different versions of optimization
problems, for which the correct answer may be easier to verify.

Deﬁnition 9.3 Given a problem min f (x), we consider the following

x∈X
three versions of this problem:
• Optimization version: ﬁnd x from X that maximizes f (x);
Answer: x∗ maximizes f (x).

• Evaluation version: ﬁnd the largest possible f (x);

Answer: the largest possible value for f (x) is f ∗ .
Complexity Issues 193
• Decision (Recognition) version: Given f ∗ , does there exist an x
such that f (x) ≥ f ∗ ?
Answer: “yes” or “no.”

Example 9.14 Consider the classical maximum clique problem. A subset C

of vertices in a simple undirected graph G = (V, E) is called a clique if {i, j} ∈
E for any distinct i, j ∈ C, i.e., all pairs of vertices are adjacent in C. Given a
simple undirected graph G, the three versions of the maximum clique problem
according to the deﬁnition above are:

1. Optimization version (maximum clique): Find a clique of the largest size

in G.

2. Evaluation version (maximum clique size): Find the size of a largest

clique in G.

3. Recognition version (clique): Given a positive integer k, does there exist

a clique of size at least k in G?

If we randomly pick a solution (a clique) and its size is at least k (which we

can verify in polynomial time), then obviously the answer to the recognition
version of the problem is “yes.” This clique can be viewed as a certiﬁcate
proving that this is indeed a yes instance of clique.

9.4.1 Class N P
We will focus on decision problems that we can potentially solve by pro-
viding an efficiently verifiable certificate proving that the answer is “yes” (as-
suming that we are dealing with a yes instance of the problem). Such a class
of problems is defined next.

Deﬁnition 9.4 If a problem Π is such that for any yes instance of Π

there exists a concise (polynomial-size) certiﬁcate that can be veriﬁed in
polynomial time, then we call Π a nondeterministic polynomial problem.
We denote the class of nondeterministic polynomial problems by N P.

9.4.2 P vs. N P
Note that any problem from P is also in N P (i.e., P ⊆ N P), so there are
easy problems in N P. So, our original question (9.1) of distinguishing between
“easy” and “hard” problems now becomes:

Are there “hard” problems in N P, and if there are,

(9.2)
how do we recognize them?
194 Numerical Methods and Optimization: An Introduction

The ﬁrst part of this question can be stated as

?
P = N P. (9.3)

NP P

FIGURE 9.1: The conjectured relation between classes P and N P.

This question is still to be answered, in fact, it is viewed as one of the most

important open questions in mathematical sciences. However, it is conjectured
that P = N P, i.e., there are problems from N P that are not in P (see
Figure 9.1). Assuming that there are hard problems in N P, it makes sense
to consider a problem from N P hard if it is at least as hard as any other
problem from N P. In other words, we can call a problem hard if the fact that
we can solve this problem would mean that we can solve any other problem
in N P “about as easily.” This idea is formalized in the notion of polynomial
time reducibility from Π1 to Π2 , which is an “easy” reduction, such that if we
can solve Π2 fast, then we can solve Π1 fast.

9.4.3 Polynomial time reducibility

Deﬁnition 9.5 We say that Π1 is polynomial time reducible to Π2 ,
denoted by Π1 ∝ Π2 , if existence of a polynomial time algorithm for Π2
would imply existence of a polynomial time algorithm for Π1 .

In other words, Π1 ∝ Π2 if there exist two polynomial time algorithms:

1. A1 , which converts an input for Π1 into an input for Π2 ;

2. A2 , which converts an output for Π2 into an output for Π1 ,

such that any instance π1 of Π1 can be solved by

(i) converting π1 into an instance π2 of Π2 using A1 ,

(ii) solving π2 , and

(iii) converting the result of solving π2 into an output for π1 using A2 .

Complexity Issues 195

An engineer and a mathematician were put in a room with an

empty water boiler and a cup full of water and were told to
boil the water.
Both men ﬁlled the boiler with water and turned it on to boil the water.
Next, they were put in the same room, but with the water boiler full of
water and the cup emptied. Again, they were told to boil the water.
The engineer just turned on the boiler to complete the task.
The mathematician poured the water back into the cup, thus reducing it
to a previously solved problem.

FIGURE 9.2: Mathematical humor touching on problem reducibility.

Π1 ∝ Π2 implies that Π2 is at least as hard as Π1 , since polynomial

time solvability of Π2 leads to polynomial time solvability of Π1 . It is easy to
check that polynomial time reducibility is a transitive relation, i.e., if Π1 is
polynomial time reducible to Π2 and Π2 is polynomial time reducible to Π3 ,
then Π1 is polynomial time reducible to Π3 :

Π1 ∝ Π2 and Π2 ∝ Π3 ⇒ Π1 ∝ Π3 . (9.4)

Indeed, let A1 and A2 be the two polynomial time algorithms used to reduce
Π1 to Π2 , and let A1 and A2 be the two polynomial time algorithms used to
reduce Π2 to Π3 . To reduce Π1 to Π3 , we ﬁrst convert the instance π1 of Π1
into an instance π2 of Π2 using A1 , and then convert π2 into an instance π3
of Π3 using A1 . Then we solve Π3 and transform the answer into the answer
to Π2 using A2 , which we then convert into the answer to Π1 using A2 .

9.4.4 N P-complete and N P-hard problems

Deﬁnition 9.6 A problem Π is called N P-complete if

1. Π ∈ N P, and

2. any problem from N P is polynomial time reducible to Π.

Due to the transitivity of polynomial time reducibility, in order to show

that a problem Π is N P-complete, it is suﬃcient to show that

1. Π ∈ N P, and
196 Numerical Methods and Optimization: An Introduction

2. there is an N P-complete problem Π that can be reduced to Π in poly-

nomial time.

To use this observation, we need to know at least one N P-complete prob-

lem.
The first problem proved to be N P-complete is the satisfiability (SAT)
problem, which is defined next in conjunctive normal form representation. A
Boolean variable x is a variable that can assume only the values true (1) and
false (0). Boolean variables can be combined to form Boolean formulas using
the following logical operations:

1. Logical AND, denoted by “∧” or “·” (also called conjunction)

2. Logical OR, denoted by “∨” or “+” (also called disjunction)

3. Logical NOT, denoted by x̄.

5
k
A clause C is a disjunctive expression C = p , where each literal p is
p=1
either xr or x̄r for some r. A conjunctive normal form (CNF) is given by

6
m
F = Ci ,
i=1

where Ci is a clause. A CNF F is called satisﬁable if there is an assignment

of variables such that F = 1 (T RU E).
The Satisfiability (SAT) problem is defined as follows: Given m clauses
7
m
C1 , . . . , Cm with the variables x1 , . . . , xn , is the CNF F = Ci satisfiable?
i=1

Theorem 9.1 (Cook, 1971 [12]) SAT is NP-complete.

Cook’s theorem made it possible to prove N P-completeness of problems

via polynomial time reductions from other problems that have been proved to
be N P-complete. As a result, hundreds of important problems are now known
to be N P-complete.

Example 9.15 To demonstrate how a polynomial time reduction is used to

prove N P-completeness of a problem, consider a special case of SAT, 3-
satisﬁability (3-SAT) problem, in which each clause is restricted to consist
of just 3 literals.

Theorem 9.2 3-SAT is NP-complete.

Complexity Issues 197

Proof: Obviously, 3-SAT is in N P. We will reduce SAT to 3-SAT. Given

7
m
F = Ci , we will construct a new formula F with 3 literals per clause, such
i=1
that F is satisﬁable if and only if F is satisﬁable. We modify each clause Ci
as follows:

1. If Ci has three literals, do nothing.

5
k
2. If Ci has k > 3 literals, Ci = p , we replace Ci by k − 2 clauses
p=1

(1 ∨ 2 ∨ x1 ) ∧ (x̄1 ∨ 3 ∨ x2 ) ∧ (x̄2 ∨ 4 ∨ x3 ) ∧ · · · ∧ (x̄k−3 ∨ k−1 ∨ k ),

where x1 , . . . , xk−3 are new variables. The new clauses are satisﬁable if
and only if Ci is.

3. If Ci = , we replace Ci by ∨ y ∨ z, and if Ci = ∨ , we replace it by

∨ ∨ y. We add to the formula the clauses

(z̄ ∨ α ∨ β) ∧ (z̄ ∨ ᾱ ∨ β) ∧ (z̄ ∨ α ∨ β̄) ∧ (z̄ ∨ ᾱ ∨ β̄)∧

(ȳ ∨ α ∨ β) ∧ (ȳ ∨ ᾱ ∨ β) ∧ (ȳ ∨ α ∨ β̄) ∧ (ȳ ∨ ᾱ ∨ β̄),

where y, z, α, β are new variables. This makes sure that z and y are false
in any truth assignment satisfying F , so that the clauses Ci = and
Ci = ∨ are equivalent to their replacements.

Example 9.16 Next, we will show that the recognition version of the max-
imum clique problem, clique, described in Example 9.14, is N P-complete.
Recall that clique is deﬁned as follows. Given a simple undirected graph
G = (V, E) and an integer k, does there exist a clique of size ≥ k in G?

Theorem 9.3 CLIQUE is NP-complete.

Proof: We use a polynomial time reduction from 3-SAT.

1. clique ∈ N P . Certiﬁcate: a clique of size k; can be veriﬁed in O(k 2 )

time.

2. 3-SAT is polynomial time reducible to clique.

7
m
Given F = Ci , we build a graph G = (V, E) as follows:
i=1

• The vertices correspond to pairs (Ci , ij ), where ij is a literal in

Ci (i = 1, . . . , m; j = 1, 2, 3).
198 Numerical Methods and Optimization: An Introduction

• Two vertices (Ci , ij ) and (Cp , pq ) are connected by an edge if and
only if
i = p and ij = ¯pq .
Then F is satisﬁable if and only if G has a clique of size m.
Note that many optimization problems of interest, such as the optimization
version of the maximum clique problem (see Example 9.14), are unlikely to
belong to class N P. Hence, we cannot claim that the optimization version of
the maximum clique problem is N P-complete, despite the fact that it is at
least as hard as clique (see Exercise 9.9). If we drop the ﬁrst requirement
(Π ∈ N P), we obtain the class of problems called N P-hard.

Deﬁnition 9.7 A problem Π is called N P-hard if any problem from

N P is polynomial time reducible to Π.

According to this deﬁnition, any N P-complete problem is also N P-hard,

and N P-hard problems are at least as hard as N P-complete problems.

Example 9.17 Since clique is polynomial time reducible to the optimization

version of the maximum clique problem (see Exercise 9.9), the maximum clique
problem is N P-hard.

9.5 Complexity of Local Optimization

It is generally assumed that a local optimal solution is easier to ﬁnd than
a global optimal solution. Next, we will show that checking local optimality
for a feasible solution and checking whether a given local minimum is strict
are both N P-hard problems, even for quadratic optimization problems.

Theorem 9.4 Consider the following nonconvex quadratic problem with

linear constraints:
minimize f (x)
subject to Ax ≥ b (9.5)
x ≥ 0,

where f (x) is an indeﬁnite quadratic function. Then the following com-

plexity results hold:

(a) Given a feasible solution x∗ of problem (9.5), checking if x∗ is a

local minimizer for this problem is N P-hard.
Complexity Issues 199
(b) Given a local minimizer x∗ of problem (9.5), checking if x∗ is a
strict local minimizer for this problem is N P-hard.

(c) Given a global minimizer x∗ of problem (9.5), checking if x∗ is a

strict global minimizer for this problem is N P-hard, even if f (x)
is a concave quadratic function.

Proof: We will use a polynomial time reduction from 3-SAT to problem (9.5).
Let
6
m
F = Ci ,
i=1
where Ci = i1 ∨ i2 ∨ i3 and each literal ij , i = 1, . . . , m, j = 1, 2, 3 is either
some variable xk or its negation x̄k , k = 1, . . . , n, be an arbitrary instance of
the 3-SAT problem. For each such F , we construct an instance of a quadratic
optimization problem with the real vector of variables x = [x0 , x1 , . . . , xn ]T as
follows. For each clause Ci = i1 ∨ i2 ∨ i3 , we construct one linear inequality
constraint. The left-hand side of this inequality is given by the sum of four
terms, one of which is x0 , and the remaining three depend on the three literals
of Ci . Namely, if ij = xk for some k, then we add xk to the inequality, and
if ij = x̄k , we add (1 − xk ) to the inequality. The sum of these four terms is
set ≥ 32 to obtain the inequality corresponding to Ci . If we denote by

Ii = {k : ∃j ∈ {1, 2, 3} such that ij = xk }

the set of indices of variables that are literals in Ci and by

I¯i = {k : ∃j ∈ {1, 2, 3} such that ij = x̄k }

the set of indices of variables whose negations are literals in Ci , then we have
the following correspondence:
Clause Ci in F ←→ a linear term
ij = xk k ∈ Ii xk
ij = xk k ∈ I¯i 1 − xk
and the linear inequality representing Ci is given by
3
x0 + xk + (1 − xk ) ≥ .
¯
2
k∈Ii k∈Ii

For example, for the clause x̄1 ∨ x2 ∨ x3 we have

3
x0 + (1 − x1 ) + x2 + x3 ≥ .
2
The system of m linear inequalities representing the clauses of F this way
is given by
3
AF x ≥ e + c,
2
200 Numerical Methods and Optimization: An Introduction

where AF is a (sparse) matrix with entries in {0, 1, −1}, x = [x0 , . . . , xn ]T ,

e = [1, . . . , 1]T , and c is a vector with the ith component ci given by ci = −|I¯i |.
Let AF be the (m + 2n) × (n + 1) matrix, let bF be an (n + 1)-vector of
right-hand sides, and let x = [x0 , . . . xn ]T be the (n + 1)-vector of variables
that describe the following system of linear constraints in the form AF x ≥ bF :

AF x ≥ 3
2 +c
x0 − xi ≥ − 12 , i = 1, . . . , n ≡ AF x ≥ b F . (9.6)
x0 + xi ≥ 1
2, i = 1, . . . , n

We will use the following indeﬁnite quadratic functions in the proof. Let

n
f1 (x) = (x0 + xi − 1/2)(x0 − xi + 1/2)
i=1

and

n
1
n
f2 (x) = (x0 + xi − 1/2)(x0 − xi + 1/2) − (xi − 1/2)2 .
i=1
2n i=1

n
Note that f1 (x) = nx20 − (xi −1/2)2 , i.e., it is a separable indeﬁnite quadratic
i=1
function with one convex term nx20 and n concave terms −(xi − 1/2)2 , i =
1, . . . , n. In addition,
f2 (x) = f1 (x) + q(x),
where
1
n
q(x) = − (xi − 1/2)2 .
2n i=1

Also, we will use a vector x̂ in the proof, which is deﬁned as follows for an
arbitrary x0 and Boolean x1 , . . . , xn satisfying F :

1/2 − x0 if xi = 0
x̂i = i = 1, . . . , n. (9.7)
1/2 + x0 if xi = 1,

We are ready to prove statements (a)–(c) of the theorem.

(a) Given an instance F of 3-SAT, we construct the following instance of an

indeﬁnite quadratic problem as described above:

minimize f2 (x)
subject to AF x ≥ bF (9.8)
x ≥ 0.

To prove (a), we show that F is satisﬁable if and only if x∗ =

[0, 1/2, . . . , 1/2] is not a local minimum of (9.8).
Complexity Issues 201

First, assume that F is satisﬁable. Let x1 , . . . , xn be a variable assign-

ment satisfying F . Given any x0 ≥ 0, which is arbitrarily close to 0,
consider the vector x̂ = [x0 , x̂1 , . . . , x̂n ]T deﬁned in (9.7). Then x̂ is
feasible and
x2
f2 (x̂) = − 0 < 0 = f2 (x∗ ).
2
∗
Hence, x is not a local minimizer.
Suppose now that x∗ is not a local minimum. Then there exists a point

y = [y0 , . . . , yn ]T such that f2 (y) < 0. (9.9)

We show that F is satisﬁed with

0 if yi ≤ 1/2
xi = (9.10)
1 if yi > 1/2.

Note that x deﬁned in (9.10) satisﬁes F if and only if every clause Ci of

F has a literal ip such that, if ip = xk , then yk > 1/2, and if ip = x̄k ,
then yk ≤ 1/2. Consider an arbitrary clause Ci of F . Without loss of
generality, assume that the indices of three variables (or their negations)
involved in deﬁning its literals are 1, 2, and 3. If we deﬁne

yk if xk is a literal in Ci
yk = k = 1, 2, 3,
(1 − yk ) if x̄k is a literal in Ci ,

then having one k ∈ {1, 2, 3} such that yk > 1/2 is suﬃcient for x
in (9.10) to satisfy Ci . Next, we show that such k always exists if y
satisﬁes (9.9). We use contradiction.
Assume, by contradiction, that yk ≤ 1/2, k = 1, 2, 3 for the clause Ci
above. Then the inequality corresponding to Ci in AF y ≤ bF can be
written as
3
y0 + y1 + y2 + y3 ≥ .
2
For example, for the clause x̄1 ∨ x2 ∨ x3 we have
3
y0 + y1 + y2 + y3 = y0 + (1 − y1 ) + y2 + y3 ≥ .
2
For this inequality to hold, we must have yk ≥ 21 − y30 for at least one
k ∈ {1, 2, 3}. Without loss of generality, consider the case k = 1 (other
cases are established analogously). By our assumption, y1 ≤ 1/2, so
1 y0 1 y0 1
− ≤ y1 ≤ ⇒ − ≤ (1 − y1 ) − ≤ 0.
2 3 2 3 2
Hence,
y02
(y1 − 1/2)2 ≤ . (9.11)
9
202 Numerical Methods and Optimization: An Introduction

Also, since y is feasible for (9.8), it must satisfy the constraints in (9.6),
so
1 1 1
y 0 − y i ≥ − , y0 + y i ≥ ⇒ −y0 ≤ yi − ≤ y0 , i = 1, . . . , n,
2 2 2
and thus
y02 − (yi − 1/2)2 ≥ 0, i = 1, . . . , n. (9.12)
Therefore, using (9.12) and then (9.11), we obtain:

n
f1 (y) = (y0 + yi − 1/2)(y0 − yi + 1/2)
i=1
n $ %
= y02 − (yi − 1/2)2
i=1
≥ y02 − (y1 − 1/2)2
≥ 8 2
9 y0 .

On the other hand,

1
n
1
q(y) = − (yi − 1/2)2 ≥ − y02 .
2n i=1 2

Hence
8 2 1 2 7 2
f2 (y) = f1 (y) + q(y) ≥ y − y = y ≥ 0,
9 0 2 0 18 0
a contradiction with (9.9).
(b) We associate the following indeﬁnite quadratic problem with the given
instance F of 3-SAT:
minimize f1 (x)
subject to AF x ≥ bF (9.13)
x ≥ 0.

This problem has the following properties:

i) f1 (x) ≥ 0 for all feasible x. Therefore, the feasible solution

x∗ = [0, 1/2, . . . , 1/2]T is a local and global minimum of f (x) since
f (x∗ ) = 0.
ii) f1 (x) = 0 if and only if xi ∈ {1/2 − x0 , 1/2 + x0 } for all i =
1, . . . , n.

To prove (b), we show that F is satisﬁable if and only if x∗ =

[0, 1/2, . . . , 1/2]T is not a strict local minimum of the problem (9.13).
Let F be satisﬁable, with x1 , . . . , xn being an assignment satisfying S.
For any x0 , consider the vector x̂ = (x0 , x̂1 , . . . , x̂n )T deﬁned in (9.7).
Then we have f1 (x̂) = 0. Since x0 can be chosen to be arbitrarily close
to zero, x∗ is not a strict local minimum.
Complexity Issues 203

Suppose now that x∗ = [0, 1/2, . . . , 1/2]T is not a strict local minimum,
that is, there exists y = x∗ such that f1 (y) = f (x∗ ) = 0; therefore, yi ∈
{1/2 − y0 , 1/2 + y0 } , i = 1, . . . , n. Then the variables xi , i = 1, . . . , n
deﬁned by

0 if yi = 1/2 − y0
xi (y) =
1 if yi = 1/2 + y0

satisfy F.

(c) Finally, to prove (c), note that if we ﬁx x0 = 1/2 in the above indeﬁnite
quadratic problem, then the objective function f (x) is concave with x∗
as the global minimum.

9.6 Optimal Methods for Nonlinear Optimization

Since many of the problems of interest in nonlinear optimization are N P-
hard, it is expected that algorithms designed to solve them will run in non-
polynomial time (in the worst case). Given two exponential-time algorithms
for the same problem, how do we decide which one is better? Note that when
we say that one method is better than another, it is important to clearly
state what exactly we mean by “better.” One common criterion that is used
to compare two algorithms for the same problem is the worst-case running
time. Now assume that we found an algorithm A that outperforms other
known methods for the same problem Π in terms of this criterion. Then an
interesting question is, can we find an algorithm with lower time complexity,
or is algorithm A the best possible for the given problem Π? Answering this
question requires establishing lower bounds on time complexity of algorithms
that can be used to solve Π, i.e., finding a function L(n), where n defines the
problem size, such that any algorithm for solving the problem will require at
least O(L(n)) time. However, if we were able to provide such a bound and
show that it is exponential for an N P-complete problem Π, this would imply
that Π is not in P, and therefore P = N P, which would solve the open P vs.
N P problem. Hence, it is unlikely that we will be able to easily establish the
lower bound that can be proved to hold for any algorithm solving Π, and we
should set a more realistic goal of establishing a lower complexity bound that
would apply to a wide class of methods for solving Π rather than all methods.

9.6.1 Classes of methods

We can deﬁne the classes of methods based on the type of information
they are allowed to use about a problem. We assume that we are dealing with
204 Numerical Methods and Optimization: An Introduction

an optimization problem in the functional form

minimize f (x)
subject to g(x) ≤ 0 (Π)
h(x) = 0,
where f : IRn → IR, g : IRn → IRp , and h : IRn → IRm .

Deﬁnition 9.8 (Zero-Order Methods) The class of zero-order meth-

ods is deﬁned as the class of methods that can only use zero-order local
information about the functions involved in the model (Π) and cannot
use any other information about the problem. That is, for a given point
x̄ ∈ IRn , zero-order methods can only use the values of f (x̄), g(x̄) and
h(x̄) and no other information about (Π).

Zero-order methods are typically referred to as direct search or derivative-

free methods.

Deﬁnition 9.9 (First-Order Methods) First-order methods are the

methods that can only use zero-order and ﬁrst-order local information
about the functions involved in the model (Π) and cannot use any other
information about the problem. That is, for a given point x̄ ∈ IRn , ﬁrst-
order methods can only use the values of ∇f (x̄), ∇gj (x̄), j = 1, . . . , p,
and ∇hi (x̄), i = 1, . . . , m in addition to f (x̄), g(x̄) and h(x̄), but no other
information about (Π).

First-order methods are also known as gradient-based methods.

Deﬁnition 9.10 (Second-Order Methods) Second-order methods are

the methods that can only use zero-order, ﬁrst-order, and second-order
local information about the functions involved in the model (Π) and
cannot use any other information about the problem. That is, for a
given point x̄ ∈ IRn , in addition to the information allowed in the
ﬁrst-order methods, second-order methods can only use the values of
∇2 f (x̄), ∇2 gj (x̄), j = 1, . . . , p, and ∇2 hi (x̄), i = 1, . . . , m, but no other
information about (Π).

9.6.2 Establishing lower complexity bounds for a class of

methods
Consider the problem
min f (x), (9.14)
x∈Bn
Complexity Issues 205

where
Bn = [0, 1]n = {x ∈ IRn : 0 ≤ xi ≤ 1, i = 1, . . . , n}
and f is Lipschitz continuous on Bn with respect to the inﬁnity norm:

|f (x) − f (y)| ≤ Lx − y∞ ∀x, y ∈ Bn ,

with L being a Lipschitz constant. According to the Weierstrass theorem, this

problem has an optimal solution. Assume that f ∗ is the optimal objective
value. Given a small > 0, we will be satisﬁed with an approximate solution
x̃ ∈ Bn such that
f (x̃) − f ∗ ≤ . (9.15)
Our goal in this subsection is to determine the minimum running time a zero-
order method will require in order to guarantee such a solution. In other words,
we want to establish the lower complexity bounds on zero-order methods for
solving the considered problem.

Theorem 9.5 Let < L/2. Then for any zero-order method for solving
problem (9.14) with an accuracy better$8than
9%n there exists an instance of
L
this problem that will require at least 2 objective function evalua-
tions.

8L9
Proof: Let p = 2 ; then p ≥ 1. Assume that there is a zero-order method
M that needs N < pn function evaluations to solve any instance of our prob-
lem (9.14) approximately. We will use the so-called resisting strategy to con-
struct a function f such that f (x) = 0 at any test point x used by the method
M, so that the method can only ﬁnd x̃ ∈ Bn with f (x̃) = 0. Note that split-
ting [0, 1] into p equal segments with the mesh points { pi , i = 0, . . . , p} for
each coordinate axis deﬁnes a uniform grid partitioning of Bn into pn equal
hypercubes with the side 1/p. Since the method M uses N < pn function eval-
uations, at least one of these hypercubes does not contain any of the points
used by the method in its interior. Let x̂ be the center of such a hypercube.
We consider the function

f¯(x) = min{0, Lx − x̂∞ − }.

It is easy to check (Exercise 9.12) that f¯(x) is Lipschitz continuous with the
constant L in the inﬁnity norm, its global optimal value is −, and it diﬀers
from zero only inside the box B = {x : x − x̂∞ ≤ /L}, which, since
2p ≤ L/, is a part of B ≡ {x : x − x̂∞ ≤ 2p 1
}. Thus, f¯(x) = 0 at all
test points of the method M. We conclude that the accuracy of the result of
our method cannot be better than if the number of the objective function
evaluations is less than pn .

Example 9.18 If L = 2, n = 10, = 0.01, in order to $8 solve9%the prob-

L n
lem (9.14) with the accuracy better than , we need at least 2 = 1020
206 Numerical Methods and Optimization: An Introduction

objective function evaluations. If a computer can perform 105 objective func-

tion evaluations per second, we will need 1015 seconds to solve the problem,
which is approximately 31,250,000 years.

9.6.3 Deﬁning an optimal method

Assume that the lower complexity bound for a given class of methods C on
a given problem Π is given by O(L(n)), and we designed a method M from
the class C that solves Π in O(U (n)) time (which can be viewed as an upper
complexity bound for C on Π). Then it makes sense to call the method M an
optimal class C method for problem Π if U (n) = O(L(n)), which is the case,
e.g., when lim UL(n) (n)
= c, where c is a constant.
n→∞
Next we give an example of an optimal zero-order method for prob-
lem (9.14). Namely, we consider a simple method for solving problem (9.14),
which we will refer to as the uniform grid method. We partition Bn into pn
equal hypercubes with the side 1/p. Note that such partitioning is unique
and is obtained by placing a uniform grid on Bn with the mesh points
{ pi , i = 0, . . . , p} used for each coordinate axis, which split [0, 1] into p equal
segments (see Figure 9.3 for an illustration). Let Bi1 i2 ...in be the hypercube
in the partition that corresponds to ik -th such segment for the k-th coor-
dinate axis, where ik ∈ {1, . . . , p}, k = 1, . . . , n. Let x(i1 , i2 , . . . , in ) be the
center of the hypercube Bi1 i2 ...in , then the k-th coordinate of x(i1 , i2 , . . . , in )
k −1
is 2i2p , k = 1, . . . , n. For example, x(2, 3) = [3/10, 5/10]T for the illustration
in Figure 9.3.
From the set
0 1T :
1 −1
Xp = x(i1 , i2 , . . . , in ) = 2i2p , . . . , 2in2p−1 : ik ∈ {1, . . . , p}, k = 1, . . . , n

of pn constructed points, the uniform grid algorithm simply picks one, x̄, that
minimizes f (x) over Xp . Let x∗ and f ∗ = f (x∗ ) be a global optimum and the
optimal value of problem (9.14), respectively. Then, x∗ must belong to one of
the hypercubes Bi1 i2 ...in . Let x̃ be the center of that hypercube. Then
L
f (x̄) − f ∗ ≤ f (x̃) − f ∗ ≤ Lx̃ − x∗ ∞ ≤ .
2p
Thus, if our goal is to ﬁnd x̄ ∈ Bn : f (x̄) − f ∗ ≤ , then we need to choose p
such that ; <
L L
≤ ⇔ p≥ .
2p 2
The resulting uniform grid method is summarized in Algorithm 9.3.
This is a zero-order method, since it uses only the function value informa-
tion. As we have shown, the uniform grid algorithm outputs an -approximate
solution by performing a total of " 2
L
# function evaluations.
Let the desired accuracy be given, and let L/2 > > be such that
Complexity Issues 207
5
p
9
x(1, 5) x(2, 5) x(3, 5) x(4, 5) x(5, 5)
2p
4
p
7
x(1, 4) x(2, 4) x(3, 4) x(4, 4) x(5, 4)
2p
3
p
5
x(1, 3) x(2, 3) x(3, 3) x(4, 3) x(5, 3)
2p
2
p
3
x(1, 2) x(2, 2) x(3, 2) x(4, 2) x(5, 2)
2p
1
p
1
x(1, 1) x(2, 1) x(3, 1) x(4, 1) x(5, 1)
2p

0 1 1 3 2 5 3 7 4 9 5
2p p 2p p 2p p 2p p 2p p

FIGURE 9.3: An illustration of the uniform grid method for n = 2, p = 5.

Algorithm 9.3 The uniform grid algorithm for minimizing f over Bn =

[0, 1]n , where f is Lipschitz-continuous on Bn with the Lipschitz constant L
in the inﬁnity norm.
Input: f (·),
Output: x̄ ∈ Bn : f (x̄) ≤ f ∗ + , where f ∗ = min f (x)
x∈Bn

1: p= " 2
L
#
1 −1 2i2 −1
2: x(i1 , i2 , . . . , in ) = [ 2i2p , 2p , . . . , 2in2p−1 ]T ∀ik ∈ {1, . . . , p}, k = 1, . . . , n
3: x̄ = arg min f (x), where
x∈Xp
1 −1
Xp = {x(i1 , . . . , in ) = [ 2i2p , . . . , 2in2p−1 ]T , ik ∈ {1, . . . , p}, k = 1, . . . , n}
4: return x̄

8 9
8L9
L
≥ 2
2 − 1 (note that such can always be selected). Recall from
Theorem 9.5 that achieving $8 accuracy
9%n < 9 using
$8 %n a zero-order method will
require at least L(n) = L
2 ≥ L
2 − 1 objective function eval-
uations. $&
Using '%nthe uniform
$8 L 9 grid
%n method, we guarantee accuracy using
U (n) = L
2 ≤ 2 + 1 objective function evaluations. If we select
208 Numerical Methods and Optimization: An Introduction

≤ L/(2n + 2), then the ratio of the upper and lower complexity bounds is
$8 L 9 %n n n
U (n) 2 + 1 2 2
≤ $8 L 9 %n = 1 + 8 L 9 ≤ 1+ → exp(2), n → ∞.
L(n) 2 − 1 2 − 1
n

This implies that the uniform grid method is an optimal zero-order method
for the considered problem.

Exercises
9.1. Given positive integer numbers a and n, develop an algorithm that com-
putes an in O(log n) time.
9.2. Prove or disprove:

n
(a) i5 = Θ(n6 ).
i=0
(b) n2.001 = Θ(n2 ).
(c) n5 + 1012 n4 log n = Θ(n5 ).
(d) n3n + 2n = O(n3 2n ).

f1 (n) g1 (n)
(e) If f1 (n) = O(g1 (n)) and f2 (n) = O(g2 (n)) then f2 (n) =O g2 (n) .

9.3. Consider an infinite array in which the first n cells contain integers in
sorted order, and the rest of the cells are filled with ∞. Propose an
algorithm that takes an integer k as input and finds the position of k
in the array or reports that k is not in the array in Θ(log n) time. Note
that the value of n is not given.
9.4. The following master theorem is used to solve recurrence relations. Let
a ≥ 1 and b > 1 be constants. Let f (n) be a function and let T (n) be
defined on the nonnegative integers by the recurrence of the form

T (n) = aT (n/b) + f (n),

where we interpret n/b to mean either n/b or "n/b#. Then T (n) can
be bounded asymptotically as follows.
$ %
1. If f (n) = O nlogb a− for some > 0, then T (n) = Θ(nlogb a );
$ %
2. If f (n) = Θ nlogb a , then T (n) = Θ(nlogb a log n);
$ %
3. If f (n) = Ω nlogb a+ for some > 0, and if af (n/b) ≤ cf (n)
for some constant c < 1 and all suﬃciently large n, then T (n) =
Θ(f (n)).
Complexity Issues 209

Use the master theorem to solve the following recurrence relations:

(a) T (n) = 3T (n/3) + 3;
(b) T (n) = T (3n/4) + n log n;
(c) T (n) = 4T (n/2) + n2 .
9.5. Assume that we are given a Monte Carlo algorithm for solving a problem
Π in T1 time units, whose output is correct with probability ≥ 12 . Also,
assume that there is another algorithm that can check, in T2 time units,
whether a given answer is valid for Π. Use these two algorithms to obtain
a Las Vegas algorithm for solving Π in time Õ((T1 + T2 ) log n).
9.6. Austin has ideas for n potential research papers 1, 2, . . . , n. To maximize
his efficiency, he wants to work on one paper at a time, until it is finished.
He knows with certainty that completing paper i will take exactly pi
days, 1 ≤ i ≤ n. Let D be the number of days remaining until the global
deadline Austin set for himself. The objective is to find the maximum
number of papers that can be completed before the deadline. Propose
an O(n log n) time algorithm for solving this problem. Prove that the
output of your algorithm is indeed optimal.
9.7. Consider n jobs 1, 2, . . . , n that must be processed by a single machine,
which can process only one job at a time. Processing job i takes pi
units of time, 1 ≤ i ≤ n. The objective is to minimize the average
n
completion time, n1 ci , where ci denotes the completion time for job i
i=1
(for example, if job i is started at time si , its completion time is si + pi .
Give an algorithm that solves the problem of scheduling to minimize
average completion time. Prove that your algorithm returns a correct
answer. What is the running time of your algorithm?
9.8. Samyukta considers a set of n activities 1, 2, . . . , n to attend during
the semester. Each activity ai has a start time si and a finish time
fi , where 0 ≤ si ≤ fi < ∞. Activity i is conducted during the half-
open time interval [si , fi ). Hence, activities i and j are compatible if the
intervals [si , fi ) and [sj , fj ) do not overlap. Since Samyukta can attend
only one activity at a time, her objective is to select a maximum number
of mutually compatible activities. Design an algorithm for this problem
and prove that your algorithm always outputs an optimal solution. What
is the running time of your algorithm?
9.9. Show that any two of the three versions of the maximum clique problem
described in Example 9.14 are polynomial time reducible to each other.
9.10. Given a simple undirected graph G = (V, E), a subset C ⊆ V of vertices
is called a vertex cover if each edge in E has at least one endpoint in C.
Prove that the minimum vertex cover problem, which is to find a vertex
cover of minimum size in G, is N P-hard.
210 Numerical Methods and Optimization: An Introduction

9.11. Given a simple undirected graph G = (V, E), a subset I ⊆ V of vertices

is called an independent set if no two vertices in I are adjacent to each
other. Prove that the maximum independent set problem, which is to
ﬁnd an independent set of maximum size in G, is N P-hard.

9.12. Show that f¯(x) = min{0, Lx − x̂∞ − } in the proof of Theorem 9.5
is Lipschitz continuous with the constant L in the inﬁnity norm, its
global optimal value is −, and it diﬀers from zero only inside the box
B = {x : x − x̂∞ ≤ /L}.
Chapter 10
Introduction to Linear Programming

Linear programming is a methodology for solving linear optimization prob-

lems, in which one wants to optimize a linear objective function subject to con-
straints on its variables expressed in terms of linear equalities and/or inequal-
ities. Ever since the introduction of the simplex method by George Dantzig
in the late 1940s, linear programming has played a major role in shaping the
modern horizons of the ﬁeld of optimization and its applications.

10.1 Formulating a Linear Programming Model

Example 10.1 Consider the following problem. Heavenly Pouch, Inc. pro-
duces two types of baby carriers, non-reversible and reversible. Each non-
reversible carrier sells for $23, requires 2 linear yards of a solid color fabric,
and costs $8 to manufacture. Each reversible carrier sells for $35, requires 2
linear yards of a printed fabric as well as 2 linear yards of a solid color fab-
ric, and costs $10 to manufacture. The company has 900 linear yards of solid
color fabrics and 600 linear yards of printed fabrics available for its new car-
rier collection. It can spend up to $4,000 on manufacturing the carriers. The
demand is such that all reversible carriers made are projected to sell, whereas
at most 350 non-reversible carriers can be sold. Heavenly Pouch is interested
in formulating a mathematical model that could be used to maximize its proﬁt
(e.g., the diﬀerence of revenues and expenses) resulting from manufacturing
and selling the new carrier collection.

10.1.1 Deﬁning the decision variables

We start formulating a model by deﬁning decision variables , which are the
variables determining the outcome whose values we can control. Since Heav-
enly Pouch, Inc. needs to decide on how many non-reversible and reversible
carriers to manufacture, it is natural to deﬁne the decision variables as follows:

x1 = the number of non-reversible carriers to manufacture

x2 = the number of reversible carriers to manufacture.

211
212 Numerical Methods and Optimization: An Introduction

10.1.2 Formulating the objective function

The objective of Heavenly Pouch, Inc. is to maximize its profit, which is
the difference of the revenues and manufacturing costs. Since non-reversible
and reversible carriers sell for $23 and $35, respectively, the total revenue r
resulting from selling all the carriers manufactured is given by r = 23x1 +35x2 .
Similarly, the total cost c of manufacturing all the carriers is given by c =
8x1 +10x2 . Hence, the profit z can be expressed as the following linear function
of the decision variables x1 and x2 :

z = r − c = (23x1 + 35x2 ) − (8x1 + 10x2 ) = 15x1 + 25x2 .

10.1.3 Specifying the constraints

Constraints are used to make sure that our model does not accept combina-
tions of variable values that are physically impossible. For example, Heavenly
Pouch, Inc. cannot manufacture 300 non-reversible and 300 reversible carriers
simultaneously, since this would require 300×$8+300×$10=$5,400 in man-
ufacturing costs, which exceeds the given budget of $4,000. Constraints are
often associated with limited availability of various types of resources (such as
materials, time, space, money) or other considerations that restrict our choice
of decision variable values. From our problem’s statement we know that
• the company has only 900 linear yards of solid color fabrics available,
• at most 600 linear yards of printed fabrics can be used,
• the manufacturing budget is limited to $4,000, and
• at most 350 non-reversible carriers can be sold.
Based on the problem data and the above deﬁnition of the decision variables,
these considerations can be expressed using linear inequalities as follows. Since
2 linear yards of the solid fabrics is used for each non-reversible and reversible
carrier, the total amount of the solid fabrics used to manufacture x1 non-
reversible carriers and x2 reversible carriers is 2x1 + 2x2 , and we have 2x1 +
2x2 ≤ 900, or equivalently,
x1 + x2 ≤ 450 (solid color fabric constraint).
Since the printed fabrics are used only for reversible carriers (2 linear
yards/carrier), we have 2x2 ≤ 600, which is the same as
x2 ≤ 300 (printed fabric constraint).
Heavenly Pouch, Inc. spends $8/carrier and $10/carrier to manufacture a non-
reversible and reversible carrier, respectively, so we have 8x1 + 10x2 ≤ 4, 000,
or
4x1 + 5x2 ≤ 2, 000 (budget constraint).
Introduction to Linear Programming 213

Finally, since at most 350 non-reversible carriers can be sold, we have

x1 ≤ 350 (demand constraint).
In addition, we need to specify the nonnegativity constraints, which are not
explicitly stated in the problem description, but result from the deﬁnition of
the decision variables. Namely, since the variables x1 and x2 represent the
quantities of physical objects, their values must be nonnegative by deﬁnition:
x1 , x2 ≥ 0 (nonnegativity constraints).
Even though in some cases the nonnegativity constraints may not have any
impact on the optimal solution of the problem, they play an important role
in designing algorithms for linear programming, and thus must always be
mentioned whenever they hold.

10.1.4 The complete linear programming formulation

In summary, we obtain the following linear program (LP) that models the
considered problem:
maximize 15x1 + 25x2 (proﬁt)
subject to (s.t.) x1 + x2 ≤ 450 (solid color fabric constraint)
x2 ≤ 300 (printed fabric constraint)
4x1 + 5x2 ≤ 2, 000 (budget constraint)
x1 ≤ 350 (demand constraint)
x 1 , x2 ≥ 0 (nonnegativity constraints).

10.2 Examples of LP Models

10.2.1 A diet problem
Example 10.2 Yiming aims to improve his diet. Based on a nutrition spe-
cialist recommendation, he wants his daily intake to contain at least 60 g of
protein, 800 mg of calcium, 75 mg of vitamin C, and 2,000 calories. He would
like his menu for the day to consist of ﬁve food types: almond butter, brown
rice, orange juice, salmon, and wheat bread. The serving size, cost per serving,
and nutrition information for each food type is provided in the table below.
Food type Cost Protein Calcium Vitamin C Calories
($) (g) (mg) (mg)
Almond butter (100 g) 2.90 15 270 1 600
Brown rice (200 g) 3.20 5 20 0 215
Orange juice (250 g) 0.50 2 25 106 110
Salmon (150 g) 4.50 39 23 0 280
Wheat bread (25 g) 0.30 3 35 0 66
Required ingestion – 60 800 75 2,000
214 Numerical Methods and Optimization: An Introduction

Formulate an LP that would allow Yiming to design the least expensive diet
that satisﬁes the above requirements.

Yiming has to decide how much of each of the food types should be included
in his diet. Therefore, it is natural to deﬁne a decision variable for the amount
of each food type consumed daily:

x1 = servings of almond butter eaten daily

x2 = servings of brown rice eaten daily
x3 = servings of orange juice drunk daily
x4 = servings of salmon eaten daily
x5 = servings of wheat bread eaten daily.

His objective is to minimize the cost, which can be easily written as a linear
function of decision variables as follows:

z = 2.9x1 + 3.2x2 + 0.5x3 + 4.5x4 + 0.3x5 .

The constraints express the minimum daily requirements for protein:

15x1 + 5x2 + 2x3 + 39x4 + 3x5 ≥ 60;

calcium:
270x1 + 20x2 + 25x3 + 23x4 + 35x5 ≥ 800;
vitamin C:
x1 + 106x3 ≥ 75;
and calories:

600x1 + 215x2 + 110x3 + 280x4 + 66x5 ≥ 2, 000.

In addition, all of the decision variables must be nonnegative. We obtain the

following LP:

minimize 2.9x1 + 3.2x2 + 0.5x3 + 4.5x4 + 0.3x5

subject to 15x1 + 5x2 + 2x3 + 39x4 + 3x5 ≥ 60
270x1 + 20x2 + 25x3 + 23x4 + 35x5 ≥ 800
x1 + 106x3 ≥ 75
600x1 + 215x2 + 110x3 + 280x4 + 66x5 ≥ 2, 000
x1 , x2 , x3 , x4 , x5 ≥ 0.

10.2.2 A resource allocation problem

Example 10.3 Jeﬀ is considering 6 projects for potential investment for the
upcoming year. The required investment and end-of-year payout amounts are
described in the following table.
Introduction to Linear Programming 215

Project
1 2 3 4 5 6
Investment ($) 10,000 25,000 35,000 45,000 50,000 60,000
Payout ($) 12,000 30,000 41,000 55,000 65,000 77,000

Partial investment (i.e., financing only a fraction of the project instead of the
whole project) is allowed for each project, with the payout proportional to the
investment amount. For example, if Jeff decides to invest $5,000 in project 2,
the corresponding payout will be $30,000×($5,000/$25,000)=$6,000. Jeff has
$100,000 available for investment. Formulate an LP to maximize the end-of-
year payout resulting from the investment.

Let
xi = fraction of project i ﬁnanced, i = 1, . . . , 6.
Then we have the following LP formulation:

maximize 1000(12x1 + 30x2 + 41x3 + 55x4 + 65x5 + 77x6 )

subject to 10x1 + 25x2 + 35x3 + 45x4 + 50x5 + 60x6 ≤ 100
x1 ≤ 1
x2 ≤ 1
x3 ≤ 1
x4 ≤ 1
x5 ≤ 1
x6 ≤ 1
x 1 , x 2 , x 3 , x 4 , x 5 , x6 ≥ 0.

10.2.3 A scheduling problem

Example 10.4 St. Tatiana Hospital uses a 12-hour shift schedule for its
nurses, with each nurse working either day shifts (7:00 am–7:00 pm) or night
shifts (7:00 pm–7:00 am). Each nurse works 3 consecutive day shifts or 3 con-
secutive night shifts and then has 4 days oﬀ. The hospital is aiming to design
a schedule for day-shift nurses that minimizes the total number of nurses em-
ployed. The minimum number of nurses required for each day shift during a
week is given in the following table:

Day of week/shift Nurses required

Monday (Mo) 16
Tuesday (Tu) 12
Wednesday (We) 18
Thursday (Th) 13
Friday (Fr) 15
Saturday (Sa) 9
Sunday (Su) 7
216 Numerical Methods and Optimization: An Introduction

In addition, it is required that at least half of the day-shift nurses have week-
ends (Saturday and Sunday) oﬀ. Formulate this problem as an LP.

Note that a nurse’s schedule can be defined by the first day of the three-day
working cycle. Thus, we can define the decision variables as follows:

x1 = the number of nurses working Mo-Tu-We schedule

x2 = the number of nurses working Tu-We-Th schedule
x3 = the number of nurses working We-Th-Fr schedule
x4 = the number of nurses working Th-Fr-Sa schedule
x5 = the number of nurses working Fr-Sa-Su schedule
x6 = the number of nurses working Sa-Su-Mo schedule
x7 = the number of nurses working Su-Mo-Tu schedule.

Then our objective is to minimize

z = x 1 + x2 + x3 + x4 + x5 + x6 + x7 .

To ensure the required number of nurses for Monday, the total number of
nurses that have Monday on their working schedule should be at least 16:

x1 + x6 + x7 ≥ 16.

The demand constraints for the remaining 6 days of the week are formulated
in the same fashion:

x1 + x2 + x7 ≥ 12 (Tuesday)
x1 + x2 + x3 ≥ 18 (Wednesday)
x2 + x3 + x4 ≥ 13 (Thursday)
x3 + x4 + x5 ≥ 15 (Friday)
x4 + x5 + x6 ≥9 (Saturday)
x5 + x6 + x7 ≥7 (Sunday).

Note that only the ﬁrst three schedules do not involve working on week-
ends. Therefore, the requirement that at least half of the nurses have weekends
oﬀ can be expressed as

x1 + x2 + x3 1
≥ .
x1 + x2 + x3 + x4 + x5 + x6 + x7 2

Multiplying both sides of this inequality by 2(x1 + x2 + x3 + x4 + x5 + x6 + x7 ),

we obtain the following equivalent linear inequality:

x1 + x2 + x3 − x4 − x5 − x6 − x7 ≥ 0.
Introduction to Linear Programming 217

In summary, we obtain the following LP:

minimize x1 + x2 + x3 + x4 + x5 + x6 + x7
subject to x1 + x6 + x7 ≥ 16
x1 + x2 + x7 ≥ 12
x1 + x2 + x3 ≥ 18
x2 + x3 + x4 ≥ 13
x3 + x4 + x5 ≥ 15
x4 + x5 + x6 ≥ 9
x5 + x6 + x7 ≥ 7
x1 + x2 + x3 − x4 − x5 − x6 − x7 ≥ 0
x 1 , x2 , x3 , x4 , x5 , x6 , x7 ≥ 0.

This problem has multiple optimal solutions with z ∗ = 31. One of them is
given by

x∗1 = 11, x∗2 = 0, x∗3 = 10, x∗4 = 3, x∗5 = 2, x∗6 = 4, x∗7 = 1.

According to this schedule, only 10 out of 31 nurses will be scheduled to work

on weekends.

10.2.4 A mixing problem

Example 10.5 Painter Joe needs to complete a job that requires 50 gallons
of brown paint and 50 gallons of gray paint. The required shades of brown and
gray can be obtained my mixing the primary colors (red, yellow, and blue) in
the proportions given in the following table.

Color Red Yellow Blue

Brown 40% 30% 30%
Gray 30% 30% 40%

The same shades can be obtained by mixing secondary colors (orange, green,
and purple), each of which is based on mixing two out of three primary col-
ors in equal proportions (red/yellow for orange, yellow/blue for green, and
red/blue for purple). Joe currently has 20 gallons each of red, yellow, and blue
paint, and 10 gallons each of orange, green, and purple paint. If needed, he
can purchase any of the primary color paints for $20 per gallon, however he
would like to save by utilizing the existing paint supplies as much as possible.
Formulate an LP helping Joe to minimize his costs.

We will use index i ∈ {1, . . . , 6} for red, yellow, blue, orange, green, and
purple colors, respectively, and index j ∈ {1, 2} for brown and gray colors,
respectively. Then our decision variables can be deﬁned as

xij = gallons of paint of color i used to obtain color j paint

218 Numerical Methods and Optimization: An Introduction

for i ∈ {1, . . . , 6}, j ∈ {1, 2},

xi = gallons of paint i purchased, i = 1, 2, 3.

Then our objective is to minimize

z = 20x1 + 20x2 + 20x3 .

Next we specify the constraints. The total amount of brown and gray paint
made must be at least 50 gallons each:

x11 + x21 + x31 + x41 + x51 + x61 ≥ 50,

x12 + x22 + x32 + x42 + x52 + x62 ≥ 50.

The amount of each paint used for mixing must not exceed its availability:

x11 + x12 ≤ 20 + x1
x21 + x22 ≤ 20 + x2
x31 + x32 ≤ 20 + x3
x41 + x42 ≤ 10
x51 + x52 ≤ 10
x61 + x62 ≤ 10.

To express the constraints ensuring that the mixing yields the right shade of
brown, note that only three out of six colors used for mixing contain red, and
the total amount of red paint (including that coming from orange and purple
paints) used in the brown mix is

x11 + 0.5x41 + 0.5x61 .

Hence, a constraint for the proportion of red color in the brown mix can be
written as follows:
x11 + 0.5x41 + 0.5x61
= 0.4.
x11 + x21 + x31 + x41 + x51 + x61
This equation can be easily expressed as a linear equality constraint:

0.6x11 − 0.4x21 − 0.4x31 + 0.1x41 − 0.4x51 + 0.1x61 = 0.

Similarly, the proportion of yellow and blue colors in the brown mix is given
by:
x21 + 0.5x41 + 0.5x51
= 0.3
x11 + x21 + x31 + x41 + x51 + x61
and
x31 + 0.5x51 + 0.5x61
= 0.3,
x11 + x21 + x31 + x41 + x51 + x61
which can be equivalently written as

−0.3x11 + 0.7x21 − 0.3x31 + 0.2x41 + 0.2x51 − 0.3x61 = 0

Introduction to Linear Programming 219

and
−0.3x11 − 0.3x21 + 0.7x31 − 0.3x41 + 0.2x51 + 0.2x61 = 0,

respectively. The constraints describing the proportion of each of the primary

colors in the gray paint mix can be derived analogously:

0.7x12 − 0.3x22 − 0.3x32 + 0.2x42 − 0.3x52 + 0.2x62 = 0 (red)

−0.3x12 + 0.7x22 − 0.3x32 + 0.2x42 + 0.2x52 − 0.3x62 = 0 (yellow)
−0.4x12 − 0.4x22 + 0.6x32 − 0.4x42 + 0.1x52 + 0.1x62 = 0 (blue).

Finally, note that the fact that 20(x1 + x2 + x3 ) is minimized will force each
of the variables x1 , x2 , and x3 to be 0 unless additional red, yellow, or blue
paint is required. The resulting formulation is given by

minimize 20x1 + 20x2 + 20x3

subject to 0.6x11 − 0.4x21 − 0.4x31 + 0.1x41 − 0.4x51 + 0.1x61 = 0
−0.3x11 + 0.7x21 − 0.3x31 + 0.2x41 + 0.2x51 − 0.3x61 = 0
−0.3x11 − 0.3x21 + 0.7x31 − 0.3x41 + 0.2x51 + 0.2x61 = 0
0.7x12 − 0.3x22 − 0.3x32 + 0.2x42 − 0.3x52 + 0.2x62 = 0
−0.3x12 + 0.7x22 − 0.3x32 + 0.2x42 + 0.2x52 − 0.3x62 = 0
−0.4x12 − 0.4x22 + 0.6x32 − 0.4x42 + 0.1x52 + 0.1x62 = 0
x11 + x21 + x31 + x41 + x51 + x61 ≥ 50
x12 + x22 + x32 + x42 + x52 + x62 ≥ 50
x11 + x12 − x1 ≤ 20
x21 + x22 − x2 ≤ 20
x31 + x32 − x3 ≤ 20
x41 + x42 ≤ 10
x51 + x52 ≤ 10
x61 + x62 ≤ 10
x11 , x12 , x21 , x22 , x31 , x32 , x41 , x42 , x51 , x52 , x61 , x62 , x1 , x2 , x3 ≥ 0.

10.2.5 A transportation problem

Example 10.6 A wholesale company specializing in one product has m ware-
houses Wi , i = 1, . . . , m serving n retail locations Rj , j = 1, . . . , n. Transport-
ing one unit of the product from Wi to Rj costs cij dollars, i = 1, . . . , m, j =
1, . . . , n. The company has si units of product available to ship from Wi , i =
1, . . . , m. To satisfy the demand, at least dj units of the product must be deliv-
ered to Rj . Formulate an LP to decide how many units of the product should
be shipped from each warehouse to each retail location so that the company’s
overall transportation costs are minimized.

The decision variables are

xij = the product quantity shipped from Wi to Rj , i = 1, . . . , m; j = 1, . . . , n.

220 Numerical Methods and Optimization: An Introduction

All variables must be nonnegative. The objective is to minimize the total cost
of transportation:
m n
z= cij xij .
i=1 j=1

We need to make sure that the number of units shipped out of Wi does not
exceed si :
n
xij ≤ si , i = 1, . . . , m.
j=1

Also, to satisfy the demand at Rj we must have

m
xij ≥ dj , j = 1, . . . , n.
i=1

In summary, we obtain the following LP:

m
n
minimize cij xij
i=1 j=1

n
subject to xij ≤ si , i = 1, . . . , m
j=1 (10.1)

m
xij ≥ dj , j = 1, . . . , n
i=1
xij ≥ 0, i = 1, . . . , m, j = 1, . . . , n.

10.2.6 A production planning problem

Example 10.7 MIA Corporation manufactures n products P1 , . . . , Pn using
m diﬀerent types of resources (such as raw material, labor, etc.), R1 , . . . , Rm .
There are bi units of resource i available per week. Manufacturing one unit
of product Pj requires aij units of resource Ri . It is known that each unit of
product Pj will sell for cj dollars. MIA Corporation needs to decide how many
units of each product to manufacture in order to maximize its weekly proﬁt.

We will use index i for the ith resource, i = 1, . . . , m; and index j for the
th
j product, j = 1, . . . , n. We deﬁne the decision variables as

xj = the number of units of product j to manufacture, j = 1, . . . , n.

Then the objective function is to maximize

c 1 x1 + . . . + c n xn .

The resource constraints, which make sure that the corporation does not ex-
ceed the availability of each resource, are given by

ai1 x1 + . . . + ain xn ≤ bi , i = 1, 2, . . . , m.
Introduction to Linear Programming 221

Including the nonnegativity constraints, we obtain the following formulation:

maximize c1 x1 + ... + c n xn
subject to a11 x1 + ... + a1n xn ≤ b1
.. .. .. ..
. . . .
am1 x1 + ... + amn xn ≤ bm
x1 , . . . , x n ≥ 0,

or, equivalently,

n
maximize c j xj
j=1

n
subject to aij xj ≤ bi , i = 1, 2, . . . , m
j=1
xj ≥ 0, j = 1, 2, . . . , n.

Denoting by
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
a11 ··· a1n b1 c1
⎢ .. .. ⎥ , ⎢ ⎥ ⎢ ⎥
A=⎣ . ..
. . ⎦ b = ⎣ ... ⎦ , c = ⎣ ... ⎦ ,
am1 ··· amn bn cn

we represent the LP in a matrix form:

maximize cT x
subject to Ax ≤ b
x ≥ 0.

10.3 Practical Implications of Using LP Models

Just like any other methodology used to solve practical problems, linear
programming has its strengths and limitations. Its main advantages are in the
relative simplicity of LP models and scalability of state-of-the-art algorithms,
which are implemented in software packages capable of solving LP models with
millions of variables and constraints. These features are suﬃciently appealing
to make LP by far the most frequently solved type of optimization problems
in practice. However, once LP is selected as the modeling tool for a particular
practical problem, it is important to recognize the assumptions that such
a choice implicitly entails, as well as practical implications of making such
assumptions.
Of course, the ﬁrst major assumption we make by representing a problem
as an LP is the linearity and certainty of dependencies involved in formulating
the model’s objective function and constraints. On the one hand, this is a
222 Numerical Methods and Optimization: An Introduction

rather strong assumption, especially given that most of the real-life processes
we attempt to model are nonlinear in nature and are typically influenced by
some uncertainties. On the other hand, any mathematical model is only an
approximation of reality, and in many situations a linear approximation is
sufficiently reasonable to serve the purpose. Recall that a linear function f (x)
in IRn is given by
f (x) = c1 x1 + . . . + cn xn ,
where c1 , . . . , cn are constant real coefficients. This implies the properties of
additivity,
f (x + y) = f (x) + f (y) for any x, y ∈ IRn ,
and proportionality,

f (αx) = αf (x) for any x ∈ IRn , α ∈ IR.

In particular, contributions of each variable to the objective function value is

independent of contributions of the other variables, and if we change the value
of one of the variables, say xj , by Δ, while keeping the remaining variables
unchanged, then the function value will change by cj Δ, i.e., the change in the
function value is proportional to the change in a variable value.
In addition to additivity, proportionality, and certainty, another important
assumption that is made in LP models is divisibility, meaning that fractional
values of decision variables are acceptable. In reality, it may be essential that
some of the decision variables are integer, however, introducing integrality
constraints would turn the LP into a (mixed) integer linear programming
problem, which is much harder to solve in general.

10.4 Solving Two-Variable LPs Graphically

If a linear program involves only two variables, it can be solved geomet-
rically, by plotting the lines representing the constraints and level sets of the
objective function. We will illustrate this approach by graphically solving the
linear program formulated in Section 10.1:

maximize 15x1 + 25x2 (proﬁt)

subject to x1 + x2 ≤ 450 (solid color fabric constraint)
x2 ≤ 300 (printed fabric constraint)
4x1 + 5x2 ≤ 2, 000 (budget constraint)
x1 ≤ 350 (demand constraint)
x 1 , x2 ≥ 0 (nonnegativity constraints).

First, consider the line representing the points where the solid color fabric
constraint is satisﬁed with equality, x1 + x2 = 450. This line passes through
Introduction to Linear Programming 223
x2 x2
(a) solid fabric constraint (b) printed fabric constraint
450
(x1 + x2 ≤ 450) (x2 ≤ 300)

300

450
x1 x1
x2
(c) budget constraint (d) demand constraint (x1 ≤ 350)
x2
(4x1 + 5x2 ≤ 2, 000)
400

500 350
x1 x1
(e) nonnegativity constraints (f) the feasible region
x2 x2

x1 x1

FIGURE 10.1: Drawing the feasible region in the Heavenly Pouch, Inc.
example.
224 Numerical Methods and Optimization: An Introduction
x2
450

' solid fabric constraint

400

printed fabric constraint

↓
300

' demand constraint

budget constraint
'

0 350 450 500 x1

FIGURE 10.2: Feasible region of the Heavenly Pouch LP.

points (0,450) and (450,0) and splits the plane into two halves, where only
the points in the lower half-plane satisfy the solid color fabric constraint (Fig-
ure 10.1(a)). Similarly, we can plot the half-planes representing the sets of
points satisfying the printed fabric constraint, the budget constraint, and the
demand constraint, respectively (see Figure 10.1). Intersection of all these
half-planes with the nonnegative quadrant of the plane will give us the fea-
sible region of the problem (Figure 10.1(f)), which represents the set of all
points that satisfy all the constraints. Figure 10.2 shows the feasible region of
the Heavenly Pouch LP, with lines corresponding to each constraint marked
accordingly.
To solve the LP graphically, we will use level sets of the objective function,
which in case of a maximization LP are sometimes referred to as iso-profit
lines. Given a target objective function value (profit) z̄, the iso-profit line is
the set of points on the plane where z = z̄, i.e., it is just the level set of
the objective function z at the level z̄. The iso-profit lines corresponding to
different profit values z̄ may or may not overlap with the feasible region. We
typically start by plotting the iso-profit line for a reasonably low value of z̄ to
Introduction to Linear Programming 225
x2
450

' solid fabric constraint

400

printed fabric constraint

↓
300
x∗
z ∗ = 9, 375 ' demand constraint
240 '

z = 6, 000
'
120 budget constraint
'

z = 3, 000
'

0 125 200 350 400 450 500 x1

FIGURE 10.3: Solving the Heavenly Pouch LP graphically.

ensure that it contains feasible points and hence can be conveniently shown
on the same plot as the feasible region. For the Heavenly Pouch LP, it appears
reasonable to first plot the iso-profit line for z = 15x1 + 25x2 = 3, 000, which
passes through the points (200, 0) and (0, 120) (see Figure 10.3). We see that
this profit level is feasible, so we can try a higher value, say z = 6, 000. We
see from the illustration that as we increased the target profit value, the new
iso-profit line is parallel to the previous one (since the slope remained the
same), and can be thought of as the result of movement of the previous iso-
profit line up or to the right. If we keep increasing the target value of z, the
corresponding iso-profit line will keep moving toward the upper right corner
of the figure. It is clear that if we select the profit value that is too optimistic
(say z = 10, 000), the iso-profit line will have no common points with the
feasible region.
However, we do not need to keep guessing which values of z would work.
Instead, we observe that the optimal solution in our example corresponds to
the last point that the iso-profit line will have in common with the feasible
region as we move the line toward the upper right corner. From the figure,
226 Numerical Methods and Optimization: An Introduction

it is obvious that this point (denoted by x∗ in the ﬁgure) is on the inter-

section of the lines defining the budget constraint (4x1 + 5x2 = 2, 000) and
the printed fabric constraint (x2 = 300). Hence, we can find it by solving
the linear system consisting of these two equations. Solving this system gives
the optimal solution, x∗1 = 125, x∗2 = 300, which yields the optimal profit
z ∗ = $15 · 125 + $25 · 300 = $9, 375.

Example 10.8 Consider again the Heavenly Pouch example discussed in Sec-
tion 10.1 and just solved graphically. Suppose that the price of a non-reversible
carrier is raised by $5, so the new objective function is z = 20x1 + 25x2 . Solve
the resulting modiﬁed Heavenly Pouch LP graphically.

The modiﬁed Heavenly Pouch LP we need to solve is given by

maximize 20x1 + 25x2 (proﬁt)

In this case, the iso-profit line has the same slope as the line defining
the budget constraint, so the two lines are parallel (see Figure 10.4). Thus,
all points on the thick line segment between points x∗ = [125, 300]T and
x = [250, 200]T in the figure are optimal. The optimal objective function value
is z ∗ = 10, 000. Thus, there are infinitely many solutions, all of which belong
to the convex combination of two extreme points (also known as vertices or
corners) of the feasible region. As we will see later, any LP that has an optimal
solution must have at least one corner optimum.

Example 10.9 A retail store is planning an advertising campaign aiming

to increase the number of customers visiting its physical location, as well as
its online store. The store manager would like to advertise through a local
magazine and through an online social network. She estimates that each 1,000
dollars invested in magazine ads will attract 100 new customers to the store,
as well as 500 new website visitors. In addition, each 1,000 dollars invested in
online advertising will attract 50 new local store customers, as well as 1,000
new website visitors. Her target for this campaign is to bring at least 500
new guests to the physical store and at least 5,000 new visitors to the online
store. Formulate an LP to help the store minimize the cost of its advertising
campaign. Solve the LP graphically.

The decision variables are

x1 = budget for magazine advertising (in thousands of dollars)

x2 = budget for online advertising (in thousands of dollars),
Introduction to Linear Programming 227
x2
450

' solid fabric constraint

400

printed fabric constraint

↓
300
x∗
' demand constraint
240 )
optima x
200

120 budget constraint

'
z ∗ = 10, 000
'
z = 6, 000
'
z = 3, 000
'
0 125 150 250 300 350 450 500 x1

FIGURE 10.4: Graphical solution of the modiﬁed Heavenly Pouch LP.

and the problem can be formulated as the following LP:

minimize x1 + x2
subject to 100x1 + 50x2 ≥ 500 (store visitors)
500x1 + 1, 000x2 ≥ 5, 000 (website visitors)
x 1 , x2 ≥ 0. (nonnegativity)

We start solving the problem graphically by plotting the lines describing the
constraints and drawing the feasible region (Figure 10.5). Then we plot two
level sets for the objective function, which in case of minimization problems
are called iso-cost lines, for z = 15 and z = 10. We observe that as the value
of z decreases, the iso-cost line moves down, toward the origin. If we keep
decreasing z, the iso-cost line will keep moving down, and at some point, will
contain no feasible points. It is clear from the ﬁgure that the last feasible point
the iso-cost line will pass through as we keep decreasing the value of z will be
the point of intersection of the lines deﬁning the store visitors constraint and
228 Numerical Methods and Optimization: An Introduction
x2

z = 15
'
← store visitors
20/3 constraint z = 10
'

z ∗ = 20/3
'
10/3
website visitors
' constraint

0 10/3 5 20/3 10 x1

FIGURE 10.5: Solving Example 10.9 LP graphically.

the website visitor constraint. Hence, solving the system

100x1 + 50x2 = 500

500x1 + 1, 000x2 = 5, 000,

we ﬁnd the optimal solution x∗1 = x∗2 = 10/3 ≈ 3.333, z ∗ = 20/3 ≈ 6.666.
Thus, the store should spend $6,666 on advertising and split this budget evenly
between the magazine and online advertising to reach its goals.
Note that even though the LP we just solved has an unbounded feasible
region, it still has an optimal solution. However, if the objective function was
improving along one of the directions in which the feasible region is unbounded
(called direction of unboundedness), an optimal solution would not exist.

Example 10.10 Consider a problem that has the same objective function and
feasible region as the LP in the previous example, but change the objective to
Introduction to Linear Programming 229

maximization:
maximize x1 + x2
subject to 100x1 + 50x2 ≥ 500
500x1 + 1, 000x2 ≥ 5, 000
x 1 , x2 ≥ 0.

Clearly the objective function value tends to inﬁnity if one of the variables is
increased toward inﬁnity; thus this LP has no optimal solution.

Encountering an LP with a feasible region containing points with any

desirable objective value is highly unlikely in practice. Next we consider an
example of an LP that may arise in decision-making situations with limited
resources. Namely, having insuﬃcient resources may lead to a model that not
only has no optimal solution, but does not even have a feasible solution.

Example 10.11 Assume that the retail store in Example 10.9 has an ad-
vertising budget limited to $5,000. This condition is reﬂected in the budget
constraint in the corresponding LP model:

minimize x1 + x2
subject to 100x1 + 50x2 ≥ 500 (store visitors)
500x1 + 1, 000x2 ≥ 5, 000 (website visitors)
x1 + x2 ≤ 5 (budget)
x 1 , x2 ≥ 0. (nonnegativity)

From the illustration in Figure 10.5 it is clear that the set of points such that
x1 + x2 ≤ 5 does not overlap with the set of points satisfying each of the
remaining constraints of the LP. Thus, no feasible point exists for this LP.
Indeed, we previously determined that an advertising campaign that will yield
the target results will cost at least $6,666.

10.5 Classiﬁcation of LPs

LPs can be classiﬁed in terms of their feasibility and optimality properties.
We saw in the previous section that some LPs have one or more optimal
solutions, whereas others may have no optimal or even feasible solutions.

Deﬁnition 10.1 An LP is called

• feasible if it has at least one feasible solution and infeasible,

otherwise;

• optimal if it has an optimal solution;

230 Numerical Methods and Optimization: An Introduction
• unbounded if it is feasible and its objective function is not
bounded (from above for a maximization problem and from below
for a minimization problem) in the feasible region.

For example, the Heavenly Pouch LP, as well as LPs in Examples 10.8 and 10.9
are optimal (which also implies that they are feasible LPs). In particular, the
LP in Example 10.9 is optimal despite its feasible region being unbounded. The
LP considered in Example 10.10 is unbounded, and the LP in Example 10.11
is infeasible.
Later (Theorem 11.5 at page 266) we will establish the following fact.

If an LP is not optimal, it is either unbounded or infeasible.

If an optimal LP has more than one optimal solution, then it is easy to show
that it has inﬁnitely many optimal solutions. Indeed, consider a maximization
LP
max cT x,
x∈X

where X is a polyhedral set, and assume that the LP has two alternative
optimal solutions x∗ and x with the optimal objective value z ∗ = cT x∗ = cT x .
Then, for an arbitrary α ∈ (0, 1) consider a convex combination of x∗ and x ,
y = αx∗ + (1 − α)x ∈ X . We have

cT y = cT (αx∗ + (1 − α)x ) = αcT x∗ + (1 − α)cT x = αz ∗ + (1 − α)z ∗ = z ∗ ,

so, y is also an optimal solution of the LP. Thus, we established that the
following property holds.

If an LP is optimal, it either has a unique optimal solution or inﬁnitely

many optimal solutions. Moreover, the set of all optimal solutions to an
LP is a convex set.

Exercises
10.1. Romeo Winery produces two types of wines, Bordeaux and Romerlot,
by blending Merlot and Cabernet Sauvignon grapes. Making one bar-
rel of Bordeaux blend requires 250 pounds of Merlot and 250 pounds
of Cabernet Sauvignon, whereas making one barrel of Romerlot re-
quires 450 pounds of Merlot and 50 pounds of Cabernet Sauvignon. The
proﬁt received from selling Bordeaux is $800 per barrel, and from selling
Introduction to Linear Programming 231

Romerlot, $600 per barrel. Romeo Winery has 9,000 pounds of Merlot
and 5,000 pounds of Cabernet Sauvignon available. Formulate an LP
model aiming to maximize the winery’s proﬁt. Solve the LP graphically.

10.2. O&M Painters produce orange and maroon paints by mixing the so-
called primary red, yellow, and blue paint colors. The proportions of red,
yellow, and blue paints used to get the required shades are 50%, 40%,
and 10%, respectively for orange, and 60%, 10%, and 30%, respectively
for maroon. What is the maximum combined amount of orange and
maroon paints that O&M Painters can produce, given that they have
6 gallons of red paint, 4 gallons of yellow paint, and 1.8 gallons of blue
paint available for mixing? Formulate this problem as an LP and solve
it graphically.

10.3. The Concrete Guys make two types of concrete by mixing cement, sand,
and stone. The regular mix contains 30% of cement, 15% of sand, and
55% of stone (by weight), and sells for 5 cents/lb. The extra-strong mix
must contain at least 50% of cement, at least 5% of sand, and at least
20% of stone, and sells for 8 cents/lb. The Concrete Guys have 100,000 lb
of cement, 50,000 lb of sand, and 100,000 lb of stone in their warehouse.
Formulate an LP to determine the amount of each mix the Concrete
Guys should make in order to maximize their proﬁt.

10.4. A football coach is deciding on an oﬀensive plan for a game. He estimates

that a passing play will yield 7 yards on average, whereas a rushing play
will average 5 yards. To balance the oﬀense, he wants neither passing
nor rushing to constitute more than 2/3 of all oﬀensive plays. Propose
an LP formulation to determine the distribution of passing and rushing
plays that maximizes an average gain per play. Solve the formulated LP
graphically.

10.5. A plant uses two alternative processes to manufacture three diﬀerent

products, X, Y, and Z, of equal value. An hour of the ﬁrst process costs
$3 and yields 3 units of X, 2 units of Y , and 1 unit of Z, whereas an
hour of the second process costs $2 and yields 2 units of each product.
The daily demand for X, Y , and Z is 25, 20, and 10 units, respectively.
Each machine can be used for at most 10 hours a day. Formulate an LP
to minimize the daily manufacturing costs while satisfying the demand
and time constraints. Solve the LP graphically.

10.6. SuperCleats Inc. is a small business manufacturing 4 models of soccer

shoes, Dynamo, Spartacus, Torpedo, and Kanga. Manufacturing one
pair of each model is associated with material, labor, and overhead costs
given in the table below, which also lists the company’s annual budget
for each kind of resource.
232 Numerical Methods and Optimization: An Introduction

Dynamo Spartacus Torpedo Kanga Annual

Budget
Price 100 80 60 50
Material Cost 15 10 8 5 45,000
Labor Cost 35 22 18 10 100,000
Overhead Cost 5 5 4 4 20,000

In addition, the annual demand for Dynamo and Spartacus shoes is

limited to 1,000 pairs each. Formulate an LP to help SuperCleats Inc.
decide on the quantity of each shoe to manufacture so that its overall
proﬁt is maximized.

10.7. A company is looking to hire retail assistants for its new store. The
number of assistants required on diﬀerent days of the week is as follows:
Monday – 4, Tuesday – 5, Wednesday – 5, Thursday – 6, Friday – 7,
Saturday – 8, and Sunday – 8. Each assistant is expected to work four
consecutive days and then have three days oﬀ. Formulate an LP aiming
to meet the requirements while minimizing the number of hires.

10.8. A financial company considers five different investment options for the
next two years, as described in the following table:

Inv.1 Inv.2 Inv.3 Inv.4 Inv.5

Year 0 cash required $11 $8 $7 $10 $0
Year 1 cash received $2 $0 $0 $12 $0
Year 1 cash required $0 $8 $0 $0 $10
Year 2 cash received $12 $20 $11 $0 $12

Here, year 0 represents the present time, year 1 represents one year from
now, and year 2 represents two years from now. For example, investment
1 requires an $11 million cash investment now and yields $2 million and
$12 million cash in one and two years from now, respectively; investment
2 requires two $8 million cash deposits (now and one year from now),
with a $20-million payday 2 years from now, etc. Any fraction of each
investment alternative can be purchased. For example, the company
could purchase 0.25 of investment 2, which would require two $2-million
cash investments (now and in one year), yielding $5 million in two years.
The company expects to have $20 million to invest now, plus $10 million
to invest at year 1 (in addition to the cash received from the original
investment). Formulate an LP to determine an investment strategy that
would maximize cash on hand after 2 years.

10.9. A farmer maintains a pasture for his 30 cows via watering and fertiliz-
ing. The grass grows uniformly and at a rate that is constant for the
given level of watering and fertilizing. Also, a cow grazing reduces the
Introduction to Linear Programming 233

amount of grass uniformly and at a constant rate. From the past expe-
rience, the farmer knows that under regular conditions (no watering or
fertilizing) the grass will run out in 20 days with 40 cows grazing, and
in 30 days with 30 cows using the pasture. Spending $1 per week on
watering increases the rate of grass growth by 1%, and spending $1 per
week on fertilizer increases the grass growth rate by 2%. For the fertil-
izer to be eﬀective, the grass must be watered properly, meaning that for
each dollar spent on fertilizing, the farmer must spend at least 50 cents
on watering. Other than that, it can be assumed that the grass growth
rate increase due to watering is independent of that due to fertilizing.
Formulate an LP to minimize the cost of maintaining the pasture while
making sure that the grass never runs out. Solve the LP graphically.
10.10. Assume that in the diet problem (Section 10.2.1, p. 213) Yiming wants
calories from fat not to exceed 30% of the total calories he consumes.
Calories from fat for each considered food type are as follows: almond
butter – 480, brown rice – 15, salmon – 110, orange juice – 6, and
wheat bread – 8. Modify the corresponding LP formulation to include a
constraint limiting the fraction of fat calories.

10.11. Solve the following LPs graphically:

(a) maximize 3x1 + 5x2
subject to x1 + x2 ≤ 5
x2 ≤ 3
3x1 + 5x2 ≤ 18
x1 ≤ 4
x 1 , x2 ≥ 0
(b) minimize x1 + 5x2
subject to x1 + 2x2 ≥ 3
x1 + x2 ≥ 2
5x1 + x2 ≥ 5
x1 ≤ 4
x 1 , x2 ≥ 0.
10.12. Design an algorithm to solve the resource allocation problem in Sec-
tion 10.2.2.
This page intentionally left blank
Chapter 11
The Simplex Method for Linear
Programming

In this chapter, we discuss one of the ﬁrst and most popular methods for
solving LPs, the simplex method originally proposed by George Dantzig in
1940s for solving problems arising in military operations. In order to apply
this method, an LP is ﬁrst converted to its standard form, as discussed in the
following section.

11.1 The Standard Form of LP

LP in the standard form has only equality and nonnegativity constraints.
An inequality constraint can easily be converted into an equality constraint by
introducing a new variable in the left-hand side as follows. If the ith constraint
is of “≤” type, we add a nonnegative slack variable si in the left-hand side of
the inequality to turn it into an equality constraint. Say, if our ith constraint
was
x1 + 2x2 + 3x3 ≤ 4,
we would replace it with the equality constraint

x1 + 2x2 + 3x3 + si = 4,

where
si = 4 − x1 − 2x2 − 3x3 ≥ 0.
Similarly, if the ith constraint is of “≥” type, we introduce a nonnegative
excess variable ej and subtract it from the left-hand side of the constraint to
obtain the corresponding equality constraint. For example, for

x1 + 2x2 + 3x3 ≥ 4,

we have
x1 + 2x2 + 3x3 − ei = 4,
where
ei = −4 + x1 + 2x2 + 3x3 ≥ 0.

235
236 Numerical Methods and Optimization: An Introduction

Example 11.1 The standard form of the LP

maximize 3x1 − 5x2 + 7x3

subject to 2x1 + 4x2 − x3 ≥ −3
4x1 − 2x2 + 8x3 ≤ 7
9x1 + x2 + 3x3 = 11
x 1 , x2 , x 3 ≥ 0

is given by

maximize 3x1 − 5x2 + 7x3

subject to 2x1 + 4x2 − x3 − e1 = −3
4x1 − 2x2 + 8x3 + s2 = 7
9x1 + x2 + 3x3 = 11
x 1 , x2 , x3 , e 1 , s 2 ≥ 0.

We require that all variables are nonnegative in the standard form repre-
sentation, however, in general, some of the LP variables may be unrestricted in
sign or free. We write xj ∈ IR to denote a free variable xj . If a free variable xj
is present in our LP, we can represent it as the diﬀerence of two nonnegative
variables as follows:

xj = xj − xj , where xj , xj ≥ 0.

Example 11.2 The standard form of the LP

maximize 3x1 − 5x2 + 7x3

subject to 2x1 + 4x2 − x3 ≥ −3
4x1 − 2x2 + 8x3 ≤ 7
9x1 + x2 + 3x3 = 11
x1 ∈ IR, x 2 , x3 ≥ 0

is given by

maximize 3x1 − 3x1 − 5x2 + 7x3

subject to 2x1 − 2x1 + 4x2 − x3 − e1 = −3
4x1 − 4x1 − 2x2 + 8x3 + s2 = 7
9x1 − 9x1 + x2 + 3x3 = 11
x1 , x1 , x2 , x3 , e1 , s2 ≥ 0.

Recall that we can always convert a “≥”-type constraints into a “≤”-type

constraint by multiplying both sides of the “≥”-type constraint by −1, thus it
is reasonable to assume that all inequality constraints are given in “≤” form.
Consider a general LP

maximize cT x
subject to A x ≤ b
(11.1)
A x = b
x ≥ 0,
The Simplex Method for Linear Programming 237

where A and b are an m × n matrix and m -vector describing m inequality

constraints, and A and b are an m × n matrix and m -vector describing
m equality constraints, respectively. Let m = m + m be the total number
of constraints in this LP other than the nonnegativity constraints. Then the
standard form of LP (11.1) is given by

maximize cT x
subject to Ax = b (11.2)
x ≥ 0,

where the vector x of m + n variables includes m slack variables

xn+1 , . . . , xn+m in addition to the n original variables x1 , . . . , xn ; A is an
m × (n + m) matrix,

A Im
A= ,
A O

where Im is the m × m identity matrix and O is the m × m zero matrix;
c is a (n + m )-vector of objective function coeﬃcients (with cn+1 = . . . =
cn+m = 0); and b is an m-vector of right-hand sides.

11.2 The Simplex Method

The simplex method, originally developed by George Dantzig in the 1940s,
is still one of the most commonly used approaches to solving LP problems in
practice. To start with, we will restrict our discussion of the method to LPs
in the form
maximize cT x
subject to Ax ≤ b where b ≥ 0. (11.3)
x ≥ 0

The reason for this is that finding an initial feasible solution x(0) for an LP
in the form (11.3) is very easy: we can just use x(0) = 0, which obviously
satisfies all the constraints. This is important, since the simplex method needs
a starting feasible solution x(0) , which it will use to generate a finite sequence
of feasible points x(0) , x(1) , x(2) , . . . , x(N ) , such that each next point in the
sequence has an objective value at least as good as that of the previous point,
i.e., z(x(k+1) ) ≥ z(x(k) ), k = 0, . . . , N −1, and x∗ = x(N ) is an optimal solution
of the LP. If we do not require b ≥ 0 in the LP above, finding a starting point
x(0) is more challenging, and we will address this case after discussing how to
solve the LPs with b ≥ 0.
We will introduce the basic idea of the method using the Heavenly Pouch
LP formulated in Section 10.1 and solved graphically in Section 10.4. Consider
238 Numerical Methods and Optimization: An Introduction

the following LP:

maximize 15x1 + 25x2

First, we convert this LP to the standard form by introducing a slack variable

si for each constraint i, i = 1, . . . , 4:

maximize 15x1 + 25x2

subject to x1 + x2 + s 1 = 450
x2 + s2 = 300
(11.4)
4x1 + 5x2 + s3 = 2, 000
x1 + s4 = 350
x 1 , x2 , s 1 , s 2 , s 3 , s 4 ≥ 0.

We can write it in the equivalent dictionary format, where the slack variables
s1 , s2 , s3 , s4 are expressed through the remaining variables as follows:

z = 15x1 + 25x2
s1 = 450 − x1 − x2
s2 = 300 − x2 (11.5)
s3 = 2, 000 − 4x1 − 5x2
s4 = 350 − x1

In this representation, we will call the variables kept in the left-hand side
the basic variables, and the remaining variables nonbasic variables. Obviously,
the number of basic variables is the same as the number of constraints in
our LP, and the number of nonbasic variables equals the number of variables
in the original LP, before the slack variables were introduced. The sets of
basic and nonbasic variables will be updated step by step, using an operation
called pivot, as we proceed with the iterations of the simplex method. We will
denote the sets of all the basic and nonbasic variables at step k of the simplex
method by BVk and N Vk , respectively. We assume that the initial dictionary
corresponds to step k = 0. Thus, in our example

BV0 = {s1 , s2 , s3 , s4 }, N V0 = {x1 , x2 }.

Note that to get a feasible solution to the considered LP, we can set all the
nonbasic variables to 0, and this will uniquely determine the corresponding
values of the basic variables and the objective function. We have:

x1 = x2 = 0 ⇒ s1 = 450, s2 = 300, s3 = 2, 000, s4 = 350; z = 0.

We call this solution the basic solution corresponding to the basis BV0 . If
The Simplex Method for Linear Programming 239

all variables have nonnegative values in a basic solution, then the solution is
called a basic feasible solution (bfs) and the corresponding dictionary is called
feasible. Note that the basic solution with the basis BV0 in our example is, in
fact, a basic feasible solution.
Our LP can also be conveniently represented in the tableau format:
z x1 x2 s1 s2 s3 s4 rhs Basis
1 −15 −25 0 0 0 0 0 z
0 1 1 1 0 0 0 450 s1
(11.6)
0 0 1 0 1 0 0 300 s2
0 4 5 0 0 1 0 2, 000 s3
0 1 0 0 0 0 1 350 s4
Here rhs stands for right-hand side. The entries in the tableau are just the
coefficients of LP in the standard form (11.4), where the z-row is modified by
moving all variables to the left-hand side, so instead of z = 15x1 + 25x2 we
write
z − 15x1 − 25x2 = 0.
In this format, z is treated as a variable that is always basic. Since the dictio-
nary format is helpful for visual explanation of the ideas behind the method
and the tableau format is more handy for performing the computations, we
will use both representations as we describe the steps of the simplex method
below. In both dictionary and tableau formats, we number the rows starting
with 0, so the top row is referred to as row 0 or the z-row, and row i corre-
sponds to the ith constraint. The basic feasible solution at step 0 is given in
the following table.
Step 0 basic feasible solution
BV0 : s 1 , s2 , s3 , s 4
N V0 : x 1 , x2
bf s : x1 = x2 = 0
s1 = 450, s2 = 300, s3 = 2, 000, s4 = 350
z=0
We are ready to perform the first iteration of the simplex method.

11.2.1 Step 1
Let us analyze our problem written in the dictionary form as in (11.5)
above, taking into account that the current basic feasible solution has all the
nonbasic variables at 0, x1 = x2 = 0. We have:
z = 15x1 + 25x2
s1 = 450 − x1 − x2
s2 = 300 − x2
s3 = 2, 000 − 4x1 − 5x2
s4 = 350 − x1
240 Numerical Methods and Optimization: An Introduction

Since the objective function is expressed in terms of nonbasic variables only,

the only way for us to change the value of z is by changing at least one of
the nonbasic variables from 0 to some positive value (recall that all variables
must be nonnegative). To increase the value of z, we can increase the value of
a nonbasic variable that has a positive coefficient in the objective. Due to the
linearity of the objective function, increasing a variable by 1 unit will change
the objective by value equal to the coefficient of that variable in the z-row
of the dictionary. Thus, to get the highest possible increase in the objective
per unit of increase in the variable value, it makes sense to try to increase the
value of the variable with the highest coefficient. We have

z = 15x1 + 25x2 ,

we pick variable x2 as the one whose value will be increased. We call this
variable the pivot variable and the corresponding column in the dictionary
is called the pivot column. We want to increase the value of x2 as much as
possible while keeping the other nonbasic variable equal to 0. The amount by
which we can increase x2 is restricted by the nonnegativity constraints for the
basic variables, which must be satisﬁed to ensure feasibility:
s1 = 450 − x2 ≥ 0
s2 = 300 − x2 ≥ 0
s3 = 2, 000 − 5x2 ≥ 0
s4 = 350 − 0x2 ≥ 0

(Note that the column of the dictionary corresponding to variable x1 is ig-

nored since x1 = 0 in the current basic feasible solution and we do not change
its value). For all of these inequalities to be satisfied, we must have x2 ≤ 300.
Thus, the largest feasible increase for x2 is equal to 300. Note that the largest
possible increase corresponds to the smallest ratio of the free coefficient to
the absolute value of the coefficient for x2 in the same row, assuming that
the coefficient for x2 is negative. The rows where the coefficient for x2 is non-
negative can be ignored, since the corresponding inequalities are redundant.
For example, if we had an inequality 500 + 5x2 ≥ 0, it is always satisfied due
to nonnegativity of x2 . We say that the row in which the smallest ratio is
achieved wins the ratio test. This row is called the pivot row. In our example,
the second row, which has s2 as the basic variable, is the pivot row.
To carry out the pivot, we express the nonbasic variable in the pivot column
through the basic variable in the pivot row:

x2 = 300 − s2 .

Then we substitute this expression for x2 in the remaining rows of the dictio-
nary:

z = 15x1 + 25x2 = 15x1 + 25(300 − s2 ) = 7, 500 + 15x1 − 25s2 ,

s1 = 450 − x1 − x2 = 450 − x1 − (300 − s2 ) = 150 − x1 + s2 ,

The Simplex Method for Linear Programming 241

s3 = 2, 000 − 4x1 − 5x2 = 2, 000 − 4x1 − 5(300 − s2 ) = 500 − 4x1 + 5s2 ,

s4 = 350 − x1 − 0x2 = 350 − x1 .

We obtain the step 1 dictionary:

z = 7, 500 + 15x1 − 25s2

s1 = 150 − x1 + s2
x2 = 300 − s2 (11.7)
s3 = 500 − 4x1 + 5s2
s4 = 350 − x1

To complete the same step using the tableau format, we consider the
tableau (11.6). We find the most negative coefficient in the z-row; it corre-
sponds to x2 , thus the corresponding column is the pivot column. We perform
the ratio test by dividing the entries in the rhs column by the correspond-
ing entries in the pivot column that are positive. The minimum such ratio,
300, corresponds to the second row, which is the pivot row. The coefficient on
the intersection of the pivot row and pivot column in the table is the pivot
element.

↓
z x1 x2 s1 s2 s3 s4 rhs Basis Ratio
1 −15 −25 0 0 0 0 0 z
0 1 1 1 0 0 0 450 s1 450
0 0 1 0 1 0 0 300 s2 300 ←
0 4 5 0 0 1 0 2, 000 s3 400
0 1 0 0 0 0 1 350 s4 −

To perform the pivot, we use elementary row operations involving the pivot
row with the goal of turning all pivot column entries in the non-pivot rows
into 0s and the pivot element into 1. In particular, we multiply the pivot row
by 25, −1, and −5, add the result to rows 0, 1, and 3, respectively, and update
the corresponding rows. Since the pivot element is already 1, the pivot row is
kept unchanged, but the corresponding basic variable is now x2 instead of s2 .
As a result, we obtain the following step 1 tableau:

z x1 x2 s1 s2 s3 s4 rhs Basis
1 −15 0 0 25 0 0 7, 500 z
0 1 0 1 −1 0 0 150 s1
(11.8)
0 0 1 0 1 0 0 300 x2
0 4 0 0 −5 1 0 500 s3
0 1 0 0 0 0 1 350 s4

Compare this tableau to the corresponding dictionary (11.7). Clearly, the dic-
tionary and the tableau describe the same system. The basic feasible solution
242 Numerical Methods and Optimization: An Introduction

after step 1 is summarized in the following table.

Step 1 basic feasible solution

BV1 : s 1 , x2 , s 3 , s 4
N V1 : x1 , s2
bf s : x1 = 0, x2 = 300
s1 = 150, s2 = 0, s3 = 500, s4 = 350
z = 7, 500

We saw that, as a result of the pivot operation, one of the previously nonbasic
variables, x2 , has become basic, whereas s2 , which was basic, has become
nonbasic. We will call the variable that is entering the basis during the current
iteration the entering variable and the variable that is leaving the basis the
leaving variable.

11.2.2 Step 2
The next step is performed analogously to the ﬁrst step. Again, we analyze
the current dictionary (11.7) and try to increase the objective function value
by updating the current basic feasible solution.

z = 7, 500 + 15x1 − 25s2

s1 = 150 − x1 + s2
x2 = 300 − s2
s3 = 500 − 4x1 + 5s2
s4 = 350 − x1

Notice that, as before, the basic feasible solution can be obtained by setting
all nonbasic variables equal to 0, so the same considerations as in step 1 apply
when we decide which nonbasic variable should be increased in value and
hence enter the basis. Since only one variable, x1 , has a positive coeﬃcient in
the objective, it is the only entering variable candidate. Row 3 wins the ratio
test, so s3 is the leaving variable and we have

5 1
x1 = 125 + s2 − s3 ,
4 4

25 15
z = 7, 500 + 15x1 − 25s2 = 9, 375 − s2 − s3 ,
4 4

1 1
s1 = 150 − x1 + s2 = 25 − s2 + s3 ,
4 4

5 1
s4 = 350 − x1 = 225 − s2 + s3 .
4 4
The Simplex Method for Linear Programming 243

The resulting step 2 dictionary is given by:

z = 9, 375 − 25
4 s2 − 15
4 s3

s1 = 25 − 1
4 s2 + 1
4 s3

x2 = 300 − s2 (11.9)
x1 = 125 + 5
4 s2 − 1
4 s3

s4 = 225 − 5
4 s2 + 1
4 s3

Next, we carry out the computations for step 2 in the tableau format. In the
step 1 tableau (11.8), we ﬁnd the most negative coeﬃcient in the z-row, which
leads to selecting x1 as the entering variable. Row 3 wins the ratio test, so s3
is the leaving variable.

↓
z x1 x2 s1 s2 s3 s4 rhs Basis Ratio
1 −15 0 0 25 0 0 7, 500 z
0 1 0 1 −1 0 0 150 s1 150
0 0 1 0 1 0 0 300 x2 −
0 4 0 0 −5 1 0 500 s3 125 ←
0 1 0 0 0 0 1 350 s4 350

We ﬁrst divide row 3 by the pivot element value, which is 4, and then use it to
eliminate the remaining nonzero coeﬃcients in the pivot column. We obtain
the following step 2 tableau:

z x1 x2 s1 s2 s3 s4 rhs Basis
1 0 0 0 25/4 15/4 0 9, 375 z
0 0 0 1 1/4 −1/4 0 25 s1
(11.10)
0 0 1 0 1 0 0 300 x2
0 1 0 0 −5/4 1/4 0 125 x1
0 0 0 0 5/4 −1/4 1 225 s4

Again, this tableau is equivalent to the corresponding dictionary (11.9). We

summarize the basic feasible solution after step 2 below.

Step 2 basic feasible solution

BV2 : x1 , x2 , s1 , s4
N V2 : s 2 , s 3
bf s : x1 = 125, x2 = 300
s1 = 25, s2 = 0, s3 = 0, s4 = 225
z = 9, 375
244 Numerical Methods and Optimization: An Introduction

11.2.3 Recognizing optimality

Aiming to improve the current solution, we analyze the step 2 dictio-
nary (11.9):
z = 9, 375 − 25 4 s2 − 154 s3

s1 = 25 − 1
4 s2 + 1
4 s3

x2 = 300 − s2
x1 = 125 + 5
4 s2 − 1
4 s3

s4 = 225 − 5
4 s2 + 1
4 s3 .
The objective function is given by
25 15
z = 9, 375 − s2 − s3 ,
4 4
where both s2 and s3 are nonnegative. Since the coefficients for s2 and s3
are negative, it is clear that the highest possible value z ∗ of z is obtained
by putting s2 = s3 = 0. Thus, the current basic feasible solution is optimal.
When reporting the optimal solution, we can ignore the slack variables as they
were not a part of the original LP we were solving. Thus, the optimal solution
is given by
x∗1 = 125, x∗2 = 300, z ∗ = 9, 375.
Recognize that this is the same solution as the one we obtained graphically in
Section 10.4. Since a negative nonbasic variable coefficient in the dictionary
format is positive in the tableau format and vice versa, a tableau is deemed
optimal if all nonbasic variables have nonnegative coefficients in row 0.

If in a feasible dictionary, all nonbasic variables have nonpositive coeﬃ-

cients in the z-row, then the corresponding basic feasible solution is an
optimal solution of the LP.
If we use the tableau format, then the basic feasible solution is optimal
if all nonbasic variables have nonnegative coeﬃcients in row 0 of the
corresponding tableau.

11.2.4 Recognizing unbounded LPs

A step of the simplex method consists of selecting an entering variable,
leaving variable, and updating the dictionary by performing a pivot. Any
nonbasic variable with a positive coeﬃcient in the z-row of the dictionary (or
a negative coeﬃcient in row 0 of the corresponding tableau) can be selected
as the entering variable. If there is no such variable, it means that the current
basic feasible solution is optimal, and the LP is solved. The leaving variable is
The Simplex Method for Linear Programming 245

the basic variable representing a row that wins the ratio test. However, if all
coeﬃcients in the pivot column of the dictionary are positive, the ratio test
produces no result. For example, consider the following dictionary:

z = 90 − 25x1 + 4x2
s1 = 25 − 14x1 + x2
s2 = 30 − x1
s3 = 12 + 5x1 + 14x2
s4 = 22 − 4x1 + 7x2 .

The only entering variable candidate is x2 . However, when we try to do the

ratio test, none of the rows participates. Because the coeﬃcient for x2 is
nonnegative in each row, we can increase x2 to +∞ without violating any
of the constraints. When x2 increases to +∞, so does z, thus the problem is
unbounded. The tableau corresponding to the dictionary above is given by

z x1 x2 s1 s2 s3 s4 rhs Basis
1 25 −4 0 0 0 0 90 z
0 14 −1 1 0 0 0 25 s1
0 1 0 0 1 0 0 30 s2
0 −5 −14 0 0 1 0 12 s3
0 4 −7 0 0 0 1 22 s4

Thus, if we use the tableau format, we conclude that the problem is unbounded
if at some point we obtain a tableau that has a column such that none of its
entries are positive.

If during the execution of the simplex method we encounter a variable

that has all nonnegative coeﬃcients in the dictionary format, then the
LP is unbounded.
In tableau format, an LP is proved to be unbounded as soon as a column
with no positive entries is detected.

11.2.5 Degeneracy and cycling

When applied to the Heavenly Pouch LP, the simplex method produced a
sequence of basic feasible solutions, such that each next solution had a strictly
higher objective value than the previous one. However, this may not always
246 Numerical Methods and Optimization: An Introduction

be the case, as illustrated by the next example. Consider the following LP:

maximize 5x1 + 4x2 − 20x3 − 2x4

subject to 1
4 x1 − 1
8 x2 + 12x3 + 10x4 ≤ 0
(11.11)
1
10 x1 + 1
20 x2 + 1
20 x3 + 5 x4
1
≤ 0
x 1 , x2 , x3 , x4 ≥ 0.

We apply the simplex method to this problem. We will use the tableau format.
Let x5 and x6 be the slack variables for the ﬁrst and second constraints,
respectively. We always select the nonbasic variable with the most negative
coeﬃcient in row 0 as the entering variable. In case there are multiple ratio
test winners, we select the basic variable with the lowest index as the leaving
variable. The step 0 tableau is given by

z x1 x2 x3 x4 x5 x6 rhs Basis
1 −5 −4 20 2 0 0 0 z
(11.12)
0 1
4 − 18 12 10 1 0 0 x5
1 1 1 1
0 10 20 20 5 0 1 0 x6

Note that both basic variables are equal to 0 in the starting basic feasible
solution.

Deﬁnition 11.1 Basic solutions with one or more basic variables equal
to 0 are called degenerate.

We carry out the ﬁrst step. The entering variable is x1 , since it is the variable
with the most negative coeﬃcient in the z-row, and the leaving variable is
x5 , which is the basic variable in the row winning the ratio test. The step 1
tableau is given by:

z x1 x2 x3 x4 x5 x6 rhs Basis
1 0 − 13
2 260 202 20 0 0 z
0 1 − 12 48 40 4 0 0 x1 (11.13)

0 0 1
10 − 19
4 − 19
5 − 25 1 0 x6

The basic feasible solution is the same as at step 0, even though the basis has
changed.

Deﬁnition 11.2 An iteration of the simplex method, which results in a

new basis with the basic feasible solution that is identical to the previous
The Simplex Method for Linear Programming 247
basic feasible solution is called a degenerate iteration and the correspond-
ing phenomenon is referred to as degeneracy.

Continuing with the computations, we obtain the step 2 tableau:

z x1 x2 x3 x4 x5 x6 rhs Basis
1 0 0 − 195
4 −45 −6 65 0 z
97 (11.14)
0 1 0 4 21 2 5 0 x1
0 0 1 − 95
2 −38 −4 10 0 x2

step 3 tableau:

z x1 x2 x3 x4 x5 x6 rhs Basis
1 195
97 0 0 − 270
97 − 192
97
7280
97 0 z
(11.15)
0 190
97 1 0 304
97 − 97
8 1920
97 0 x2
4 84 8 20
0 97 0 1 97 97 97 0 x3

step 4 tableau:

z x1 x2 x3 x4 x5 x6 rhs Basis
1 15
4
135
152 0 0 − 39
19
1760
19 0 z
(11.16)
0 − 12 − 21
76 1 0 2
19 − 100
19 0 x3
0 5
8
97
304 0 1 − 38
1 120
19 0 x4

step 5 tableau:

z x1 x2 x3 x4 x5 x6 rhs Basis
1 −6 − 92 39
2 0 0 −10 0 z
1 1 1 (11.17)
0 2 4 4 1 0 5 0 x4
0 − 19
4 − 21
8
19
2 0 1 −50 0 x5

step 6 tableau:

z x1 x2 x3 x4 x5 x6 rhs Basis
1 −5 −4 20 2 0 0 0 z
(11.18)
0 1
4 − 18 12 10 1 0 0 x5
1 1 1 1
0 10 20 20 5 0 1 0 x6
248 Numerical Methods and Optimization: An Introduction

The last tableau is exactly the same as the step 0 tableau (11.12). Thus, if we
continue with the execution of the simplex method, we will keep repeating the
calculations performed in steps 1–6 and will never be able to leave the same
solution.

Deﬁnition 11.3 A situation when the simplex method goes through a

series of degenerate steps, and as a result, revisits a basis it encountered
previously is called cycling.

Several methods are available that guarantee that cycling is avoided. One
of them is Bland’s rule, which is discussed next. According to this rule, the
variables are ordered in a certain way, for example, in the increasing order of
their indices, i.e., x1 , x2 , . . . , xn+m . Then, whenever there are multiple candi-
dates for the entering or the leaving variable, the preference is given to the
variable that appears earlier in the ordering. All nonbasic variables with a
positive coeﬃcient in the z-row of the dictionary (or a negative coeﬃcient in
row 0 of the tableau) are candidates for the entering variable, and all the basic
variables representing the rows that win the ratio test are candidates for the
leaving variable.

Theorem 11.1 If Bland’s rule is used to select the entering and leaving
variables in the simplex method, then cycling never occurs.

To illustrate Bland’s rule, we will apply it to the cycling example above.

The ﬁrst 5 steps are identical to the computations above, and we arrive at the
following tableau:

z x1 x2 x3 x4 x5 x6 rhs Basis
1 −6 − 92 39
2 0 0 −10 0 z
1 1 1 (11.19)
0 2 4 4 1 0 5 0 x4
0 − 19
4 − 21
8
19
2 0 1 −50 0 x5

The candidates for entering the basis are x1 , x2 , and x6 , thus, according to
Bland’s rule, x1 is chosen as the entering variable.

z x1 x2 x3 x4 x5 x6 rhs Basis
1 0 − 32 45
2 12 0 50 0 z
1 1 (11.20)
0 1 2 2 2 0 10 0 x1
0 0 − 14 95
8
19
2 1 − 52 0 x5
The Simplex Method for Linear Programming 249

After one more step, we obtain an optimal tableau:

z x1 x2 x3 x4 x5 x6 rhs Basis
1 3 0 24 18 0 80 0 z
(11.21)
0 2 1 1 4 0 20 0 x2
1 97 21 5
0 2 0 8 2 1 2 0 x5

In fact, the solution has not changed compared to the basic feasible solution
we had at step 0, however, the last tableau proves its optimality.

11.2.6 Properties of LP dictionaries and the simplex method

Note that the dictionary format we used to execute the simplex steps
above is just a linear system in which m basic variables and z are expressed
through n nonbasic variables, where n and m are the number of variables and
constraints, respectively, in the original LP (n = 2 and m = 4 for the Heavenly
Pouch LP). Namely, for the problem

n
maximize c j xj
j=1

n
(11.22)
subject to aij xj ≤ bi , i = 1, . . . , m
j=1
xj ≥ 0, j = 1, . . . , n,

where b ≥ 0, we constructed the initial feasible dictionary as follows:

n
z = c j xj
j=1
n (11.23)
s i = bi − aij xj , i = 1, . . . , m.
j=1

We ﬁrst show that any two dictionaries with the same basis must be identical.
Indeed, consider two dictionaries corresponding to the same basis. Let B be
the set of indices of the basic variables and let N be the set of indices of
nonbasic variables.

z = z̄ + c̄j xj z = z̃ + c̃j xj
j∈N j∈N

xi = b̄i − āij xj , i∈B xi = b̃i − ãij xj , i ∈ B.
j∈N j∈N

For a nonbasic variable xk , set xk = t > 0, and for j ∈ N , j = k, set xj = 0.

Then we have

b̄i − āik t = b̃i − ãik t for all i ∈ B, and z̄ + c̄k t = z̃ + c̃k t.

250 Numerical Methods and Optimization: An Introduction

Since these identities hold for any t, we have

b̄i = b̃i , āik = ãik , z̄ = z̃, c̄k = c̃k for all i ∈ B, k ∈ N .

Therefore, the following property holds.

Any two dictionaries of the same LP with the same basis are identical.

The initial dictionary (11.23) is, essentially, a linear system that represents the
original LP written in the standard form. The only transformations we apply
to this linear system at each subsequent iteration of the simplex method are
the elementary row operations used to express the new set of basic variables
through the remaining variables. Since applying an elementary row operation
to a linear system results in an equivalent linear system, we have the following
property.

Every solution of the set of equations comprising the dictionary obtained

at any step of the simplex method is also a solution of the step 0 dictio-
nary, and vice versa.

The ratio test we use to determine the leaving variable at each step is designed
to ensure that the constant term in the right-hand side of each equation is
nonnegative, so that setting all the nonbasic variables to 0 yields nonnegative
values for all the basic variables and thus the corresponding basic solution is
feasible. Thus, if we start with a feasible dictionary, feasibility is preserved
throughout execution of the simplex method.

If step 0 dictionary is feasible, then each consecutive dictionary generated

using the simplex method is feasible.

The entering variable on each step of the simplex method is chosen so that
each next basic feasible solution obtained during the simplex method execu-
tion is at least as good as the previous solution. However, we saw that the
simplex method may go through some consecutive degenerate iterations with
no change in the objective function value, in which case cycling can occur.
It appears that this is the only case where the method may not terminate.

Theorem 11.2 If the simplex method avoids cycling, it must terminate

by either finding an optimal solution or by detecting that the LP is un-
bounded.
The Simplex Method for Linear Programming 251
$ %
Proof. There are only n+mm different ways of choosing a set of m basic vari-
ables from the set of n+m variables. Since any two dictionaries corresponding
to the same basis are identical, there can only be a finite number of the sim-
plex method steps that are different. Therefore, if the simplex method does
not terminate, it must eventually revisit a previously visited basis, meaning
that cycling occurs.

We can use Bland’s rule, discussed in Section 11.2.5, or other methods to

make sure that cycling does not occur, and thus the simplex method termi-
nates.

11.3 Geometry of the Simplex Method

We illustrate the simplex steps geometrically using the Heavenly Pouch
LP formulated in Section 10.1, solved graphically in Section 10.4, and solved
using the simplex method in Section 11.2. The Heavenly Pouch LP is given
by:
maximize 15x1 + 25x2
subject to x1 + x2 ≤ 450 (solid color fabric constraint)
x2 ≤ 300 (printed fabric constraint)
4x1 + 5x2 ≤ 2, 000 (budget constraint)
x1 ≤ 350 (demand constraint)
x 1 , x2 ≥ 0 (nonnegativity constraints),

and has the following representation in the standard form:

maximize 15x1 + 25x2
subject to x1 + x2 + s1 = 450
x2 + s2 = 300
(11.24)
4x1 + 5x2 + s3 = 2, 000
x1 + s4 = 350
x 1 , x2 , s 1 , s 2 , s 3 , s 4 ≥ 0.

The LP is solved graphically in Figure 11.1. Also, solving this LP with the
simplex method produced the following basic feasible solutions:

Step Basic variables Basic feasible solution

0 s 1 , s2 , s3 , s4 x1 = x2 = 0, s1 = 450, s2 = 300, s3 = 2, 000, s4 = 350
1 s 1 , x2 , s 3 , s 4 x1 = 0, x2 = 300, s1 = 150, s2 = 0, s3 = 500, s4 = 350
2 x 1 , x2 , s 1 , s 4 x1 = 125, x2 = 300, s1 = 25, s2 = 0, s3 = 0, s4 = 225

These basic feasible solutions are represented by extreme points A, B, and

C of the feasible region (see Figure 11.1). Note that each pair of consecutive
252 Numerical Methods and Optimization: An Introduction
x2
H
450

G ' solid fabric constraint

400

printed fabric constraint

B I ↓
300
C J
z = 9, 375 ' demand constraint
∗
240 '

D
z = 6, 000
'
120 K budget constraint
'
E

z = 3, 000
'
A F L M
0 200 350 400 450 500 x1

FIGURE 11.1: Graphical solution of the Heavenly Pouch LP.

basic feasible solutions generated by the simplex method, which have all but
one basic variables in common, represent vertices of the polyhedron that are
connected by an edge of the feasible region.

Deﬁnition 11.4 We will call two basic feasible solutions adjacent if

their sets of basic variables diﬀer by just one element.

In other words, if the LP has m constraints, then adjacent basic feasible so-
lutions have m − 1 variables in common. Using this terminology, at any given
step the simplex method moves from the current basic feasible solution to
an adjacent basic feasible solution. Next, we will show that any basic feasi-
ble solution represents an extreme point (vertex) of the feasible region, which
is exactly what we observed in our example. We consider the feasible region
X = {x : Ax = b, x ≥ 0} of an LP in the standard form.
The Simplex Method for Linear Programming 253

Theorem 11.3 A point x̄ ∈ IRn is an extreme point of X = {x : Ax =

b, x ≥ 0} if and only if it can be represented as a basic feasible solution
of an LP with the feasible region given by X.

Proof. Let x̄ be an extreme point of X. Let J + = {j : x̄j > 0} be the set of

indices that represent positive components of x̄. We construct a basic feasible
solution for an arbitrary LP with the feasible region X as follows. We need to
select a set of m basic variables {xj : j ∈ B}, where B is an m-element subset
of {1, . . . , m + n}. If x̄ has no more than m positive components, we select
an arbitrary B such that J + ⊆ B. Then, since x̄ ∈ X, x̄ is a basic feasible
solution for the corresponding dictionary. If, on the other hand, x̄ has more
than m positive components, we select B ⊂ J + , and consider the dictionary
that has {xj : j ∈ B} as the basic variables. Then x̄ satisﬁes the system given
by the dictionary, where at least one of the nonbasic variables, x̄k (k ∈ / B), is
positive. If we reduce or increase the value of x̄k by > 0 and keep the other
nonbasic variable values unchanged, we can always select small enough for
the basic variables to stay positive after this modiﬁcation. Let x̄− and x̄+ be
the feasible solutions obtained from x̄ by setting x̄− k = x̄k − , x̄k = x̄k + for
+

suﬃciently small , while keeping the remaining x̄j , j ∈ / B unchanged. Then,

obviously x̄− , x̄+ ∈ X, x̄− = x̄+ and x̄ = 0.5x̄− + 0.5x̄+ , which contradicts the
assumption that x̄ is an extreme point of X.
Now, assume that x̄ is a basic feasible solution of an arbitrary LP with
the feasible region X that is not an extreme point of X. Then there exist
x̃, x̂ ∈ X, x̃ = x̂, and α ∈ (0, 1) such that

x̄ = αx̃ + (1 − α)x̂.

Let {xj : j ∈ B} be the set of basic variables, then for x̃ to be diﬀerent

from x̄, we must have x̃k > 0 for at least one k ∈ / B (otherwise, if all nonba-
sic variables are set to 0, the remaining variables must have the same values
as they have in x̄, meaning that x̃ = x̄). From the equation above we have
x̄k = αx̃k + (1 − α)x̂k , where x̄k = 0 and x̃k > 0. Then x̂k must be negative,
contradicting the feasibility of x̂.

Returning to our example, observe that the feasible region in Figure 11.1
has 6 extreme points, A, B, C, D, E, and F . The correspondence between these
extreme points and basic feasible solutions of the LP is shown in Table 11.1.
Recall that any basic (not necessarily feasible) solution is obtained by set-
ting the nonbasic variables to 0. In our two-dimensional case, we have two
nonbasic variables. If one of the original variables is 0, then the correspond-
ing basic solution lies on the line deﬁning the corresponding coordinate axis,
and if a slack variable is 0 then the constraint it represents is binding for
the corresponding basic solution. Thus, basic solutions correspond to pairs
of lines deﬁning the feasible region (including the two coordinate axes). The
total number of basic solutions that we may potentially have in our example
254 Numerical Methods and Optimization: An Introduction

TABLE 11.1: The correspondence between extreme points of the feasible

region, as shown in Figure 11.1, and basic feasible solutions of the LP.

Basic Basic feasible solution Extreme

variables x1 x2 s1 s2 s3 s4 point
s 1 , s 2 , s 3 , s4 0 0 450 300 2, 000 350 A
x 2 , s 1 , s 3 , s4 0 300 150 0 500 350 B
x 1 , x2 , s 1 , s 4 125 300 25 0 0 225 C
x 1 , x2 , s 2 , s 4 250 200 0 100 0 100 D
x 1 , x2 , s 2 , s 3 350 100 0 200 100 0 E
x 1 , s 1 , s 2 , s3 350 0 100 300 600 0 F

$ %
is 2+44 = 15, however, not every set of 4 variables may form a basis. For
example, the basis consisting of variables x1 , s1 , s3 , and s4 is not possible,
since this would imply that x2 = 0 and s2 = 300 − x2 = 0 at the same time,
which is impossible. Geometrically, this corresponds to parallel lines deﬁning
a pair of constraints (in our case, the line for the printed fabric constraint,
x2 = 300, is parallel to the line x2 = 0), meaning that both constraints
cannot be binding at the same time. In our example, x1 and s4 cannot be
nonbasic simultaneously as well, since the demand constraint line is parallel
to the x2 -axis. Excluding these two cases, there are 15 − 2 = 13 potential
basic solutions. As we already discussed, six of them (A, B, C, D, E, F ) are
basic feasible solutions, and we can see that the remaining basic solutions,
which lie on pairwise intersections of lines deﬁning the constraints (including
nonnegativity), are infeasible (points G, H, I, J, K, L, and M in Figure 11.1).
Establishing the correspondence between these points and basic solutions is
left as an exercise (Exercise 11.4 at page 278).

In summary, from a geometric viewpoint, the simplex method starts from

one of the vertices of the feasible region and proceeds by moving to a better
(or, at least, no worse) adjacent vertex, whenever one exists. The method
terminates at a vertex that has no adjacent vertex with a better objective.

11.4 The Simplex Method for a General LP

Assume that there is an additional minimum manufacturing order con-

straint in the Heavenly Pouch problem formulated in Section 10.1, which re-
quires that at least 100 carriers are made. After adding the constraint, which
The Simplex Method for Linear Programming 255

can be expressed as x1 + x2 ≥ 100, we obtain the following LP:

maximize 15x1 + 25x2

subject to x1 + x2 ≤ 450 (solid color fabric constraint)
x2 ≤ 300 (printed fabric constraint)
4x1 + 5x2 ≤ 2, 000 (budget constraint)
x1 ≤ 350 (demand constraint)
x1 + x2 ≥ 100 (manufacturing constraint)
x 1 , x2 ≥ 0 (nonnegativity constraints).

As before, we ﬁrst convert the LP to the standard form by introducing the

slack variables si , i = 1, . . . , 4 for the ﬁrst four constraints, and an excess
variable e5 for the ﬁfth constraint.

maximize 15x1 + 25x2

subject to x1 + x2 + s 1 = 450
x2 + s2 = 300
4x1 + 5x2 + s3 = 2, 000 (11.25)
x1 + s4 = 350
x1 + x2 − e5 = 100
x 1 , x2 , s 1 , s 2 , s 3 , s 4 , e 5 ≥ 0.

The corresponding dictionary,

z = 15x1 + 25x2
s1 = 450 − x1 − x2
s2 = 300 − x2
(11.26)
s3 = 2, 000 − 4x1 − 5x2
s4 = 350 − x1
e5 = −100 + x1 + x2

is clearly infeasible, since setting x1 = x2 = 0 results in e5 = −100 < 0,

and the excess variable e5 cannot be used as the basic variable for the fifth
constraint. Therefore we cannot initialize the simplex method with this dic-
tionary.
Next, we will discuss two variants of the simplex method designed to over-
come this obstacle, the big-M method and the two-phase simplex method. Both
methods are based on a similar idea of introducing an artificial variable for
each constraint where the starting basic variable (such as the slack or excess
variable) is not readily available. For example, the fifth constraint in the last
LP above,
x1 + x2 − e5 = 100,
could be modified to include an artificial variable a5 ≥ 0 as follows:

x1 + x2 − e5 + a5 = 100.

With this modiﬁcation, we can use a5 as the starting basic variable for this
256 Numerical Methods and Optimization: An Introduction

constraint, thus obtaining a basic feasible solution for the resulting LP. How-
ever, to get a feasible solution for the original problem (if there is one), we
need to make sure that in the end the artificial variable a5 = 0. The methods
we are about to discuss utilize alternative ways of driving the artificial vari-
ables out of the basis, thus ensuring that they all eventually vanish whenever
the LP is feasible. We first discuss the general setup for both methods and
then proceed to describe each of them in more detail.
Consider a general LP

n
maximize c j xj
j=1

n
subject to aij xj ≤ bi , i = 1, . . . , m
j=1 (P)
n
aij xj = bi , i = 1, . . . , m
j=1
xj ≥ 0, j = 1, . . . , n.

After introducing m slack variables xn+1 , . . . , xn+m and updating the prob-
lem coeﬃcients accordingly, we can write this LP in the standard form as
follows:

n+m
maximize c j xj
j=1

n+m (PS)
subject to aij xj = bi , i = 1, . . . , m
j=1
xj ≥ 0, j = 1, . . . , n + m ,

where m = m + m , indices i = m + 1, . . . , m correspond to the original

equality constraints, and bi ≥ 0 for all i = m + 1, . . . , m (this can be easily
guaranteed by multiplying both sides of the original equality constraints by
−1 whenever bi < 0, i = 1, . . . , m ).
Let I − = {i ∈ {1, . . . , m } : bi < 0} be the set of indices that correspond to
the original inequality constraints that have a negative right-hand side in (PS).
Each of these constraints will require introducing an artiﬁcial variable in order
to obtain a starting basic feasible solution. In addition, we will add an artiﬁcial
variable to each of the last m constraints (i.e., the equations corresponding to
the original equality constraints in (P)), which will serve as the basic variables
for these constraints in the initial dictionary. For convenience, we denote by

I a = I − ∪ {m + 1, . . . , m}

the set of indices of all constraints, both inequality and equality, that require
an artiﬁcial variable to initialize the simplex method.
We will associate the following two problems, which share the same feasible
The Simplex Method for Linear Programming 257

region, with LP (P) (and (PS)):

minimize ai
i∈I a

n+m
subject to aij xj = bi , i ∈ {1, . . . , m } \ I a
j=1

(A)

n+m
aij xj + ai = bi , i ∈ Ia
j=1
x j , ai ≥ 0, j = 1, . . . , n + m , i ∈ I a ,

n+m
maximize c j xj − M ai
j=1 i∈I a

n+m
subject to aij xj = bi , i ∈ {1, . . . , m } \ I a
j=1 (B)

n+m
aij xj + ai = bi , i ∈ Ia
j=1
x j , ai ≥ 0, j = 1, . . . , n + m , i ∈ I a ,
where M is some sufficiently large positive constant referred to as the “big
M .” Problem (A) is called the auxiliary problem and problem (B) is the big-
M problem associated with (P).
Example 11.3 Consider the following LP:
maximize x1 − 2x2 + 3x3
subject to −2x1 + 3x2 + 4x3 ≥ 12
3x1 + 2x2 + x3 ≥ 6 (11.27)
x1 + x2 + x3 ≤ 9
x 1 , x2 , x3 ≥ 0.
This LP in the standard form is given by:
maximize x1 − 2x2 + 3x3
subject to −2x1 + 3x2 + 4x3 − x4 = 12
3x1 + 2x2 + x3 − x5 = 6 (11.28)
x1 + x2 + x3 + x6 = 9
x 1 , x2 , x3 , x4 , x5 , x6 ≥ 0.
Clearly, the basic solution with the basis consisting of x4 , x5 , and x6 is infea-
sible since it has negative values for x4 and x5 (x4 = −12 and x5 = −6).
Hence, we introduce artificial variables for the first two constraints. Then the
corresponding auxiliary problem can be written as a maximization problem as
follows:
maximize − a1 − a2
subject to −2x1 + 3x2 + 4x3 − x4 + a1 = 12
3x1 + 2x2 + x3 − x5 + a2 = 6 (11.29)
x1 + x2 + x3 + x6 = 9
x 1 , x 2 , x 3 , x 4 , x 5 , x6 , a 1 , a 2 ≥ 0,
258 Numerical Methods and Optimization: An Introduction

and the big-M problem associated with LP (11.27) is given by:

maximize x1 − 2x2 + 3x3 − M a1 − M a2
subject to −2x1 + 3x2 + 4x3 − x4 + a1 = 12
3x1 + 2x2 + x3 − x5 + a2 = 6 (11.30)
x1 + x2 + x3 + x6 = 9
x 1 , x2 , x3 , x4 , x5 , x6 , a1 , a2 ≥ 0.

Note that we can easily obtain a basic feasible solution for both (A) and (B)
by selecting the basis consisting of the slack variables xn+i , i ∈ {1, . . . , m }\I a
for rows where artificial variables were not needed, and the artificial variables
ai , i ∈ I a for the remaining rows. Also, the objective function of the auxil-
iary problem (A) is always nonnegative and thus any feasible solution of this
problem with ai = 0, i ∈ I a is optimal.
In addition, observe that if a feasible solution of (B) with ai > 0 for at least
one i ∈ I a exists, then the objective function value can be made arbitrarily
poor (i.e., very large, negative) by selecting a sufficiently large constant M > 0.

Theorem 11.4 The following properties hold for LP (P) and the asso-
ciated auxiliary problem (A) and big-M problem (B):

1. LP (P) is feasible if and only if the optimal objective function value

of the associated auxiliary problem (A) is 0.
2. LP (P) is optimal and a vector x∗ is an optimal solution to its
standard form LP (PS) if and only if there exists a positive real
constant M such that x = x∗ , ai = 0, i ∈ I a is an optimal solution
of the associated big-M problem (B).
3. LP (P) is unbounded if and only if the associated big-M prob-
lem (B) is unbounded for any M > 0.

Proof: 1. If LP (P) is feasible, then any feasible solution x to (P) and ai =

0, i ∈ I a provide an optimal solution for (A). Also, if 0 is not the optimal
objective function value for (A), then (A) does not have a feasible solution
with ai = 0, i ∈ I a , implying that LP (P) is infeasible.
2. If LP (P) is optimal and a vector x∗ is an optimal solution to its standard
form LP (PS), then x = x∗ , ai = 0, i ∈ I a give an optimal solution to (B).
Indeed, if one of the artiﬁcial variables is positive then taking M → +∞, the
objective function of (B) will tend to −∞. If, on the other hand, there exists
M such that x = x∗ , ai = 0, i ∈ I a is an optimal solution of (B), then x∗ is an
optimal solution for (P), since if there was a better solution x̃ for (P), taking
x = x̃, ai = 0, i ∈ I a would give a better than optimal solution for (B), which
is impossible.
3. If (P) is unbounded, then setting all the artiﬁcial variables to 0 shows
that (B) is also unbounded for any M > 0. If (B) is unbounded for any
The Simplex Method for Linear Programming 259

M > 0, then taking M → 0 shows that (P) is unbounded.

The last theorem provides foundations for the two-phase simplex and the
big-M methods discussed next.

11.4.1 The two-phase simplex method

The two-phase simplex method consists of the following two phases:

• Phase I: Solve the auxiliary problem (A), and, as a result, either obtain
a feasible tableau for the original problem (P) (if the optimal objective
value is 0), or conclude that the problem is infeasible (if the optimal
objective value is positive).

• Phase II: If the problem was not judged infeasible in Phase I, solve the
original problem using the optimal tableau of the auxiliary problem (A)
to get the starting tableau for the original LP (P).

To get a feasible tableau for (P) from an optimal tableau for (A) in Phase
II, we need to get rid of the artificial variables. If all artificial variables are non-
basic in the optimal tableau for (A), we just drop the corresponding columns
and express the objective of (P) through nonbasic variables to obtain a feasi-
ble tableau for (P). If some of the auxiliary variables are basic in the obtained
optimal solution for (P), the solution must be degenerate since the basic artifi-
cial variables are equal to 0, and we attempt to drive them out of the basis by
performing additional degenerate pivots. This is always possible, unless there
is a basic artificial variable in the optimal tableau such that its corresponding
row does not have nonzero coefficients in columns other than artificial variable
columns, in which case we can just remove the corresponding row. This can
be the case only when the original LP in the standard form has linearly de-
pendent constraints, as will be illustrated in Example 11.6. We consider other
examples first.

Example 11.4 Use the two-phase simplex method to solve LP (11.27) from
Example 11.3:

maximize x1 − 2x2 + 3x3

subject to −2x1 + 3x2 + 4x3 ≥ 12
3x1 + 2x2 + x3 ≥ 6
x1 + x2 + x3 ≤ 9
x 1 , x2 , x3 ≥ 0.
260 Numerical Methods and Optimization: An Introduction

The corresponding auxiliary problem was presented in (11.29) and is given by

(here we use e1 , e2 , and s3 instead of x4 , x5 , and x6 for convenience):

maximize − a1 − a2
subject to −2x1 + 3x2 + 4x3 − e1 + a1 = 12
3x1 + 2x2 + x3 − e2 + a2 = 6
x1 + x2 + x3 + s3 = 9
x 1 , x2 , x3 , e 1 , e 2 , s 3 , a1 , a2 ≥ 0.

We solve the Phase I LP using the simplex method in the tableau format.
Expressing the objective through nonbasic variables,

z = −a1 − a2
= −(12 + 2x1 − 3x2 − 4x3 + e1 ) − (6 − 3x1 − 2x2 − x3 + e2 )
= −18 + x1 + 5x2 + 5x3 − e1 − e2 ,

we obtain the following step 0 tableau:

z x1 x2 x3 e1 e2 s3 a1 a2 rhs Basis
1 −1 −5 −5 1 1 0 0 0 −18 z
0 −2 3 4 −1 0 0 1 0 12 a1
0 3 2 1 0 −1 0 0 1 6 a2
0 1 1 1 0 0 1 0 0 9 s3

Step 1 tableau:

z x1 x2 x3 e1 e2 s3 a1 a2 rhs Basis
1 13
2 0 − 52 1 − 32 0 0 5
2 −3 z

0 − 13
2 0 5
2 −1 3
2 0 1 − 32 3 a1
0 3
2 1 1
2 0 − 12 0 0 1
2 3 x2
0 − 12 0 1
2 0 1
2 1 0 − 12 6 s3

Step 2 tableau:

z x1 x2 x3 e1 e2 s3 a1 a2 rhs Basis
1 0 0 0 0 0 0 1 1 0 z
0 − 13
5 0 1 − 25 3
5 0 2
5 − 35 6
5 x3
0 14
5 1 0 1
5 − 45 0 − 15 4
5
12
5 x2
0 4
5 0 0 1
5
1
5 1 − 15 − 15 27
5 s3

This tableau is optimal. Since the optimal objective value is 0, the original
LP is feasible. To obtain a feasible tableau for the original LP, we drop the
The Simplex Method for Linear Programming 261

columns for a1 and a2 in the tableau above and replace the basic variables x2
and x3 in the objective function of the original LP,

z = x1 − 2x2 + 3x3 ,

with their expressions through the nonbasic variables from the optimal Phase
I tableau,
x2 = 12 5 − 5 x1 − 5 e1 + 5 e2 ;
14 1 4

x3 = 6
5 + 13
5 x1 + 25 e1 − 35 e2 .

We obtain

z = x1 − 2x2 + 3x3 = − 65 + 72
5 x1 + 85 e1 − 17
5 e2 .

Hence, the initial phase II tableau is given by

z x1 x2 x3 e1 e2 s3 rhs Basis
1 − 72
5 0 0 − 85 17
5 0 − 65 z
0 −513
0 1 − 25 3
5 0 6
5 x3

0 14
5 1 0 1
5 − 45 0 12
5 x2
4 1 1 27
0 5 0 0 5 5 1 5 s3

Step 1 (phase II) tableau:

z x1 x2 x3 e1 e2 s3 rhs Basis
1 0 36
7 0 − 47 − 57 0 78
7 z
0 0 13
14 1 − 14
3
− 17 0 24
7 x3
0 1 5
14 0 1
14 − 27 0 6
7 x1

0 0 − 27 0 1
7
3
7 1 33
7 s3

Step 2 (phase II) tableau:

z x1 x2 x3 e1 e2 s3 rhs Basis
1 0 14
3 0 − 13 0 5
3 19 z
0 0 5
6 1 − 16 0 1
3 5 x3
1 1 2
0 1 6 0 6 0 3 4 x1
0 0 − 23 0 1
3 1 7
3 11 e2
262 Numerical Methods and Optimization: An Introduction

Step 3 (phase II) tableau:

z x1 x2 x3 e1 e2 s3 rhs Basis
1 2 5 0 0 0 3 27 z
0 1 1 1 0 0 1 9 x3
0 6 1 0 1 0 4 24 e1
0 −2 −1 0 0 1 1 3 e2

This tableau is optimal, so the optimal solution to the original LP is

x∗1 = x∗2 = 0, x∗3 = 9, z ∗ = 27.

Example 11.5 Consider the following LP:

maximize x1 + x2 + x3
subject to 2x1 + 2x2 + 3x3 = 6
x1 + 3x2 + 6x3 = 12.

Use the two-phase simplex method to solve this LP.

We introduce artiﬁcial variables a1 and a2 and write down the phase I starting
tableau:
z x1 x2 x3 a1 a2 rhs Basis
1 −3 −5 −9 0 0 −18 z
0 2 2 3 1 0 6 a1
0 1 3 6 0 1 12 a2
After one pivot, we obtain an optimal tableau:

z x1 x2 x3 a1 a2 rhs Basis
1 3 1 0 3 0 0 z
2 2 1
0 3 3 1 3 0 2 x3
0 −3 −1 0 −2 1 0 a2

The corresponding basic feasible solution is degenerate, and the optimality

will not be altered if we make an additional pivot with x2 as the entering
variable and a2 as the leaving variable:

z x1 x2 x3 a1 a2 rhs Basis
1 0 0 0 1 1 0 z
0 − 43 0 1 −1 2
3 2 x3
0 3 1 0 2 −1 0 x2
The Simplex Method for Linear Programming 263

Removing the artiﬁcial variables and expressing the objective in terms of the
nonbasic variable x1 , we obtain the following phase II initial tableau:
z x1 x2 x3 rhs Basis
2
1 3 0 0 2 z
0 − 43 0 1 2 x3
0 3 1 0 0 x2
This tableau happens to be optimal, with the optimal solution given by
x∗1 = x∗2 = 0, x∗3 = 2, z ∗ = 2.
Example 11.6 Consider the following LP with linearly dependent con-
straints:
maximize x1 + x2 + 2x3
subject to x1 + 2x2 + 3x3 = 6
2x1 + 4x2 + 6x3 = 12.
Solve the problem using the two-phase simplex method.
We introduce artiﬁcial variables a1 and a2 and write down the phase I starting
tableau:
z x1 x2 x3 a1 a2 rhs Basis
1 −3 −6 −9 0 0 −18 z
0 1 2 3 1 0 6 a1
0 2 4 6 0 1 12 a2
After one pivot, we obtain an optimal phase I tableau:
z x1 x2 x3 a1 a2 rhs Basis
1 0 0 0 3 0 0 z
1 2 1
0 3 3 1 3 0 2 x3
0 0 0 0 −2 1 0 a2
Again, the corresponding basic optimal solution is degenerate, but this time
we cannot make both a1 and a2 nonbasic at the same time like we did in the
previous example. However, we can see that the second row does not involve
the original variables. Hence, if we remove this row along with the column for
a1 and express the objective function through the nonbasic variables,
2 4 1 1
z = x1 + x2 + 2x3 = x1 + x2 + 4 − x1 − x2 = 4 + x1 − x2 ,
3 3 3 3
we obtain a feasible tableau for the original problem:
z x1 x2 x3 rhs Basis
1 − 13 1
3 0 4 z
1 2
0 3 3 1 2 x3
264 Numerical Methods and Optimization: An Introduction

Carrying out a pivot, we obtain:

z x1 x2 x3 rhs Basis
1 0 1 1 6 z
0 1 2 3 6 x1

This tableau is optimal, with the corresponding basic optimal solution given
by
x∗1 = 6, x∗2 = x∗3 = 0, z ∗ = 6.

11.4.2 The big-M method

Unlike the two-phase simplex method, the big-M method consists of just
one phase that solves the big-M problem (B) in order to solve (P). Note that
there is something unusual about the big-M problem (B)–it involves a non-
numerical parameter M as the objective function coefficient for the artificial
variables. When performing a pivot, the parameter M is treated as M → +∞.
Thus, when we compare two nonbasic variable coefficients involving M , the
coefficient with a greater multiplier for M is considered greater. For example,
3M − 200 is greater than 2M + 500 for a sufficiently large M .
After solving the big-M LP (B), we check if all artificial variables are equal
to 0 in the optimal solution. If they are, the corresponding solution for x is
optimal for (P). If there is a positive artificial variable in the optimal solution
for (B), then the original LP (P) is infeasible.

Example 11.7 Use the big-M method to solve LP (11.27) from Example 11.3:

maximize x1 − 2x2 + 3x3

subject to −2x1 + 3x2 + 4x3 ≥ 12
3x1 + 2x2 + x3 ≥ 6
x1 + x2 + x3 ≤ 9
x 1 , x2 , x3 ≥ 0.

The corresponding big-M problem is given by:

maximize x1 − 2x2 + 3x3 − M a1 − M a2

subject to −2x1 + 3x2 + 4x3 − e1 + a1 = 12
3x1 + 2x2 + x3 − e2 + a2 = 6
x1 + x2 + x3 + s3 = 9
x 1 , x2 , x3 , e 1 , e 2 , s 3 , a1 , a2 ≥ 0.

Expressing the artiﬁcial variables in row 0 through the nonbasic variables, we

obtain the following step 0 tableau:
The Simplex Method for Linear Programming 265

z x1 x2 x3 e1 e2 s3 a1 a2 rhs Basis
1 −M − 1 −5M + 2 −5M − 3 M M 0 0 0 −18M z
0 −2 3 4 −1 0 0 1 0 12 a1
0 3 2 1 0 −1 0 0 1 6 a2
0 1 1 1 0 0 1 0 0 9 s3
Step 1 tableau:
z x1 x2 x3 e1 e2 s3 a1 a2 rhs Basis
1 − 7M2+5 − 5M4−17 0 − M4+3 M 0 5M +3
4 0 −3M + 9 z
0 − 12 3
4 1 − 14 0 0 1
4 0 3 x3

0 7
2
5
4 0 1
4 −1 0 − 14 1 3 a2
0 3
2
1
4 0 1
4 0 1 − 14 0 6 s3

Step 2 tableau:
z x1 x2 x3 e1 e2 s3 a1 a2 rhs Basis
1 0 36
7 0 − 47 − 57 0 7M +4
7
7M +5
7
78
7 z
0 0 13
14 1 − 14
3
− 17 0 3
14
1
7
24
7 x3
0 1 5
14 0 1
14 − 27 0 − 14
1 2
7
6
7 x1

0 0 − 27 0 1
7
3
7 1 − 17 − 37 33
7 s3

Step 3 tableau:
z x1 x2 x3 e1 e2 s3 a1 a2 rhs Basis
1 0 14
3 0 − 13 0 5
3
3M +1
3 M 19 z
0 0 5
6 1 − 16 0 1
3
1
6 0 5 x3

0 1 1
6 0 1
6 0 2
3 − 16 0 4 x1
0 0 − 23 0 1
3 1 7
3 − 13 −1 11 e2

Step 4 tableau:
z x1 x2 x3 e1 e2 s3 a1 a2 rhs Basis
1 2 5 0 0 0 3 M M 27 z
0 1 1 1 0 0 1 0 0 9 x3
0 6 1 0 1 0 4 −1 0 24 e1
0 −2 −1 0 0 1 1 0 −1 3 e2
266 Numerical Methods and Optimization: An Introduction

This tableau is optimal. The optimal solution is

x∗1 = x∗2 = 0, x∗3 = 9, z ∗ = 27.

11.5 The Fundamental Theorem of LP

We are now ready to state the fundamental theorem of linear programming.

Theorem 11.5 (The fundamental theorem of LP) Every LP has

the following properties:

1. If it has no optimal solution, then it is either infeasible or un-

bounded.

2. If it has a feasible solution, then it has a standard-form represen-

tation in which it has a basic feasible solution.

3. If it has an optimal solution, then it has a standard-form represen-

tation in which it has a basic optimal solution.

Proof. The proof follows from the analysis of the two-phase simplex method
above. If an LP has a feasible solution, then Phase I of the two-phase simplex
method will ﬁnd a basic feasible solution. If the LP is optimal, then Phase II
of the two-phase simplex method will ﬁnd a basic optimal solution. If the LP
has no optimal solution and was not proved infeasible at Phase I of the two-
phase simplex method, then we start Phase II. Phase II will always terminate
if, e.g., we use Bland’s rule to avoid cycling. If the LP is not optimal, Phase
II will prove that the problem is unbounded, since this is the only remaining
possibility.

11.6 The Revised Simplex Method

In this section, we will discuss how the simplex method steps can be per-
formed more eﬃciently using matrix operations. We will consider a problem
The Simplex Method for Linear Programming 267

in the form

n
maximize c j xj
j=1

n
(11.31)
subject to aij xj ≤ bi , i = 1, . . . , m
j=1
xj ≥ 0, j = 1, . . . , n.

Before solving the problem using the simplex method, we introduce the slack
variables xn+1 , xn+2 , . . . , xn+m and write this problem in the standard form,

maximize cT x
subject to Ax = b (11.32)
x ≥ 0.

Then matrix A has m rows and n + m columns, the last m of which form the
m × m identity matrix Im . Vector x has length n + m and vector b has length
m. Vector c has length n + m, with its last m components being zeros.
Next, we will apply the simplex method to this problem, however, this time
the computations will be done using a matrix representation of the data at
each step. The resulting method is referred to as the revised simplex method.
To facilitate the discussion, we will illustrate the underlying ideas using the
following LP:

maximize 4x1 + 3x2 + 5x3

subject to x1 + 2x2 + 2x3 ≤ 4
3x1 + 4x3 ≤ 6 (11.33)
2x1 + x2 + 4x3 ≤ 8
x 1 , x2 , x3 ≥ 0.

For this problem, n = m = 3. Assume that this LP represents a production

planning problem, where xj is the number of units of product j to be pro-
duced, the objective coefficient cj is the profit obtained from a unit of product
j. The constraints represent limitations on 3 different resources used in the
production process, with the coefficient aij for variable xj in the constraint i
standing for the amount of resource i used in the production of 1 unit of prod-
uct j, and the right-hand side bi in the constraint i representing the amount
of resource i available; i, j = 1, 2, 3.
Following the general framework, we introduce the slack variables x4 , x5 , x6
and put the LP in the standard form (11.32), where
⎡ ⎤ ⎡ ⎤
1 2 2 1 0 0 4
cT = [4, 3, 5, 0, 0, 0], A = ⎣ 3 0 4 0 1 0 ⎦ , b = ⎣ 6 ⎦ . (11.34)
2 1 4 0 0 1 8

Each basic feasible solution x̃ partitions x1 , x2 , . . . , xn+m into m basic and

n nonbasic variables, which can be listed in two separate vectors xB and xN .
268 Numerical Methods and Optimization: An Introduction

Taking into account the correspondence of each variable to a coeﬃcient in c,

we can naturally partition c into two vectors, cB and cN , containing the ob-
jective function coeﬃcients for the basic and nonbasic variables, respectively.
Similarly, the columns of A can be split to form two diﬀerent matrices, B
and N , containing basic and nonbasic variable columns, respectively. As a
result, the objective function cT x and the constraints system Ax = b can be
equivalently written as

z = cT x = cTB xB + cTN xN (11.35)

and
Ax = BxB + N xN = b, (11.36)
respectively, where
xB = the vector of basic variables listed in the increasing order of their
indices;
xN = the vector of nonbasic variables listed in the increasing order of
their indices;
cB = the vector consisting of the components of c that correspond to
the basic variables, listed in the increasing order of their indices;
cTN = the vector consisting of the components of c that correspond to the
nonbasic variables, listed in the increasing order of their indices;
B = the matrix whose columns are the columns of A that correspond to
the basic variables, listed in the increasing order of their indices;
N = the matrix whose columns are the columns of A that correspond
to the nonbasic variables, listed in the increasing order of their
indices.
For example, for LP (11.33) at step 0 of the simplex method we have:
⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤
x4 0 1 0 0 4
xB = ⎣ x 5 ⎦ , c B = ⎣ 0 ⎦ , B = ⎣ 0 1 0 ⎦ , x̃B = ⎣ 6 ⎦ ,
x6 0 0 0 1 8
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
x1 4 1 2 2
xN = ⎣ x2 ⎦ , cN = ⎣ 3 ⎦ , N = ⎣ 3 0 4 ⎦.
x3 5 2 1 4
Note that B is given by the 3 × 3 identity matrix and is clearly nonsingular.
Next we show that matrix B will remain nonsingular for any basis obtained
during the execution of the simplex method.

Theorem 11.6 Consider LP (11.31) and its standard form represen-

tation (11.32). Let B denote a matrix consisting of the columns of A
The Simplex Method for Linear Programming 269
that correspond to the basic variables obtained at an arbitrary step of the
simplex method applied to (11.32). Then B is nonsingular.

Proof. Let S denote the m × m matrix consisting of the last m columns of A.

Note that each column of S corresponds to a slack variable, and S is just the
m × m identity matrix Im . Next, we analyze what happens to B and S as we
proceed with the execution of the simplex method. The only operation applied
to matrix A at each simplex step is a pivot, which is a set of elementary row
operations aiming to eliminate the coeﬃcients of the entering variable in all
rows but one, for which the entering variable is basic. Let A be the matrix
that is obtained from A after applying all the elementary row operations used
by the simplex method to arrive at the current basis, and let B and S be
the matrices resulting from B and S as a result of applying the same set of
elementary row operations. Then the rows of A can always be arranged in
the order such that B becomes the identity matrix. Let A , B , and S be
the matrices obtained from A , B , and S as a result of applying such row
ordering. Then we have

[B|S] = [B|Im ] ∼ [B |S ] = [Im |S ],

hence S = B −1 and B is nonsingular.

Note that the proof above is constructive and allows one to easily extract
B −1 from the simplex tableau, as illustrated in the following example.

Example 11.8 Consider the LP solved in Section 11.2:

maximize 15x1 + 25x2

subject to x1 + x2 ≤ 450
x2 ≤ 300
4x1 + 5x2 ≤ 2, 000
x1 ≤ 350
x 1 , x2 ≥ 0.

Its step 0 tableau is given by:

z x1 x2 s1 s2 s3 s4 rhs Basis
1 −15 −25 0 0 0 0 0 z
0 1 1 1 0 0 0 450 s1
(11.37)
0 0 1 0 1 0 0 300 s2
0 4 5 0 0 1 0 2, 000 s3
0 1 0 0 0 0 1 350 s4
270 Numerical Methods and Optimization: An Introduction

After two steps of the simplex method we had the following tableau:

z x1 x2 s1 s2 s3 s4 rhs Basis
1 0 0 0 25/4 15/4 0 9, 375 z
0 0 0 1 1/4 −1/4 0 25 s1
(11.38)
0 0 1 0 1 0 0 300 x2
0 1 0 0 −5/4 1/4 0 125 x1
0 0 0 0 5/4 −1/4 1 225 s4

Rearranging the rows of this tableau so that the matrix comprised of the
columns corresponding to the basic variables x1 , x2 , s1 , and s4 is the 4 × 4
identity matrix, we obtain:

z x1 x2 s1 s2 s3 s4 rhs Basis
1 0 0 0 25/4 15/4 0 9, 375 z
0 1 0 0 −5/4 1/4 0 125 x1
(11.39)
0 0 1 0 1 0 0 300 x2
0 0 0 1 1/4 −1/4 0 25 s1
0 0 0 0 5/4 −1/4 1 225 s4

Matrix B consisting of columns of the basic variables x1 , x2 , s1 , and s4 in step

0 tableau (11.37) is given by
⎡ ⎤
1 1 1 0
⎢ 0 1 0 0 ⎥
B=⎢⎣ 4
⎥, (11.40)
5 0 0 ⎦
1 0 0 1

and its inverse can be read from the columns for s1 , s2 , s3 , and s4 in (11.39):
⎡ ⎤
0 −5/4 1/4 0
⎢ 0 1 0 0 ⎥
B −1 = ⎢⎣ 1 1/4 −1/4
⎥. (11.41)
0 ⎦
0 5/4 −1/4 1

This is easy to verify by checking that the product of the matrices in (11.40)
and (11.41) gives the 4 × 4 identity matrix.

Knowing that B −1 always exists, we can express xB through xN by mul-

tiplying both sides of (11.36) by B −1 :

xB = B −1 b − B −1 N xN .

Then, using (11.35), z can be expressed through nonbasic variables only as

follows:
z = cTB xB + cTN xN = cTB B −1 b + (cTN − cTB B −1 N )xN .
In summary, the corresponding revised dictionary is given by
The Simplex Method for Linear Programming 271

z = cTB B −1 b + (cTN − cTB B −1 N )xN

(RD)
xB = B −1 b − B −1 N xN .

Writing the tableau for the same basis, we obtain the revised tableau:

z xB xN rhs
1 0 −(cTN − cTB B −1 N ) cTB B −1 b (RT)
−1 −1
0 Im B N B b

Note that the formulas in (RD) and (RT) are valid for any LP in the stan-
dard form (11.32), as long as it has a basic solution, which ensures that B is
nonsingular for every basic solution.
We will use the matrix representation of the dictionary (RD) or of the
tableau (RT) to carry out the steps of the revised simplex method. Let us
consider the tableau form (RT) to motivate the method. With the “usual”
simplex method, at each step of the method we update the data in (RT) by
applying certain elementary row operations to the tableau computed at the
previous step. However, note that we do not need to know every single en-
try of the tableau in order to determine the entering and leaving variables,
which is the only information needed to update the basis. The main reason
for computing the whole tableau at each step was due to the fact that parts of
the tableau that may not be used at the current step may become necessary
at further steps. But, as we can see from (RT), the information required to
perform a pivot may be obtained using the input data given by A, b, and c
directly, rather than using the output from the previous step. We will demon-
strate how the corresponding computations can be carried out using the LP
given in (11.33), with c, A, and b as in (11.34):
⎡ ⎤ ⎡ ⎤
1 2 2 1 0 0 4
cT = [4, 3, 5, 0, 0, 0], A=⎣ 3 0 4 0 1 0 ⎦, b = ⎣ 6 ⎦ . (11.42)
2 1 4 0 0 1 8

In the beginning, the basic variables are given by BV = {x4 , x5 , x6 }, so

we have the following step 0 data:
⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤
x4 0 1 0 0 4
xB = ⎣ x5 ⎦ , cB = ⎣ 0 ⎦ , B = ⎣ 0 1 0 ⎦, x̃B = ⎣ 6 ⎦ ,
x6 0 0 0 1 8
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
x1 4 1 2 2
xN = ⎣ x2 ⎦ , cN = ⎣ 3 ⎦, N =⎣ 3 0 4 ⎦.
x3 5 2 1 4
272 Numerical Methods and Optimization: An Introduction

Here x̃B is the vector storing the values of the basic variables in the current
basic feasible solution.
Step 1. Since the initial values of B and cB are given by the 3 × 3 identity
matrix and the 3-dimensional 0 vector, the formulas in (RT) simplify to the
following:

z xB xN rhs
1 0 −cTN 0 (RT-1)
0 Im N b

Choosing the entering variable. At the ﬁrst step, selecting the entering
variable is easy; we just take a variable corresponding to the highest positive
coeﬃcient in cN , which is x3 in our case.
Choosing the leaving variable. We select the leaving variable based on the
ratio test, which is performed by considering the components of the vector of
right-hand sides given by x̃B that correspond to positive entries of the column
of N representing the entering variable, x3 . Let us denote this column by Nx3 .
Then we have:

⎡Nx3⎤ ⎡ x̃B ⎤ ratios ⎡ xB ⎤

2 4 2 x4
⎣ 4 ⎦ ⎣ 6 ⎦ 3/2 ⎣ x5 ⎦ ⇒ x5 wins.
4 8 2 x6

Updating the basic feasible solution. The information used for the ratio
test is also suﬃcient for determining the values of the new vector x̃B . For the
entering variable, x̃3 is given by the minimum ratio value, i.e., x̃3 = 3/2. As for
the remaining components of x̃B , for each basic variable xj they are computed
based on the following observation. To perform the pivot, we would apply the
elementary row operation which, in order to eliminate x3 from the xj -row,
multiplies the x5 -row (i.e., the row where x5 was basic) by the coeﬃcient for
x3 in the xj -row divided by 4 and then subtracts the result from the xj -row.
When this elementary row operation is applied to the right-hand side column,
we obtain:
x̃3 = 3/2
x̃4 = 4 − 2(6/4) = 1
x̃6 = 8 − 4(6/4) = 2.
In summary, we have the following step 1 output:
⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤
x3 5 2 1 0 3/2
xB = ⎣ x4 ⎦ , cB = ⎣ 0 ⎦ , B = ⎣ 4 0 0 ⎦, x̃B = ⎣ 1 ⎦ ,
x6 0 4 0 1 2
The Simplex Method for Linear Programming 273
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
x1 4 1 2 0
xN = ⎣ x2 ⎦ , cN = ⎣ 3 ⎦ , N = ⎣ 3 0 1 ⎦ .
x5 0 2 1 0

Step 2. In order to carry out the necessary computations, we will use the
formulas in (RT) written as follows:

z xB xN rhs
1 0 −c̃N = −(cTN − cTB B −1 N ) z̃ = cTB B −1 b (RT-2)
0 Im Ñ = B −1 N x̃B = B −1 b

Choosing the entering variable. To ﬁnd the entering variable, we need to

compute
c̃TN = cTN − cTB B −1 N.
Instead of computing cTB B −1 directly, we can denote by uT = cTB B −1 and
then ﬁnd u by solving the system uT B = cTB , or, equivalently,

B T u = cB .

We have
⎡ ⎤⎡ ⎤ ⎡ ⎤ ⎡ ⎤
2 4 4 u1 5 0
⎣ 1 0 0 ⎦ ⎣ u2 ⎦ = ⎣ 0 ⎦ ⇔ u = ⎣ 5/4 ⎦ ,
0 0 1 u3 0 0

so,
⎡ ⎤
1 2 0
c̃TN = cTN − uT N = [4, 3, 0] − [0, 5/4, 0] ⎣ 3 0 1 ⎦ = [1/4, 3, −5/4],
2 1 0

and the entering variable is the second nonbasic variable, which is x2 .

Choosing the leaving variable. To perform the ratio test, we only need
x̃B and the second column Ñx2 of the matrix Ñ = B −1 N , which is equal to

Ñx2 = B −1 × Nx2 .

We can ﬁnd v = Ñx2 by solving the system Bv = Nx2 for v:

⎡ ⎤⎡ ⎤ ⎡ ⎤ ⎡ ⎤
2 1 0 v1 2 0
⎣ 4 0 0 ⎦ ⎣ v2 ⎦ = ⎣ 0 ⎦ ⇔ v = ⎣ 2 ⎦.
4 0 1 v3 1 1
274 Numerical Methods and Optimization: An Introduction

We compare the ratios of components of x̃B and Ñx2 = v:

⎡Ñx2⎤ ⎡ x̃B ⎤ ratios ⎡ xB ⎤

0 3/2 − x3
⎣ 2 ⎦ ⎣ 1 ⎦ 1/2 ⎣ x4 ⎦ ⇒ x4 wins.
1 2 2 x6

Updating the basic feasible solution. New x̃B is computed similarly to

the previous step:

x̃2 = 1/2 (min ratio value)

x̃3 = 3/2 − 0(1/2) = 3/2
x̃6 = 2 − 1(1/2) = 3/2.
In summary, step 2 produces the following output, which is also the input
for step 3.
⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤
x2 3 2 2 0 1/2
xB = ⎣ x3 ⎦ , cB = ⎣ 5 ⎦ , B = ⎣ 0 4 0 ⎦, x̃B = ⎣ 3/2 ⎦
x6 0 1 4 1 3/2
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
x1 4 1 1 0
xN = ⎣ x4 ⎦ , cN = ⎣ 0 ⎦, N =⎣ 3 0 1 ⎦.
x5 0 2 0 0

Step 3. We use the formulas as in (RT-2), which we mention again for

convenience:

z xB xN rhs
1 0 −c̃N = −(cTN − cTB B −1 N ) z̃ = cTB B −1 b (RT-3)
0 Im Ñ = B −1 N x̃B = B −1 b

Choosing the entering variable. To ﬁnd the entering variable, we need to

compute c̃TN = cTN − cTB B −1 N. Again, we denote by uT = cTB B −1 and then
ﬁnd u by solving the system B T u = cB . We have
⎡ ⎤⎡ ⎤ ⎡ ⎤ ⎡ ⎤
2 0 1 u1 3 3/2
⎣ 2 4 4 ⎦ ⎣ u2 ⎦ = ⎣ 5 ⎦ ⇔ u = ⎣ 1/2 ⎦
0 0 1 u3 0 0

and
⎡ ⎤
1 1 0
c̃TN = cTN − uT N = [4, 0, 0] − [3/2, 1/2, 0] ⎣ 3 0 1 ⎦ = [1, −3/2, −1/2]
2 0 0
The Simplex Method for Linear Programming 275

and the entering variable is the ﬁrst nonbasic variable, which is x1 .

Choosing the leaving variable. To perform the ratio test, we only need
x̃B and the ﬁrst column Ñx1 of the matrix Ñ = B −1 N , which is equal to

Ñx1 = B −1 × Nx1 .

We can ﬁnd v = Ñx1 by solving the system Bv = Nx1 for v:

⎡ ⎤⎡ ⎤ ⎡ ⎤ ⎡ ⎤
2 2 0 v1 1 −1/4
⎣ 0 4 0 ⎦ ⎣ v2 ⎦ = ⎣ 3 ⎦ ⇔ v = ⎣ 3/4 ⎦ .
1 4 1 v3 2 −3/4

We compare the ratios of components of x̃B and Ñx1 = v:

⎡ Ñx1 ⎤ ⎡ x̃B ⎤ ratios ⎡ xB ⎤

−1/4 1/2 − x2
⎣ 3/4 ⎦ ⎣ 3/2 ⎦ 2 ⎣ x3 ⎦ ⇒ x3 wins.
−3/4 3/2 − x6

Updating the basic feasible solution. Next, we update x̃B :

x̃1 = 2 (min ratio value)

x̃2 = 1/2 + (1/4)2 = 1
x̃6 = 3/2 + (3/4)2 = 3.

The output of step 3 is given by:

⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤
x1 4 1 0 2 2
xB = ⎣ x 2 ⎦ , cB = ⎣ 3 ⎦ , B = ⎣ 3 0 ⎦ , x̃B = ⎣ 1 ⎦ ,
0
x6 0 2 1 1 3
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
x3 5 2 1 0
x N = ⎣ x 4 ⎦ , cN = ⎣ 0 ⎦ , N = ⎣ 4 0 1 ⎦.
x5 0 4 0 0

Step 4. We proceed similarly to the previous steps. To ﬁnd the entering

variable, we compute c̃TN = cTN −cTB B −1 N by ﬁrst solving the system B T u = cB
for u to ﬁnd cTB B −1 ,
⎡ ⎤⎡ ⎤ ⎡ ⎤ ⎡ ⎤
1 3 2 u1 4 3/2
⎣ 2 0 1 ⎦ ⎣ u2 ⎦ = ⎣ 3 ⎦ ⇔ u = ⎣ 5/6 ⎦ ,
0 0 1 u3 0 0

and then use this information to ﬁnd c̃TN :

⎡ ⎤
2 1 0
c̃TN = cTN − uT N = [5, 0, 0] − [3/2, 5/6, 0] ⎣ 4 0 1 ⎦ = [−4/3, −3/2, −5/6].
4 0 0
276 Numerical Methods and Optimization: An Introduction

Since none of the components of c̃TN is positive, this is a basic optimal solution.
We have ⎡ ⎤
2
x∗B = x̃B = ⎣ 1 ⎦ ,
3
so,
x∗ = [2, 1, 0, 0, 0, 3]T , z ∗ = cT x∗ = 11.
The ﬁnal answer for the original variables is

x∗ = [2, 1, 0]T , z ∗ = 11.

11.7 Complexity of the Simplex Method

In this section we discuss the worst-case behavior of the simplex method
in terms of the number of steps it has to execute before ﬁnding an optimal
solution. Consider the problem in the form

n
maximize c j xj
j=1

n
(11.43)
subject to aij xj ≤ bi , i = 1, . . . , m
j=1
xj ≥ 0, j = 1, . . . , n.

After introducing the slack variables xn+1 , xn+2 , . . . , xn+m we write this prob-
lem in the standard form:
maximize cT x
subject to Ax = b (11.44)
x ≥ 0.

The matrix A has m rows and n + m columns, and we may potentially have

m+n (m + n)!
K= =
m m! n!

diﬀerent basic feasible solutions. If n = m, taking into account that n+i

i ≥2
for i = 1, . . . , n, we have:

(n + 1)(n + 2) · . . . · 2n n + i
n
(2n)!
K= = = ≥ 2n .
(n!) 2 1 · 2 · ... · n i=1
i

Thus, the number of basic feasible solutions may be exponential with respect
to the problem input size. If there existed an LP with K basic feasible solu-
tions, and if the simplex method would have to visit each basic feasible solution
The Simplex Method for Linear Programming 277

before terminating, this would imply that the simplex method requires an ex-
ponential number of steps in the worst case. Unfortunately, such examples
have been constructed for various strategies for selecting leaving and enter-
ing variables in the simplex method. The first such example was constructed
by Klee and Minty in 1972, and is widely known as the Klee-Minty problem,
which can be formulated as follows:
m
10m−j xj
P
maximize
j=1
i−1
(11.45)
10i−j xj ≤ 100i−1 ,
P
subject to xi + 2 i = 1, . . . , m
j=1
xj ≥ 0, j = 1, . . . , m.

If the entering variable is always selected to be the nonbasic variable with the
highest coefficient in the objective (written in the dictionary format), then the
simplex method will visit each of the 2m basic feasible solutions of this LP.
It is still unknown whether a variation of the simplex method can be de-
veloped that would be guaranteed to terminate in a number of steps bounded
by a polynomial function of the problem’s input size. It should be noted that
all currently known polynomial time algorithms for linear optimization prob-
lems, such as the ellipsoid method and the interior point methods, are based
on the approaches that treat linear programs as continuous convex optimiza-
tion problems, whereas the simplex method exploits the discrete nature of an
LP’s extreme points along with the fact that any optimal LP has an optimal
solution at an extreme point.

Exercises
11.1. Write the following problems as LPs in the standard form:
(a) maximize 2x1 + 3x2 − 3x3
subject to x1 + x2 + x3 ≤ 7
x2 − x3 ≤ 5
−x1 + x2 + 4x3 ≥ 4
x1 ∈ IR, x2 , x3 ≥ 0
(b) minimize |x1 | + 2|x2 |
subject to 2x1 + 3x2 ≤ 4
5|x1 | − 6x2 ≤ 7
8x1 + 9x2 ≥ 10
x1 , x2 ∈ IR.

11.2. Solve the following LPs using the simplex method:

278 Numerical Methods and Optimization: An Introduction

(a) maximize 3x1 + 5x2

subject to x1 + x2 ≤ 5
x2 ≤ 3
3x1 + 5x2 ≤ 18
x1 ≤ 4
x 1 , x2 ≥ 0
(b) minimize x1 + 5x2
subject to x1 + 2x2 ≤ 3
x1 + x2 ≤ 2
5x1 + x2 ≤ 5
x 1 , x2 ≥ 0.
Illustrate the simplex steps graphically (that is, draw the feasible region,
the iso-proﬁt/iso-cost line corresponding to the optimal solution(s), and
indicate the basic feasible solutions explored by the algorithm).

11.3. Solve the following LPs using the simplex method:

(a) maximize 5x1 − 2x2 + 3x3
subject to 3x1 + 2x2 − 2x3 ≤ 6
x1 + x2 + x3 ≤ 3
x 1 , x2 , x 3 ≥ 0
(b) maximize x1 + 2x2 + 3x3
subject to x1 + x2 + x3 ≤ 3
3x1 + 2x2 + 2x3 ≤ 6
2x1 + x2 + 3x3 ≤ 6
x 1 , x2 , x3 ≥ 0
(c) minimize −x1 + x2 + x3
subject to − x2 + x3 ≤ 2
x1 − 2x2 − 2x3 ≤ 6
−2x1 + x2 + x3 ≤ 2
x 1 , x2 , x3 ≥ 0
(d) maximize 2x1 + 3x2 + 4x3 + 5x4
subject to x1 + 2x2 + x4 ≤ 5
x1 + x2 + x3 + 5x4 ≤ 1
x 1 , x2 , x 3 , x 4 ≥ 0.

11.4. Find the basic solutions corresponding to points G, H, I, J, K, L, and M

in Figure 11.1 at page 252 (Section 11.3).

11.5. Use the simplex method to solve the Klee-Minty LP (11.45) at page 277
for m = 3. Always select the nonbasic variable with the highest co-
eﬃcient in the objective as the entering variable. Illustrate your steps
geometrically.
The Simplex Method for Linear Programming 279

11.6. Explain how the simplex algorithm can be used for ﬁnding the second-
best basic feasible solution to an LP. Then ﬁnd the second best bfs for
problems (a) and (b) in Exercise 11.3.

11.7. Solve the LP given in Eq. (11.25) at page 255 using

(a) the two-phase simplex method;
(b) the big-M method.

11.8. Solve the following LPs using

(a) the two-phase simplex method;
(b) the big-M method.
(i) maximize 4x1 − 2x2
subject to x1 − 2x2 = 2
2x1 + x2 ≤ 7
x1 + 2x2 ≤ 4
x 1 , x2 ≥ 0
(ii) minimize x1 + x2
subject to x1 − 3x2 = 3
−x1 + 5x2 ≥ 2
x 1 , x2 ≥ 0
(iii) minimize 3x1 + 2x2
subject to x1 − 3x2 ≥ 3
−2x1 + 5x2 ≥ 3
x 1 , x2 ≥ 0
(iv) maximize 2x1 + x2
subject to x1 − 2x2 = 2
2x1 + x2 ≤ 7
x1 + 2x2 ≥ 5
x 1 , x2 ≥ 0.

11.9. Use the revised simplex method to solve LPs (a) and (b) in Exercise 11.3.
This page intentionally left blank
Chapter 12
Duality and Sensitivity Analysis in
Linear Programming

12.1 Deﬁning the Dual LP

Consider an instance of the production planning problem discussed in Sec-
tion 10.2.6. SuperCleats Inc. is a small business manufacturing 4 models of
soccer shoes, Dynamo, Spartacus, Torpedo, and Kanga. Manufacturing one
pair of each model is associated with the material, labor, and overhead costs
given in the table below, which also lists the company’s annual budget for
each kind of resource.
Dynamo Spartacus Torpedo Kanga Annual
Budget
Price 100 80 60 50
Material Cost 15 10 8 5 45,000
Labor Cost 35 22 18 10 100,000
Overhead Cost 5 5 4 4 20,000

In addition, the annual demand for Dynamo and Spartacus shoes is limited
to 1,000 pairs each. The LP that SuperCleats Inc. solves in order to decide
on the quantity of each shoe model to manufacture so that its overall proﬁt
is maximized is given by

maximize 100x1 + 80x2 + 60x3 + 50x4

subject to 15x1 + 10x2 + 8x3 + 5x4 ≤ 45, 000
35x1 + 22x2 + 18x3 + 10x4 ≤ 100, 000
5x1 + 5x2 + 4x3 + 4x4 ≤ 20, 000 (12.1)
x1 ≤ 1, 000
x2 ≤ 1, 000
x1 , x2 , x3 , x4 ≥ 0.

All Shoes United (ASU) is a bigger shoe manufacturer looking to acquire

SuperCleats. They want to prepare an offer that would minimize ASU’s annual
expense to cover SuperCleats’ operations, while making sure that the offer is
attractive for SuperCleats Inc. ASU’s management assumes that for the offer

281
282 Numerical Methods and Optimization: An Introduction

to be attractive, the annual payment to SuperCleats must be no less than the

profit SuperCleats Inc. expects to generate.
First, assume that ASU managers are partially aware of the information
presented in the table above. They know the prices, the cost of materials
required to manufacture one pair of shoes of each type, and SuperCleats’
budget for materials. They want to use this information to get an estimate
of how much the potential acquisition may cost. Based on the information in
the first two rows of the table, we can write down the objective and the first
constraint of LP (12.1). We have
z= 100x1 + 80x2 + 60x3 + 50x4 ;
15x1 + 10x2 + 8x3 + 5x4 ≤ 45, 000.
If we multiply both sides of the constraint by 10,
150x1 + 100x2 + 80x3 + 50x4 ≤ 450, 000,
the coefficient of each variable in the constraint will be no less than the co-
efficient of the same variable in the objective, thus, we can easily obtain an
upper bound on z by just taking the right-hand side of the constraint:
z= 100x1 + 80x2 + 60x3 + 50x4
≤ 150x1 + 100x2 + 80x3 + 50x4
≤ 450, 000.
Thus, z ≤ 450, 000, and ASU obtains an upper bound on how much they
should pay annually for the acquisition.
To get a better estimate of the minimum price that would be attractive for
SuperCleats, we can use all the constraints in (12.1). Namely, we can multiply
both sides of each constraint i by some nonnegative value yi for i = 1, . . . , 5,
so that the inequalities are preserved,
15x1 + 10x2 + 8x3 + 5x4 ≤ 45, 000 ×y1
35x1 + 22x2 + 18x3 + 10x4 ≤ 100, 000 ×y2
5x1 + 5x2 + 4x3 + 4x4 ≤ 20, 000 ×y3
x1 ≤ 1, 000 ×y4
x2 ≤ 1, 000 ×y5
and then sum the resulting inequalities to obtain:
(15x1 + 10x2 + 8x3 + 5x4 )y1 + (35x1 + 22x2 + 18x3 + 10x4 )y2
+(5x1 + 5x2 + 4x3 + 4x4 )y3 + x1 y4 + x2 y5
= (15y1 + 35y2 + 5y3 + y4 )x1 + (10y1 + 22y2 + 5y3 + y5 )x2
+(8y1 + 18y2 + 4y3 )x3 + (5y1 + 10y2 + 4y3 )x4
≤ 45, 000y1 + 100, 000y2 + 20, 000y3 + 1, 000y4 + 1, 000y5 .
If we select y1 , . . . , y5 so that the coefficients for xj , j = 1, . . . , 4 in the last
expression are no less than the corresponding coefficients in
z = 100x1 + 80x2 + 60x3 + 50x4 ,
Duality and Sensitivity Analysis in Linear Programming 283

that is,
15y1 + 35y2 + 5y3 + y4 ≥ 100
10y1 + 22y2 + 5y3 + y5 ≥ 80
8y1 + 18y2 + 4y3 ≥ 60
5y1 + 10y2 + 4y3 ≥ 50,
then we have

z = 100x1 + 80x2 + 60x3 + 50x4 ≤ 1, 000(45y1 + 100y2 + 20y3 + y4 + y5 ).

To get the lowest possible upper bound on z this way, we need to solve the
following LP:

minimize 1, 000(45y1 + 100y2 + 20y3 + y4 + y5 )

subject to 15y1 + 35y2 + 5y3 + y4 ≥ 100
10y1 + 22y2 + 5y3 + y5 ≥ 80
(12.2)
8y1 + 18y2 + 4y3 ≥ 60
5y1 + 10y2 + 4y3 ≥ 50
y1 , y2 , y3 , y4 , y5 ≥ 0.

This LP is called the dual to the SuperCleats’ maximization problem (12.1),

which is called the primal LP. If we compare the primal LP (12.1) to the dual
LP (12.2), we can see that the coeﬃcients in the rows of one LP appear in the
columns of the other LP, and vice versa.
More generally, if the primal LP is given by

maximize c 1 x1 + ... + c n xn
subject to a11 x1 + ... + a1n xn ≤ b1
.. .. .. ..
. . . .
am1 x1 + ... + amn xn ≤ bm
x1 , . . . , x n ≥ 0

or

n
maximize c j xj
j=1

n
subjectto aij xj ≤ bi , i = 1, . . . , m
j=1
x1 , . . . , xn ≥ 0,
then its dual LP is

minimize b1 y1 + ... + bm ym
subject to a11 y1 + ... + am1 ym ≥ c1
.. .. .. ..
. . . .
a1n y1 + ... + amn ym ≥ cn
y1 , . . . , y m ≥ 0
284 Numerical Methods and Optimization: An Introduction

or

m
minimize bi y i
i=1

m
subject to aij yi ≥ cj , j = 1, . . . , n
i=1
y1 , . . . , y m ≥ 0.

Using matrix representation, we have the following correspondence between

the primal and dual LPs:

Primal LP Dual LP
maximize cT x minimize bT y
subject to Ax ≤ b subject to AT y ≥ c
x ≥ 0 y ≥ 0

The variables xj , j = 1, . . . , n of the primal LP are called the primal variables,

and the variables yi , i = 1, . . . , m of the dual LP are called the dual variables.

12.1.1 Forming the dual of a general LP

We discussed how to form the dual of an LP with “≤”-type inequality
constraints only. Next we discuss how to form the dual for LPs involving
diﬀerent types of constraints.
If we have a “≥”-type inequality constraint, we can multiply both sides by
−1 and convert it into an equivalent “≤”-type inequality. Assume now that
we have a primal problem with an equality constraint:

maximize −x1 − 2x2 − 3x3

subject to 7x1 − 8x2 + 9x3 = 10
x1 , x2 , x3 ≥ 0.

In order to use the rules for forming the dual that we already know, we equiv-
alently represent this equality by two inequality constraints,

7x1 − 8x2 + 9x3 ≤ 10

7x1 − 8x2 + 9x3 ≥ 10,

and then convert the second inequality into “≤”-type:

7x1 − 8x2 + 9x3 ≤ 10

−7x1 + 8x2 − 9x3 ≤ −10,

Let y and y be the dual variable for the ﬁrst and the second constraints
Duality and Sensitivity Analysis in Linear Programming 285

above, respectively. Then the corresponding dual LP is given by

minimize 10y − 10y

subject to 7y − 7y ≥ −1
−8y + 8y ≥ −2
9y − 9y ≥ −3

y ,y ≥ 0,

or, equivalently,
minimize 10(y − y )
subject to 7(y − y ) ≥ −1
−8(y − y ) ≥ −2
9(y − y ) ≥ −3
y , y ≥ 0.
Since y and y always appear together, with the same absolute value but
opposite sign coeﬃcients, we can make the following change of variables:

y = y − y .

Then we obtain an equivalent one-variable problem:

minimize 10y
subject to 7y ≥ −1
−8y ≥ −2
9y ≥ −3
y ∈ IR.

Thus, an equality constraint in the primal problem corresponds to a free vari-

able in the dual problem.
Next we show that if we take the dual of the dual LP, we obtain the primal
LP we started with.
Consider the primal LP

n
maximize c j xj
j=1

n
(12.3)
subject to aij xj ≤ bi , i = 1, . . . , m
j=1
x1 , . . . , x n ≥ 0

and the corresponding dual LP

m
minimize bi yi
i=1

m
(12.4)
subject to aij yi ≥ cj , j = 1, . . . , n
i=1
y1 , . . . , y m ≥ 0.
286 Numerical Methods and Optimization: An Introduction

We can write (12.4) as an equivalent maximization problem,

m
maximize (−bi )yi
i=1

m
(12.5)
subject to (−aij )yi ≤ −cj , j = 1, . . . , n
i=1
y1 , . . . , y m ≥ 0

and then form its dual:

n
minimize (−cj )xj
j=1

n
(12.6)
subject to (−aij )xj ≥ −bi , i = 1, . . . , m
j=1
x1 , . . . , x n ≥ 0.

Representing (12.6) as an equivalent maximization problem,

n
maximize c j xj
j=1

n
(12.7)
subject to aij xj ≤ bi , i = 1, . . . , m
j=1
x1 , . . . , x n ≥ 0,

we obtain an LP that is identical to the original primal LP (12.3).

Given an LP P, let LP D be its dual. Then the dual of D is given by P.

Taking into account that the dual of the dual gives the primal LP, we
have the following summary of the correspondence between the constraints
and variables of the primal and dual LPs, which can be used as rules for
forming the dual for a given primal LP. We assume that the primal LP is a
maximization problem and the dual LP is a minimization problem.

Primal LP Dual LP
Equality constraint ↔ Free variable
Inequality constraint (≤) ↔ Nonnegative variable
Free variable ↔ Equality constraint
Nonnegative variable ↔ Inequality constraint (≥)
Duality and Sensitivity Analysis in Linear Programming 287

Example 12.1 The dual of the LP

maximize 2x1 + x2
subject to x1 + x2 = 2
2x1 − x2 ≥ 3
x1 − x2 ≤ 1
x1 ≥ 0, x2 ∈ IR

is given by

minimize 2y1 − 3y2 + y3

subject to y1 − 2y2 + y3 ≥ 2
y1 + y2 − y3 = 1
y1 ∈ IR, y2 , y3 ≥ 0.

12.2 Weak Duality and the Duality Theorem

Let z and w denote the objective function of the primal and the dual LP,
respectively. Then we have:

m ⎛ ⎞

n
n
m n
m
c j xj ≤ aij yi xj = ⎝ aij xj ⎠ yi ≤ bi y i . (12.8)
j=1 j=1 i=1 i=1 j=1 i=1

Hence, for any feasible solution x to the primal LP and any feasible solution
y to the dual LP we have the property of weak duality:

n
m
z= cj xj ≤ bi yi = w.
j=1 i=1

Therefore, if we ﬁnd a feasible solution x∗ for the primal LP and a feasible

solution y ∗ for the dual LP such that

n
m
cj x∗j = bi yi∗ ,
j=1 i=1

then x∗ is an optimal solution of the primal problem, and y ∗ is an optimal

solution of the dual problem. The following theorem establishes the result,
known as strong duality.
288 Numerical Methods and Optimization: An Introduction

Theorem 12.1 (The duality theorem) If the primal problem has an

optimal solution x∗ , then the dual problem also has an optimal solution
y ∗ such that
n
m
cj x∗j = bi yi∗ .
j=1 i=1

Proof. Assume that the primal problem is a general maximization problem

with “≤” and “=” constraints only. Let
maximize cT x
subject to Ax = b (12.9)
x ≥ 0
be the associated big-M problem, which has n + m + m + m variables and
m = m + m constraints, where m is the number of inequality constraints
in the original LP, m is the number of equality constraints in the original
LP, and m is the number of inequality constraints in the original LP that
required an artificial variable in addition to the slack variable in order to
formulate (12.9). Assume that the columns of A correspond to the variables
in the following natural order: the first n columns correspond to the original
n variables, the next m columns to the slack variables, followed by m and
m columns corresponding to the artificial variables for m equality and m
inequality constraints, respectively. The dual to the big-M problem is given
by
minimize bT y
subject to AT y ≥ c (12.10)
y ∈ IRn .
We will use a matrix representation of the optimal tableau obtained using the
big-M method for the primal LP given by equation (RT) at page 271:
z xB xN rhs
1 0 −c̃TN = −(cTN − cTB B −1 N ) cTB B −1 b (12.11)
−1 −1
0 Im B N B b

We claim that the vector y ∗ = cTB B −1 is an optimal solution to the dual

LP (12.10). To prove this, we first show that y ∗ is feasible for (12.10), i.e.,
AT y ∗ ≥ c. This inequality can be equivalently represented as a system of two
inequalities, (y ∗ )T B ≥ cTB and (y ∗ )T N ≥ cTN . We have:
(y ∗ )T B = cTB B −1 B = cTB ,
so the first inequality is satisfied at equality. As for the second inequality, we
have
(y ∗ )T N = cTB B −1 N ≥ cTN ,
Duality and Sensitivity Analysis in Linear Programming 289

since c̃TN = cTN − cTB B −1 N must be nonpositive due to optimality of the

tableau (12.11).
Knowing that y ∗ is feasible for the dual LP, it suﬃces to show that the
corresponding objective function value, bT y ∗ , coincides with the optimal ob-
jective value, cTB B −1 b, of the primal big-M LP (12.9). Since y ∗ = cTB B −1 ,
we trivially have bT y ∗ = cTB B −1 b. Thus, y ∗ must be optimal for the dual
LP (12.10).

12.3 Extracting an Optimal Solution of the Dual LP

from an Optimal Tableau of the Primal LP
The proof of the duality theorem above is constructive and provides an
easy way of extracting an optimal solution of the dual LP from an optimal
tableau of the primal LP. To see this, we first analyze the row 0 coefficients
in (12.11) for the slack variables and the artificial variables corresponding to
the equality constraints in (12.9).
First, assume that a slack or an artificial variable xn+i , i = 1, . . . , m is
basic in the optimal tableau (12.11). Then, obviously, the coefficient in row 0
of (12.11) is 0 for this variable. We show that in this case the ith component of
cTB B −1 equals 0 (that is, yi∗ = 0). Indeed, if we denote by uT = cTB B −1 , then,
multiplying both sides by B from the right, we have B T u = cB . Since xn+i
is the slack or artificial variable for the ith row of the LP, the corresponding
column of B has 1 in the ith component and 0 in the remaining components.
Also, the entry of cB corresponding to xn+i is 0. Hence, the row of the system
B T u = cB corresponding to xn+i gives ui = 0.
Second, assume that a slack variable xn+i , i = 1, . . . , m is nonbasic, then
cn+i = 0, and the column corresponding to xn+i in N has the ith component
equal to 1, and the rest components equal to 0. Hence, c̃n+i = 0 − yi = −yi ,
where yi is the ith component of cTB B −1 , i = 1, . . . , m . As for an artificial
variable xn+i , i = m + 1, . . . , m, we have cn+i = −M , and if it is nonbasic
then c̃n+i = −M − yi , where, again, yi is the ith component of cTB B −1 , i =
m + 1, . . . , m. But cTB B −1 is exactly the dual optimal solution y ∗ we used to
prove the duality theorem.
Thus, the coefficient for xn+i , i = 1, . . . , m in row 0 of the optimal
tableau (12.11) is always given by yi∗ if xn+i is a slack variable and by M +yi∗ if
xn+i is the artificial variable representing the ith constraint. This observation
can be summarized as follows.
We consider a maximization LP with “≤,” “≥,” and “=” constraints. For
convenience, we convert each “≥” constraint into an equivalent “≤” constraint
by multiplying both sides by −1. Hence, we only need to deal with “≤” and
“=” constraints. Let si be the slack variable for the ith constraint if the ith
290 Numerical Methods and Optimization: An Introduction

constraint is a “≤” constraint, and let ai be the artiﬁcial variable for the ith
constraint if the ith constraint is a “=” constraint used in the big-M method.
Then the following table represents the correspondence between the optimal
value yi∗ of the dual variable corresponding to the ith constraint and the co-
eﬃcients of si or ai in row 0 of the optimal tableau.

ith yi ∗
constraint type (found in row 0 of the
in the primal LP optimal primal LP tableau)
≤ → coeﬃcient of si
= → (coeﬃcient of ai )−M

(Note that this correspondence is for the tableau format; sign changes would
have to be made in the dictionary format).

12.4 Correspondence between the Primal and Dual LP

Types
From the duality theorem we know that if the primal LP is optimal, then
so is its dual. Next we investigate the correspondence between the primal
and dual LP types in all possible cases. More speciﬁcally, the question we are
addressing here is, what can we say about the type of the dual (primal) LP if
we know the primal (dual) LP type? Recall than an LP can either be optimal
or infeasible, or unbounded.
To establish the correspondence sought, we use the fact that the dual of
the dual LP is the primal LP. Consider a primal LP P with the objective
function denoted by z(x) and its dual D with the objective function denoted
by w(y). Then according to the duality theorem, P is an optimal LP if and
only if D is an optimal LP. If P is unbounded, then D must be infeasible, since
if D had a feasible solution ȳ, then z(x) ≤ w(ȳ) for any feasible solution x of
P, contradicting the unboundedness of P. Finally, if P is infeasible, then since
the dual of the dual is primal and the dual of an unbounded LP is infeasible,
it is possible that D is unbounded. It is also possible that D is infeasible, as
shown in the following example.
Example 12.2 (An infeasible LP with infeasible dual) Consider the LP
maximize 5x1 − 2x2
subject to 2x1 − x2 ≤ 2
−2x1 + x2 ≤ −3
x 1 , x2 ≥ 0.
Duality and Sensitivity Analysis in Linear Programming 291

This LP is, clearly, infeasible. The dual of this LP is

minimize 2y1 − 3y2

subject to 2y1 − 2y2 ≥ 5
−y1 + y2 ≥ −2
x 1 , x2 ≥ 0,

which is also infeasible.

The correspondence between the primal and dual LP types is summarized

below.

Dual LP
Optimal Infeasible Unbounded

Optimal possible impossible impossible

Primal LP

Infeasible impossible possible possible

Unbounded impossible possible impossible

12.5 Complementary Slackness

In this section, we explore the relationship between the constraints of the
primal LP and the variables of its dual when both problems are optimal. Let
the primal LP be given by

n
maximize c j xj
j=1

n
(P)
subject to aij xj ≤ bi , i = 1, . . . , m
j=1
x1 , . . . , x n ≥ 0.

Then the corresponding dual LP is given by

m
minimize bi y i
i=1

m
(D)
subject to aij yi ≥ cj , j = 1, . . . , n
i=1
y1 , . . . , y m ≥ 0.
292 Numerical Methods and Optimization: An Introduction

Let x∗ and y ∗ be optimal solutions to the primal problem (P) and the dual
problem (D), respectively. Then the strong duality holds,

n
m
cj x∗j = bi yi∗ ,
j=1 i=1

and from the derivation of the weak duality property we have

m ⎛ ⎞
n (a)
n m n (a)
m
∗
c j xj ≤ ∗ ∗
aij yi xj = ⎝ ∗⎠ ∗
aij xj yi ≤ bi yi∗ . (12.12)
j=1 j=1 i=1 i=1 j=1 i=1

Note that the equality in the inequality (a) above is possible if and only if for

m
every j = 1, . . . , n, either x∗j = 0 or cj = aij yi∗ . Similarly, the equality in
i=1
the inequality (b) above is possible if and only if for every i = 1, . . . , m, either

n
yi∗ = 0 or bi = aij x∗j . Thus we obtain the following result.
j=1

Theorem 12.2 (Complementary Slackness) Let x∗ = [x∗1 , . . . , x∗n ]T

be a feasible solution of the primal LP (P) and y ∗ = [y1∗ , . . . , ym
∗ T
] be a
∗
feasible solution of the dual LP (D). Then x is an optimal solution of
(P) and y ∗ is an optimal solution of (D) simultaneously if and only if
both of the following statements hold:

m
x∗j = 0 or aij yi∗ = cj for all j = 1, . . . , n (12.13)
i=1

and

n
yi∗ = 0 or aij x∗j = bi for all i = 1, . . . , m. (12.14)
j=1

The theorem above implies that a feasible solution x∗ = [x∗1 , . . . , x∗n ]T of

(P) is optimal if and only if there exist numbers y1∗ , . . . , ym
∗
such that

m
• if x∗j > 0 then aij yi∗ = cj ; j = 1, . . . , n,
i=1

n
• if aij x∗j < bj then yi∗ = 0; i = 1, . . . , m,
j=1

m
• aij yi∗ ≥ cj , j = 1, . . . , n, and
i=1

• yi∗ ≥ 0, i = 1, . . . , m.
Duality and Sensitivity Analysis in Linear Programming 293

We can use these conditions to test optimality of a given solution to an LP.

Example 12.3 Check if x̃ = [9, 0, 11, 5, 0, 4, 3]T is an optimal solution of the
following LP:

minimize x1 +x2 +x3 +x4 +x5 +x6 +x7

subject to x1 +x6 +x7 ≥ 16
x1 +x2 +x7 ≥ 12
x1 +x2 +x3 ≥ 18
x2 +x3 +x4 ≥ 13
x3 +x4 +x5 ≥ 15
x4 +x5 +x6 ≥ 9
x5 +x6 +x7 ≥ 7
x1 +x2 +x3 −x4 −x5 −x6 −x7 ≥ 0
x 1 , x2 , x 3 , x 4 , x 5 , x 6 , x 7 ≥ 0.

The dual of this LP is given by

maximize 16y1 + 12y2 + 18y3 + 13y4 + 15y5 + 9y6 + 7y7

subject to y1 +y2 +y3 +y8 ≤ 1
y2 +y3 +y4 +y8 ≤ 1
y3 +y4 +y5 +y8 ≤ 1
y4 +y5 +y6 −y8 ≤ 1
y5 +y6 +y7 −y8 ≤ 1
y1 +y6 +y7 −y8 ≤ 1
y1 +y2 +y7 −y8 ≤ 1
y1 , y2 , y3 , y4 , y5 , y6 , y7 , y8 ≥ 0.

First, we determine which of the constraints are not binding at x̃:

x̃1 +x̃6 +x̃7 = 16

x̃1 +x̃2 +x̃7 = 12
x̃1 +x̃2 +x̃3 = 20 > 18
x̃2 +x̃3 +x̃4 = 16 > 13
x̃3 +x̃4 +x̃5 = 16 > 15
x̃4 +x̃5 +x̃6 = 9
x̃5 +x̃6 +x̃7 = 7
x̃1 +x̃2 +x̃3 −x̃4 −x̃5 −x̃6 −x̃7 = 8 > 0

We see that the 3rd , 4th , 5th , and 8th constraints are not binding, thus, due to
the complementary slackness, if x̃ is optimal for the primal problem, we must
have
ỹ3 = ỹ4 = ỹ5 = ỹ8 = 0
in an optimal solution ỹ of the dual problem.
Also observe that x̃1 , x̃3 , x̃4 , x̃6 , x̃7 are all positive, thus the 1st , 3rd , 4th ,
6th , and 7th constraints of the dual LP must be binding at an optimal point
ỹ. However, this is impossible since taking ỹ3 = ỹ4 = ỹ5 = ỹ8 = 0 in the
294 Numerical Methods and Optimization: An Introduction

third constraint of the dual problem as binding gives 0 = 1. Therefore, x̃ is

not optimal for the considered LP. Noting that if we reduce x̃3 by 1 then the
resulting vector x∗ = [9, 0, 10, 5, 0, 4, 3]T is still feasible for the primal LP and
its objective function value (31) is better than that of x̃ (32), we see that x̃ is
indeed not optimal.
Next, we will use the complementary slackness to show that x∗ is, in fact,
an optimal solution for the given primal LP. For x∗ , the 5th constraint becomes
binding, whereas the 3rd , 4th , and 8th constraints remain nonbinding. Thus,

y3∗ = y4∗ = y8∗ = 0

in an optimal solution y ∗ of the dual problem. As with x̃, since the same
components of x∗ are positive, we must have the 1st , 3rd , 4th , 6th , and 7th
constraints of the dual LP binding at an optimal point y ∗ . Taking into account
that y3∗ = y4∗ = y8∗ = 0, this gives the following system:

y1∗ +y2∗ = 1
y5∗ = 1
y5∗ +y6∗ = 1
y1∗ +y6∗ +y7∗ = 1
y1∗ +y2∗ +y7∗ = 1.

Solving this system, we ﬁnd y1∗ = y5∗ = 1, y2∗ = y6∗ = y7∗ = 0, so

y ∗ = [1, 0, 0, 0, 1, 0, 0, 0]T .

It remains to check that y ∗ satisﬁes the 2nd and 5th constraints of the dual
problem, and that the dual objective function value at y ∗ is 31, which is the
same as the primal objective function value at x∗ . We conclude that x∗ is
optimal for the primal problem and y ∗ is optimal for the dual problem.
Finally, note that the problem we considered in this example is the schedul-
ing LP formulated in Section 10.2.3 (page 215).

12.6 Economic Interpretation of the Dual LP

Consider the following problem. Romeo Winery produces two types of
wines, Bordeaux and Romerlot, by blending the Merlot and Cabernet Sauvi-
gnon grapes. Making one barrel of Bordeaux blend requires 250 pounds of
Merlot and 250 pounds of Cabernet Sauvignon, whereas making one barrel of
Romerlot requires 450 pounds of Merlot and 50 pounds of Cabernet Sauvi-
gnon. The proﬁt received from selling Bordeaux is $800 per barrel, and from
selling Romerlot, $600 per barrel. Romeo Winery has 9,000 pounds of Merlot
and 5,000 pounds of Cabernet Sauvignon available.
Duality and Sensitivity Analysis in Linear Programming 295

We formulate an LP model aiming to maximize the winery’s revenue. Let

x1 and x2 be the amount (in barrels) of Bordeaux and Romerlot made, re-
spectively. Then we have:
maximize 800x1 + 600x2
subject to 250x1 + 450x2 ≤ 9, 000 (Merlot constraint)
250x1 + 50x2 ≤ 5, 000 (Cabernet constraint)
x 1 , x2 ≥ 0.
(12.15)
The dual of this LP is given by
minimize 9, 000y1 + 5, 000y2
subject to 250y1 + 250y2 ≥ 800
(12.16)
450y1 + 50y2 ≥ 600
y1 , y2 ≥ 0.

The optimal solution of the primal LP (12.15) is x∗1 = 18, x∗2 = 10. The optimal
solution of the dual LP (12.16) is y1∗ = 1.1, y2∗ = 2.1. The optimal objective
value of both the primal and dual LP is 20,400.
We start the analysis of the economic meaning of the dual LP by deter-
mining the units of measure for the dual variables y1 and y2 . Consider the
ﬁrst constraint of the dual LP:

250y1 + 250y2 ≥ 800.

The coeﬃcient for both y1 and y2 is 250 pounds/barrel, and the right-hand
side is 800 dollars/barrel. Thus, for the constraint to make a physical sense,
the units for both dual variables must be dollars/pound, meaning that y1 and
y2 express the cost of grapes.
Note that multiplying the ﬁrst and the second constraint of the primal
LP (12.15) by y1∗ = 1.1 and y2∗ = 2.1, respectively, and then summing up the
left-hand sides of the resulting inequalities gives the primal objective function:

z = 800x1 + 600x2 = (250x1 + 450x2 )1.1 + (250x1 + 50x2 )2.1.

Now, assume that the winery is considering purchasing p1 pounds of Merlot

grapes in addition to 9,000 pounds already available. The new primal LP is
then
maximize 800x1 + 600x2
subject to 250x1 + 450x2 ≤ 9, 000 + p1
(12.17)
250x1 + 50x2 ≤ 5, 000
x1 , x2 ≥ 0.
If we multiply the constraints by y1∗ and y2∗ , respectively,

maximize 800x1 + 600x2

subject to 250x1 + 450x2 ≤ 9, 000 + p1 ×y1∗ = 1.1
250x1 + 50x2 ≤ 5, 000 ×y2∗ = 2.1
x 1 , x2 ≥ 0,
296 Numerical Methods and Optimization: An Introduction

then for any feasible solution we have:

z = 800x1 + 600x2 = (250x1 + 450x2 )1.1 + (250x1 + 50x2 )2.1

≤ (9, 000 + p1 )1.1 + 5, 000 · 2.1
= 20, 400 + 1.1p1 .

Thus, we got the following upper bound

z ≤ 20, 400 + 1.1p1

on the objective, in which the optimal value of the first dual variable y1∗ = 1.1
is the coefficient for the variable representing the extra Merlot grapes. The
extra profit added to the currently optimal profit of $20,400 will never exceed
1.1p1 , thus $1.1/pound is the maximum extra amount the winery should be
willing to pay for additional Merlot grapes.
Similarly, if the winery looks to purchase p2 more pounds of Cabernet
Sauvignon grapes, we obtain the upper bound

z = 800x1 + 600x2 = (250x1 + 450x2 )1.1 + (250x1 + 50x2 )2.1

≤ 9, 000 · 1.1 + (5, 000 + p2 )2.1
= 20, 400 + 2.1p2 ,

i.e.,
z ≤ 20, 400 + 2.1p2 ,

implying that the company should not pay more than $2.1/pound in addition
to what they already pay for Cabernet Sauvignon grapes. This quantity is
sometimes referred to as the shadow price.

Deﬁnition 12.1 The shadow price for the ith resource constraint of an
LP is deﬁned as the amount by which the optimal objective function value
is improved if the right-hand side of this constraint is increased by 1.

12.7 Sensitivity Analysis

Sensitivity analysis studies how changes in the problem’s input data impact
its solution. We will ﬁrst illustrate the idea geometrically using the Heavenly
Pouch LP formulated in Section 10.1 and solved graphically in Section 10.4
Duality and Sensitivity Analysis in Linear Programming 297
x2
H
450

G ' solid fabric constraint

400

printed fabric constraint

B ∗
x I ↓
300
C J
z = 9, 375 ' demand constraint
∗
240 '

D
z = 6, 000
'
120 K budget constraint
'
E

z = 3, 000
'
A F L M
0 125 200 350 400 450 500 x1

FIGURE 12.1: Graphical solution of the Heavenly Pouch LP.

(see also Sections 11.2 and 11.3). The Heavenly Pouch LP is given by:

maximize 15x1 + 25x2

and its graphical solution is shown in Figure 12.1. The optimal solution is
x∗ = [125, 300]T , z ∗ = 9, 375. What happens to this solution if some of the
coefficients of the LP change? What changes to the problem parameters are
allowed if we require that the currently optimal basis must still remain optimal
after the changes take effect? These are some of the questions addressed using
sensitivity analysis.
First, we consider a situation when there is a change in one of the objective
function coefficients, say c1 , which corresponds to changing the profit obtained
from each non-reversible carrier. Currently, c1 = 15, and we consider changing
298 Numerical Methods and Optimization: An Introduction

it to c̄1 = 15 + Δ. The iso-proﬁt line passing through x∗ will change from

15x1 + 25x2 = z ∗ to (15 + Δ)x1 + 25x2 = z̄ ∗ , where z̄ ∗ = (15 + Δ)x∗1 + 25x∗2 =
z ∗ + Δx∗1 = z ∗ + 125Δ. Obviously, Δ = 0 corresponds to the original problem
data. By expressing x2 through x1 in the iso-profit line equation,
z̄ ∗ 15 + Δ
x2 = − x1 ,
25 25
we see that the slope of the iso-profit line is − 15+Δ25 , and changing Δ while
requiring that the iso-profit line passes through x∗ geometrically corresponds
to rotating the iso-profit line around x∗ . The range of Δ values for which the
current basis remains optimal is determined by the slopes of the lines corre-
sponding to the constraints that are active at x∗ , the printed fabric constraint
and the budget constraint. From the illustration, it is clear that if we rotate
the iso-profit line around x∗ counter-clockwise, the current solution remains
optimal as long as the iso-profit line is steeper than the printed fabric con-
straint line. If the iso-profit line is rotated clockwise, x∗ remains optimal as
long as the iso-profit line is still flatter than the budget constraint line. As
soon as the iso-profit line becomes steeper than the solid fabric constraint line,
the optimal solution moves from C to another extreme point, D. Thus, to find
the “allowable” range for Δ (i.e., the range of change in the c1 value under
which the currently optimal basis remains optimal), we need to solve the in-
equalities ensuring that the slope of the iso-profit line is sandwiched between
the slopes of the active constraint lines at x∗ . The slope of the printed fabric
constraint is 0, whereas the slope of the budget constraint is − 45 . Hence, the
current basic feasible solution remains optimal if and only if
4 15 + Δ
− ≤− ≤0 ⇔ −15 ≤ Δ ≤ 5.
5 25
Thus, with −15 ≤ Δ ≤ 5, x∗ remains optimal. Recall that the case of Δ = 5
was solved graphically in Figure 10.4 (page 227).
Next we analyze an example where there is a change in a constraint’s
right-hand side. To be specific, assume that the right-hand side of the budget
constraint is changed from 2, 000 to 2, 000 + Δ. Change in the right-hand side
does not affect the line’s slope, and geometrically, such a change corresponds
to a move of the line on parallel to its original position. We are interested in
what is happening to the optimal point C, which is defined by the intersection
of the printed fabric constraint and the budget constraint lines.
If Δ increases from 0 to a positive value, point C moves to the right, toward
I (x1 = 150, x2 = 300), at which point the total budget is 4 × 150 + 5 × 300 =
2, 100, i.e., Δ = 2, 100 − 2, 000 = 100. If we keep increasing Δ beyond that
point, C becomes infeasible, and I is the optimal point. Thus, the highest
allowable increase for the right-hand side of the budget constraint is Δ = 100.
If, on the other hand, Δ is negative, then as Δ grows in absolute value, the
budget constraint line moves down, point C moves to the left, and D moves
toward E along the solid fabric constraint line. When Δ = −100, D turns
Duality and Sensitivity Analysis in Linear Programming 299
x2
H
450

' solid fabric constraint

G
380

printed fabric constraint

B I ↓
300
C J
' demand constraint
240 z ∗ = 9, 000
'

z = 6, 000
'

120
budget constraint
E '
(Δ = −100)
z = 3, 000
'
A F L M
0 200 350 400 450475 x1

FIGURE 12.2: Graphical solution of the Heavenly Pouch LP with changed

right-hand side for the budget constraint (Δ = −100).

into E (see Figure 12.2). At this point, C is still the optimal solution, with
x∗1 = 100, x∗2 = 300, z ∗ = 9, 000. When Δ = −500, both C and G converge
to B (see Figure 12.3), which becomes the optimal point with x∗1 = 0, x∗2 =
300, z ∗ = 7, 500. If Δ keeps increasing in the absolute value after that, the
optimum will relocate to point G, which is the point on the intersection of the
budget constraint line and x2 axis. Thus, Δ = −500 gives the highest decrease
in the budget constraint’s right-hand side that does not alter the optimality
of the basis.
In summary, based on the graphical sensitivity analysis we conclude that
−500 ≤ Δ ≤ 100 is the allowable range of change for the budget constraint’s
right-hand side.
Next, we will discuss how the sensitivity analysis can be performed in
algebraic form. Unlike the graphical approach, these techniques can be applied
to LPs with an arbitrary number of variables. Therefore, we consider a general
300 Numerical Methods and Optimization: An Introduction
x2
H
450

' solid fabric constraint

printed fabric constraint

B=G I ↓
300
J
z ∗ = 7, 500
' ' demand constraint
240
)
budget
constraint
(Δ = −500)

120

z = 3, 000
'
E
A F M L
0 200 350 375 450 500 x1

FIGURE 12.3: Graphical solution of the Heavenly Pouch LP with changed

right-hand side for the budget constraint (Δ = −500).

LP in the standard form

maximize cT x
subject to Ax = b (12.18)
x ≥ 0.
The cases we discuss include:
• changing a coeﬃcient in the objective function,
• changing the column of a nonbasic variable,
• changing the right-hand side,
• introducing a new variable, and
• introducing a new constraint.
The analysis will be based on observations made regarding the simplex
tableaux in the matrix form, as introduced in Section 11.6:
Duality and Sensitivity Analysis in Linear Programming 301

z xB xN rhs
1 0 −(cTN − cTB B −1 N ) cTB B −1 b (12.19)
−1 −1
0 Im B N B b

More speciﬁcally, we will use the formulas in (12.19) to answer the following
important questions:

(1) Will the current basis still remain feasible after a given change in the
problem input data?

(2) Will the current basis still remain optimal after a given change in the
problem input data?

(3) How can we ﬁnd a new optimal solution if the feasibility or optimality
of the optimal basis for the original LP is altered by the change in the
input data?

The first question can be addressed by checking whether the vector of the
right-hand sides of the constraints given by B −1 b in (12.19) has only nonneg-
ative entries. To answer the second question, we need to verify the signs of the
entries of nonbasic variable coefficients in the objective given by cTN −cTB B −1 N .
To address the third question, note that the types of changes we analyze can
be grouped into two categories, the first one being the changes that impact
optimality (determined based on cTN − cTB B −1 N ) of the current solution but
do not impact its feasibility (i.e., the sign of B −1 b entries), and the second
one representing the changes that may alter feasibility, but do not impact the
vector of nonbasic variable coefficients in row 0 of the optimal tableau. The
first group includes changing a coefficient in the objective function, introduc-
ing a new variable, changing the column of a nonbasic variable, and the second
group includes changing the right-hand side and introducing a new constraint.
Based on this classification, we will have two different recipes for dealing with
the third question.
We will illustrate the underlying ideas using the following LP, which was
solved using the revised simplex method in Section 11.6:

maximize 4x1 + 3x2 + 5x3

subject to x1 + 2x2 + 2x3 ≤ 4
3x1 + 4x3 ≤ 6 (12.20)
2x1 + x2 + 4x3 ≤ 8
x 1 , x2 , x3 ≥ 0.

Recall that this LP represents a production planning problem, where xj is the

number of units of product j to be produced; the objective coefficient cj is
the profit obtained from selling a unit of product j. The constraints represent
limitations on three different resources used in the production process, with
302 Numerical Methods and Optimization: An Introduction

the coeﬃcient aij for variable xj in the constraint i standing for the amount
of resource i used in the production of 1 unit of product j, and the right-hand
side bi in the constraint i representing the amount of resource i available;
i, j = 1, 2, 3. As before, let x4 , x5 , and x6 be the slack variables for (12.20).
The optimal basis for this problem consists of variables x1 , x2 , and x6 , and
the corresponding basic optimal solution is given by

x∗ = [2, 1, 0, 0, 0, 3]T , z ∗ = 11.

We will refer to this solution and the corresponding basis as current. The
corresponding tableau is given by

z xB xN rhs
1 0 −c̃N = −(cTN − cTB B −1 N ) cTB B −1 b (12.21)
0 Im B −1 N x̃B = B −1 b

Assume that (12.21) represents the optimal tableau for LP (12.20), with

xTB = [x1 , x2 , x6 ],
x̃TB = [2, 1, 3], (12.22)
c̃TN = [−4/3, −3/2, −5/6],

as was shown in Section 11.6 (page 275).

We consider the types of changes in the problem data outlines above and
analyze how sensitive the current optimal basis is to these changes.

12.7.1 Changing the objective function coeﬃcient of a basic

variable
Assume that the objective function coeﬃcient of a basic variable x1 is
changed from c1 = 4 to c 1 = 4 + Δ. Then the modiﬁed problem is

maximize (4 + Δ)x1 + 3x2 + 5x3

subject to x1 + 2x2 + 2x3 ≤ 4
3x1 + 4x3 ≤ 6 (12.23)
2x1 + x2 + 4x3 ≤ 8
x 1 , x2 , x3 ≥ 0,

and (12.19) is modiﬁed as follows:

Duality and Sensitivity Analysis in Linear Programming 303

z xB xN rhs
1 0 −c̄N = −(cTN − c TB B −1 N ) c TB B −1 b (12.24)
−1 −1
0 Im B N B b

Here, c B is the vector resulting from cB after changing c1 to c 1 , i.e.,

c TB = cTB + [Δ, 0, 0] = [c1 , c2 , c6 ] + [Δ, 0, 0] = [4 + Δ, 3, 0].
Thus, the objective function coefficients for nonbasic variables in the tableau
of the modified problem with the same basis as current are given by
c̄TN = cTN − c TB B −1 N.
To compute c TB B −1 N , we first compute uT = c TB B −1 by solving the system
B T u = c B and then computing uT N .
⎡ ⎤⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤
1 3 2 u1 4+Δ u1 3/2
B T u = ⎣ 2 0 1 ⎦ ⎣ u2 ⎦ = ⎣ 3 ⎦ ⇔ ⎣ u2 ⎦ = ⎣ 5/6 + Δ/3 ⎦ .
0 0 1 u3 0 u3 0
We have:
⎡ ⎤
2 1 0
uT N = [3/2, 5/6 + Δ/3, 0] ⎣ 4 0 1 ⎦ = [(19 + 4Δ)/3, 3/2, 5/6 + Δ/3].
4 0 0
So,
c̄TN = cTN − c TB B −1 N = [5, 0, 0] − [(19 + 4Δ)/3, 3/2, 5/6 + Δ/3]
= [−4(1 + Δ)/3, −3/2, −5/6 − Δ/3].
The current solution will remain optimal if and only if
[−4(1 + Δ)/3, −3/2, −5/6 − Δ/3] ≤ 0,
which is the case if and only if Δ ≥ −1.

12.7.2 Changing the objective function coeﬃcient of a

nonbasic variable
Assume that the objective function coeﬃcient of a nonbasic variable x3 is
changed from c3 = 5 to c 3 = 5 + Δ. Then the modiﬁed problem is
maximize 4x1 + 3x2 + (5 + Δ)x3
subject to x1 + 2x2 + 2x3 ≤ 4
3x1 + 4x3 ≤ 6 (12.25)
2x1 + x2 + 4x3 ≤ 8
x 1 , x2 , x3 ≥ 0,
and the tableau (12.19) becomes:
304 Numerical Methods and Optimization: An Introduction

z xB xN rhs
1 0 −c̄N = −(ccTN − cTB B −1 N ) cTB B −1 b (12.26)
−1 −1
0 Im B N B b

Hence, the objective function coeﬃcients for nonbasic variables in the tableau
of the modiﬁed problem with the same basis as current are given by

c̄TN = c TN − cTB B −1 N,

where

c TN = cTN + [Δ, 0, 0] = [c3 , c4 , c5 ] + [Δ, 0, 0] = [5 + Δ, 0, 0].

Recalling from (12.22) that

c̃TN = cTN − cTB B −1 N = [−4/3, −3/2, −5/6],

we have:
c̄TN = c TN − cTB B −1 N
= cTN + [Δ, 0, 0] − cTB B −1 N
= cTN − cTB B −1 N + [Δ, 0, 0]
= [−4/3, −3/2, −5/6] + [Δ, 0, 0]
= [−4/3 + Δ, −3/2, −5/6].

The current solution will remain optimal if and only if

[−4/3 + Δ, −3/2, −5/6] ≤ 0,

which is true if and only if Δ ≤ 4/3.

Thus, if Δ ≤ 4/3, the current solution is still optimal, and the new optimal
objective function value can be easily computed. If Δ > 4/3, then the current
basis is still feasible, and we can use the corresponding tableau to compute a
new optimal solution with the simplex method.

Deﬁnition 12.2 The absolute value of the coeﬃcient of a nonbasic vari-

able in row 0 of an optimal simplex tableau is called the reduced cost of
this variable.

In other words, the reduced cost of a nonbasic variable xj is the minimum

amount by which cj needs to be increased in order for xj to become a basic
variable. For example, to make producing product 3 reasonable, we need to
either increase its price or reduce its manufacturing cost by at least 4/3.
Duality and Sensitivity Analysis in Linear Programming 305

12.7.3 Changing the column of a nonbasic variable

Assume that in addition to changing c3 to c 3 , we also change the column
Nx3 of N corresponding to x3 to N x3 . The resulting tableau corresponding to
the current basis is then given by:

z xB xN rhs
1 0 −c̄N = −(ccTN − cTB B −1 N ) cTB B −1 b (12.27)
−1 N −1
0 Im B N B b

where N is the matrix resulting from N as a result of changing Nx3 to N x3 .

The current basis may not be optimal anymore as a result of the change,
however, the modiﬁcation does not impact the feasibility of the current basis.
We could apply the simplex method starting with the feasible tableau above
to obtain a solution to the modiﬁed problem.

12.7.4 Changing the right-hand side

Assume that the right-hand side vector b = [4, 6, 8]T is changed to

b = b + [Δ, 0, 0]T = [4 + Δ, 6, 8]T .

Then the modiﬁed problem is

maximize 4x1 + 3x2 + 5x3

subject to x1 + 2x2 + 2x3 ≤ 4+Δ
3x1 + 4x3 ≤ 6 (12.28)
2x1 + x2 + 4x3 ≤ 8
x 1 , x2 , x3 ≥ 0.

Recall that (12.21) represents the optimal tableau for the original LP (12.20).
Then the tableau with the same basis for problem (12.28) is given by:

z xB xN rhs
1 0 −(cTN − cTB B −1 N ) cTB B −1b (12.29)
−1 −1 b
0 Im x̄ = B N B b

The corresponding values for basic variables are given by x̄ = B −1b . Thus, the
basic solution corresponding to the tableau (12.29) will be feasible if and only
if B −1b ≥ 0. Note that if this is the case, then this tableau is optimal (since
the coeﬃcients for nonbasic variables are all nonpositive due to the optimality
of tableau (12.21) for problem (12.20)).
306 Numerical Methods and Optimization: An Introduction

To compute v = B −1b , we solve the system Bv = b for v:

⎡ ⎤⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤
1 2 0 v1 4+Δ v1 2
⎣ 3 0 0 ⎦ ⎣ v2 ⎦ = ⎣ 6 ⎦ ⇔ ⎣ v2 ⎦ = ⎣ 1 + Δ/2 ⎦ , (12.30)
2 1 1 v3 8 v3 3 − Δ/2
and v ≥ 0 if and only if −2 ≤ Δ ≤ 6. Thus, the basis consisting of x1 , x2 ,
and x6 will remain optimal if and only if −2 ≤ Δ ≤ 6. Next we consider two
concrete examples of the right-hand side change, one with Δ chosen within
the allowable interval, and the other one outside the interval.
First, consider Δ = −1 (which is within the interval [−2, 6]). Then from
(12.30) we have ⎡ ⎤ ⎡ ⎤
x1 2
x∗B = ⎣ x2 ⎦ = B −1b = ⎣ 1/2 ⎦ ,
x6 7/2
so the optimal solution to (12.28) with Δ = −1 is [x1 , x2 , x3 ]T = [2, 1/2, 0]
with z ∗ = 4 × 2 + 3 × 1/2 + 5 × 0 = 19/2.
Now consider Δ = 8 (which is outside the interval [−2, 6]). Then from
(12.30) we have ⎡ ⎤ ⎡ ⎤
x1 2
x∗B = ⎣ x2 ⎦ = B −1b = ⎣ 5 ⎦ ,
x6 −1
so the tableau corresponding to the current basis is infeasible. Therefore
we will use the corresponding dual tableau. First, we compute the primal
tableau (12.29) for the basis x1 , x2 , x6 . We obtain

z x1 x2 x3 x4 x5 x6 rhs Basis
4 3 5
1 0 0 3 2 6 0 23 z
4 1
0 1 0 3 0 3 0 2 x1 (12.31)
0 0 1 1
3
1
2 − 16 0 5 x2
0 0 0 1 − 12 − 12 1 −1 x6
The corresponding dual tableau is
−w y1 y2 y3 y4 y5 y6 rhs Basis
1 0 0 −1 2 5 0 −23 −w
0 1 0 1
2 0 − 12 0 3
2 y1 (12.32)
0 0 1 1
2 − 31 1
6 0 5
6 y2
0 0 0 −1 − 43 − 13 1 4
3 y6
We will apply the simplex method to ﬁnd the optimal dual tableau. The
entering variable is y3 , the leaving variable is y2 , and the tableau after the
pivot is given by
Duality and Sensitivity Analysis in Linear Programming 307

−w y1 y2 y3 y4 y5 y6 rhs Basis
1 0 2 0 4
3
16
3 0 − 64
3 −w
0 1 −1 0 1
3 − 23 0 2
3 y1 (12.33)
0 0 2 1 − 23 1
3 0 5
3 y3
0 0 2 0 −2 0 1 3 y6
This tableau is optimal with the solution y1∗ = 2/3; y2∗ = 0; y3∗ = 5/3. The
optimal solution for the primal problem (12.28) with Δ = 8 can be read from
row 0 of the optimal dual tableau:

x∗1 = 4/3, x∗2 = 16/3, x∗3 = 0, z ∗ = 64/3.

12.7.5 Introducing a new variable

Next we introduce a new variable, x4 in problem (12.20), to obtain the
modiﬁed problem:

maximize 4x1 + 3x2 + 5x3 + 5x4

subject to x1 + 2x2 + 2x3 + 3x4 ≤ 4
3x1 + 4x3 + 4x4 ≤ 6 (12.34)
2x1 + x2 + 4x3 + 5x4 ≤ 8
x 1 , x 2 , x 3 , x4 ≥ 0.

Introducing a new variable in our production planning problem can be thought

of as considering a new activity, such as producing a new product. Then x4
represents the quantity of this new product 4 to be produced, its coefficient
in the objective function, c4 = 5, is the profit obtained from 1 unit of product
4, and the corresponding coefficients in the constraints, given by the vector
N x4 = [3, 4, 5]T , give the amount of the first, the second, and the third re-
source, respectively, required to produce one unit of product 4.
The new slack variables are x5 , x6 , x7 (in the original problem, the slack
variables were x4 , x5 , x6 ), so the basis of (12.34) corresponding to the optimal
basis x1 , x2 , x6 of the original LP (12.20) now consists of variables x1 , x2 , x7 .
The newly introduced variable x4 is treated as a nonbasic variable and its
inclusion results in the following changes to cN and N :

c N = [cN , c4 ]T = [c3 , c5 , c6 , c4 ]T = [5, 0, 0, 5]T ;

⎡ ⎤
2 1 0 3
N = [N, a4 ] = ⎣ 4 0 1 4 ⎦ .
4 0 0 5
The corresponding tableau with the current basis x1 , x2 , x7 is now given by:
308 Numerical Methods and Optimization: An Introduction

z xB xN rhs
1 0 −(ccTN − cTB B −1N ) cTB B −1 b (12.35)
−1 N −1
0 Im B N B b

Thus, the coeﬃcient c̄4 of x4 in row 0 of the tableau is

⎡ ⎤
3
c̄4 = c 4 − cTB B −1a 4 = 5 − [3/2, 5/6, 0] ⎣ 4 ⎦ = −17/6 < 0.
5

This means that the basis x1 , x2 , x7 remains optimal, and the optimal solution
is the same as for the original problem. Thus, x4 = 0, meaning that we should
not produce the new product.
In general, even if c̄4 was positive, the tableau with the same basis would
still be feasible, even though it would not be optimal anymore. We could apply
the simplex method starting with this feasible tableau to obtain a solution to
the modiﬁed problem.

12.7.6 Introducing a new constraint

We add the constraint x1 +2x2 +4x3 ≤ 6 to (12.20) to obtain the following
LP:
maximize 4x1 + 3x2 + 5x3
subject to x1 + 2x2 + 2x3 ≤ 4
3x1 + 4x3 ≤ 6
(12.36)
2x1 + x2 + 4x3 ≤ 8
1 x1 + 2 x2 + 4 x3 ≤ 6
x1 , x2 , x3 ≥ 0.
This results in additional entries in B, b, cB , and N . Since xB = B −1 b in the
solution corresponding to the tableau (12.19), the changes in B and b may
lead to an infeasible tableau. However, we will show that the corresponding
dual tableau will still be feasible. We have the basic variables x1 , x2 , x6 , x7 (x7
is the slack variable for the new constraint) and let B , b , c B , and N denote the
corresponding problem data after the new constraint was introduced. Then
⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤
1 2 0 0 4 4 2 1 0
⎢ 3 0 0 0 ⎥ ⎢ 6 ⎥ ⎢ 3 ⎥ ⎢ 4 0 1 ⎥
B =⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎣ 2 1 1 0 ⎦ , b = ⎣ 8 ⎦ , cB = ⎣ 0 ⎦ , N = ⎣ 4 0 0 ⎦ ,
⎢ ⎥

1 2 0 1 6 0 4 0 0

and the current tableau is given by

Duality and Sensitivity Analysis in Linear Programming 309

z xB xN rhs
1 0 −(cTN − c TB B −1N ) c TB B −1b (12.37)
0 Im B −1N
B N B −1b
B b

From (12.37) it appears that introducing the new constraint causes changes to
both the vector of nonbasic variable coeﬃcients and the vector of right-hand
sides. However, next we show that, in fact, cTN −ccTB B −1N = cTN − cTB B −1 N, so
only the feasibility can be impacted by adding a new constraint. To compute
c TB B −1N , we ﬁrst compute uT = c TB B −1 by solving the system B T u = c B and
then computing uT N .
⎡ ⎤⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤
1 3 2 1 u1 4 u1 3/2
⎢ 2 0 1 2 ⎥ ⎢ u2 ⎥ ⎢ 3 ⎥ ⎢ u2 ⎥ ⎢ 5/6 ⎥
BT u = ⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢
⎣ 0 0 1 0 ⎦ ⎣ u3 ⎦ = ⎣ 0 ⎦ ⇔ ⎣ u3 ⎦ = ⎣ 0 ⎦ ,
⎥ ⎢ ⎥

0 0 0 1 u4 0 u4 0
and ⎡ ⎤
2 1 0
⎢ 4 0 1 ⎥
u N = [3/2, 5/6, 0, 0] ⎢
T
⎣ 4
⎥ = [19/3, 3/2, 5/6].
0 0 ⎦
4 0 0
Recall that cTN = [5, 0, 0], so,
cTN − c TB B −1 N = [5, 0, 0] − [19/3, 3/2, 5/6] = [−4/3, −3/2, −5/6],
which is the same as the vector of coefficients c̃N = cTN −cTB B −1 N for nonbasic
variables in the optimal tableau of the original problem. Note that this will
always be the case if we add a new constraint, i.e., the vector of the coefficients
for nonbasic variables in the optimal tableau of the original problem and the
vector of the coefficients for nonbasic variables in the corresponding tableau
of the modified problem will be the same:
cTN − c TB B −1N = cTN − cTB B −1 N.
Hence, (12.37) is equivalent to

z xB xN rhs
1 0 −(cTN − cTB B −1 N ) cTB B −1 b (12.38)
0 Im B −1N
B N x̄ = BB −1b b

and since row 0 is not aﬀected by the change, the corresponding dual tableau
is feasible and we can use it to solve the dual of the modiﬁed problem to
optimality.
310 Numerical Methods and Optimization: An Introduction

12.7.7 Summary
While performing the sensitivity analysis above, we saw that with the
following modiﬁcations:

• changing a coeﬃcient in the objective,

• introducing a new variable, and

• changing the column of a nonbasic variable,

the optimal basis still remains feasible. Then, if the basis is not optimal, we
can apply the simplex method starting with this basis and ﬁnd a new optimal
solution.
With the following modiﬁcations:

• changing the right-hand side,

• introducing a new constraint

the dual optimal basis still remains feasible. Then, if the dual tableau is not
optimal, we can apply the simplex method starting with this basis to solve the
dual problem. We can then extract the optimal solution of the primal problem
from the optimal dual tableau.

Exercises
12.1. Write the dual for each of the following LPs:
(a) maximize x1 + 2x2 − 3x3
subject to 4x1 + 5x2 + x3 ≤ 6
7x1 + x2 ≤ 8
9x2 + x3 = 10
x1 ≥ 0, x2 , x3 ∈ IR
(b) minimize 8x1 + 4x2 − 2x3
subject to x1 + 3x2 ≤ 5
6x1 − 7x2 + 8x3 = 7
x1 ∈ IR, x2 , x3 ≥ 0
(c) maximize 9x1 − 7x2 + 5x3
subject to 3x1 + x2 − x3 ≤ 2
4x1 − 6x2 + 3x3 ≥ 8
x1 + 2x2 + x3 = 11
x1 ∈ IR, x2 , x3 ≥ 0.
Duality and Sensitivity Analysis in Linear Programming 311

12.2. Write down the dual of the LP formulated for the diet problem in Ex-
ample 10.2 (page 213). Provide an economic interpretation of the dual
LP.
12.3. Prove or disprove each of the following statements concerning a primal
LP,
maximize z = cT x subject to Ax ≤ b, x ≥ 0
and the corresponding dual LP,

minimize w = bT y subject to AT y ≥ c, y ≥ 0.

(a) If the primal LP is infeasible, the corresponding dual LP must be

unbounded.
(b) If the primal LP has a nonempty, bounded feasible region, then the
corresponding dual LP must have a global minimum.
(c) If b < 0, then the primal LP is infeasible.
(d) If the primal LP has a feasible solution, then the dual LP must also
have a feasible solution.
(e) Even if the primal LP has an unbounded feasible region, the cor-
responding dual LP can still have an optimal solution.
(f) If the primal LP has an optimal solution, then altering the entries
of right-hand side vector b cannot result in an unbounded LP.
12.4. Consider the following LP:

maximize 2x1 + 3x2 − x3

subject to 3x1 + 3x2 − 2x3 ≤ 6
−x1 − 2x2 + 3x3 ≤ 6
x 1 , x2 , x3 ≥ 0.

Determine the missing entries in the z-row of its optimal tableau:

z x1 x2 x3 s1 s2 rhs
1 8/5 0 ? 7/5 3/5 ?

12.5. Consider the following LP:

maximize 5x1 + 3x2 − x3

subject to 3x1 + 3x2 − 2x3 ≤ 4
−x1 − 2x2 + 3x3 ≤ 6
x 1 , x2 , x3 ≥ 0.

Find the missing entries in the z-row of its optimal tableau:

z x1 x2 x3 s1 s2 rhs
1 ? 1 0 2 ? 14
312 Numerical Methods and Optimization: An Introduction

12.6. Consider the following primal LP:

maximize 5x1 − 2x2 + 3x3

subject to 3x1 + 2x2 − 3x3 ≤ 6
x1 + x2 + x3 ≤ 6
x 1 , x2 , x3 ≥ 0.

(a) Formulate its dual and solve it graphically.

(b) Use your solution to the dual LP and complementary slackness to
solve the primal LP.

12.7. Use the complementary slackness theorem to determine whether x∗ is

an optimal solution of the LP

maximize x1 + 3x2 + 3x3

subject to x1 + 3x2 − 2x3 ≤ 3
5x1 − 2x2 + 4x3 ≤ 2
4x1 − 2x2 + 5x3 ≤ 4
x 1 , x2 , x3 ≥ 0,

where

(a) x∗ = [1, 2, 1/2]T .

(b) x∗ = [0, 2, 3/2]T .

12.8. Consider the following LP:

maximize x1 + 2x2
subject to x1 + 5x2 ≤ 5
3x1 + 2x2 ≤ 6
5x1 + x2 ≤ 5
x 1 , x2 ≥ 0.

(a) Solve the LP graphically.

(b) Graphically determine the range of values of the objective function
coeﬃcient for x1 for which the optimal basis does not change.
(c) Graphically determine the range of values of the right-hand side of
the ﬁrst constraint for which the optimal basis does not change.

12.9. For the following LP,

maximize 2x1 + 2x2 − 3x3

subject to −3x1 + 2x2 + x3 ≤ 3
2x1 + x2 + x3 ≤ 5
x 1 , x 2 , x3 ≥ 0,
Duality and Sensitivity Analysis in Linear Programming 313

consider its optimal tableau:

z x1 x2 x3 s1 s2 rhs
1 0 0 33/7 2/7 10/7 8
0 0 1 5/7 2/7 3/7 3
0 1 0 1/7 −1/7 2/7 1

(a) Write the dual to this LP and use the above tableau to find the
dual optimal solution.
(b) Find the range of values of the objective function coefficient of x2
for which the current basis remains optimal.
(c) Find the range of values of the objective function coefficients of x3
for which the current basis remains optimal.
(d) Find the range of values of the right-hand side of the first constraint
for which the current basis remains optimal.
(e) What will be the optimal solution to this LP if the right-hand side
of the first constraint is changed from 3 to 8?

12.10. Use the LP and its optimal tableau given in Exercise 12.9 to solve the
following LPs:
(a) maximize 2x1 + 2x2 − 2x3
subject to −3x1 + 2x2 + 2x3 ≤ 3
2x1 + x2 + 2x3 ≤ 5
x 1 , x2 , x3 ≥ 0
(b) maximize 5x1 + 2x2 − 3x3
subject to −3x1 + 2x2 + x3 ≤ 3
2x1 + x2 + x3 ≤ 5
x 1 , x2 , x3 ≥ 0
(c) maximize 2x1 + 2x2 − 3x3 + 3x4
subject to −3x1 + 2x2 + x3 + 3x4 ≤ 3
2x1 + x2 + x3 + 2x4 ≤ 5
x 1 , x2 , x3 , x4 ≥ 0.
12.11. Use the LP and its optimal tableau given in Exercise 12.9 to solve the
following LPs:
(a) maximize 2x1 + 2x2 − 3x3
subject to −3x1 + 2x2 + x3 ≤ 3
2x1 + x2 + x3 ≤ 1
x 1 , x2 , x3 ≥ 0,
(b) maximize 2x1 + 2x2 − 3x3
subject to −3x1 + 2x2 + x3 ≤ 3
2x1 + x2 + x3 ≤ 5
−x1 + 2x2 + x3 ≤ 4
x 1 , x2 , x3 ≥ 0.
314 Numerical Methods and Optimization: An Introduction

12.12. For the following LP,

maximize x1 + 3x2 + 3x3

subject to 2x1 + 3x2 − 2x3 ≤ 1
4x1 − 2x2 + 4x3 ≤ 2
x 1 , x2 , x3 ≥ 0,

consider its optimal tableau:

z x1 x2 x3 s1 s2 rhs
1 11 0 0 9/4 15/8 6
0 2 1 0 1/2 1/4 1
0 2 0 1 1/4 3/8 1

(a) Write the dual to this LP and use the above tableau to find the
dual optimal solution.
(b) Find the range of values of the objective function coefficient of x1
for which the current basis remains optimal.
(c) Find the range of values of the objective function coefficients of x2
for which the current basis remains optimal.
(d) Find the range of values of the right-hand side of the second con-
straint for which the current basis remains optimal.
(e) Extract B −1 from the optimal tableau.

12.13. A manufacturing company makes three products, 1, 2, and 3, using two

kinds of resources, I and II. The resource requirements and selling price
for the three products are shown in the following table:

Product
Resource 1 2 3
I (units) 3 2 4
II (units) 4 2 3
Price ($) 10 7 12

Currently, 100 units of Resource I are available. Up to 120 units of

Resource II can be purchased at $1 per unit. Then the following LP can
be formulated to maximize the company’s proﬁts:

maximize 10x1 + 7x2 + 12x3 − r

subject to 3x1 + 2x2 + 4x3 − r ≤ 0
4x1 + 2x2 + 3x3 ≤ 100
r ≤ 120
x 1 , x2 , x3 , r ≥ 0,

where xi = units of product i produced, i = 1, 2, 3, and r = number of

units of resource II purchased.
Duality and Sensitivity Analysis in Linear Programming 315

The optimal solution to this LP is x∗1 = 0, x∗2 = 20, x∗3 = 20, r∗ = 120
with the optimal objective function value z ∗ = 260. Sensitivity analysis
yielded results summarized in the tables below.

Decision Optimal Reduced Objective Allowable Allowable

Variable Value Cost Coeﬃcient Increase Decrease
x1 0 −5/2 10 5/2 ∞
x2 20 0 7 1/3 5/7
x3 20 0 12 2 1/2
r 120 0 −1 ∞ 1/2

Constraint Shadow Constraint Allowable Allowable

Price RHS Increase Decrease
1 3/2 0 40/3 20
2 2 100 20 10
3 1/2 120 40/3 20

Use these results to answer the following questions.

(a) Write down the dual of this LP. What is its optimal solution?
(b) What is the most the company should be willing to pay for an
additional unit of resource I?
(c) What is the most the company should be willing to pay for another
unit of resource II?
(d) What selling price of product 1 would make it reasonable for the
company to manufacture it?
(e) What would the company’s proﬁt be if 110 units of resource I was
available?
(f) What would the company’s proﬁt be if 130 units of resource II
could be purchased?
(g) Find the new optimal solution if product 3 sold for $13.
This page intentionally left blank
Chapter 13
Unconstrained Optimization

In this chapter we deal with unconstrained optimization problems in the form

minimize f (x),

where f : IRn → IR is a continuously diﬀerentiable function. We ﬁrst develop

optimality conditions, which characterize analytical properties of local optima
and are fundamental for developing solution methods. Since computing not
only global but even local optimal solutions is a very challenging task in gen-
eral, we set a more realistic goal of finding a point that satisfies the first-order
necessary conditions, which are the conditions that the first-order derivative
(gradient) of the objective function must satisfy at a point of local optimum.
We discuss several classical numerical methods converging to such a point.

13.1 Optimality Conditions

Optimality conditions characterize analytical properties of local optima
and are fundamental for developing numerical methods. Necessary optimality
conditions aim to characterize a point x∗ , assuming that x∗ is a local min-
imizer of the considered problem. Such conditions allow one to determine a
set of candidate points that may need to be further analyzed in order to find
out if they are local minimizers. In some cases, such analysis can be done
using sufficient optimality conditions, providing properties that, if satisfied,
guarantee that a candidate point x∗ is a local minimizer.

13.1.1 First-order necessary conditions

The ﬁrst-order necessary conditions (FONC) for unconstrained optimiza-

tion can be viewed as a special case of the following optimality conditions for
a more general set-constrained optimization problem.

317
318 Numerical Methods and Optimization: An Introduction

Theorem 13.1 (FONC for a Set-Constrained Problem)

If f : X → IR, where X ⊆ IRn , is a continuously diﬀerentiable function
on an open set containing X, and x∗ is a point of local minimum for the
set-constrained problem

minimize f (x)
(13.1)
subject to x ∈ X,

then for any feasible direction d at x∗ , the rate of increase of f at x∗ in

the direction d is nonnegative: ∇f (x∗ )T d ≥ 0.

Proof. Let x∗ ∈ X be a local minimizer of (13.1). Consider a feasible direction

d at x∗ such that d = 1. Using Taylor’s theorem for f , x∗ , and x = x∗ + αd,
where α > 0, we have:

f (x) = f (x∗ ) + ∇f (x∗ )T αd + o(αd)

(13.2)
= f (x∗ ) + α∇f (x∗ )T d + o(α).

Since x∗ is a local minimizer of (13.1), from the above equation we have:

f (x) − f (x∗ ) o(α)
= ∇f (x∗ )T d + . (13.3)
α α
Since f (x) − f (x∗ ) ≥ 0, α > 0 and o(α)
α → 0 as α → 0, we obtain

∇f (x∗ )T d ≥ 0.

Indeed, if we assume f (x∗ )T d < 0, then selecting such that f (x∗ ) ≤ f (x)
for any x ∈ B(x∗ , ) and the error term o(α) ∗ T
α in (13.3) is less than |f (x ) d|
∗
(such exists since x is a local minimizer and by deﬁnition of o(α)) would
result in f (x) − f (x∗ ) < 0. We obtain a contradiction with the assumption
that f (x∗ ) ≤ f (x) for any x ∈ B(x∗ , ).
As a corollary of Theorem 13.1, we get the following result.

Corollary 13.1 If x∗ is an interior point of X and a local minimizer

of (13.1), then ∇f (x∗ ) = 0.

Proof. If x∗ is an interior point of X, then any d ∈ IRn is a feasible direction

at x∗ . Thus, using Theorem 13.1 for an arbitrary direction d = 0 and its
opposite direction −d we have: ∇f (x∗ )T d ≥ 0, ∇f (x∗ )T (−d) ≥ 0, therefore,
∇f (x∗ )T d = 0. In particular, if we use d = [d1 , . . . dn ]T with dj = 1 for an
arbitrary j ∈ {1, . . . , n} and di = 0 for all i = j, this implies that the j th
component of ∇f (x∗ ) equals 0. Since this is the case for each j, we have
∇f (x∗ ) = 0.
Unconstrained Optimization 319

In the case of an unconstrained problem, X = IRn and any point x ∈ IRn is

an interior point of X. Thus, we obtain the following FONC for unconstrained
optimization.

Theorem 13.2 (FONC for an Unconstrained Problem)

If x∗ is a point of local minimum of the unconstrained problem

minimize f (x), (13.4)

where f : IRn → IR is continuously diﬀerentiable, then ∇f (x∗ ) = 0.

Deﬁnition 13.1 (Stationary Point) A point x∗ satisfying the FONC

for a given problem is called a stationary point for this problem.

The following example shows that the FONC is not suﬃcient for a local
minimizer.

Example 13.1 Applying the FONC to f (x) = x3 , we have

f (x) = 3x2 = 0 ⇔ x = 0.

But, obviously, x = 0 is not a local minimizer of f (x). In fact, for any given
point x and any small > 0, there always exist x∗ , x∗ ∈ B(x∗ , ) such that
f (x∗ ) < f (x) < f (x∗ ), so f (x) does not have any local or global minimum or
maximum.

Next, we prove that a point satisfying the FONC is, in fact, a global
minimizer if a problem is convex. The proof is based on the first-order char-
acterization of a convex function. Consider a convex problem min f (x). For
x∈X
any x, y ∈ X, a differentiable convex function satisfies

f (y) ≥ f (x) + ∇f (x)T (y − x),

hence, if for x = x∗ : ∇f (x∗ ) = 0, we obtain

f (y) ≥ f (x∗ )

for any y ∈ X. Thus, x∗ is a point of global minimum for this problem.

This implies that the FONC in the case of an unconstrained convex problem
becomes a suﬃcient condition for a global minimizer.

Theorem 13.3 (Optimality Condition for a Convex Problem)

For a convex unconstrained problem minimize f (x), where f : IRn →
320 Numerical Methods and Optimization: An Introduction
IR is diﬀerentiable and convex, x∗ is a global minimizer if and only if
∇f (x∗ ) = 0.

Consider a quadratic problem

maxn q(x),
x∈IR

where q(x) = 12 xT Qx + cT x for a given n × n-matrix Q and n-vector c. If Q

is a positive semidefinite matrix, then q(x) is convex and any point satisfying
the FONC is a global minimizer. We have
∇q(x) = 0 ⇔ Qx = −c.
If Q is positive definite, then this system has a unique solution x∗ = −Q−1 c,
which is the only global minimizer of the considered problem.
Example 13.2 Find the minimum of the function
1 T
q(x) = x Qx + cT x,
2
where ⎡ ⎤ ⎡ ⎤
3 −1 −1 1
Q = ⎣ −1 2 0 ⎦ and c = ⎣ −2 ⎦ .
−1 0 4 3
Since
q11 = 3 > 0,

q11 q12 3 −1
= =5>0
q21 q22 −1 2
and
det(Q) = 18 > 0,
the matrix Q is positive definite. Thus q(x) is convex, and it has a unique
global minimum which can be found by solving the system
Qx = −c.
The solution to this system is
x∗ = [−1/3, 5/6, −5/6]T , and q(x∗ ) = −9/4.

13.1.2 Second-order optimality conditions

Next, we derive the second-order necessary conditions (SONC) and second-
order suﬃcient conditions (SOSC) for a local minimizer of an unconstrained
problem. We assume that the objective function f ∈ C (2) (IRn ), i.e., f is twice
continuously diﬀerentiable.
Unconstrained Optimization 321

Theorem 13.4 (SONC for an Unconstrained Problem)

If x∗ ∈ IRn is a local minimizer for the problem minn f (x), where f (x) ∈
x∈IR
C (2) (IRn ), then ∇2 f (x∗ ) is positive semideﬁnite.

Proof. Let x∗ be a point of local minimum of f (x). Given an arbitrary d ∈ IRn

such that d = 1 and a scalar α > 0, using Taylor’s theorem we have
1
f (x∗ + αd) = f (x∗ ) + α∇f (x∗ )T d + α2 dT ∇2 f (x∗ )d + o(α2 ). (13.5)
2
Since x∗ is a local minimizer, ∇f (x∗ ) = 0, and (13.5) can be written as
1 2 T 2
f (x∗ + αd) − f (x∗ ) = α d ∇ f (x∗ )d + o(α2 ),
2
or, dividing both sides by α2 , as
f (x∗ + αd) − f (x∗ ) 1 T 2 ∗ o(α2 )
= d ∇ f (x )d + . (13.6)
α2 2 α2
Due to the local minimality of x∗ , there exists > 0 such that

f (x∗ + αd) − f (x∗ ) ≥ 0 for any α ∈ (0, ). (13.7)

o(α2 )
If we assume that dT ∇2 f (x∗ )d < 0, then, since lim 2 = 0, there exists a
α→0 α
sufficiently small α ∈ (0, ) such that
1 T 2 o(α2 )
d ∇ f (x∗ )d + < 0. (13.8)
2 α2
However, (13.7) and (13.8) combined contradict (13.6). Thus, the assumption
that dT ∇2 f (x∗ )d < 0 is incorrect and dT ∇2 f (x∗ )d ≥ 0 for any d, implying
that ∇2 f (x∗ ) is positive semidefinite.
Next we derive second-order sufficient conditions (SOSC) for a local min-
imizer of an unconstrained problem.

Theorem 13.5 (SOSC for an Unconstrained Problem)

If x∗ satisﬁes the FONC and SONC for an unconstrained problem
minn f (x) and ∇2 f (x∗ ) is positive deﬁnite, then x∗ is a point of strict
x∈IR
local minimum for this problem.

Proof. We assume that x∗ satisﬁes the FONC and SONC. Then, for any
d ∈ IRn with d = 1 and any α > 0 we have
f (x∗ + αd) − f (x∗ ) 1 T 2 ∗ o(α2 )
= d ∇ f (x )d + .
α2 2 α2
322 Numerical Methods and Optimization: An Introduction

If we additionally assume that ∇2 f (x∗ ) is positive deﬁnite, then by Rayleigh’s

inequality,
1 T 2
d ∇ f (x∗ )d ≥ λmin d2 = λmin > 0.
2
Here λmin denotes the smallest eigenvalue of ∇2 f (x∗ ). Thus, there exists > 0
such that for any α ∈ (0, ) we have f (x∗ + αd) − f (x∗ ) > 0. Since d is an
arbitrary direction in IRn , x∗ is a point of strict local minimum by definition.
Thus, the FONC, SONC, together with the positive definiteness of ∇2 f (x∗ ),
constitute the sufficient conditions for a strict local minimizer.

Example 13.3 Consider the function f (x) = x31 − x32 + 3x1 x2 . We apply the
optimality conditions above to ﬁnd its local optima. FONC system for this
problem is given by

3x21 + 3x2 0
∇f (x) = = .
−3x22 + 3x1 0

From the second equation of this system we have x1 = x22 . Substituting for x1
in the ﬁrst equation gives x42 + x2 = 0, which yields x2 = 0 or x2 = −1. The
corresponding stationary points are x̂ = [0, 0]T and x̃ = [1, −1]T , respectively.
The Hessian of f (x) is

6x1 3 0 3 6 3
∇2 f (x) = ⇒ ∇2 f (x̂) = ; ∇2 f (x̃) = .
3 −6x2 3 0 3 6

Since the determinant of ∇2 f (x̂) equals −9, the Hessian is indefinite at x̂,
and from the SONC, x̂ cannot be a local optimum. On the other hand, ∇2 f (x̃)
is positive definite, hence x̃ is a strict local minimum by the SOSC. Note that
if we fix x2 = 0, the function f is then given by x31 , which is unbounded from
below or above. Thus, f has no global minima and no local or global maxima.
Its only local minimum is x̃ = [1, −1]T .

13.1.3 Using optimality conditions for solving optimization

problems
As we have seen in examples above, in some cases optimality conditions
can be used to find points of local or global optima. In particular, a search
for local optima of a given function f can be carried out by first finding all
points satisfying FONC (stationary points), and then applying second order
optimality conditions to analyze the nature of the stationary points. If the
SONC condition is not satisfied at a stationary point x∗ , then x∗ is not a
local optimizer. A stationary point that is not a local optimum is called a
saddle point. If, on the other hand, a stationary point x∗ satisfies the SOSC,
then it is a local optimum of f . In case a stationary point x∗ satisfies the
SONC but does not satisfy the SOSC, the second-order optimality conditions
are inconclusive and cannot be used to decide whether x∗ is a local optimum
Unconstrained Optimization 323

or a saddle point. In addition, if the problem of optimizing f is known to have

a global optimal solution, its global optima can potentially be computed by
considering all stationary points and comparing the corresponding objective
function values.
On a negative side, verifying the existence of a global optimum and finding
the stationary points are difficult problems in general, which limits the use
of optimality conditions for the purpose of finding local and global optimal
solutions. Given the intractability of these problems, when designing numerical
methods for solving the problem of optimizing f (x), it is reasonable to settle
for the goal of computing a point x∗ such that ∇f (x∗ ) ≈ 0, i.e., a point where
the FONC is approximately satisfied. This is the premise of the methods for
unconstrained optimization that we discuss below in this chapter. However,
before proceeding to outlining such methods, we first consider some simple
derivative-free techniques for solving problems involving a single variable.

13.2 Optimization Problems with a Single Variable

Consider a function f : IR → IR which has a unique minimizer c over a
closed interval [a, b]. Also, assume that f is strictly decreasing on [a, c] and
strictly increasing on [c, b]. Such a function is referred to as a unimodal func-
tion. The golden section search and Fibonacci search are methods designed to
ﬁnd the minimizer of a unimodal function over a closed interval.

13.2.1 Golden section search

Assume that we are given a unimodal function f : [a0 , b0 ] → IR with the
minimizer x∗ . We aim to reduce the search space by locating a smaller interval
containing the minimizer. To do so, we evaluate f at two points a1 , b1 ∈
(a0 , b0 ). We choose these points in the way that

a1 − a0 = b0 − b1 = ρ(b0 − a0 ),

where ρ < 1/2. If f (a1 ) < f (b1 ), then the minimizer is in the interval [a0 , b1 ].
Otherwise, if f (a1 ) > f (b1 ), then the minimizer is in [a1 , b0 ] (see Figure 13.1).
Thus, the range of uncertainty will be reduced by the factor of (1 − ρ), and
we can continue the search using the same method over a smaller interval.
In the golden section search method, we want to reduce the number of
function evaluations by using previously computed intermediate points. Con-
sider the example shown in Figure 13.1. In this example, f (a1 ) < f (b1 ), so
the range of uncertainty reduces to the interval [a0 , b1 ]. To continue the pro-
cess, we need to choose two points in [a0 , b1 ] and evaluate f in these points.
However, we know that a1 ∈ [a0 , b1 ]. So, a1 can be chosen as one of the two
324 Numerical Methods and Optimization: An Introduction

a0 a2 x∗ a1 b1 b0
b2

FIGURE 13.1: Golden section search.

points: b2 = a1 , and it suffices to find only one new point a2 , which would be
as far from a0 , as b2 is from b1 . The advantage of such a choice of intermediate
points is that in the next iteration we would have to evaluate f only in one
new point, a2 . Now we need to compute the value of ρ which would result in
having a1 chosen as one of the intermediate points. Figure 13.2 illustrates this
situation: If we assume that the length of [a0 , b0 ] is l, and the length of [a0 , b1 ]
is d, then we have
d = (1 − ρ)l,
so
d
l= .
1−ρ
On the other hand, if we consider the interval [a0 , a1 ] = [a0 , b2 ], then its length
can be expressed in two different ways:

ρl = (1 − ρ)d.

From the last two expressions, we obtain

ρ
= 1 − ρ.
1−ρ

This formula can be interpreted as follows. If we divide the interval of length 1

into two segments, one of length ρ < 1/2 and the other of length 1 − ρ > 1/2,
then the ratio of the shorter segment to the longer equals to the ratio of the
longer to the sum of the two. In Ancient Greece, this division was referred to
as the golden ratio, therefore the name of the method. We can now compute
ρ by solving the quadratic equation

ρ2 − 3ρ + 1 = 0.
Unconstrained Optimization 325

d = (1 − ρ)l -

ρl = (1 − ρ)d -(1 − 2ρ)l - ρl -

a0 a2 a1 b1 b0
b2
l = b0 − a 0 -

FIGURE 13.2: Finding ρ in the golden section search.

Taking into account that we are looking for ρ < 1/2, we obtain the solution
√
3− 5
ρ= ≈ 0.382.
2
Note that the uncertainty interval is reduced by the factor of 1 − ρ ≈ 0.618 at
each step. So, in N steps the reduction factor would be

(1 − ρ)n ≈ 0.618N .

13.2.1.1 Fibonacci search

Fibonacci search is a generalization of the golden section search. Instead
of using the same value of ρ at each step, we can vary it and use a diﬀerent
value ρk for each step k. Analogously to the golden section search, we want
to select ρk ∈ [0, 1/2] in such a way that only one new function evaluation
is required at each step. Using reasonings similar to those for the choice of
ρ in the golden section search (see also Figure 13.2), we obtain the following
relations for the values of ρk :
ρk
ρk+1 = 1 − , k = 1, . . . , n − 1.
1 − ρk
In order to minimize the interval of uncertainty after N steps, we consider the
following minimization problem:

minimize (1 − ρ1 )(1 − ρ2 ) · · · (1 − ρN ) (13.9)

ρk
subject to ρk+1 = 1 − , k = 1, . . . , N − 1 (13.10)
1 − ρk
1
0 ≤ ρk ≤ , k = 1, . . . , N. (13.11)
2
To describe the solution to this problem, we need the following deﬁnition.
The Fibonacci sequence {Fk , k ≥ 0} is deﬁned by F0 = F1 = 1 and the
recursive relation Fk+1 = Fk + Fk−1 .
326 Numerical Methods and Optimization: An Introduction

Theorem 13.6 The optimal solution to the problem (13.9)–(13.11) is

given by
FN −k+1
ρk = 1 − , k = 1, . . . , N,
FN −k+2
where Fk is the k th element of the Fibonacci sequence.

Proof. Note that using (13.10) we can recursively express all variables ρk , k =
1, . . . , N in the objective function of (13.9) through one of the variables, say
ρN . If we denote the resulting univariate function by fN (ρN ), then
1 − ρN
fN (ρN ) = , N ≥ 2. (13.12)
FN − FN −2 ρN

We will prove (13.12) using

induction
by N . For N = 2, from (13.10) we have
ρ1 = 2−ρ2 , so f2 (ρ2 ) = 1 − 2−ρ2 (1 − ρ2 ) = 1−ρ
1−ρ2 1−ρ2 2 1−ρ2
2−ρ2 = F2 −F0 ρ2 , and (13.12)
is correct for N = 2. Assuming that (13.12) is correct for some N = K − 1,
i.e., fK−1 (ρK−1 ) = FK−11−ρ K−1
−FK−3 ρK−1 , we need to show that it is also correct
1−ρK
for N = K. From (13.10) we have ρK−1 = 2−ρK , so

1 − ρK
fK (ρK ) = fK−1 (1 − ρK )
2 − ρK
1−ρK
1− 2−ρK
= (1 − ρK )
FK−1 − FK−3 1−ρ
2−ρK
K

1 − ρK
=
2FK−1 − FK−3 − (FK−1 − FK−3 )ρK
1 − ρK
= ,
FK − FK−2 ρK

so (13.12) is correct for any N ≥ 2.

Next, we will show that fN (ρN ) is a strictly decreasing function on [0, 12 ].

We can do so by showing that the derivative fN (ρN ) < 0, ∀ρN ∈ [0, 12 ]. Indeed,

−FN + FN −2 −FN −1 1
fN (ρN ) = = < 0, ∀ρN ≤ .
(FN − FN −2 ρN )2 (FN − FN −2 ρN )2 2

Therefore,

1 − 1/2 1
min fN (ρN ) = fN (1/2) = = ,
ρN ∈[0,1/2] FN − FN −2 /2 FN +1

which means that the reduction factor after N steps of the Fibonacci search
is
1
.
FN +1
Unconstrained Optimization 327

Returning to the original problem (13.9)–(13.11), we have

F1
ρN = 1/2 = 1 − ;
F2
1 − ρN F1 F2
ρN −1 = = =1− ;
2 − ρN F3 F3
..
.
FN −k
ρk+1 = 1− ;
FN −k+1
1 − ρk+1 FN −k FN −k+1
ρk = = =1− ;
2 − ρk+1 FN −k+2 FN −k+2
..
.
FN
ρ1 = 1− .
FN +1
Note that the above ρk , k = 1, . . . , N satisfy the conditions (13.10)–(13.11)
and thus represent the (unique) optimal solution to (13.9)–(13.11).

13.3 Algorithmic Strategies for Unconstrained

Optimization
Consider the problem
minimizef (x).
Classical algorithms for this problem usually aim to construct a sequence of
points {x(k) : k ≥ 0}, such that x(k) → x∗ , k → ∞, where x∗ is a stationary
point of f (x) (that is, ∇f (x∗ ) = 0). Each next point in this sequence is
obtained from the previous point by moving some distance along a direction
d(k) :
x(k+1) = x(k) + αk d(k) , k ≥ 0,
where αk ≥ 0 is a scalar representing the step size. Two popular strategies
used are line search and trust region. In a line search strategy, we proceed
as follows. Given the current solution x(k) , we ﬁrst select a direction d(k) and
then search along this direction for a new solution with a lower objective value.
To determine the step size αk , we usually need to solve (approximately) the
following one-dimensional problem:
min f (x(k) + αd(k) ).
α≥0

The direction d(k) in a line search iteration is typically selected based on the
gradient ∇f (x(k) ), leading to gradient methods, such as the method of steepest
descent discussed in Section 13.4.
328 Numerical Methods and Optimization: An Introduction

Alternatively, in a trust region strategy, the information collected about

f is used to construct a model function fˆk , which approximates f in some
neighborhood of x(k) . Then, instead of unconstrained minimization of f , we
deal with minimization of fˆk over a trust region T (the region where fˆk ap-
proximates f reasonably well):

min fˆk (x(k) + d).

d∈T

Let d(k) be a solution of this problem. If the decrease in the value of f going
from x(k) to x(k+1) = x(k) + d(k) is not suﬃcient, we conclude that the ap-
proximation is not good enough, perhaps because the region T is too large.
Therefore, we shrink the trust region and resolve the problem. We stop if
we are not able to get a meaningful decrease in the objective after a certain
number of attempts.
In the reminder of this chapter we discuss line search methods for uncon-
strained optimization.

13.4 Method of Steepest Descent

Consider the problem
minn f (x),
x∈IR

where f (x) is a continuously diﬀerentiable function. Given x(0) ∈ IRn and a

direction d ∈ IRn , the directional derivative of f (x) at x(0) is

∇f (x(0) )T d.

Recall that if d = 1, the directional derivative is interpreted as the rate of

increase of f (x) at x(0) in the direction d. The Cauchy-Schwartz inequality,
stating that for any two vectors u, v ∈ IRn : uT v ≤ uv with equality if
and only if u = αv for some scalar α ≥ 0, allows one to ﬁnd the direction
with the largest possible rate of increase. Applying this inequality for d and
∇f (x(0) ), we have
∇f (x(0) )T d ≤ ∇f (x(0) )d,
where equality is possible if and only if d = α∇f (x(0) ) with α ≥ 0. So, the
direction d = α∇f (x(0) ) is the direction of the maximum rate of increase for
f at x(0) .
Similarly, for d and −∇f (x(0) ), we have

∇f (x(0) )T d ≥ −∇f (x(0) )d,

where equality is possible if and only if d = −α∇f (x(0) ) with α ≥ 0. So, the
Unconstrained Optimization 329

direction d = −α∇f (x(0) ) is the direction of the maximum rate of decrease

at x(0) . Thus, intuitively, the direction opposite to the gradient is the “best”
direction to take in a minimization method. The general outline of a gradient
method is
x(k+1) = x(k) − αk ∇k , k ≥ 0,
where αk ≥ 0 and ∇k = ∇f (x(k) ). Different choices of αk , k ≥ 0 result in
different variations of gradient methods, but in general αk is chosen so that
the descent property is satisfied:
f (x(k+1) ) < f (x(k) ), k ≥ 0.
Next we show that if ∇k = 0, then αk can always be chosen such that the
sequence {f (x(k) ) : k ≥ 0} possesses the descent property. Using Taylor’s
theorem, we have
f (x(k+1) ) = f (x(k) ) + ∇Tk (x(k+1) − x(k) ) + o(x(k+1) − x(k) ).
Since x(k+1) = x(k) − αk ∇k , we have
f (x(k+1) ) − f (x(k) ) = −αk ∇k 2 + o(αk ∇k ).
This implies that there exists ᾱ > 0 such that for any positive αk ≤ ᾱ:
f (x(k+1) ) − f (x(k) ) < 0.
In the steepest descent method, the step size αk corresponds to the largest
decrease in the objective while moving along the direction ∇f (x(k) ) from point
x(k) :
αk : f (x(k) + αk ∇k ) = min f (x(k) − α∇k ),
α≥0

i.e., αk = arg min f (x (k)

−α∇k ). From the above, it is obvious that such choice
α≥0
of αk will guarantee the descent property, since
f (x(k+1) ) ≤ f (x(k) − ᾱ∇k ) < f (x(k) ).

Theorem 13.7 If x(k) → x∗ , where {x(k) : k ≥ 0} is the sequence

generated by the steepest descent method, then ∇f (x∗ ) = 0.

Proof. Consider φk (α) = f (x(k) −α∇k ). Since in the steepest descent method
αk minimizes φk (α) for α > 0, by the FONC
φk (αk ) = 0.
On the other hand, using the chain rule:

df (x(k) − α∇k )
φk (αk ) = = ∇Tk+1 ∇k .
dα α=αk
330 Numerical Methods and Optimization: An Introduction

x(1) x(3) x(5)

x(6) x∗
x(4)
(2)
x
x(0)

FIGURE 13.3: Illustration of steepest descent iterations.

So, ∇Tk+1 ∇k = 0, k ≥ 0 and ∇f (x∗ )2 = lim ∇Tk+1 ∇k = 0.

k→∞

In the proof, we have shown that if we apply the method of steepest de-
scent, then
∇Tk ∇k+1 = 0, k ≥ 0.
Thus, the gradients of f in two consecutive points generated by the steepest
descent are orthogonal to each other. Since the negative of the gradient rep-
resents the direction we move along at each iteration of the steepest descent,
this means that the directions in the two consecutive steps are orthogonal as
well. Indeed,

(x(k+1) − x(k) )T (x(k+2) − x(k+1) ) = (−αk ∇k )T (−αk+1 ∇k+1 )

= αk αk+1 ∇Tk ∇k+1
= 0.
Figure 13.3 illustrates steepest descent iterations for optimizing a function of
two variables geometrically. The ﬁgure shows six steps of the method starting
from x(0) . The direction used at each step is orthogonal to the level set of
the objective function passing through the current point, as well as to the
direction used at the previous step. The method will eventually converge to
the optimal solution x∗ in this example.

13.4.1 Convex quadratic case

Consider a convex quadratic function
1 T
f (x) = x Qx + cT x,
2
where Q is a positive deﬁnite matrix. Then, for some x(k) ,

∇k = ∇f (x(k) ) = Qx(k) + c,

and the (k + 1)st iteration of the steepest descent method is

x(k+1) = x(k) − αk ∇k ,

with
αk = arg min{φk (α)},
α≥0
Unconstrained Optimization 331

where
φk (α) = f (x(k) − α∇k ); ∇k = Qx(k) + c.
We have
1 (k)
φk (α) = (x − α∇k )T Q(x(k) − α∇k ) + cT (x(k) − α∇k )
2
1 T $ %
= α 2
∇ Q∇k − α ∇Tk ∇k + f (x(k) ).
2 k

Since Q is positive deﬁnite, the coeﬃcient for α2 is positive, so φk (α) is a

convex quadratic function whose global minimizer is given by

∇Tk ∇k
αk = .
∇Tk Q∇k

Therefore, an iteration of the steepest descent method for the convex quadratic
function f (x) = 12 xT Qx + cT x is

∇Tk ∇k
x(k+1) = x(k) − ∇k , k ≥ 0.
∇Tk Q∇k

n
Example 13.4 For f (x) = x2i = xT x = x2 , we have Q = 2In , where
i=1
In is the n × n identity matrix, c = 0, and for any x(0) ∈ IRn ,

4(x(0) )T x(0) (0)

x(1) = x(0) − 2x = x(0) − x(0) = 0.
8(x(0) )T x(0)

Thus, we get the global minimizer in one step in this case.

13.4.2 Global convergence of the steepest descent method

Recall that a numerical method is said to be globally convergent if it con-
verges starting from any point. It can be shown that the steepest descent
method globally converges to a stationary point. We discuss the global con-
vergence analysis for the convex quadratic case only; however, the result holds
in general.
Consider a convex quadratic function
1 T
f (x) = x Qx + cT x.
2
Then the global minimizer of f (x) is x∗ = −Q−1 c. For convenience, instead
of f (x) we will use the quadratic function in the form

1
q(x) = (x − x∗ )T Q(x − x∗ ) (13.13)
2
332 Numerical Methods and Optimization: An Introduction

in our analysis. It is easy to check that q(x) = f (x) + 12 x∗ T Qx∗ , so the two
functions diﬀer only by a constant. We denote by ∇k the gradient of q(x) at
point x(k) :
∇k = ∇q(x(k) ) = Q(x(k) − x∗ ). (13.14)
The steepest descent iteration for this function is

x(k+1) = x(k) − αk ∇k , (13.15)

where
∇Tk ∇k
αk = . (13.16)
∇Tk Q∇k
Next, we prove that

λmin (Q)
q(x(k+1) ) ≤ q(x(k) ) 1 − ,
λmax (Q)

where λmin (Q) and λmax (Q) are the smallest and the largest eigenvalues of
Q, respectively. To show this, we ﬁrst observe that

(x(k) − x∗ )T Q(x(k) − x∗ ) = (x(k) − x∗ )T QT Q−1 Q(x(k) − x∗ )

= (Q(x(k) − x∗ ))T Q−1 (Q(x(k) − x∗ ))
= ∇Tk Q−1 ∇k ,

and then express q(x(k+1) ) in terms of q(x(k) ) as follows:

q(x(k+1) ) = 1
2 (x
(k)
− αk ∇k − x∗ )Q(x(k) − αk ∇k − x∗ )
= q(x(k) ) − αk ∇Tk Q(x(k) − x∗ ) + 12 αk2 ∇Tk Q∇k

∇T ∇ ∇ T ∇k
= q(x(k) ) 1 − ∇TkQ∇kk 1 (x(k) −x∗ )kT Q(x (k) −x∗ )
k 2

∇T ∇ 1
∇T Q∇
k
+ ( ∇TkQ∇kk )2 1 (x(k) −x
2 k
∗ )T Q(x(k) −x∗ )
4
k 2

k

∇k
4
= q(x(k) ) 1 − (∇T Q∇2
∇
)(∇ T Q−1 ∇ ) + (∇T Q∇ )(∇T Q−1 ∇ )
k k k k
k k k k

∇k
4
= q(x ) 1 − (∇T Q∇k )(∇T Q−1 ∇k ) .
(k)
k k

From Rayleigh’s inequality,

∇Tk Q∇k ≤ λmax (Q)∇k 2

and
(∇Tk Q−1 ∇k ) ≤ λmax (Q−1 )∇k 2 = (λmin (Q))−1 ∇k 2 .
Therefore,

∇k 4 λmin (Q)
q(x(k+1) ) = q(x(k) ) 1 − ≤ q(x(k) ) 1 − .
(∇k Q∇k )(∇Tk Q−1 ∇k )
T λmax (Q)
Unconstrained Optimization 333

In summary, we have
k+1
λmin (Q)
q(x(k+1) ) ≤ q(x(0) ) 1 − , (13.17)
λmax (Q)
and since 0 < λmin (Q) ≤ λmax (Q),
q(x(k) ) → 0, k → ∞.
Note that q(x) = 0 ⇐⇒ x = x∗ , so x(k) → x∗ , k → ∞. Thus, the steepest
descent method is globally convergent for a convex quadratic function. Note
that the rate of convergence is linear.
From the above inequality (13.17), we also see that if λmin (Q) = λmax (Q),
then we will have convergence in one step (as for f (x) = x21 + x22 in Exam-
ple 13.4). On the other hand, if λmax (Q) is much larger than λmin (Q), then
min (Q)
1 − λλmax (Q) ≈ 1 and the convergence may be extremely slow in this case.
Recall that the ratio k(Q) = λλmax (Q)
min (Q)
= QQ−1 is called the condition
number of matrix Q. When a matrix is poorly conditioned (i.e., it has a large
condition number), we have “long, narrow” level sets, and the steepest descent
may move back and forth (“zigzag”) in search of the minimizer.

13.5 Newton’s Method

As before, we consider the unconstrained problem
minn f (x).
x∈IR

Newton’s method is based on minimizing a quadratic approximation of f (x)

obtained using Taylor’s theorem instead of f (x). We have
1
f (x) ≈ f (x(k) ) + ∇Tk (x − x(k) ) + (x − x(k) )T ∇2k (x − x(k) ),
2
where ∇k = ∇f (x(k) ) and ∇2k = ∇2 f (x(k) ). If ∇2k is positive deﬁnite, then
the global minimizer of the quadratic approximation
1
q(x) = f (x(k) ) + ∇Tk (x − x(k) ) + (x − x(k) )T ∇2k (x − x(k) )
2
is given by x(k) − (∇2k )−1 ∇k . Setting

x(k+1) = x(k) − (∇2k )−1 ∇k ,

we obtain an iteration of Newton’s method:
x(k+1) = x(k) − (∇2k )−1 ∇k , k ≥ 0.
334 Numerical Methods and Optimization: An Introduction

Example 13.5 Using Newton’s iteration for a convex quadratic function

1 T
f (x) = x Qx + cT x
2
with positive deﬁnite Q, we obtain

x(k+1) = x(k) − Q−1 (Qx(k) + c) = −Q−1 c.

Thus, we get the global minimizer in one step.

13.5.1 Rate of convergence

Next we show that Newton’s method has a quadratic rate of convergence
under certain assumptions. We assume that f ∈ C (3) (IRn ), ∇2k and ∇2 f (x∗ )
are positive deﬁnite, where x∗ is a stationary point of f (x) and x(k) is a point
close to x∗ . Using Taylor’s theorem for ∇f (x) and x(k) , we have:

∇f (x) − ∇k − ∇2k (x − x(k) ) = O(x − x(k) 2 ).

Thus, by deﬁnition of O(·), there is a constant c1 such that

∇f (x) − ∇k − ∇2k (x − x(k) ) ≤ c1 x − x(k) 2 (13.18)

if x is suﬃciently close to x(k) . Consider a closed epsilon-ball B̄(x∗ , ) = {x :

x − x∗ ≤ } centered at x∗ , in which ∇2 f (x) is nonsingular (such a neigh-
borhood exists since ∇2 f (x∗ ) is positive deﬁnite). Then, by the Weierstrass
theorem, since (∇2 f (x))−1 is a continuous function, it has a maximizer over
B̄(x∗ , ), that is, there exists a constant c2 such that for any x ∈ B̄(x∗ , )

(∇2 f (x))−1 ≤ c2 . (13.19)

Consider x(k+1) obtained from x(k) using an iteration of Newton’s method:

x(k+1) = x(k) − (∇2k )−1 ∇k .

Then,

x(k+1) − x∗ = x(k) − (∇2k )−1 ∇k − x∗

= (∇2k )−1 (∇f (x∗ ) − ∇k − ∇2k (x∗ − x(k) ))
≤ (∇2k )−1 · (∇k − ∇f (x∗ ) − ∇2k (x(k) − x∗ ))
≤ c2 c1 x(k) − x∗ 2 .

So, if we start close enough from the stationary point x∗ , then Newton’s
method converges to x∗ with a quadratic rate of convergence under the as-
sumptions speciﬁed above.
Unconstrained Optimization 335

13.5.2 Guaranteeing the descent

Consider an iteration of Newton’s method:

x(k+1) = x(k) − (∇2k )−1 ∇k .

Here we assume that ∇k = 0. In general, Newton’s method may not possess

the descent property, that is, we may have f (x(k+1) ) > f (x(k) ). However, if
∇2k is positive deﬁnite, the Newton’s direction d(k) = −(∇2k )−1 ∇k is a descent
direction in the sense that there exists ᾱ such that for all α ∈ (0, ᾱ) : f (x(k) +
αd(k) ) < f (x(k) ). To show this, we introduce a function of one variable

φk (α) = f (x(k) + αd(k) )

and consider its derivative for α = 0:

φk (0) = ∇Tk d(k) = −∇Tk (∇2k )−1 ∇k < 0.

The inequality above holds since (∇2k )−1 is positive deﬁnite and ∇k = 0. Thus,
there exists ᾱ such that φk (α) < 0 for α ≤ ᾱ, i.e., φk (α) is decreasing on [0, ᾱ]
and for all α ∈ (0, ᾱ) : f (x(k) + αd(k) ) < f (x(k) ).
Therefore, we can modify Newton’s method to enforce the descent property
by introducing a step size as follows:

• ﬁnd αk = arg min(f (x(k) − α(∇2k )−1 ∇k ));

α≥0

• set x(k+1) = x(k) − αk (∇2k )−1 ∇k .

Then the descent property follows from the observation that

f (x(k+1) ) ≤ f (x(k) − α(∇2k )−1 ∇k ) < f (x(k) )

for any α ∈ (0, ᾱ].

13.5.3 Levenberg-Marquardt method

We have assumed above that the Hessian is positive definite in Newton’s
method. Here we show that if the Hessian is not positive definite, it can still be
modified so that the iteration resulting from this modification has the descent
property.
Consider the matrix
Mk = ∇2k + μk In ,
where In is the n × n identity matrix and μk is a scalar. If we denote by
λi , i = 1, . . . , n the eigenvalues of ∇2k , then the eigenvalues of Mk are given by

λi + μk , i = 1, . . . , n.
336 Numerical Methods and Optimization: An Introduction

Indeed, if vi is the eigenvector of ∇2k corresponding to the eigenvalue λi , then

Mk vi = (∇2k + μk In )vi = λi vi + μk vi = (λi + μk )vi .

Hence, if we choose μk > |λmin (∇2k )|, where λmin (∇2k ) is the minimum eigen-
value of ∇2k , then all eigenvalues of Mk are positive, so Mk is a positive deﬁnite
matrix.
To make sure that the descent property holds, we can use the direction
−Mk−1 ∇k instead of the direction −(∇2k )−1 ∇k used in Newton’s method. In-
cluding the step size, we obtain the following iteration:

x(k+1) = x(k) − αk Mk−1 ∇k ,

where αk = arg min f (x(k) − αMk−1 ∇k ). This method is referred to as the

α≥0
Levenberg-Marquardt method. With this modiﬁcation of the Hessian matrix,
the direction used becomes a descent direction. To see this, we write down the
derivative of the function φk (α) = f (x(k) − αMk−1 ∇k ). We have

φk (0) = −∇Tk (Mk )−1 ∇k < 0,

since Mk−1 is positive deﬁnite.

The Levenberg-Marquardt method is in some sense intermediate between
the steepest descent and Newton’s methods. If μk = 0, then it coincides with
Newton’s method. On the other hand, if μk is very large, then Mk ≈ CIn for
some very large C > 0, so Mk−1 ≈ In for some small = 1/C > 0, and the
iteration is
x(k+1) ≈ x(k) − αk ∇k .
Thus, we obtain an approximation of the steepest descent iteration in this
case.

13.6 Conjugate Direction Method

Next we discuss methods that search for a stationary point by exploiting
a set of conjugate directions, which are defined next. Given a positive definite
n×n matrix Q, the nonzero directions d(0) , d(1) , . . . , d(k) are called Q-conjugate
if
(d(i) )T Qd(j) = 0 for i = j.
We first show that Q-conjugate directions d(0) , d(1) , . . . , d(k) form a set of
linearly independent vectors. Consider a linear combination of the given Q-
conjugate directions that results in a zero vector:

c0 d(0) + c1 d(1) + . . . + ci d(i) + . . . + ck d(k) = 0.

Unconstrained Optimization 337

If we premultiply both sides of this equation by (d(i) )T Q, we obtain

c0 (d(i) )T Qd(0) +c1 (d(i) )T Qd(1) +. . .+ci (d(i) )T Qd(i) +. . .+ck (d(i) )T Qd(k) = 0.

But the directions d(0) , d(1) , . . . , d(k) are Q-conjugate, so (d(i) )T Qd(j) = 0 for
all j = i, and hence we have

ci (d(i) )T Qd(i) = 0.

This means that ci = 0, since Q is positive deﬁnite and d(i) = 0. Note that
the index i was chosen arbitrarily, hence ci = 0 for all i = 1, . . . , k.
The linear independence of the conjugate directions implies that one can
choose at most n Q-conjugate directions in IRn .

13.6.1 Conjugate direction method for convex quadratic

problems
We consider a convex quadratic problem

minn f (x), (13.20)

x∈IR

where
1 T
x Qx + cT x,
f (x) =
2
and Q is a positive deﬁnite matrix. Consider an iteration

x(k+1) = x(k) + αk d(k) ,

where
αk = arg min f (x(k) + αd(k) ).
α∈IR

In the conjugate direction method the directions d(k) are chosen so that

{d(0) , d(1) , . . . , d(n−1) }

is a set of Q-conjugate directions.

Denote by
φk (α) = f (x(k) + αd(k) ).
Then,
1 (k)
φk (α) = (x + αd(k) )T Q(x(k) + αd(k) ) + cT (x(k) + αd(k) )
2
1 (k) T (k)
= α2 (d ) Qd + α((x(k) )T Q + cT )d(k) + f (x(k) )
2

1 (k) T (k)
= (d ) Qd α2 + (∇Tk d(k) )α + f (x(k) ).
2
338 Numerical Methods and Optimization: An Introduction

Solving φk (α) = 0, we ﬁnd that

∇Tk d(k)
αk = − ,
(d )T Qd(k)
(k)

so an iteration of the conjugate direction method in this case is

∇Tk d(k)
x(k+1) = x(k) − d(k) .
(d(k) )T Qd(k)
Next we establish some basic properties of the conjugate direction method
in the convex quadratic case.

Lemma 13.1 Let x(k+1) be the point obtained by applying k + 1 itera-

tions of a conjugate direction method to the problem of minimizing f (x)
starting at x(0) and using a set of Q-conjugate directions d(0) , . . . , d(k) .
Then the gradient ∇k+1 = ∇f (x(k+1) ) satisﬁes

∇Tk+1 d(i) = 0, i = 0, . . . , k.

Proof. We will use induction for the proof. We ﬁrst show that for any k:
∇Tk+1 d(k) = 0. We know that αk satisﬁes the FONC for φk (α), hence
φk (αk ) = 0.
On the other hand, using the chain rule
φk (αk ) = ∇f (x(k) + αk d(k) )T d(k) = ∇Tk+1 d(k) ,

so ∇Tk+1 d(k) = 0.
Now, assume that the statement is correct for all k = 1, . . . , K for some
K, i.e.,
∇TK d(i) = 0, i = 0, . . . , K − 1.
We need to show that it is correct for k = K +1. For i = 0, . . . , K −1, consider
∇TK+1 d(i) = (Qx(K+1) + c)T d(i)
T
= Q(x(K) + αK d(K) ) + c d(i)
T
= Qx(K) + c + αK Qd(K) d(i)
T
= ∇K + αK Qd(K) d(i)
= ∇TK d(i) + αK (d(K) )T Qd(i)
= 0.
We have already shown that the statement is correct for i = K. Thus, by
induction, the statement is correct for any k and any i = 0, . . . , k.
Unconstrained Optimization 339

Lemma 13.2 Denote by α(k) = [α0 , α1 , . . . , αk ]T the vector of step sizes

obtained using k steps of the conjugate direction method, and by

Φk (a(k) ) = f (x(0) + a0 d(0) + a1 d(1) + . . . + ak d(k) ) = f (x(0) + Dk a(k) ),

where

Dk = [d(0) d(1) · · · d(k) ] and a(k) = [a0 , a1 , . . . , ak ]T .

Then we have
α(k) = arg min Φ(a(k) ).
k+1
a(k) ∈IR

Proof. We have
Φk (a(k) ) = (x(0) + Dk a(k) )T Q(x(0) + Dk a(k) ) + cT (x(0) + Dk a(k) )
1 $ %
= (a(k) )T Dk T QDk a(k) + (x(0) )T QDk + cT Dk a(k) + f (x(0) ).
2
Since Q is a positive deﬁnite matrix and Dk is a matrix of full rank (due to
linear independence of Q-conjugate directions), Dk T QDk is positive deﬁnite,
so Φk (a(k) ) is a convex quadratic function. The gradient of Φk (a(k) ) is
∇Φk (a(k) )T = ∇f (x(0) + Dk a(k) )T Dk .
Then for a(k) = α(k) we have
∇Φk (α(k) )T = ∇f (x(0) + Dk α(k) )T Dk
= ∇Tk+1 Dk
= ∇Tk+1 [d(0) d(1) · · · d(k) ]
= [∇Tk+1 d(0) ∇Tk+1 d(1) · · · ∇Tk+1 d(k) ]
= 0T .
Since Φk (a(k) ) is a convex quadratic function, a(k) = α(k) is the global mini-
mizer of Φk (a(k) ).
The last property has an important implication concerning the convergence
of the method. We state it in the following theorem.

Theorem 13.8 The conjugate direction algorithm for a convex quadratic

function converges to the global minimizer in no more than n steps for
any starting point x(0) .

Proof. Note that

3 4

n−1
{x = x (0)
+ Dn−1 a : a ∈ IR } =n
x=x (0)
+ ai d (i)
: ai ∈ IR ∀i = IRn .
i=0
340 Numerical Methods and Optimization: An Introduction

Thus,
f (x(n) ) = f (x(0) +Dn−1 α(n−1) ) = min f (x(0) +Dn−1 a(n−1) ) = minn f (x).
a(n−1) ∈IR x∈IR
n

13.6.2 Conjugate gradient algorithm

To implement a conjugate direction algorithm, one needs to specify n
Q-conjugate directions. This can be done using a procedure similar to the
Gram-Schmidt orthogonalization (see Exercise 13.14). The conjugate gradient
algorithm provides an alternative approach that allows us to ﬁnd Q-conjugate
directions one by one as the steps of the algorithm progress. The directions
used in this algorithm are related to the gradient. More speciﬁcally, the direc-
tions are chosen so that
d(0) = −∇0 ;
d(k) = −∇k + βk d(k−1) , k = 1, . . . , n − 1.
Here βk needs to be chosen so that (d(k) )T Qd(k−1) = 0. We have
∇Tk Qd(k−1)
(−∇k + βk d(k−1) )T Qd(k−1) = 0 ⇒ βk = .
(d(k−1) )T Qd(k−1)
It can be shown by induction that with such choice of βk , k = 1, . . . , n − 1, the
directions d(0) , d(1) , . . . , d(n−1) are Q-conjugate. The conjugate gradient algo-
rithm for a convex quadratic problem proceeds as described in Algorithm 13.1.

Algorithm 13.1 The conjugate gradient algorithm for minimizing a convex

quadratic function q(x) = 12 xT Qx + cT x on IRn .
1: Input: Q, c, x(0)
2: Output: x∗ , the minimizer of q(x)
3: for k = 0, . . . , n − 1 do
∇T d(k)
4: αk = − (d(k)k)T Qd(k) , where ∇k = −(Qx(k) + c)
5: if k = 0 then
6: d0 = −∇0
7: else
∇Tk Qd
(k−1)
8: βk = (d(k−1) )T Qd(k−1)
9: d(k) = −∇k + βk d(k−1)
10: end if
11: x(k+1) = x(k) + αk d(k)
12: end for
13: return x(n)

Figure 13.4 illustrates how the steepest descent method, conjugate gradient
method, and Newton’s method compare for a convex quadratic function.
Unconstrained Optimization 341

(a) steepest descent

x(1) x(3) x(5)

x(6) x∗
(4)
x
x(2)
x(0)
(b) conjugate gradient

x(1) x∗

x(0)
(c) Newton’s method

x∗

x(0)

FIGURE 13.4: Comparison of the steepest descent method, conjugate gra-

dient method, and Newton’s method on a quadratic function.

13.6.2.1 Non-quadratic problems

If the minimized function is not quadratic, one can use its quadratic ap-
proximation given by Taylor’s theorem:

1
f (x) ≈ f (x(k) ) + ∇f (x(k) )T (x − x(k) ) + (x − x(k) )T ∇2 f (x(k) )(x − x(k) ).
2
We could replace f (x) with this quadratic approximation and apply the conju-
gate gradient algorithm to ﬁnd the minimizer of the quadratic function, which
we would use as x(k+1) . However, computing and evaluating the Hessian at
each iteration is a computationally expensive procedure that we would like to
avoid. In the conjugate gradient algorithm above, two operations involve the
matrix Q, which would correspond to the Hessian. These operations are com-
puting αk and βk . While αk could be approximated using line search, we need
to ﬁnd a way to approximate βk . Next we discuss several formulas designed
for this purpose.
The Hestenes-Stiefel formula. We use the fact that for a quadratic func-
tion
∇k − ∇k−1 = Q(x(k) − x(k−1) ) = αk−1 Qd(k−1) ,
so
∇k − ∇k−1
Qd(k−1) = .
αk−1
342 Numerical Methods and Optimization: An Introduction

Therefore,

∇Tk Qd(k−1) ∇Tk (∇k − ∇k−1 )

βk = ≈ . (13.21)
(d(k−1) )T Qd(k−1) (d(k−1) )T (∇k − ∇k−1 )

The Polak-Ribiere formula. Here we start with the Hestenes-Stiefel for-

mula and observe that for a quadratic function

(d(k−1) )T ∇k = 0

and
(d(k−1) )T ∇k−1 = (−∇k−1 + βk−1 d(k−2) )T ∇k−1
= −∇Tk−1 ∇k−1 + βk−1 (d(k−2) )T ∇k−1
= −∇Tk−1 ∇k−1

(since (d(k−2) )T ∇k−1 = 0). Therefore,

∇Tk (∇k − ∇k−1 ) ∇Tk (∇k − ∇k−1 )

βk ≈ ≈ . (13.22)
(d (k−1) )T ∇k − (d(k−1) )T ∇k−1 ∇Tk−1 ∇k−1

The Fletcher-Reeves formula. Note that for a quadratic function

∇Tk ∇k−1 = ∇Tk (−d(k−1) + βk−1 d(k−2) ) = 0.

So, starting with the Polak-Ribiere formula, we obtain

∇Tk (∇k − ∇k−1 ) ∇Tk ∇k − ∇Tk ∇k−1 ∇Tk ∇k

βk ≈ = ≈ . (13.23)
∇Tk−1 ∇k−1 ∇Tk−1 ∇k−1 ∇Tk−1 ∇k−1

In the above derivations, each formula was obtained from the previous one
using equalities that are exact for quadratic functions, but only approximate
in general. Therefore, in the process of simplifying the formula for βk , the
quality of approximation of the original problem gradually decreases.

13.7 Quasi-Newton Methods

Recall an iteration of Newton’s method with step size:

x(k+1) = x(k) − αk (∇2k )−1 ∇k , k ≥ 0.

Quasi-Newton methods are obtained from Newton’s method by approximating

the inverse of the Hessian with a matrix that does not involve second-order
Unconstrained Optimization 343

derivatives. Thus, if we replace (∇2k )−1 with its approximation Hk , we obtain

a step of a quasi-Newton method:

x(k+1) = x(k) − αk Hk ∇k , k ≥ 0.

Diﬀerent choices of Hk result in diﬀerent variations of the quasi-Newton

methods. Matrix Hk will be chosen so that it satisﬁes some properties
that the Hessian matrix satisﬁes in the case of a convex quadratic function
f (x) = 12 xT Qx + cT x. In this case, since ∇k = Qx(k) + c, we have

Q(x(k+1) − x(k) ) = ∇k+1 − ∇k .

If we denote by p(k) = x(k+1) − x(k) and by g (k) = ∇k+1 − ∇k , then

Qp(k) = g (k) , k ≥ 0 ⇔ p(k) = Q−1 g (k) , k ≥ 0.

Therefore, we choose Hk such that

p(i) = Hk g (i) , i = 0, . . . , k − 1,

then after n − 1 steps we obtain

Hn g (i) = p(i) , i = 0, . . . , n − 1.

This can be written in matrix form as

Hn [g (0) g (1) · · · g (n−1) ] = [p(0) p(1) · · · p(n−1) ],

or, if we denote by Gn = [g (0) g (1) ··· g (n−1) ] and by Pn =

[p(0) p(1) · · · p(n−1) ], then
Hn Gn = Pn .
If Gn is a nonsingular matrix, we have

Hn = Pn G−1
n .

Similarly, we can show that

Q−1 Gn = Pn

and
Q−1 = Pn G−1
n .

So, for a quadratic function, Hn = Q−1 . This means that after n + 1 steps
of a quasi-Newton method we get the same answer as we get after one step
of Newton’s method, which is the global minimizer of the convex quadratic
function. Next, we show that the global minimizer of the convex quadratic
function is, in fact, obtained in no more than n steps of a quasi-Newton
344 Numerical Methods and Optimization: An Introduction

method. More speciﬁcally, assuming that Hk is symmetric for any k, we show

that the quasi-Newton directions

{d(k) = −Hk ∇k : k = 0, . . . , n − 1}

are Q-conjugate. We use induction.

Assume that the statement we are proving is correct for some K < n − 1,
i.e.,
{d(k) = −Hk ∇k : k = 0, . . . , K}
are Q-conjugate. Then for k = K + 1 and for any i ≤ K:
$ (i) %
(d(K+1) )T Qd(i) = (d(K+1) )T Qp(i) /αi p = x(i+1) − x(i) = αi d(i)
= (d(K+1) )T g (i) /αi (Qp(i) = g (i) )
= −∇K+1 HK+1 g /αi
T (i)
(d(K+1) = HK+1 ∇K+1 )
= −∇TK+1 p(i) /αi (HK+1 g (i) = p(i) )
= −∇K+1 d
T (i)
(p(i) = αi d(i) )
= 0

(since d(i) , i = 0, . . . , K are Q-conjugate by the induction assumption). Thus,

by induction, the directions d(0) , d(1) , . . . , d(n−1) in a quasi-Newton method
are Q-conjugate.
Next, we discuss how to select Hk , k ≥ 0 in quasi-Newton methods.

13.7.1 Rank-one correction formula

In the rank-one correction formula, we start with a symmetric positive
deﬁnite matrix H0 (say, H0 = In ). At step k + 1 we add a rank-one matrix to
Hk to obtain Hk+1 :
Hk+1 = Hk + ak z (k) (z (k) )T (13.24)
for some vector z (k) and a scalar ak (note that z (k) (z (k) )T is an n × n matrix
of rank 1). As before, we require that

Hk+1 g (i) = p(i) , i = 0, . . . , k.

For i = k, we have
Hk+1 g (k) = p(k) ,
so,
(Hk + αk z (k) (z (k) )T )g (k) = p(k) , (13.25)
which is equivalent to

Hk g (k) + αk z (k) (z (k) )T g (k) = p(k) .

From here we ﬁnd that

p(k) − Hk g (k)
z (k) = ,
αk (z (k) )T g (k)
Unconstrained Optimization 345

thus,
(p(k) − Hk g (k) ) (p(k) − Hk g (k) )T
αk z (k) (z (k) )T = αk
αk (z (k) )T g (k) αk (z (k) )T g (k)
(p(k) − Hk g (k) )(p(k) − Hk g (k) )T
= . (13.26)
αk ((z (k) )T g (k) )2
Note that Eq. (13.25) is equivalent to
αk z (k) (z (k) )T g (k) = p(k) − Hk g (k) .
Premultiplying both sides of this equation by (g (k) )T , we obtain
αk ((z (k) )T g (k) )2 = (g (k) )T (p(k) − Hk g (k) ).
Substituting this expression in the denominator of Eq. (13.26) we get
(p(k) − Hk g (k) )(p(k) − Hk g (k) )T
αk z (k) (z (k) )T = .
(g (k) )T (p(k) − Hk g (k) )
Thus, from (13.24), we obtain the following rank-one correction formula:
(p(k) − Hk g (k) )(p(k) − Hk g (k) )T
Hk+1 = Hk + . (13.27)
(g (k) )T (p(k) − Hk g (k) )
One of the drawbacks of this formula is that given a positive deﬁnite Hk ,
the resulting matrix Hk+1 is not guaranteed to be positive deﬁnite. Some
other quasi-Newton methods, such as those described next, do guarantee such
a property.

13.7.2 Other correction formulas

One of the most popular classes of quasi-Newton methods uses the follow-
ing correction formula:
p(k) (p(k) )T Hk g (k) (g (k) )T Hk
Hk+1 = Hk + − + ξk (g (k) )T Hk g (k) h(k) (h(k) )T ,
(p(k) )T g (k) (g (k) )T Hk g (k)
where
p(k) Hk g (k)
h(k) = − ;
(p(k) )T g (k) (g (k) )T Hk g (k)
p(k) = x(k+1) − x(k) , g (k) = ∇k+1 − ∇k ;
the parameters ξk satisfy
0 ≤ ξk ≤ 1
for all k, and H0 is an arbitrary positive definite matrix. The scalars ξk pa-
rameterize the method (different values of ξk yield different algorithms). If
ξk = 0 for all k, we obtain the Davidon-Fletcher-Powell (DFP) method (his-
torically the first quasi-Newton method). If ξk = 1 for all k, we obtain the
Broyden-Fletcher-Goldfarb-Shanno (BFGS) method (considered one of the
best general-purpose quasi-Newton methods).
346 Numerical Methods and Optimization: An Introduction

13.8 Inexact Line Search

In all methods we discussed so far, we had an iteration in the form

x(k+1) = x(k) + αk d(k) ,

where
αk = arg min(f (x(k) + αd(k) )).
α≥0

Computing αk exactly may be expensive, therefore an inexact line search is

used in practice, which provides adequate reductions in the value of f . In
particular, one could try several values of α, and accept one that satisﬁes
certain conditions, such as the Wolfe conditions given next.
Let φk (α) = f (x(k) + αd(k) ). To ensure a suﬃcient decrease, we require
that
φk (α) ≤ φk (0) + c1 αφk (0)
for some small c1 ∈ (0, 1). This is the same as

f (x(k) + αd(k) ) ≤ f (x(k) ) + c1 α∇f (x(k) )T d(k) .

This inequality is often called the Armijo condition.

The suﬃcient decrease condition may result in very small αk . To avoid
this, we introduce the curvature condition, which requires α to satisfy the
inequality
φk (α) ≥ c2 φk (0)
for some constant c2 ∈ (c1 , 1). Expressing this condition in terms of f , we
obtain
∇f (x(k) + αd(k) )T d(k) ≥ c2 ∇f (x(k) )T d(k) .
The suﬃcient decrease and the curvature condition together comprise the
Wolfe conditions:

f (x(k) + αd(k) ) ≤ f (x(k) ) + c1 α∇f (x(k) )T d(k)

∇f (x(k) + αd(k) )T d(k) ≥ c2 ∇f (x(k) )T d(k) .

The strong Wolfe conditions make sure that the derivative φk (α) is not “too
positive” by requiring α to satisfy the following inequalities:

f (x(k) + αd(k) ) ≤ f (x(k) ) + c1 α∇f (x(k) )T d(k)

|∇f (x(k) + αd(k) )T d(k) | ≤ c2 |∇f (x(k) )T d(k) |.

To determine a value of α which would satisfy the desired conditions, a

backtracking line search could be used, which proceeds as follows (for Armijo
condition):
Unconstrained Optimization 347

1: Choose ᾱ > 0, ρ, c ∈ (0, 1)

2: α ← ᾱ
3: repeat
4: α ← ρα
5: until f (x(k) + αd(k) ) ≤ f (x(k) ) + cα∇f (x(k) )T d(k)
6: return αk = α

Exercises
13.1. A company manufactures two similar products. The manufacturing cost
is $40 for a unit of product 1 and $42 for a unit of product 2. Assume
that the company can sell q1 = 150 − 2p1 + p2 units of product 1 and
q2 = 120 + p1 − 3p2 units of product 2, where p1 and p2 are prices
charged for product 1 and product 2, respectively. The company’s goal
is to maximize the total proﬁt. What price should be charged for each
product? How many units of each product should be produced? What
is the optimal proﬁt?
13.2. Solve the problem minn f (x) for the following functions:
x∈IR

1
(a) f (x) = −x2 +3x−7 , n = 1;
(b) f (x) = (x1 − x2 − 1)2 + (x1 − x2 + 1)4 , n = 2;
(c) f (x) = x41 + x42 − 4x1 x2 , n = 2;
(d) f (x) = 2x21 − x1 x2 + x22 − 7x2 , n = 2;
1
(e) f (x) = (x1 −x2 −2)2 +(x1 −x2 +1)4 , n = 2.

13.3. Let f be continuously diﬀerentiable on an open set containing a compact

convex set X ⊂ IRn , and let x∗ be an optimal solution of the problem

min f (x).
x∈X

Prove that x∗ is also optimal for the problem

min xT ∇f (x∗ ).
x∈X

13.4. Consider a class C of continuously diﬀerentiable functions deﬁned on IRn

satisfying the following properties:

(i) For any f ∈ C and x∗ ∈ IRn , if ∇f (x∗ ) = 0, then x∗ is a global

minimizer of f (x).
(ii) For any f1 , f2 ∈ C and α, β ≥ 0, we have αf1 + βf2 ∈ C.
348 Numerical Methods and Optimization: An Introduction

(iii) Any linear function f (x) = aT x + b, x ∈ IRn belongs to C.

Show that a continuously diﬀerentiable function f : IRn → IR belongs

to C if and only if it is convex.

13.5. Solve the problem of minimizing a quadratic function f (x) = 12 xT Qx +

cT x, where

2 1 1
(a) Q = , c= ;
1 3 −1
⎡ ⎤ ⎡ ⎤
4 1 1 −1
(b) Q = ⎣ 1 5 0 ⎦ , c = ⎣ 1 ⎦ ;
1 0 2 2
⎡ ⎤ ⎡ ⎤
3 1 1 1
(c) Q = ⎣ 1 2 0 ⎦ , c = ⎣ 0 ⎦ ;
1 0 5 2
⎡ ⎤ ⎡ ⎤
2 −1 0 1
(d) Q = ⎣ −1 3 1 ⎦ , c = ⎣ 2 ⎦ ;
0 1 4 1
⎡ ⎤ ⎡ ⎤
3 0 1 3
(e) Q = ⎣ 0 3 1 ⎦ , c = ⎣ 1 ⎦ .
1 1 2 2
Clearly explain your solution.

13.6. Given the points [x1 , y1 ]T , . . . , [xn , yn ]T ∈ IR2 , use the optimality condi-
tions to prove that the solution of the problem

n
min 2 f (a, b) = (axi + b − yi )2
[a,b]T ∈IR i=1

is given by the solution of the system AT Az = AT y, where

a
z= , y = [y1 , . . . , yn ]T ,
b

T x1 . . . x n
and A = . The line l(x) = ax + b is called the linear
1 ... 1
regression line for the points [x1 , y1 ]T , . . . , [xn , yn ]T and is often used in
statistics.

13.7. In the ﬁnal iteration of the Fibonacci search method,

ρn = 1/2,

therefore an = bn , i.e., instead of two points an and bn we obtain only

Unconstrained Optimization 349

one, which is the mid-point of the interval [an−1 , bn−1 ]. But we need
two evaluation points in order to determine the final interval of uncer-
tainty. To overcome this problem, we can add a new evaluation point
an = bn − (bn−1 − an−1 ), where is a small number. Show that with
this modification, the reduction factor in the uncertainty range for the
Fibonacci method is no worse than
1 + 2
Fn+1
(therefore this drawback of Fibonacci search is of no significant practical
consequence). Hint: Note that bn−1 − an−1 = Fn+1 2
(b0 − a0 ).

13.8. Let f (x) = 12 xT Qx + cT x be a quadratic function of n variables, where

Q is a positive deﬁnite n × n matrix.
(a) Show that f (x) has a unique global minimizer x∗ .
(b) Show that if the initial point x(0) is such that x(0) − x∗ is an eigen-
vector of Q, then the steepest descent sequence x(k) with initial
point x(0) reaches x∗ in one step, i.e., x(1) = x∗ .
13.9. Apply three steps of the steepest descent method to the problems in
Exercise 13.5. Use the zero vector of appropriate dimension as your
initial guess.
13.10. Consider the problem
min f (x1 , x2 ) = cx21 + x22 .
(a) What is the global minimum of f ?
(b) Apply three steps of the method of steepest descent to this problem
with x(0) = [1, c]T , where c > 0 is some constant.
(c) What is the rate of convergence of the steepest descent sequence
{x(k) : k ≥ 0}?

13.11. For the function of a single variable f (x) = x4/3 , show that

(a) f (x) has a unique global minimizer x∗ = 0;

(b) for any starting guess x(0) = 0, Newton’s method applied to f (x)
diverges.

13.12. The total annual cost C of operating a certain electric motor can be
expressed as a function of its horsepower, x, as follows
$0.2
C(x) = $120 + $1.5x + (1, 000).
x
Use Newton’s method to ﬁnd the motor horsepower that minimizes the
total annual cost. Select an appropriate starting point and apply three
iterations of the method.
350 Numerical Methods and Optimization: An Introduction

13.13. Consider the problem

min2 f (x) = x21 exp(x1 ) + x22 exp(x2 ) + 1.

x∈IR

(a) Apply Newton’s method twice starting with [1, 1]T .

(b) What is the global minimum of f (x)? Explain your answer.

13.14. For a real symmetric positive deﬁnite n × n matrix Q and an arbitrary

set of linearly independent vectors p(0) , . . . , p(n−1) ∈ IRn , the Gram-
Schmidt procedure generates the set of vectors d(0) , . . . , d(n−1) ∈ IRn as
follows:
d(0) = p(0) ;

k T
p(k+1) Qd(i)
d(k+1) = p(k+1) − d(i) .
i=0
d(i)T Qd(i)

Show that the vectors d(0) , . . . , d(n−1) are Q-conjugate.

13.15. Solve the problem

1 T
minimize x Qx + cT x,
2
where

2 0 0
(a) Q = ,c = , x(0) = [1, 1]T , H0 = I2 and
0 1 0
⎡ ⎤ ⎡ ⎤
9 3 1 −8
(b) Q = ⎣ 3 7 2 ⎦ , c = ⎣ 0 ⎦, x(0) = [0, 0, 0]T , H0 = I3
1 2 5 −9
using
(i) the conjugate gradient method,
(ii) the rank-one quasi-Newton method, and
(iii) the BFGS quasi-Newton method.

Use exact line search.

13.16. Illustrate the Wolfe conditions geometrically.

Chapter 14
Constrained Optimization

In discussing constrained optimization problems, we will follow the same se-

quence as we did for unconstrained problems in Chapter 13. Namely, we will
present optimality conditions, followed by a brief outline of ideas behind al-
gorithms for constrained optimization.

14.1 Optimality Conditions

14.1.1 First-order necessary conditions
We start by considering problems involving equality constraints only, and
then move on to discussing the more general case of problems with both
equality and inequality constraints.

14.1.1.1 Problems with equality constraints

We consider a problem with equality constraints in the form

minimize f (x)
subject to h(x) = 0,

where f (x) : IRn → IR, h(x) = [h1 (x), h2 (x), . . . , hm (x)]T : IRn → IRm . We
assume that m < n. Denote by
⎡ ∂h1 (x) ∂h1 (x)
⎤ ⎡ ⎤
∂x1 ∂x2 · · · ∂h∂x1 (x) ∇h1 (x)T
⎢ ∂h2 (x) n
⎥ ⎢
⎥ ⎢ ∇h2 (x)T ⎥
∂h2 (x)
⎢ ∂x1 · · · ∂h∂x2 (x) ⎥
⎢
Jh (x) = ⎢
∂x2 n ⎥=⎢ ⎥
.. .. .. .. ⎥ ⎣ ..
⎦
⎣ . . . . ⎦ .
∂hm (x) ∂hm (x)
··· ∂hm (x) ∇hm (x) T
∂x1 ∂x2 ∂xn

the Jacobian of h at x. Let X = {x ∈ IRn : h(x) = 0} denote the feasible

set of the considered optimization problem. We will call x∗ ∈ IRn a regular
point if Jh (x∗ ) has the full rank, that is, rank(Jh (x∗ )) = m, or equivalently,
∇h1 (x∗ ), ∇h2 (x∗ ), . . . , ∇hm (x∗ ) are linearly independent.
We ﬁrst discuss the FONC for the case with two variables and a single

351
352 Numerical Methods and Optimization: An Introduction

constraint: n = 2 and m = 1. We can write such a problem as follows:

minimize f (x1 , x2 )
subject to h1 (x1 , x2 ) = 0.

Then the feasible set X = {x ∈ IR2 : h1 (x1 , x2 ) = 0} is the level set of h1 at

the level 0.
Let x∗ = [x∗1 , x∗2 ]T be a regular point and a local minimizer of this problem.
Then h1 (x∗ ) = 0, and ∇h1 (x∗ ) is orthogonal to the tangent line to any curve
passing through x∗ in the level set X. Consider an arbitrary curve γ = {y(t) =
[x1 (t), x2 (t)]T , t ∈ [α, β]} ⊂ X passing through x∗ in X, that is, y(t∗ ) = x∗
for some t∗ ∈ (α, β). The direction of the tangent line to γ at x∗ is given by
y (t), hence we have
∇h1 (x∗ )T y (t∗ ) = 0.

Since x∗ is a local minimizer of our problem, it will remain a local minimizer

if we restrict the feasible region to the points of the curve γ. Therefore, t∗ is a
local minimizer of the problem min f (y(t)), and since t∗ is an interior point
t∈[α,β]
of [α, β], t∗ is a local minimizer of the single-variable unconstrained problem

min f (y(t)).
t∈IR

Using the FONC for this unconstrained problem, we have

df (y(t∗ ))
= 0.
dt

On the other hand, from the chain rule,

df (y(t∗ ))
= ∇f (y(t∗ ))T y (t∗ ),
dt
so
∇f (y(t∗ ))T y (t∗ ) = 0.

Thus, ∇f (y(t∗ )) is orthogonal to y (t∗ ). So, we have shown that if ∇f (x∗ ) = 0,

then ∇f (x∗ ) and ∇h1 (x∗ ) are both orthogonal to the same vector y (x∗ ).
For 2-dimensional vectors, this means that ∇f (x∗ ) and ∇h1 (x∗ ) are parallel,
implying that there exists a scalar λ such that

∇f (x∗ ) + λ∇h1 (x∗ ) = 0.

A similar property holds for the general case and is formulated in the following
theorem.
Constrained Optimization 353

Theorem 14.1 (Lagrange Theorem) If x∗ is a regular point and a

local minimizer (maximizer) of the problem

minimize f (x)
(14.1)
subject to h(x) = 0,

where f (x) : IRn → IR, h(x) = [h1 (x), . . . , hm (x)]T : IRn → IRm , then
there exists λ = [λ1 , . . . , λm ]T ∈ IRm such that

m
∇f (x∗ ) + λi ∇hi (x∗ ) = 0. (14.2)
i=1

Here, λi , i = 1, . . . , m, are called the Lagrange multipliers and the function

m
L(x, λ) = f (x) + λi hi (x)
i=1

is called the Lagrangian of the considered problem. Note that L(x, λ) is a

function of n + m variables. If we apply the unconstrained FONC to this
function we obtain the system

m
∇f (x∗ ) + λi ∇hi (x∗ ) = 0
i=1
h(x∗ ) = 0,

which coincides with the FONC stated in the Lagrange theorem (the second
equation just guarantees the feasibility). This system has n + m variables
and n + m equations. Its solutions are candidate points for a local minimizer
(maximizer). The system is not easy to solve in general. Moreover, like in
the unconstrained case, even if we solve it, a solution may not be a local
minimizer–it can be a saddle point or a local maximizer. Figure 14.1 illustrates
the FONC.

Example 14.1 Apply the FONC (Lagrange theorem) to the problem

minimize x21 + x22

subject to 4x21 + x22 − 1 = 0.

Note that all feasible points are regular for this problem, so any local
minimizer has to satisfy the Lagrange conditions. We have

L(x, λ) = x21 + x22 + λ(4x21 + x22 − 1),

354 Numerical Methods and Optimization: An Introduction

∇h(x∗ )

x∗

∇f (x∗ )

h(x) = 0

f (x) = c2
f (x) = c1 f (x∗ ) = c1 < c2

FIGURE 14.1: An illustration of the FONC for equality-constrained prob-

lems. Here x∗ satisﬁes the FONC and is a local maximizer, which is not global.

and the Lagrange conditions give the system

2x1 (1 + 4λ) = 0
2x2 (1 + λ) = 0
4x21 + x22 = 1.

This system has four solutions:

(1) λ1 = −1, x(1) = [0, 1]T ;
(2) λ2 = −1, x(2) = [0, −1]T ;
(3) λ3 = −1/4, x(3) = [1/2, 0]T ;
(4) λ4 = −1/4, x(4) = [−1/2, 0]T .
From a geometric illustration (Figure 14.2), it is easy to see that x(1) and
x are global maximizers, whereas x(3) and x(4) are global minimizers.
(2)

Example 14.2 Consider the problem of optimizing f (x) = xT Qx subject to

a single equality constraint xT P x = 1, where P is a positive deﬁnite matrix.
Apply the FONC to this problem.

The Lagrangian is

L(x, λ) = xT Qx + λ(1 − xT P x).

Constrained Optimization 355
x2
x(1)

1/2

−1 x(3) 1
x (4) 0 x1

-1/2

x(2)

FIGURE 14.2: Illustration of Example 14.1.

From the FONC, we have

Qx − λP x = 0.

Premultiplying this equation by P −1 , we obtain

(P −1 Q − λIn )x = 0.

A solution (λ∗ , x∗ ) of this equation is an eigenpair of the matrix P −1 Q. More-

over, if we premultiply this equation for x = x∗ by x∗ T P , we get

x∗ T Qx∗ − λ∗ x∗ T P x∗ = 0,

and since x∗ T P x∗ = 1, this yields

x∗ T Qx∗ = λ∗ .

Therefore, an eigenvector corresponding to the smallest (largest) eigenvalue

of P −1 Q is a global minimizer (maximizer) of this problem and

min xT Qx = λmin (P −1 Q),

xT P x=1

max xT Qx = λmax (P −1 Q).

xT P x=1
356 Numerical Methods and Optimization: An Introduction

Convex case
Consider a convex problem with equality constraints,
minimize f (x)
subject to h(x) = 0,
where f (x) is a convex function and X = {x ∈ IRn : h(x) = 0} is a convex set.
We will show that the Lagrange theorem provides suﬃcient conditions for a
global minimizer in this case.

Theorem 14.2 Let x∗ be a regular point satisfying the Lagrange theo-

rem,
h(x∗ ) = 0 (14.3)
and

m
∇f (x∗ ) + λi ∇hi (x∗ ) = 0. (14.4)
i=1

Then x∗ is a global minimizer.

Proof. From the ﬁrst-order characterization of a convex function, we have

f (x) − f (x∗ ) ≥ ∇f (x∗ )T (x − x∗ ), ∀x ∈ X. (14.5)
From the FONC (14.4),

m
∇f (x∗ ) = − λi ∇hi (x∗ ).
i=1

So, from (14.5)

m
f (x) − f (x∗ ) ≥ − λi ∇hi (x∗ )T (x − x∗ ). (14.6)
i=1

Note that for any i, ∇hi (x∗ )T (x − x∗ ) is the directional derivative of hi at x∗

in the direction x−x∗ . Hence, using the deﬁnition of the directional derivative
we obtain
hi (x∗ + α(x − x∗ )) − hi (x∗ )
∇hi (x∗ )T (x − x∗ ) = lim
α→0+ α
hi (αx + (1 − α)x∗ ) − hi (x∗ )
= lim
α→0+ α
= 0,
since αx + (1 − α)x∗ ∈ X due to the convexity of X and h(y) = 0 for any
y ∈ X. Substituting this result into (14.6) we obtain
f (x) − f (x∗ ) ≥ 0.
Constrained Optimization 357

Since x is an arbitrary point in X, x∗ is a global minimizer of the considered

problem.

Example 14.3 Let f (x) = 12 xT Qx, where Q is a positive deﬁnite matrix,

and let h(x) = Ax − b, where A is an m × n matrix of rank m. Consider the
problem
min f (x),
x∈X

where X = {x : h(x) = 0} = {x : Ax = b}. This is a convex problem, since

f (x) is a convex function (its Hessian Q is positive deﬁnite) and X is a convex
set (deﬁned by a system of linear equations).
The Lagrangian of this problem is given by

1 T
L(x, λ) = x Qx + λT (Ax − b),
2
where λ ∈ IRm . The FONC can be expressed by the system

Qx + AT λ = 0
Ax − b = 0.

From the ﬁrst equation,

x = −Q−1 AT λ. (14.7)
Premultiplying both sides of this equation by A, we obtain

Ax = −AQ−1 AT λ,

so
b = −AQ−1 AT λ
and
λ = −(AQ−1 AT )−1 b.
Substituting this value of λ into (14.7), we obtain the global minimizer of the
considered problem:
x∗ = Q−1 AT (AQ−1 AT )−1 b.
In an important special case, when Q = In , the discussed problem becomes

1
min x2 .
Ax=b 2
Its solution,
x∗ = AT (AAT )−1 b,
gives the solution of the system Ax = b with minimum norm.
358 Numerical Methods and Optimization: An Introduction

14.1.1.2 Problems with inequality constraints

We consider the following problem:

minimize f (x)
subject to h(x) = 0
g(x) ≤ 0,

where f (x) : IRn → IR, h(x) : IRn → IRm (m < n), and g(x) : IRn → IRp .
This problem involves two types of constraints, equality and inequality
constraints. Recall that an inequality constraint gj (x) ≤ 0 is called active at
x∗ if gj (x∗ ) = 0. We denote by I(x∗ ) = {j : gj (x∗ ) = 0} the set of indices cor-
responding to the active constraints for x∗ . A point x∗ is called a regular point
for the considered problem if ∇hi (x∗ ), i = 1, . . . , m and ∇gj (x∗ ), j ∈ I(x∗ )
form a set of linearly independent vectors. The Lagrangian of this problem is
deﬁned as
L(x, λ, μ) = f (x) + λT h(x) + μT g(x),
where λ = [λ1 , . . . , λm ]T ∈ IRm and μ = [μ1 , . . . , μp ]T ∈ IRp , μ ≥ 0. As before,
the multipliers λi , i = 1, . . . , m corresponding to the equality constraints are
called the Lagrange multipliers. The multipliers μj , j = 1, . . . , p correspond-
ing to the inequality constraints are called the Karush-Kuhn-Tucker (KKT)
multipliers.
The ﬁrst-order necessary conditions for the problems with inequality con-
straints are referred to as Karush-Kuhn-Tucker (KKT) conditions.

Theorem 14.3 (Karush-Kuhn-Tucker (KKT) conditions)

If x∗ is a regular point and a local minimizer of the problem

minimize f (x)
subject to h(x) = 0
g(x) ≤ 0,

where all functions are continuously diﬀerentiable, then there exist λ ∈

IRm and a nonnegative μ ∈ IRp such that

m
p
∇f (x∗ ) + λi ∇hi (x∗ ) + μj ∇gj (x∗ ) = 0
i=1 j=1

and

μj gj (x∗ ) = 0, j = 1, . . . , p (Complementary slackness).

In summary, the KKT conditions can be expressed in terms of the following

system of equations and inequalities:
Constrained Optimization 359

1. λ ∈ IRm , μ ∈ IRp , μ ≥ 0;

m
p
2. ∇f (x∗ ) + λi ∇hi (x∗ ) + μj ∇gj (x∗ ) = 0;
i=1 j=1

3. μj gj (x∗ ) = 0, j = 1, . . . , p;
4. h(x∗ ) = 0;
5. g(x∗ ) ≤ 0.

Example 14.4 Consider the problem

minimize x2 − x1
subject to x21 + x22 ≤ 4
(x1 + 1)2 + x22 ≤ 4.

The Lagrangian is

L(x, μ) = x2 − x1 + μ1 (x21 + x22 − 4) + μ2 ((x1 + 1)2 + x22 − 4).

Using the KKT conditions we have the system

−1 + 2μ1 x1 + 2μ2 (x1 + 1) = 0

1 + 2μ1 x2 + 2μ2 x2 = 0
+ x22 − 4)
μ1 (x21 = 0
$ %
μ2 (x1 + 1) + x22 − 4
2
= 0
x21 + x22 ≤ 4
(x1 + 1)2 + x22 ≤ 4
μ 1 , μ2 ≥ 0.

1. If μ1 = 0, then the system becomes

2μ2 (x1 + 1) = 1
2μ2 x2 = −1
$ %
μ2 (x1 + 1) + x2 − 4
2 2
= 0
x21 + x22 ≤ 4
(x1 + 1) + 2
x22 ≤ 4
μ2 ≥ 0.

Note that μ2 = 0 (from the first equation), so μ2 > 0. Adding the first
two equations we get
x1 = −x2 − 1,
which, using the third equation, gives
√
x2 = ± 2.
360 Numerical Methods and Optimization: An Introduction
√ √
μ2 > 0, from the second equation x2 = − 2 and μ2 = 1/(2 2), so
Since √
x1 = 2−1. √ These values
√ of x1 and x2 satisfy the inequality constraints,
thus x∗ = [ 2 − 1, − 2]T satisfies the KKT conditions.

2. If μ2 = 0, then the system becomes

2μ1 x1 = 1
2μ1 x2 = −1
μ1 (x21 + x22 − 4) = 0
x21 + x22 ≤ 4
2
(x1 + 1) + x22 ≤ 4
μ1 ≥ 0.

Note that μ1 = 0 (from the first equation), so μ1 > 0. Adding the first
two equations we get
x1 = −x2 ,
which, using the third equation, gives
√
x2 = ± 2.
√ √
μ1 > 0, from the second equation x2 = − 2 and μ1 = 1/(2 2), so
Since √
x1 = 2. However, these values of x1 and x2 do not satisfy the last in-
equality constraint, hence the point is infeasible and the KKT conditions
are not satisfied.

3. If μ1 = 0, μ2 = 0, then the system becomes

−1 + 2μ1 x1 + 2μ2 (x1 + 1) = 0

1 + 2μ1 x2 + 2μ2 x2 = 0
x21 + x22 −4 = 0
(x1 + 1) +2
x22 −4 = 0
μ 1 , μ2 ≥ 0.
√
From the last two equalities we obtain x1 = −1/2, x2√= ± 15/2. Solving
the ﬁrst two
√ equations with x√ 1 = −1/2, for x2 = 15/2
√ we get μ1 =
− 12 (1+1/ 15), μ2 = 12 (1−1/ 15), whereas for x2 = − 15/2 we obtain
√ √
μ1 = − 12 (1 − 1/ 15), μ2 = 12 (1 + 1/ 15). In both cases, one of the KKT
multipliers is negative, so these points do not satisfy the KKT conditions.

From Figure 14.3 it is clear that the KKT point x∗ is the global minimizer.
√
The level set of the objective corresponding to the optimal value (1 − 2 2) is
shown by the dashed line (a tangent to the feasible region).
Constrained Optimization 361
x2
2

-3 -2 1 2 x1

x∗

-2

FIGURE 14.3: An illustration of Example 14.4.

Convex case
Assume that the feasible set X = {x : h(x) = 0, g(x) ≤ 0} is a convex set
and f (x) is a convex function over X. We will show that in this case the KKT
conditions are suﬃcient conditions for a global minimizer.

Theorem 14.4 Consider a convex problem

minimize f (x)
subject to h(x) = 0
g(x) ≤ 0,

where all functions are continuously diﬀerentiable, the feasible set X =

{x ∈ IRn : h(x) = 0, g(x) ≤ 0} is convex, and f (x) is convex on X.
A regular point x∗ is a global minimizer of this problem if and only
if it satisﬁes the KKT conditions, that is, there exist λ ∈ IRm and a
nonnegative μ ∈ IRp such that

m
p
∇f (x∗ ) + λi ∇hi (x∗ ) + μj ∇gj (x∗ ) = 0
i=1 j=1
362 Numerical Methods and Optimization: An Introduction
and
μj gj (x∗ ) = 0, j = 1, . . . , p.

Proof. The necessity follows from Theorem 14.3. To establish the suﬃciency,
assume that x∗ is a regular point satisfying the KKT conditions. Using the
ﬁrst-order characterization of a convex function, for any x ∈ X:

f (x) ≥ f (x∗ ) + ∇f (x∗ )T (x − x∗ ). (14.8)

From the KKT conditions,

m
p
∇f (x∗ ) = − λi ∇hi (x∗ ) − μj ∇gj (x∗ ).
i=1 j=1

So, from (14.8) we have

m
p
f (x) − f (x∗ ) ≥ − λi ∇hi (x∗ )T (x − x∗ ) − μj ∇gj (x∗ )T (x − x∗ ).
i=1 j=1

When we considered convex problems with equality constraints, we proved

that
∇hi (x∗ )T (x − x∗ ) = 0, i = 1, . . . , m.

Next we show that

μj ∇gj (x∗ )T (x − x∗ ) ≤ 0.

Indeed, ∇gj (x∗ )T (x − x∗ ) is the directional derivative of gj (x) at x∗ in the

direction x − x∗ , hence

gj (x∗ + α(x − x∗ )) − gj (x∗ )

μj ∇gj (x∗ )T (x − x∗ ) = μj lim
α→0+ α
μj gj (x∗ + α(x − x∗ )) − μj gj (x∗ )
= lim
α→0+ α
μj gj (αx + (1 − α)x∗ )
= lim
α→0+ α
≤ 0.

Here we used the complementary slackness (μj gj (x∗ ) = 0).

So, f (x) − f (x∗ ) ≥ 0 for any x ∈ X, thus x∗ is a global minimizer of our
problem.

Example 14.5 The problem in Example 14.4 is convex, hence, we can con-
clude that the only KKT point x∗ is the global minimizer for this problem.
Constrained Optimization 363

14.1.2 Second-order conditions

14.1.2.1 Problems with equality constraints
For the surface X = {x ∈ IRN : h(x) = 0}, the tangent space T (x∗ ) at a
point x∗ is deﬁned as the null space of the Jacobian of h(x) at x∗ :

T (x∗ ) = {y ∈ IRn : Jh (x∗ )y = 0},

and the corresponding tangent plane at x∗ is given by

T P (x∗ ) = {y ∈ IRn : Jh (x∗ )(y − x∗ ) = 0}.

Here we assume that f (x) and h(x) are twice continuously diﬀerentiable. De-
note by ∇2xx L(x, λ) the Hessian of L(x, λ) as a function of x:

m
∇2xx L(x, λ) = ∇ f (x) +
2
λi ∇2 hi (x).
i=1

The second-order optimality conditions in the case of equality-constrained

problems are formulated similarly to the unconstrained case. However, in the
equality-constrained case, they are restricted to the tangent space at x∗ only.

Theorem 14.5 (SONC) If a regular point x∗ is a local minimizer of

the problem min f (x), then there exists λ = [λ1 , . . . , λm ]T ∈ IRm such
h(x)=0
that

m
(1) ∇f (x∗ ) + λi ∇hi (x∗ ) = 0;
i=1

(2) for any y ∈ T (x∗ ) :

y T ∇2xx L(x∗ , λ)y ≥ 0. (14.9)

Theorem 14.6 (SOSC) If a regular point x∗ satisﬁes the SONC stated

in Theorem 14.5 with λ = λ∗ and for any y ∈ T (x∗ ), y = 0:

y T ∇2xx L(x∗ , λ∗ )y > 0, (14.10)

then x∗ is a strict local minimizer of the problem min f (x).

h(x)=0

Since the problem max f (x) is equivalent to the problem min (−f (x)),
h(x)=0 h(x)=0
it is easy to check that by reversing the sign of the inequalities in (14.9)
364 Numerical Methods and Optimization: An Introduction

and (14.10) we will obtain the corresponding second-order optimality condi-

tions for the maximization problem max f (x).
h(x)=0

Example 14.6 Consider the problem from Example 14.1 (page 353): f (x) =
x21 + x22 ; h(x) = 4x21 + x22 − 1. There are four candidates for local optimizers:
(1) λ1 = −1, x(1) = [0, 1]T ;
(2) λ2 = −1, x(2) = [0, −1]T ;
(3) λ3 = −1/4, x(3) = [1/2, 0]T ;
(4) λ4 = −1/4, x(4) = [−1/2, 0]T .
We apply the second-order conditions to these points. We have

L(x, λ) = x21 + x22 + λ(4x21 + x22 − 1),

2 + 8λ 0
∇2xx L(x, λ) = .
0 2 + 2λ
Next we need to ﬁnd the tangent space for each of the candidate points.

∇h(x) = [8x1 , 2x2 ]T ,

so ∇h(x(1) ) = [0, 2]T and

T (x(1) ) = {y = [y1 , y2 ]T : ∇h(x(1) )T y = 0}

= {y : 0y1 + 2y2 = 0}
= {y : y = [c, 0]T , c ∈ IR}.

Thus,
y T ∇2xx L(x(1) , λ1 )y = −6c2 < 0, ∀c = 0,
and x(1) is a local maximizer. It is easy to check that T (x(2) ) = T (x(1) ) and
∇2xx L(x(2) , λ2 ) = ∇2xx L(x(1) , λ1 ), so x(2) is also a local maximizer.
Similarly, T (x(3) ) = T (x(4) ) = {y : y = [0, c]T , c ∈ IR} and
3 2
y T ∇2xx L(x(3) , λ3 )y = y T ∇2xx L(x(4) , λ4 )y = c > 0, ∀c = 0.
2
Thus, x(3) and x(4) are local minimizers.

14.1.2.2 Problems with inequality constraints

As before, we consider a problem with inequality constraints in the form

minimize f (x)
subject to h(x) = 0 (P)
g(x) ≤ 0,

where f : IRn → IR, h : IRn → IRm , g : IRn → IRp . For x∗ ∈ IRn , I(x∗ ) denotes
the set of indices corresponding to active constraints at x∗ .
Constrained Optimization 365

Given the Lagrangian of the problem under consideration,

m
p
L(x, λ, μ) = f (x) + λi hi (x) + μj gj (x),
i=1 j=1

we denote by

m
p
∇2xx L(x, λ, μ) = ∇ f (x) +
2
λi ∇ hi (x) +
2
μj ∇2 gj (x).
i=1 j=1

Let :
∗ y ∈ IRn : ∇hi (x∗ )T y = 0, i = 1, . . . , m
T (x ) = .
∇gj (x∗ )T y = 0, j ∈ I(x∗ )

Theorem 14.7 (SONC) If a regular point x∗ is a local minimizer

of (P), then there exist λ ∈ IRm and nonnegative μ ∈ IRp such that

(I) the KKT conditions are satisﬁed for x∗ , λ, and μ, and

(II) for any y ∈ T (x∗ ): y T ∇2xx L(x, λ, μ)y ≥ 0.

Before we state the second-order suﬃcient conditions, we need to introduce

another notation. Given x∗ ∈ IRn and μ ∈ IRp , we denote by
:
∗ y ∈ IRn : ∇hi (x∗ )T y = 0, i = 1, . . . , m
T (x , μ) = .
∇gj (x∗ )T y = 0, j ∈ I(x∗ ) and μj > 0

Note that T (x∗ ) ⊆ T (x∗ , μ).

Theorem 14.8 (SOSC) If x∗ is a regular point satisfying the KKT

conditions and the SONC in Theorem 14.7 with some λ ∈ IRm and non-
negative μ ∈ IRp , and for any nonzero y ∈ T (x∗ , μ):

y T ∇2xx L(x∗ , λ, μ)y > 0,

then x∗ is a strict local minimizer of (P).

Example 14.7 Consider the problem

minimize x2 − x1
subject to x21 + x22 = 4
(x1 + 1)2 + x22 ≤ 4.
366 Numerical Methods and Optimization: An Introduction

The Lagrangian is

L(x, λ, μ) = x2 − x1 + λ(x21 + x22 − 4) + μ((x1 + 1)2 + x22 − 4).

Using the KKT conditions we have the system

−1 + 2λx1 + 2μ(x1 + 1) = 0
1 + 2λx2 + 2μx2 = 0
$ %
μ (x1 + 1)2 + x22 − 4 = 0
x21 + x22 = 4
2
(x1 + 1) + x22 ≤ 4
μ ≥ 0.

To solve this system, we consider two cases: μ = 0 and (x1 + 1)2 + x22 − 4 = 0.

1. In the ﬁrst case, the KKT system reduces to

−1 + 2λx1 = 0
1 + 2λx2 = 0
x21 + x22 = 4
2
(x1 + 1) + x22 ≤ 4.

From the ﬁrst two equations, noting that λ cannot be zero, we obtain
x1√= −x2 , and considering the third equation we have x1 = −x2 =
± 2. Taking into account the inequality constraint, we obtain a unique
solution to the above system,
(1)
√ (1)
√ √
x1 = − 2, x2 = 2, λ(1) = −1/(2 2), μ(1) = 0.

2. In the second case, (x1 + 1)2 + x22 − 4 = 0, the KKT system becomes

−1 + 2λx1 + 2μ(x1 + 1) = 0
1 + 2λx2 + 2μx2 = 0
2
(x1 + 1) + x22 = 4
x21 + x22 = 4
μ ≥ 0.
√
From the last two equalities we obtain x1 = −1/2, x2 = ± 15/2, and
using the ﬁrst two equations we obtain the following two solutions:
(2) (2)
√ 1 √ 1 √
x1 = −1/2, x2 = 15/2, λ(2) = − (1 + 1/ 15), μ(2) = (1 − 1/ 15);
2 2
(3) (3)
√ 1 √ 1 √
x1 = −1/2, x2 = − 15/2, λ(3) = − (1−1/ 15), μ(3) = (1+1/ 15).
2 2
Constrained Optimization 367

Thus, there are three points satisfying the KKT conditions, x(1) , x(2) , and x(3) .
Next, we apply the second-order optimality conditions to each of these points.
Let h(x) = x21 +x22 −4, g(x) = (x1 +1)2 +x22 −4. The Hessian of the Lagrangian
as the function of x is

2(λ + μ) 0
∇2xx L(x, λ, μ) = .
0 2(λ + μ)

For the ﬁrst KKT point,

(1)
√ (1)
√ √
x1 = − 2, x2 = 2, λ(1) = −1/(2 2), μ(1) = 0,

since the inequality constraint is inactive at x(1) , it can be ignored, and we

have
T (x(1) ) = T (x(1) , μ(1) ) = {y ∈ IR2 : ∇h(x(1) )T y = 0}
(1) (1)
= {y ∈ IR2 : 2x1√ y1 + 2x√2 y2 = 0}
= {y ∈ IR : −2 2y1 + 2 2y2 = 0}
2

= {y ∈ IR2 : y1 = y2 }
= {y = [c, c]T : c ∈ IR}.

In this case,
√
−1/ 2 0√
∇2xx L(x(1) , λ(1) , μ(1) ) = ,
0 −1/ 2
√
and for any y ∈ T (x(1) ), we have y T ∇2xx L(x(1) , λ(1) , μ(1) )y = − 2c2 < 0 (if
c = 0). Therefore, x(1) is a strict local maximizer. It should be noted that
we would arrive at the same conclusion for any tangent space T (x(1) ) since
∇2xx L(x(1) , λ(1) , μ(1) ) is clearly negative deﬁnite and the inequality constraint
is inactive at x(1) .
For the second KKT point,

(2) (2)
√ 1 √ 1 √
x1 = −1/2, x2 = 15/2, λ(2) = − (1 + 1/ 15), μ(2) = (1 − 1/ 15),
2 2
we have
:
(2) (2) (2) y ∈ IR2 : ∇h(x(2) )T y = 0
T (x ) = T (x ,μ ) =
3 ∇g(x(2) )T y = 0 4
(2) (2)
y ∈ IR : 2
2x1 y1 + 2x2 y2 = 0
= (2) (2)
2(x1 + 1)y1 + 2x2 y2 = 0
√ :
y ∈ IR2 : −y1 + √15y2 = 0
=
y1 + 15y2 = 0
= {[0, 0]T }.

Hence, the tangent space has no nonzero elements, the SONC and SOSC are
automatically satisﬁed, and x(2) is a strict local minimizer.
368 Numerical Methods and Optimization: An Introduction

Finally, for the third KKT point,

(3) (3)
√ 1 √ 1 √
x1 = −1/2, x2 = − 15/2, λ(3) = − (1 − 1/ 15), μ(3) = (1 + 1/ 15),
2 2
we also have T (x(3) ) = T (x(3) , μ(3) ) = {[0, 0]T }, implying that x(3) is a strict
local minimizer.
Note that the feasible region of this problem is a compact set, therefore, a
global minimizer exists. Since there are two local minimizers, one of them must
be global. Comparing the objective
√ function f (x) = √x2 − x1 at the points of
local minimum, f (x(2) ) = ( 15 + 1)/2, f (x(3) ) = (− 15 + 1)/2, we conclude
that the global minimum is achieved at x(3) .

14.2 Duality
Consider the functions f : X → IR, g : Y → IR, and F : X × Y → IR,
where X ⊆ IRm and Y ⊆ IRn . We assume that the global minima and maxima
do exist in all cases discussed below in this section. Suppose that f (x) ≤ g(y)
for all (x, y) ∈ X × Y . Then it is clear that

max f (x) ≤ min g(y).

x∈X y∈Y

Under certain conditions, the above inequality can be satisﬁed as an equality

max f (x) = min g(y).

x∈X y∈Y

A result of this kind is called a duality theorem. It is easy to prove that the
following inequality holds:

max min F (x, y) ≤ min max F (x, y).

y∈Y x∈X x∈X y∈Y

Under certain conditions, we can prove that

max min F (x, y) = min max F (x, y).

y∈Y x∈X x∈X y∈Y

A result of this type is called a minimax theorem.

The point (x∗ , y ∗ ) ∈ X × Y is a saddle point of F (with respect to maxi-
mizing in Y and minimizing in X) if

F (x∗ , y) ≤ F (x∗ , y ∗ ) ≤ F (x, y ∗ ) for all (x, y) ∈ X × Y.

Constrained Optimization 369

Theorem 14.9 The point (x∗ , y ∗ ) ∈ X × Y is a saddle point of F if

and only if

F (x∗ , y ∗ ) = max min F (x, y) = min max F (x, y). (14.11)

y∈Y x∈X x∈X y∈Y

Proof. First assume that (x∗ , y ∗ ) is a saddle point of F . Then

F (x∗ , y) ≤ F (x∗ , y ∗ ) ≤ F (x, y ∗ ) for all (x, y) ∈ X × Y,

hence

F (x∗ , y ∗ ) ≤ min F (x, y ∗ ) ≤ max min F (x, y) ≤ min max F (x, y)

x∈X y∈Y x∈X x∈X y∈Y
∗ ∗ ∗
≤ max F (x , y) ≤ F (x , y ),
y∈Y

implying that all the above inequalities must hold with equality.
Next assume that

F (x∗ , y ∗ ) = max min F (x, y) = min max F (x, y).

y∈Y x∈X x∈X y∈Y

Then we have

min F (x, y ∗ ) = F (x∗ , y ∗ ) = max F (x∗ , y),

x∈X y∈Y

which implies that (x∗ , y ∗ ) is a saddle point.

Consider now the primal optimization problem
minimize f (x)
subject to g(x) ≤ 0 (P)
x ∈ X,

where g : IRn → IRp and X ⊆ IRn .

The Lagrangian of (P) is

L(x, μ) = f (x) + μT g(x), μ ∈ IRp , μ ≥ 0, x ∈ X.

Note that

f (x), if g(x) ≤ 0
sup L(x, μ) = sup{f (x) + μ g(x)} = T
μ≥0 μ≥0 +∞, otherwise,

and the problem (P) can be restated in the form

min max L(x, μ).

x∈X μ≥0

For μ ≥ 0, deﬁne the dual function

d(μ) = min L(x, μ). (14.12)

x∈X
370 Numerical Methods and Optimization: An Introduction

It can be shown (Exercise 14.15) that d(μ) is a concave function, independently

of whether the problem (P) is convex. Then the dual problem of (P) is deﬁned
to be the following optimization problem:

max d(μ) = max min L(x, μ). (D)

μ≥0 μ≥0 x∈X

The problem (D) is called the dual problem.

Example 14.8 Consider the quadratic programming (QP) problem

1 T
minimize 2 x Qx+ cT x
subject to Ax ≤ b,

where Q is an n × n symmetric positive deﬁnite matrix, c ∈ IRn , A ∈ IRm×n ,

b ∈ IRm . The corresponding Lagrangian function is
1 T
L(x, μ) = x Qx + cT x + μT (Ax − b)
2
1 T
= x Qx + (AT μ + c)T x − bT μ.
2
The minimum of L(x, μ) with respect to x occurs at the point x∗ where
∇L(x∗ , μ) = 0, that is,
x∗ = −Q−1 (c + AT μ).
Hence, the dual function
1 $ % 1
d(μ) = minn L(x, μ) = − μT AQ−1 AT μ − (AQ−1 c + b)T μ − cT Q−1 c,
x∈IR 2 2

and the dual problem is given by

1
max d(μ) = − μT M μ + dT μ,
μ≥0 2

where M = AQ−1 AT and d = −(AQ−1 c + b). Hence, the dual of a convex QP

problem is a concave QP problem.

Theorem 14.10 (Weak Duality Theorem) Let x∗ be a global mini-

mizer of the primal problem, and let μ∗ be a global maximizer of the dual
problem. Then for any μ ≥ 0,

d(μ) ≤ d (μ∗ ) ≤ f (x∗ ) .

Proof. By deﬁnition of the dual, d(μ) = min L(x, μ), hence for any μ ≥ 0
x∈X
and x∗ ∈ X, we have d(μ) ≤ f (x∗ ) + μT g(x∗ ). This implies that d(μ) ≤
Constrained Optimization 371

max d(μ) = d(μ∗ ). Since μ ≥ 0 and g(x∗ ) ≤ 0, it follows that μT g(x∗ ) ≤ 0 for
μ≥0
any μ, hence, d(μ∗ ) = f (x∗ ) + μ∗ T g(x∗ ) ≤ f (x∗ ).
The diﬀerence f (x∗ ) − d(μ∗ ) is called the duality gap.
The following result follows directly from Theorem 14.10.

Theorem 14.11 A point (x∗ , μ∗ ) ∈ X × IRp+ is a saddle point of the

Lagrangian L(x, μ) if and only if x∗ is a global minimum point of the
primal problem (P), μ∗ is a global maximum point of the dual (D), and
the optimal values f (x∗ ) of (P) and d(μ∗ ) of (D) coincide.

14.3 Projected Gradient Methods

Consider a set-constrained problem

min f (x).
x∈X

In methods for unconstrained problems, we used an iteration in the form

x(k+1) = x(k) + αk d(k) , k ≥ 0,

where d(k) is a direction and αk is a step size. If such an iteration is used

for the set-constrained problem with x(k) ∈ X, we may obtain x(k+1) ∈ / X.
Therefore we take the projection of this point onto X as our x(k+1) :

x(k+1) = ΠX (x(k) + αk d(k) ),

where ΠX (y) denotes the projection of y onto set X. The projection of y onto
X can be deﬁned as
ΠX (y) = arg min z − y.
z∈X

This definition may not be valid since such a minimizer may not exist or
may not be unique in general. Even if it does exist, it may be as difficult to
find as to solve the original optimization problem. However, in some cases the
projection can be easily computed.

Example 14.9 For X = {x : ai ≤ xi ≤ bi , i = 1, . . . , n} and y ∈ IRn , the ith

component of the projection of y onto X is given by
⎧
⎨ yi , if ai ≤ yi ≤ bi ,
[ΠX (y)]i = ai , if yi < ai ,
⎩
bi , if yi > bi .
372 Numerical Methods and Optimization: An Introduction

Example 14.10 Given an m × n matrix A of rank m(m < n), consider

X = {x : Ax = b}. Given y ∈ IRn , y ∈
/ X, the projection of y onto X is given
by a solution to the problem

min z − y.
Az=b

Changing the variable, ξ = z − y, we obtain an equivalent problem

min ξ.
Aξ=b−Ay

This is the problem of ﬁnding a solution of the system Aξ = b − Ay with

minimum norm. The solution to this problem is (see Example 14.3 at page 357)

ξ = AT (AAT )−1 (b − Ay).

Thus,

z = y + ξ = y + AT (AAT )−1 (b − Ay) = (In − AT (AAT )−1 A)y + AT (AAT )−1 b.

Consider an iteration for the problem min f (x):

x∈X

y = x(k) + αk d(k) , x(k) ∈ X.

The projection of y onto X is

ΠX (y) = (In − AT (AAT )−1 A)(x(k) + αk d(k) ) + AT (AAT )−1 b

= x(k) − AT (AAT )−1 Ax(k) + αk P d(k) + AT (AAT )−1 b
= x(k) + αk P d(k) ,

where P = In − AT (AAT )−1 A is the orthogonal projector onto the null space
{x : Ax = 0} of A. Note that for any x ∈ IRn we have A(P x) = (A −
AAT (AAT )−1 A)x = 0.
Thus, we can write our iteration for the constrained problem in the form

x(k+1) = x(k) + αk P d(k) .

In other words, instead of direction d(k) used for the unconstrained problem,
we will use the direction P d(k) in the constrained problem. This direction is
the projection of d(k) onto the null space of A.
Recall that in the gradient methods for unconstrained problems we used
a step
x(k+1) = x(k) + αk d(k) ,
where
d(k) = −∇k .
Consider a nonlinear programming problem

min f (x).
Ax=b
Constrained Optimization 373

In the projected gradient methods we use

d(k) = P (−∇k ),

where P = In − AT (AAT )−1 A, as deﬁned before. If we deﬁne

αk = arg min f (x(k) − αP ∇k ),

α≥0

we obtain an iteration of the projected steepest descent:

x(k+1) = x(k) − αk P ∇k , k ≥ 0,
P = In − AT (AAT )−1 A,
αk = arg min f (x(k) − αP ∇k ).
α≥0

Theorem 14.12 Given x∗ ∈ IRn , P ∇f (x∗ ) = 0 if and only if x∗ satis-

ﬁes the Lagrange conditions, ∇f (x∗ ) + AT λ = 0 for some λ ∈ IRm .

Proof. Assume that P ∇f (x∗ ) = 0, then we have

(In − AT (AAT )−1 A)∇f (x∗ ) = 0,

that is
∇f (x∗ ) − AT ((AAT )−1 A∇f (x∗ )) = 0,
so, the Lagrange conditions are satisﬁed with λ = (AAT )−1 A∇f (x∗ ).
On the other hand, assuming that there exists λ ∈ IRm such that

∇f (x∗ ) + AT λ = 0,

we obtain

P ∇f (x∗ ) = −P AT λ = −(In − AT (AAT )−1 A)AT λ

= −(AT − AT (AAT )−1 AAT )λ = 0.

Descent property
Consider the k th step of the projected steepest descent method and denote
by
φk (α) = f (x(k) − αP ∇k ).
The derivative of this function is

φk (α) = −∇f (x(k) − αP ∇k )T P ∇k ,

374 Numerical Methods and Optimization: An Introduction

so,
φk (0) = −∇Tk P ∇k .
Using the following properties of the projector P ,

P T = P, P 2 = P,

we obtain
φk (0) = −∇Tk P ∇k = −∇Tk P T P ∇k = −P ∇k 2 .
Thus, if P ∇k = 0, then φk (0) < 0, so there exists ᾱ > 0 such that for any
α ∈ (0, ᾱ): φk (α) < φk (0). Hence, f (x(k+1) ) < f (x(k) ) and we have the descent
property.
Next we show that if x(k) → x∗ , where {x(k) : k ≥ 0} is the sequence
generated by the projected steepest descent method, then P ∇f (x∗ ) = 0.
Consider φk (α) = f (x(k) − αP ∇k ). Since in the steepest descent method αk
minimizes φk (α), by the FONC,

φk (αk ) = 0.

On the other hand, using the chain rule:

df (x(k) − αP ∇k )
φk (αk ) = = ∇Tk+1 P ∇k = (P ∇k+1 )T (P ∇k ).
dα α=αk

So, (P ∇k+1 )T (P ∇k ) = 0, k ≥ 0 and

P ∇f (x∗ )2 = lim (P ∇k+1 )T (P ∇k ) = 0.

k→∞

Note that if f (x) is a convex function, then the problem min f (x) is a convex
Ax=b
problem. In this case, if the projected steepest descent converges, then it
converges to a global minimizer.

14.3.1 Aﬃne scaling method for LP

To illustrate the projected gradient method, we discuss its variation ap-
plied to linear programming (LP). The resulting aﬃne scaling method belongs
to the class of interior point methods, in which the points generated by the
algorithms always lie in the interior of the feasible region. It should be noted
that, unlike some other variations of interior point methods for LP, the method
we discuss here is not eﬃcient in general.
Recall that a linear program in the standard form can be written as

minimize cT x
subject to Ax = b (14.13)
x ≥ 0,

where c ∈ IRn , b ∈ IRm , A ∈ IRm×n . We also assume that rank(A) = m.

Constrained Optimization 375

We assume that all points generated by the method are interior, x(k) >
0, k ≥ 0. In the aﬃne scaling method, the original LP is transformed to an
equivalent LP, so that the current point is “better” positioned for the projected
steepest descent method. It is based on observation that if the current point is
close to the “center” of the feasible region, then there is more space for move in
a descent direction; therefore a larger step toward the minimum can be made.
An appropriate choice for the “center” is the point e = [1, 1, . . . , 1]T ∈ IRn ,
which has equal distance to all bounds given by xi = 0, i = 1, . . . , n.
Denote by
⎡ (k) ⎤
x1 0 ... 0
⎢ (k) ⎥
⎢ 0 x2 ... 0 ⎥
Dk = diag(x(k) ) = ⎢ ⎢ .. .. .. .. ⎥
⎥.
⎣ . . . . ⎦
(k)
0 0 ... xn

Then, introducing a new variable y such that

y = Dk−1 x ⇔ x = Dk y,

results in the aﬃne scaling that transforms x to y and, in particular, x(k) to

y (k) = e. Matrix Dk is called the scaling matrix.
Expressing problem (14.13) in terms of y we obtain the equivalent problem

minimize cTk y
subject to Ak y = b (14.14)
y ≥ 0,
where
ck = Dk c and Ak = ADk .
We have y (k) = Dk−1 x(k)
= e. Next, we make an iteration in the direction
of the projected steepest descent for problem (14.14). Since ∇(cTk y) = ck , we
obtain

y (k+1) = y (k) − αk d(k) , (14.15)

(k+1)
where y denotes the result of the iteration, and

d(k) = Pk ck ,

Pk = In − ATk (Ak ATk )−1 Ak .

We will select αk > 0 such that the step is as large as possible, but still results
in an interior point y (k+1) . In other words, when moving in the direction of
projected steepest descent, we need to stop just short of the boundary (the
method is not deﬁned on the boundary since for y (k+1) on the boundary we
(k+1) −1
would have yi = 0 for some i, and, as a result, Dk+1 would not exist).
376 Numerical Methods and Optimization: An Introduction

Consider the ith component of (14.15):

(k+1) (k) (k) (k)
yi = yi − α k di = 1 − αk di . (14.16)
(k) (k+1)
Observe that if di ≤ 0, then yi > 0 for any αk > 0. If, on the other
(k)
hand, di > 0, then
(k+1) (k) (k)
yi >0 ⇔ 1 − α k di >0 ⇔ αk < 1/di .
It is easy to see that the above inequality is satisﬁed for αk given by
A B
(k)
αk = α min 1/di ,
(k)
i:di >0

where 0 < α < 1 (typically α = 0.9 or 0.99 is chosen). Note that if d(k) = 0
and d(k) ≤ 0, then the problem is unbounded (there is no minimizer). Finally,
(k+1)
having computed yi , we need to ﬁnd the corresponding feasible point of
the original problem (14.13) by applying the inverse scaling operation,
x(k+1) = Dk y (k+1) = Dk (e − αk d(k) ).
The aﬃne scaling method is summarized in Algorithm 14.1.

Algorithm 14.1 The aﬃne scaling algorithm for solving LP (14.13).

1: Input: A, b, c, x(0) > 0 such that Ax(0) = b, α
2: Output: an approximate solution x∗ of the LP
3: k=0
4: repeat
5: Dk = diag(x(k) )
6: ck = D k c
7: Ak = ADk
8: Pk = In − ATk (Ak ATk )−1 Ak
9: d(k) = Pk ck A B
(k)
10: αk = α min 1/di
(k)
i:di >0
11: x (k+1)
= Dk (e − αk d(k) )
12: k =k+1
13: until a stopping criterion is satisﬁed
14: return x(k)

Example 14.11 We apply the aﬃne scaling algorithm to the following LP

starting from the point [1, 2, 1]T :
minimize 3x1 + 2x2 + 4x3
subject to x1 + 2x2 + 3x3 = 8
x1 − x2 + x3 = 0
x1 , x2 , x3 ≥ 0.
Constrained Optimization 377

We use the step-length coeﬃcient α = 0.9. We have:

1 2 3
A= , b = [8, 0]T , c = [3, 2, 4]T , x(0) = [1, 2, 1]T ,
1 −1 1
hence, ⎡ ⎤
1 0 0
D0 = ⎣ 0 2 0 ⎦ .
0 0 1
Using the aﬃne scaling we obtain

1 4 3
A0 = AD0 = ;
1 −2 1
⎡ ⎤
5/7 1/7 −3/7
P0 = I3 − AT0 (A0 AT0 )−1 A0 = ⎣ 1/7 1/35 −3/35 ⎦ ;
−3/7 −3/35 9/35
d(0) = P0 D0 c = [1, 1/5, −3/5]T .

Next we need to determine the step length. We have

3 4
1
min (0)
= min{1, 5} = 1,
(0)
i:di >0 di
so α0 = α · 1 = 0.9. Finally,
x(1) = D0 (e − 0.9d(0) ) = [0.1, 1.64, 1.54]T .
Table 14.1 shows the approximate solution x(k) , the corresponding objective
function value cT x(k) , and the error x(k) −x∗ for 8 steps of the aﬃne scaling
method applied to the considered LP. Here x∗ = [0, 1.6, 1.6]T is the exact
optimal solution of the problem.

14.4 Sequential Unconstrained Minimization

Consider the problem
minimize f (x)
subject to x ∈ X,
where X = {x ∈ IRn : gj (x) ≤ 0, j = 1, . . . , m} and f, gj : IRn → IR,
j = 1, . . . , m are continuously diﬀerentiable functions.
We will brieﬂy review two types of schemes of sequential unconstrained
minimization, in which a solution to this problem is approximated by a se-
quence of solutions to some auxiliary unconstrained minimization problems;
namely, the penalty function methods and the barrier methods.
378 Numerical Methods and Optimization: An Introduction

TABLE 14.1: Aﬃne scaling iterations for the LP in Example 14.11. All the
values are rounded to 7 decimal places.
T
k x(k) cT x(k) x(k) − x∗
1 [0.1000000, 1.6400000, 1.5400000] 9.7400000 0.1232883
2 [0.0100000, 1.6040000, 1.5940000] 9.6140000 0.0123288
3 [0.0010000, 1.6004000, 1.5994000] 9.6014000 0.0012329
4 [0.0001000, 1.6000400, 1.5999400] 9.6001400 0.0001233
5 [0.0000100, 1.6000040, 1.5999940] 9.6000140 0.0000123
6 [0.0000010, 1.6000004, 1.5999994] 9.6000014 0.0000012
7 [0.0000001, 1.6000000, 1.5999999] 9.6000001 0.0000001
8 [0.0000000, 1.6000000, 1.6000000] 9.6000000 0.0000000

14.4.1 Penalty function methods

A continuous function Φ(x) is called a penalty function (or simply, a
penalty) for a closed set X if Φ(x) = 0 for any x ∈ X and Φ(x) > 0 for
any x ∈/ X.

Example 14.12 Denote by (a)+ = max{a, 0}. Then for

X = {x ∈ IRn : gj (x) ≤ 0, j = 1, . . . , m}

we have the following quadratic penalty:

m
Φ(x) = ((gj (x))+ )2 .
j=1

It is easy to see that the penalty functions have the following property: If
Φ1 (x) is a penalty for X1 and Φ2 (x) is a penalty for X2 , then Φ1 (x) + Φ2 (x)
is a penalty for the intersection X1 ∩ X2 .
The general scheme of penalty function methods is described in Algo-
rithm 14.2. The following theorem establishes its convergence.

Theorem 14.13 Let x∗ be a global optimal solution to the considered

constrained problem. If there exists a value t̄ > 0 such that the set

S = {x ∈ IRn : f (x) + t̄Φ(x) ≤ f (x∗ )}

is bounded, then

lim f (x(k) ) = f (x∗ ), lim Φ(x(k) ) = 0.

k→∞ k→∞
Constrained Optimization 379

Algorithm 14.2 A general scheme of the penalty function method for solving
min f (x).
x∈X
1: choose a penalty Φ(x) of X
2: choose x0 ∈ IRn
3: choose a sequence of penalty coefficients 0 < tk < tk+1 such that tk → ∞
4: for k = 1, 2, . . . do
5: find a solution x(k) to minn {f (x) + tk−1 Φ(x)} starting with xk−1
x∈IR
6: if a stopping criterion is satisfied then
7: return x(k)
8: end if
9: end for

Proof: Denote by Ψk (x) = f (x) + tk Φ(x), Ψ∗k = minn Ψk (x). We have

x∈IR
Ψ∗k ≤ Ψk (x∗ ) = f (x∗ ), and for any x ∈ IRn , Ψk+1 (x) ≥ Ψk (x) ⇒ Ψ∗k+1 ≥ Ψ∗k .
Hence, there exists a limit lim Ψ∗k = Ψ∗ ≤ f (x∗ ). If tk > t̄, then f (x(k+1) ) +
k→∞
t̄Φ(x(k+1) ) ≤ f (x(k+1) )+tk Φ(x(k+1) ) = Ψ∗k ≤ f (x∗ ), implying that x(k+1) ∈ S.
By Bolzano-Weierstrass theorem (which states that every bounded, inﬁnite set
of real numbers has a limit point), since S is bounded, the sequence {x(k) :
k ≥ 1} has a limit point x∗ . Since lim tk = +∞ and lim Ψ∗k ≤ f (x∗ ), we
k→∞ k→∞
have Φ(x∗ ) = 0 and f (x∗ ) ≤ f (x∗ ). The fact that Φ(x∗ ) = 0 implies that
x∗ ∈ X, so x∗ is a global minimizer of the problem.
To apply a penalty function method in practice, we need to choose an
appropriate penalty function and a sequence of penalty coeﬃcients {tk }. In
addition, a method for solving the unconstrained penalty problems needs to
be selected. These important decisions are not easy to address in general.

14.4.2 Barrier methods

Let X be a closed set with a nonempty interior. A continuous function
F (x) : X → IR is called a barrier function (or simply, a barrier) for X if
F (x) → ∞ when x approaches the boundary of X.

Example 14.13 Let X = {x ∈ IRn : gj (x) ≤ 0, j = 1, . . . , m} satisfy the

Slater condition: there exists x̄ such that gj (x̄) < 0, j = 1, . . . , m. Then the
following functions are barriers for X.

m
• Power-function barrier: F (x) = 1
(−gj (x))p , p ≥ 1.
j=1

m
• Logarithmic barrier: F (x) = − ln(−gj (x)).
j=1
380 Numerical Methods and Optimization: An Introduction

m
• Exponential barrier: F (x) = exp −gj1(x) .
j=1

Barriers have the following property. If F1 (x) is a barrier for X1 and F2 (x)
is a barrier for X2 , then F1 (x)+F2 (x) is a barrier for the intersection X1 ∩X2 .
The general scheme of barrier methods is outlined in Algorithm 14.3.

Algorithm 14.3 A general scheme of the barrier function method for solving
min f (x).
x∈X
1: choose a barrier F (x) of X
2: choose x0 in the interior of X
3: choose a sequence of coefficients tk > tk+1 > 0 such that tk → 0, k → ∞
4: for k = 1, 2, . . . do
5: find a solution x(k) to the barrier problem min{f (x) + tk−1 F (x)}
x∈X
6: if a stopping criterion is satisfied then
7: return x(k)
8: end if
9: end for

Denote by Ψk (x) = f (x) + tk F (x), Ψ∗k = min Ψk (x).

x∈X

Theorem 14.14 Let the barrier F (x) be bounded from below on X.

Then lim Ψ∗k = f ∗ , where f ∗ is the optimal objective value of the con-
k→∞
sidered problem.

Proof: Let F (x) ≥ F ∗ for all x ∈ X. For any x̄ in the interior of X we have

lim sup Ψ∗k ≤ lim [f (x̄) + tk F (x̄)] = f (x̄),

k→∞ k→∞

thus lim sup Ψ∗k ≤ f ∗ (since the opposite would yield lim sup Ψ∗k > f (x̄) for
k→∞ k→∞
some x̄ in the interior of X).
On the other hand,

Ψ∗k = min {f (x) + tk F (x)} ≥ min {f (x) + tk F ∗ } = f ∗ + tk F ∗ ,

x∈X x∈X

so lim inf Ψ∗k ≥ f ∗ . Thus, lim Ψ∗k = f ∗ .

k→∞ k→∞
Again, several important issues need to be addressed in order to use this
method, such as finding a starting point x0 , choosing the barrier function,
updating the penalty coefficients, and solving the barrier problems. These
issues can be effectively addressed for convex optimization problems, leading
to a powerful framework of interior point methods.
Constrained Optimization 381

14.4.3 Interior point methods

Interior point methods take advantage of fast convergence of Newton’s
method for approximately solving the barrier problems. They are generally
preferred to other available techniques for convex optimization due to their su-
perior theoretical convergence properties and excellent practical performance.
Consider the problem
minimize f (x)
subject to h(x) = 0
g(x) ≤ 0,
where f (x) : IRn → IR, h(x) : IRn → IRm (m < n), and g(x) : IRn → IRp .
Introducing p slack variables given by a p-dimensional vector s, we obtain the
following equivalent problem:
minimize f (x)
subject to h(x) = 0
g(x) + s = 0
s ≥ 0.
Using the logarithmic barrier for the nonnegativity constraints, we write the
corresponding barrier problem

p
minimize f (x) − t log si
i=1
subject to h(x) = 0
g(x) + s = 0.
To apply the barrier method, we need to choose the sequence of positive
coeﬃcients {tk : k ≥ 0} that converges to zero and solve the above barrier
problem. The Lagrangian for the barrier problem is

p
L(x, s, λ, μ) = f (x) − t log si − λh(x) − μ(g(x) + s).
i=1

KKT conditions are given by

∇f (x) − JhT (x)λ − JgT (x)μ = 0
−tS −1 e − μ = 0
h(x) = 0
g(x) + s = 0,
where S = diag(s) is the diagonal matrix with the diagonal given by s, and e
is the vector of all ones. The last system can be rewritten as
∇f (x) − JhT (x)λ − JgT (x)μ = 0
Sμ + te = 0
h(x) = 0
g(x) + s = 0,
382 Numerical Methods and Optimization: An Introduction

which can be written as Ft (y) = 0, where y = [x, s, λ, μ]T and

⎡ ⎤
∇f (x) − JhT (x)λ − JgT (x)μ
⎢ Sμ + te ⎥
Ft (y) = ⎢
⎣
⎥.
⎦
h(x)
g(x) + s

Recall that to derive the (k + 1)st iteration of Newton’s method for solving
this system, we consider a linear approximation

Ft (y) ≈ Ft (y (k) ) + JFt (y (k) )(y − y (k) ),

where y (k) is the solution after k steps. Solve the system JFt (y (k) )z =
−Ft (y (k) ) for z, and set y (k+1) = y (k) + z. To ensure some desired convergence
and numerical stability properties, the primal-dual interior point methods use
step-length coefficients that may be different for different components of New-
ton’s direction z. More specifically, the system JFt (y (k) )z = −Ft (y (k) ) is given
by
⎡ 2 ⎤⎡ ⎤ ⎡ ⎤
∇xx L(y) 0 −JhT (x) −JgT (x) zx ∇x L(y)
⎢ 0 M 0 S ⎥ ⎢ zs ⎥ ⎢ Sμ + te ⎥
⎢ ⎥⎢ ⎥ ⎢ ⎥
⎣ Jh (x) 0 0 0 ⎦ ⎣ zλ ⎦ = − ⎣ h(x) ⎦ ,
Jg (x) Ip 0 0 zμ g(x) + s

where y = [x, s, λ, μ] = y (k) = [x(k) , s(k) , λ(k) , μ(k) ]T , M = diag(μ(k) ), S =

p
diag(s(k) ), and L(y) = f (x) − t log si − λh(x) − μ(g(x) + s). Then after
i=1
solving this system for z = [zx , zs , zλ , zμ ]T , the (k + 1)st iterate y (k+1) =
[x(k+1) , s(k+1) , λ(k+1) , μ(k+1) ]T is given by
(k+1) (k) (k+1) (k)
x x zx λ λ zλ
= + α s , = + α λ
s(k+1) s(k) zs μ(k+1) μ(k) zμ

with appropriately selected coeﬃcients αs and αλ .

Exercises
14.1. For a twice continuously diﬀerentiable function f : IRn → IR, let
x(k) ∈ IRn be such that ∇f (x(k) ) = 0 and ∇2 f (x(k) ) is positive deﬁ-
nite. Consider the following constrained problem:

minimize f (d) = ∇f (x(k) )T d

subject to dT ∇2 f (x(k) )d ≤ 1.
Constrained Optimization 383

Show that the solution d∗ of this problem is the Newton’s direction for
f (x).
14.2. Given the equality constraints h(x) = 0, x ∈ IRn , where

(a) h(x) = 2x21 + x22 − 1 = 0 (n = 2);

(b) h(x) = Ax − b, where A is an m × n matrix with rank(A) = m,
m < n;
(c) h(x) = 1 − xT P x, where P is a symmetric positive deﬁnite matrix,

check that all feasible points are regular in each case.

14.3. The ﬁgure below illustrates a problem of minimizing a concave function
f over the set given by {x : g(x) ≤ 0}. Which of the following statements
are true/false?

f (x) = c4

∇g(x1 )
∇f (x1 ) f (x) = c3

x1
f (x) = c2

f (x) = c1

∇f (x2 )

∇g(x2 )

(a) c1 ≥ c2 ≥ c3 ≥ c4 ;
(b) x1 is a local minimizer;
(c) x2 is a local minimizer;
(d) x1 is a global minimizer;
(e) x2 is a global minimizer.

14.4. Let x∗ be an optimal solution to the problem

minimize f (x)
subject to h(x) = 0,
384 Numerical Methods and Optimization: An Introduction

where f, h : IR3 → IR, ∇f (x) = [10x1 + 2x2 , 2x1 + 8x2 − 5, −1]T , and
∇h(x∗ ) = [−12, −5, 1]T . Find ∇f (x∗ ).

14.5. Consider the problem

minimize 5x21 + 2x1 x2 + 4x22 − 5x2 − x3

subject to 12x1 + 5x2 − x3 = 16.

(a) Apply the Lagrange theorem to ﬁnd all stationary points.

(b) Is there a global minimizer? Explain.

14.6. Find all points satisfying FONC for the following problem:

minimize 2x21 + x22 + x23

subject to x1 + 2x2 + 3x3 = 4.

Which of these points, if any, are points of local or global minimum?

Explain.

14.7. Solve the problem

minimize x21 + x22

subject to x21 + 9x22 = 9

geometrically and using the optimality conditions.

14.8. Consider the problem

minimize x2 − x1
subject to x21 − x32 = 0.

(a) Find the set of all regular points for this problem.
(b) Draw the feasible set and level sets of the objective function corre-
sponding to local minima.
(c) Does x∗ = [0, 0]T satisfy the FONC? Is x∗ a point of local mini-
mum? Explain.

14.9. Formulate the FONC (KKT conditions), SONC and SOSC for the max-
imization problem with inequality constraints:

maximize f (x)
subject to h(x) = 0
g(x) ≤ 0,
Constrained Optimization 385

where f : IRn → IR, h : IRn → IRm , g : IRn → IRp are twice continuously
diﬀerentiable functions.

14.10. Use KKT conditions to ﬁnd all stationary points for the following prob-
lems:
(a) minimize 2x − x2
subject to 0 ≤ x ≤ 3
(b) minimize −(x21 + x22 )
subject to x1 ≤ 1
(c) minimize x1 − (x2 − 2)3 + 3
subject to x1 ≥ 1.
Which of these points, if any, are points of local or global minimum?
Explain.

14.11. Check that the following problem is convex and use the KKT conditions
to ﬁnd all its solutions.

minimize exp{−(x1 + x2 )}
subject to exp (x1 ) + exp (x2 ) ≤ 10
x2 ≥ 0.

14.12. Use the SOSC to ﬁnd all local minimizers of the problem

minimize x21 + x22

subject to x21 − x2 ≤ 4
x2 − x1 ≤ 2.

14.13. Consider the following problem:

minimize x2
subject to x21 + (x2 − 4)2 ≤ 16
(x1 − 3)2 + (x2 − 3)2 = 18.

(a) Find all points satisfying the KKT conditions.

(b) Apply the second-order conditions to determine the type of each
KKT point found.
(c) Solve the problem graphically.

14.14. Find all points satisfying the KKT conditions for the following quadratic
programming problem

1 T
minimize 2 x Qx
subject to Ax ≤ b,
386 Numerical Methods and Optimization: An Introduction

where Q is a symmetric positive deﬁnite matrix, A is a m × n matrix,

and b ≥ 0.

14.15. Show that the dual function d(μ) deﬁned in Eq. (14.12) at page 369 is
concave.
14.16. Apply two iterations of the aﬃne scaling method to the following prob-
lem:

minimize 2x1 + x2 + 3x3

subject to x1 + 2x2 − x3 = 1
x1 + x2 + x3 = 1
x1 , x2 , x3 ≥ 0.

Use x(0) = [1/4, 1/2, 1/4]T as the initial point. At each iteration, use the
step-length coeﬃcient α = 0.9.
Notes and References

There are many excellent books for further reading on the topics so concisely
introduced in this text. Here we provide several pointers based on our personal
preferences, which, we hope, will help the reader to avoid getting overwhelmed
by the variety of available choices.
The following texts are good choices for learning more about numerical
methods for problems discussed in Part II of this text, as well as their com-
puter implementations: [9, 14, 18, 24]. The classical book by Cormen et al. [13]
provides a great, in-depth introduction to algorithms. See [16, 28] for more
information on computational complexity.
The discussion in Sections 9.5 and 9.6 is based on [20] and [25], respectively.
The simplex-based approach to establishing the fundamental theorem of LP
presented in Chapter 10 is inspired by Chvátal [11]. Textbooks [19,31] contain
an extensive collection of LP exercises and case studies. Other recommended
texts on various aspects of linear and nonlinear optimization include [1–6, 10,
15, 17, 23, 26, 27, 29].

387
This page intentionally left blank
Bibliography

[1] M. Avriel. Nonlinear Programming: Analysis and Methods. Dover Pub-

lications, 2003.
[2] M. S. Bazaraa, J. J. Jarvis, and H. D. Sherali. Linear Programming and
Network Flows. John Wiley & Sons, 4th edition, 2010.
[3] M. S. Bazaraa, H. D. Sherali, and C. M. Shetty. Nonlinear Programming:
Theory and Algorithms. Wiley-Interscience, 2006.
[4] D. Bertsekas. Nonlinear Programming. Athena Scientific, Belmont, MA,
2nd edition, 1999.
[5] D. Bertsimas and J. N. Tsitsiklis. Introduction to Linear Optimization.
Athena Scientific, Belmont, MA, 1997.
[6] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge Uni-
versity Press, 2004.
[7] S. Brin and L. Page. The anatomy of a large-scale hypertextual web
search engine. Computer Networks, 30:107–117, 1998.
[8] K. Bryan and T. Leise. The $25,000,000,000 eigenvector: The linear
algebra behind Google. SIAM Review, 48:569–581, 2006.
[9] R. Butt. Introduction to Numerical Analysis Using MATLAB. Infinity
Science Press, 2007.
[10] E. K. P. Chong and S. H. Żak. An Introduction to Optimization. Wiley-
Interscience, 3rd edition, 2008.
[11] V. Chvátal. Linear Programming. W. H. Freeman, New York, 1980.
[12] S. Cook. The complexity of theorem proving procedures. In Proceedings
of the Third Annual ACM Symposium on Theory of Computing, pages
151-158, 1971.
[13] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction
to Algorithms. MIT Press and McGraw-Hill, 3rd edition, 2009.
[14] G. Dahlquist and Å. Björck. Numerical Methods. Dover Publications,
2003.

389
390 Numerical Methods and Optimization: An Introduction

[15] D.-Z. Du, P. M. Pardalos, and W. Wu. Mathematical Theory of Opti-

mization. Kluwer Academic Publishers, 2001.

[16] M.R. Garey and D.S. Johnson. Computers and Intractability: A Guide
to the Theory of NP-Completeness. W.H. Freeman and Company, New
York, 1979.

[17] I. Griva, S. Nash, and A. Sofer. Linear and Nonliner Optimization. SIAM,
Philadelphia, 2nd edition, 2009.
[18] R. W. Hamming. Numerical Methods for Scientists and Engineers. Dover
Publications, 1987.

[19] F. S. Hillier and G. J. Lieberman. Introduction to Operations Research.

McGraw-Hill, 9th edition, 2010.

[20] R. Horst, P. Pardalos, and N. Thoai. Introduction to Global Optimization.

Kluwer Academic Publishers, 2nd edition, 2000.
[21] W. W. Leontief. Input-output economics. Scientiﬁc American, 185:15–21,
October 1951.

[22] D. G. Luenberger. Investment Science. Oxford University Press, 1997.

[23] O. L. Mangasarian. Nonlinear Programming. SIAM, Philadelphia, 1987.

[24] J. Mathews and K. Fink. Numerical Methods Using MATLAB. Prentice
Hall, 4th edition, 2004.

[25] Y. Nesterov. Introductory Lectures on Convex Optimization: A Basic

Course. Kluwer Academic Publishers, 2003.
[26] Y. Nesterov and A. Nemirovski. Interior-Point Polynomial Algorithms
in Convex Programming. SIAM, Philadelphia, 1994.