Convex Functions
Convex Functions
When in doubt on the accuracy of these notes, please cross check with the instructor’s notes,
on aaa. princeton. edu/ orf523 . Any typos should be emailed to gh4@princeton.edu.
1 Outline
• Convexity-preserving operations
• is f1 · f2 convex?
f1
• is f2
convex?
1
2.2 Composition with an affine mapping
Rule 2. Suppose f : Rn → R, A ∈ Rn×m , and b ∈ Rn . Define g : Rm → R as
g(x) = f (Ax + b)
2
• It is also easy to prove this result using epigraphs. Recall that f convex ⇔ epi(f ) is
convex. But epi(f ) = ∩m i=1 epi(fi ), and we know that the intersection of convex sets is
convex.
• One can similarly show that the pointwise minimum of two concave functions is con-
cave.
• But the pointwise minimum of two convex functions may not be convex.
3
Many algorithms for unconstrained convex optimization (e.g., steepest descent with exact
line search) work by iteratively minimizing a function over lines. It’s useful to remember
that the restriction of a convex function to a line remains convex. This tells us that in each
subproblem we are faced with a univariate convex minimization problem, and hence we can
simply find a global minimum e.g. by finding a zero of the first derivative.
Proof: We prove this in the case where f is twice differentiable. Let g = f k . Then
3 Convex envelopes
Definition 1. The convex envelope (or convex hull) convD f of a function f : Rn → R
over a convex set D ⊆ Rn is “the largest convex underestimator of f on D”; i.e.,
• Equivalently, convD f (x) is the pointwise maximum of all convex function that lie
below f (on D).
4
• As the pictures suggest, the epigraph of conv f is the convex hull of the epigraph of f .
Theorem 1 ([5]). Consider the problem minx∈S f (x), where S is a convex set. Then,
and
To see the converse, note that the constant function g(x) = f ∗ is a convex underestimator
of f . Hence, we must have convS f (x) ≥ f ∗ , ∀x ∈ S.
To prove (2), let y ∈ S be such that f (y) = f ∗ . Suppose for the sake of contradiction that
convS f (y) < f ∗ . But this means that the function
max{f ∗ , convS f }
Theorem 2. The convex envelope of the l0 pseudonorm over the set {x| ||x||∞ ≤ 1} is the
l1 function.
5
This simple observation is the motivation (or one motivation) behind many heuristics for l0
optimization like compressed sensing, LASSO, etc.
Here’s the idea behind why this could be useful. Consider a very simple scenario where you
are given m data points in Rn and want to fit a function f to the data that minimizes the
sum of the squares of the deviations. The problem is, however, that you don’t have a good
idea of what function class exactly f belongs to. So you decide to throw in a lot of functions
in your basis: maybe you include a term for every monomial up to a certain degree, you
add trigonometric functions, exponential functions, etc. After this, you try to write f as a
linear combination of this massive set of basis functions by solving an optimization problem
that finds the coefficients of the linear combination. Well, if you use all the basis functions
(nonzero coefficients everywhere), then you will have very small least squares error but you
would be overfitting the data like crazy. What LASSO tries to do, as you increase λ, is to
set many of these coefficients equal to zero and tell you (somehow magically) that which of
the basis functions were actually important for fitting the data and which weren’t.
6
4 Support vector machines
• Support vector machines (SVM) constitute a prime example of supervised learning. In
such a setting, we would like to learn a classifier from a labeled data set (called the
training set). The classifier is then used to label future data points.
– Given a large number of emails with correct labels “spam” or “not spam”, we
would like an algorithm for classifying future emails as spam or not spam.
– The emails for which we already have the labels constitute the “training set”.
• A basic approach is to associate a pair (xi , yi ) to each email: yi is the label, which is
either 1 (spam) or −1 (not spam). The vector xi ∈ Rn is called a feature vector ; it
collects some relevant information about email i. For example:
7
• If we have m emails, we end up with m vectors in Rn , each with a label ±1. Here is a
toy example in R2 :
• The goal is now to find a classifier f : Rn → R, which takes a positive value on spam
emails and a negative value on non-spam emails.
• We can search for many classes of classifier functions using convex optimization.
aT xi − b > 0 if yi = 1
aT xi − b < 0 if yi = −1
yi (aT xi − b) ≥ 1, i = 1, . . . , m.
8
• This is a convex feasibility problem (in fact a set of linear inequalities). It may or may
not be feasible (compare examples above and below). Can you identify the geometric
condition for feasibility of linear classification? (Hint: think of convex hulls.)
• When linear separation is possible, there could be many (in fact infinitely many) linear
classifiers to choose from. Which one should we pick?
s.t. yi (aT xi − b) ≥ 1, i = 1, . . . , m.
9
Claim 1. The optimization problem above is equivalent to
max t
a,b,t
• Let’s believe these three claims for the moment. What optimization problem (3) is
then doing is finding a hyperplane that maximizes the minimum distance between the
hyperplane (our classifier) and any of our data points. Do you see why?
• We are trying to end up with as wide a margin as possible. Formally, the margin is
defined to be the distance between the two gray hyperplanes in the figure above. What
is the length of this margin in terms of a∗ (ans possibly b∗ )?
• Having a wide margin helps us be robust to noise, in case the feature vector of our
future data points happens to be slightly misspecified.
The proof of the three claims are given as homework. Here are a few hints:
10
• Claim 1: how would you get feasible solutions to one from the other?
min ||η||0
a,b,η
s.t. yi (aT xi − b) ≥ 1 − ηi , i = 1, . . . , m
ηi ≥ 0, i = 1, . . . , m.
• The optimization problem above is trying to set as many entries of η to zero as possible.
min ||η||1
a,b,η
s.t. yi (aT xi − b) ≥ 1 − ηi , i = 1, . . . , m
ηi ≥ 0, i = 1, . . . , m.
11
• This is a convex program (why?). We can solve it efficiently.
• The solution with minimum l1 norm tends to be sparse; i.e., has many entries that are
zero.
• Note that when ηi ≤ 1, data point i is still correctly classified but it falls within our
margin; hence it is not “robustly classified”.
• We can solve a modified optimization problem to balance the tradeoff between the
number of missclassified points and the width of our margin:
s.t. yi (aT xi − b) ≥ 1 − ηi , i = 1, . . . , m
ηi ≥ 0, i = 1, . . . , m.
• On your homework, you will run some numerical experiments on this problem.
12
Notes
Further reading for this lecture can include Chapter 3 of [2]. You can read more about SVMs
in Section 8.6 of [2].
References
[1] S. Boyd and M. Grant. Graph implementations for nonsmooth convex programs. Recent
Advances in Learning and Control. Springer-Verlag, 2008.
[3] Y. Crama. Recognition problems for special classes of polynomials in 0-1 variables.
Mathematical Programming, 44, 1989.
[4] Inc. CVX Research. CVX: Matlab software for disciplined convex programming, version
2.0. Available online at http://cvxr.com/cvx, 2011.
[5] D.Z. Du, P.M. Pardalos, and W. Weili. Mathematical Theory of Optimization. Kluwer
Academic Publishers, 2001.
[6] J.R. Shewchuk. An introduction to the conjugate gradient method without the agonizing
pain. Carnegie-Mellon University. Department of Computer Science, 1994.
[7] R. Tibshirani. Regression shrinkage and selection via the Lasso. Journal of the Royal
Statistical Society. Series B (Methodological), pages 267–288, 1996.
13