This appears in Algorithmica 66 (2013), pp. 93–112.
Isotonic Regression via Partitioning
Quentin F. Stout
Computer Science and Engineering
University of Michigan
Ann Arbor, MI 48109–2121 USA
Abstract
Algorithms are given for determining weighted isotonic regressions satisfying order constraints specified
via a directed acyclic graph (DAG). For the L1 metric a partitioning approach is used which exploits the fact
that L1 regression values can always be chosen to be data values. Extending this approach, algorithms for
binary-valued L1 isotonic regression are used to find Lp isotonic regressions for 1 < p < ∞. Algorithms
are given for trees, 2-dimensional and multidimensional orderings, and arbitrary DAGs. Algorithms are also
given for Lp isotonic regression with constrained data and weight values, L1 regression with unweighted
data, and L1 regression for DAGs where there are multiple data values at the vertices.
Keywords: isotonic regression, median regression, monotonic, nonparametric, tree, DAG, multidimensional
1 Introduction
A directed acyclic graph (DAG) G with n vertices V = {v1 , v2 , . . . , vn } and m edges E defines a partial
order over the vertices, where vi ≺ vj if and only if there is a path from vi to vj . It is assumed that the
DAG is connected, and hence m ≥ n − 1. If it isn’t connected then the regression of each component is
independent of the others. For a real-valued function f over V , fi denotes f (vi ). By weighted data on G
we mean a pair of real-valued functions (y, w) on V where y is the data values and w is the non-negative
weights. A function f on G is isotonic iff whenever vi ≺ vj , then fi ≤ fj . An Lp isotonic regression of the
data is a real-valued isotonic function z over V that minimizes
P
( ni=1 wi |yi − zi |p )1/p 1 ≤ p < ∞
maxni=1 wi |yi − zi |
p=∞
among all isotonic functions. The regression error is the value of this expression. For a vertex vi ∈ V and
isotonic regression z, the regression value of vi is zi . Isotonic regressions partition the vertices into level
sets which are connected regions where the regression value is a constant. The regression value of a level
set is the Lp weighted mean of the data values.
Isotonic regression is an important alternative for standard parametric regression when there is no confidence that the relationship between independent variables and the dependent one is parametric, or when an
independent variable is discrete. For example, one may believe that average weight is an increasing function
of height and of S < M < L < XL shirt size. That is, if the shirt size is held constant the average weight
1
ordered
set
linear
tree
2-dim
grid
2-dim
arbitrary
d ≥ 3 dim
grid
d ≥ 3 dim
arbitrary
arbitrary
L1
Θ(n log n)
[1, 22]
Θ(n2 )
[5]
Θ(n2 log n)
*
Θ(n3 )
*
Θ(n2 log n)
*
Θ(n3 )
*
Θ(nm + n2 log n)
[2]
metric
L2
Θ(n)
PAV
Θ(n log n)
[15]
Θ(n2 )
[21]
Θ(n3 )
[21]
Θ(n4 )
*
Θ(n4 )
*
Θ(n4 )
[14]
L∞
Θ(n log n)
*
Θ(n log n)
*
Θ(n log n)
*
Θ(n log2 n)
[24]
Θ(n log n)
*
Θ(n logd n)
[24]
Θ(m log n)
[24]
* indicates that it is implied by the result for arbitrary ordered sets
Table 1: Times of fastest previously known isotonic regression algorithms
increases with height, and similarly the weight increases if the height is fixed and the shirt size increases.
However, there are no assumptions of what happens if height increases but shirt size decreases. Barlow et
al. [3] and Robertson et al. [20] review work on isotonic regression dating back to the 1940’s. This includes
use of the L1 , L2 , and L∞ metrics and orderings including linear, tree, multidimensional, and arbitrary
DAGs. These books also give examples showing uses of isotonic regression for optimization problems such
as nonmetric multidimensional scaling.
Table 1 shows the fastest known isotonic regression algorithms for numerous combinations of metric
and ordered set. “Dim” ordered sets have vertices that are points in d-dimensional space with the natural ordering, i.e., (a1 , . . . , ad ) ≺ (b1 , . . . , bd ) iff ai ≤ bi for 1 ≤ i ≤ d. For d = 2 this is also known as “matrix”,
“row and column”, or “bimonotonic” ordering, and for general d is sometimes called “domination”. Here it
will be called multidimensional
Q ordering. “Grid” sets are arranged in a d-dimensional grid of extent ni in
the ith dimension, where n = di=1 ni , and “arbitrary” sets have points in arbitrary locations.
For unweighted data the L∞ isotonic regression can be found in Θ(m) time using the well-known fact
that the regression value at v ∈ V can be chosen to be (minvvj yj + maxvi v yi )/2. For 1 < p < ∞,
restricting to unweighted data does not appear to help, but Section 5.4 shows that restricting both data and
weight values can reduce the time required. For L1 , restricting to unweighted data can be useful for dense
DAGs, reducing the time to Θ(n2.5 log n). This is discussed in Section 7.
For linear orders, pair adjacent violators, PAV, has been repeatedly rediscovered. Starting with the initial
data as n level sets, whenever two consecutive level sets violate the isotonic condition they are merged into a
single level set. No matter what order the sets are merged in, for any p the result is optimal. Simple left-right
parsing yields the L2 isotonic regression in Θ(n) time, L1 in Θ(n log n) time [1, 22], and L∞ in Θ(n log n)
time [24]. Thompson’s extension of PAV to trees [27] is discussed in Section 3.
For more general orderings PAV can give incorrect results. For arbitrary DAGs, Maxwell and Muck2
stadt [14], with a small correction by Spouge et al. [21], gave a Θ(n4 ) time algorithm for L2 isotonic regression; Angelov et al. [2] gave a Θ(nm + n2 log n) time algorithm for L1 ; and Stout [24] gave a Θ(m log n)
time algorithm for L∞ , based on a small modification of an algorithm of Kaufman and Tamir [12].
For 1 < p < ∞ the Lp mean of a set is unique (see the Appendix), as is the isotonic regression, but
this is not always true for L1 . The L1 mean of a set is a weighted median, and L1 regression isPoften called
median regression. A
wi , is any
Pweighted median of the
P set of data (y, w), with total weight W =
number z such that yi ≤z wi ≥ W/2 and yj ≥z wj ≥ W/2. Since the weighted median of a set can
always be chosen to be one of the values in the set, there is always an L1 isotonic regression where all
regression values are original data values. In Section 2 a partitioning approach is introduced, one which
repeatedly halves the number of possible regression values at each vertex. Since there are at most n possible
regression values being considered, only a logarithmic number of iterations are needed. This approach is
applied to tree (Section 3) and 2-dimensional (Section 4) orderings.
Partitioning is extended to general Lp isotonic regression in Section 5, where each stage involves a
binary-valued L1 regression. Algorithms are developed for “semi-exact” regressions where the answers are
as accurate as the ability to compute the Lp mean, for approximations to within a specified accuracy, and for
exact regressions when the data values and weights are constrained.
Section 6 considers L1 isotonic regression when there are multiple values at each vertex, and Section 7
contains final remarks, including an observation concerning L1 isotonic regression on unweighted data.
For data (y, w), let err(vi , z) = wi |yi −z|, i.e., it is the error of using z as the regression value at vi .
2 Partitioning
For a closed set S of real numbers, an S-valued isotonic regression is an isotonic regression where the
regression values are elements of S, having minimal regression error among all isotonic S-valued functions.
Rounding x to {a, b}, where a < b and x 6∈ (a, b), results in a if x ≤ a and b if x ≥ b.
Lemma 2.1 For a DAG G and data (y, w), let S = {a, b}, a < b, be such that there is an L1 isotonic regression with no values in (a, b). Let zs be an S-valued L1 isotonic regression. Then there is an unrestricted
L1 isotonic regression z such that z has no values in (a, b) and zs is z rounded to S.
Proof: Let z̃ be an optimal regression with no values in (a, b). If there is no vertex vi where z̃i ≤ a and
zis = b, nor any vertex vj where z̃j ≥ b and zjs = a, then zs is z̃ rounded to S. Otherwise the two cases are
similar, so assume the former holds. Let y be the largest value such that there is a vertex vi where z̃i = y ≤ a
and zis = b, and let U be the subset of V where zis = b and z̃i = y.
Let [c, d] be the range of weighted medians of the data on U . If c > a then the function which equals
z̃ on V \ U and min{c, b} on U is isotonic and has L1 error strictly less than that of z̃, contradicting the
assumption that it is optimal. If c ≤ a and d < b, then the function which is zs on V \ U and a on U is an
isotonic S-valued function with error strictly less than that of zs , contracting the assumption that is optimal.
Therefore [a, b] ⊆ [c, d], so the isotonic function which is z̃ on V \ U and b on U has the same error as z̃,
i.e., is optimal, and has more vertices where it rounds to zs than z̃ has. In a finite number of iterations one
obtains an unrestricted regression which rounds to zs everywhere.
The L1 algorithms herein have an initial sort of the data values. Given data (y, w), let ý1 < . . . < ýn′
be the distinct data values in sorted order, where n′ is the number of distinct values, and let ý[k, ℓ] denote
{ýk , . . . , ýℓ }. L1 partitioning proceeds by identifying, for each vertex, subintervals of ý[1, n′ ] in which the
3
final regression value lies. Let V [a, b] denote the vertices with interval ý[a, b] and let Int(vi ) denote the
interval that vi is currently assigned to. Initially Int(vi ) = ý[1, n′ ] for all vi ∈ V .
Lemma 2.1 shows that by finding an optimal {ýℓ , ýℓ+1 }-valued isotonic regression zs , the vertices where
s
z = ýℓ can be assigned to V [1, ℓ], and those where zs = ýℓ+1 can be assigned to V [ℓ+ 1, n′ ]. The level
sets of V [1, ℓ] are independent of those of V [ℓ + 1, n′ ], and hence an optimal regression can be found by
recursively finding an optimal regression on V [1, ℓ] using only regression values in ý[1, ℓ], and an optimal
regression on V [ℓ+ 1, n′ ] using only regression values in ý[ℓ+ 1, n′ ]. Keeping track of a vertex’s interval
merely requires keeping track of k and ℓ (in fact, ℓ does not need to be stored with the vertex since at each
stage, vertices with the same lower bound have the same upper bound). When determining the new interval
for a vertex one only uses vertices with the same interval. At each stage the assignment is isotonic in that if
vi ≺ vj then either Int(vi ) = Int(vj ) or the intervals have no overlap and the upper limit of Int(vi ) is less than
the lower limit of Int(vj ). Each stage halves the size of each interval, so eventually they are a single value,
i.e., the regression value has been determined. Thus there are ⌈lg n′ ⌉ stages. Figure 1 shows the partitioning
intervals that are produced by the algorithm in Theorem 3.1.
An aspect that one needs to be careful about is that after partitioning has identified the vertices with
regression values in a given interval, then the next partitioning step on that subgraph does not necessarily
use the median of the data values in the subgraph. For example, suppose the values and weights on a linear
ordering are (4,1), (3,10), (1,1), (2,1). The first partitioning involves values 2 and 3, and results in all vertices
being assigned to the interval [3,4]. Since the subgraph is the original graph, using the median of the data
values in the subgraph results in an infinite loop. Instead, the next stage uses partitioning values 3 and 4,
quartiles from the original set of values. One can use the median of the subgraph if when the subgraph
corresponds to the range of values [a, b] then only data values in [a, b] are used to determine the median.
Also note that it is the median of the values, not their weighted median.
While having some similarities, Ahuja and Orlin’s scaling for L1 regression [1] is quite different from
partitioning. In their approach, all vertices are initially in their own level set, and whenever two vertices are
placed in the same level set then they remain in the same level set. In partitioning, all vertices are initially
in the same level set, and whenever two are placed in different level sets they stay in different level sets. An
important partitioning approach is “minimum lower sets”, first used by Brunk in the 1950’s [4].
3 Trees
Throughout, all trees are rooted, with each vertex except the root pointing to its parent. Star ordering, where
all nodes except the root have the root as their parent, is often called tree ordering in statistical literature
concerning tests of hypotheses, where several hypotheses are compared to the null hypothesis.
Thompson [27] showed that PAV can be extended to tree orderings by using a bottom-up process. A
level set S is merged with a violating level set T containing a child of a vertex in S iff T ’s mean is the largest
among all level sets containing a child of a vertex in S. At each step, every level set is a subtree. Using this,
for an arbitrary tree the regression can be found in Θ(n log n) time for the L2 [15] and L∞ [24] metrics.
These times can be reduced to Θ(n) when the tree is a star [15, 24].
For L1 , Chakravarti [5] used linear programming in an algorithm taking Θ(n2 ) time, while Qian [18]
used PAV but no details were given. While PAV can be used to determine L1 isotonic regression in
Θ(n log n) time, a partitioning approach is simpler and is used in the extensions to Lp isotonic regression in
Section 5. Figure 1 illustrates the stages of the following theorem for a linear ordering.
4
!"!#!$%&'$#$
()#&*$+#!#!,"!"-
."'&*$+#!#!,"!"-
/+'&*$+#!#!,"!"-0&1"$%&2$%34)
Figure 1: Partitioning in Theorem 3.1: size is weight, bars are the interval regression value will be in.
Theorem 3.1 For a tree of n vertices and data (y, w), an L1 isotonic regression can be determined in
Θ(n log n) time.
Proof: Initially each vertex has interval ý[1, n′ ]. At each partitioning stage, suppose vertex vi has interval
ý[k, ℓ]. If k = ℓ then the regression value at vi is ýk . Otherwise, let j = ⌊(k+ℓ)/2⌋. Define Z(vi , 0) to be
the minimal possible error in the subtree rooted at vi with vertices in V [k, ℓ], given that the regression at vi
is ýj , and similarly for Z(vi , 1) and ýj+1 . Then Z satisfies the recursive equations:
X
Z(vi , 0) = err(vi , ýj ) +
{Z(u, 0) : u child of vi and Int(vi ) = Int(u)}
X
Z(vi , 1) = err(vi , ýj+1 ) +
{min{Z(u, 0), Z(u, 1)} : u child of vi and Int(vi ) = Int(u)}
which can be computed using a bottom-up postorder traversal.
To recover the optimal solution for this stage and to determine the subintervals for each vertex, a topdown preorder traversal is used. For the root r, its new interval is ý[k, j] or ý[j + 1, ℓ] depending on which
of Z(r, 0) and Z(r, 1) is smallest, respectively. Ties can be broken arbitrarily. For a node p, if its parent
initially had the same interval and the parent’s new interval is ý[k, j], then that is p’s new interval as well.
If the parent’s new interval is ý[j +1, ℓ], or had an initial interval different than p’s, then p’s new interval is
determined using the same procedure as was used for the root. The traversals for computing Z and updating
the intervals can all be done in Θ(n) time per stage. .
A faster algorithm, using neither partitioning nor sorting, is possible when the tree is star and, more
generally, when the graph is a complete directed bipartite DAG, i.e., the vertices are partitioned into V1 and
V2 , there is a directed edge from every vertex in V1 to every vertex in V2 , and no additional edges exist. A
star corresponds to |V2 | = 1.
5
Figure 2: The complete directed bipartite DAG K4,5 given via 2-dimensional ordering
Proposition 3.2 For a complete directed bipartite DAG of n vertices, given data (y, w), an L1 isotonic
regression can be determined in Θ(n) time.
Proof: Let V1 and V2 be the bipartite partition of the vertices. Note that there is an optimal regression z of
the form zi = min{yi , c} for vi ∈ V1 and zj = max{yj , c} for vj ∈ V2 , for some data value c. Let a be the
largest data value such that
X
X
wi ≥
wj
vi ∈V1 , yi ≥a
vj ∈V2 , yj ≤a
and let b be the smallest data value > a. Then there is an optimal L1 isotonic regression where either c = a
or c = b. In linear time one can determine a, b, and c. .
Punera and Ghosh [17] consider regression on a tree with the monotonicity constraint that zˆi = max{zˆj :
vj a child of vi } whenever vi is not a leaf node. This can easily be computed by modifying the definition
of Z(vi , 1) to require that for at least one child Z(u, 1) is used (even if it is not smaller than Z(u, 0)), and
recording this child so that the proper intervals can be determined. The time remains Θ(n log n), which
improves upon their Θ(n2 log n) time algorithm.
4 Multidimensional Orderings
Isotonic regression involving more than one independent variable has often been studied (see the numerous
references in [3, 20]). A problem with d independent ordered variables naturally maps to a DAG where
the vertices are points in a d-dimensional space and the edges are implied by domination ordering, i.e.,
(a1 , . . . , ad ) ≺ (b1 , . . . , bd ) iff ai ≤ bi for 1 ≤ i ≤ d. In statistics, when the observations are over a grid it
is sometimes called a complete layout, and when they are arbitrarily distributed it is an incomplete layout.
A d-dimensional grid has Θ(dn) grid edges, and thus the general algorithm in [2] shows that the L1
isotonic regression can be found in Θ(n2 log n) time. However for vertices in arbitrary positions Θ(n2 )
edges may be needed, as is shown in Figure 2, which would result in Θ(n3 ) time. The following theorem
shows that for points in general position the time can be reduced to Θ̃(n) for d = 2 and Θ̃(n2 ) for d ≥ 3.
Theorem 4.1 For a set of n vertices in d-dimensional space, given data (y, w), an L1 isotonic regression
can be determined in
a) Θ(n log n) time if the vertices form a grid, d = 2,
b) Θ(n log2 n) time if the vertices are in arbitrary locations, d = 2,
c) Θ(n2 log n) time if the vertices form a grid, d ≥ 3,
6
d) Θ(n2 logd n) time if the vertices are in arbitrary locations, d ≥ 3,
where the implied constants depend on d.
The proof will be broken into parts, with part a) in Section 4.1 and part b) in Section 4.2. Part c) was
discussed above.
For d), let V be the set of vertices. In [25] it is shown that in Θ(n logd−1 n) time one can construct
a DAG G′ = (V ′ , E ′ ) where V ⊆ V ′ , |V ′ | = |E ′ | = Θ(n logd−1 n), and for all vi , vj ∈ V , vi ≺ vj in
domination ordering iff vi ≺ vj in G′ . One obtains d) by placing arbitrary values with weight 0 on V ′ \ V
and applying the general algorithm in [2], noting that the number of stages is linear in the number of vertices.
4.1 2-Dimensional Grids
Our algorithm for Theorem 4.1 a) is similar to that of Spouge, Wan and Wilbur [21] for L2 regression.
Suppose the vertices are a 2-dimensional grid with rows 1 . . . r and columns 1 . . . c, where n = r · c. Let
err((g, h), z) denote the error of using z as the regression value at grid location (g, h). For the initial round
of partitioning, let ℓ = ⌈n′ /2⌉ and S = {ýℓ , ýℓ+1 }. To determine an optimal S-valued regression, let Z(g, h)
denote the optimal S-valued regression error on the subgrid [1 . . . g] × [1 . . . c] subject to the constraint that
the regression value at (g, h) is ýℓ+1 (and hence the regression values in row g, columns [h+1 . . . c], are also
ýℓ+1 ), and that this is the leftmost ýℓ+1 value in this row (hence all regression values in [1 . . . g] × [1 . . . h−1]
are ýℓ ). Define Z(g, c+1) to be the optimal regression error when all of the values on row g are ýℓ .
For g ∈ [1 . . . r] and h ∈ [1 . . . c+1] let
R(g, h) =
h−1
X
err((g, k), ýℓ ) +
k=1
c
X
err((g, k), ýℓ+1 )
k=h
Z satisfies a simple recurrence which can be computed in increasing row order:
R(g, h) + min{Z(g−1, k) : k ∈ [h . . . c+1]} g > 1
Z(g, h) =
R(g, h)
g=1
(1)
i.e., if the leftmost ýℓ+1 value in row g is at column h, then the optimal S-valued regression on [1 . . . g] ×
[1 . . . c] given this constraint will contain the optimal regression on [1 . . . g−1] × [1 . . . c] that on row g − 1
has its leftmost ýℓ+1 value no further left than h. For any row, the R and Z values can be easily computed in
Θ(c) time, and thus for all entries they can be computed in Θ(n) time. The minimal regression error is the
minimum of Z(r, h) for h ∈ [1 . . . c+1]. As usual, by storing the value of k that minimizes (1), the optimal
regression values and new intervals can be recovered in linear time.
For subsequent partitions the calculation of R(g, h) is modified to include only values corresponding
to grid points in the same level set, i.e., those with the same interval as (g, h). To determine Z(g, h), if
Int((g, h)) = Int((g − 1, h)) then the minimum is taken over values at grid points with the same interval
(these grid points are consecutive), while if Int((g, h)) 6= Int((g − 1, h)) then Z(g, h) = R(g, h). This
completes the proof of Theorem 4.1 a).
4.2 2-Dimension Data, Arbitrary Position
To prove Theorem 4.1 b), since the coordinates in each dimension are only used for ordering, their specific
values are not important. Assume the first coordinates have values 1 . . . r, and second coordinates have
7
values 1 . . . c. One can view the vertices as being in an r × c grid where some of the grid points have no
data. It is possible that r · c = Θ(n2 ) and hence a naive use of Theorem 4.1 a) would require Θ(n2 log n)
time. However, one can exploit the fact that most of the grid points have no data.
Suppose that the points have been sorted in increasing row order, breaking ties by sorting in increasing
column order. Thus if u ≻ v then u occurs after v in the ordering. Relabel the vertices so that vi is the
ith point in this ordering, with coordinates (ri , ci ). For g ∈ [0 . . . n] and h ∈ [1 . . . c + 1], when solving
the S = {ýℓ , ýℓ+1} partitioning problem let Z(g, h) be be the minimal regression error for an S-valued
regression on the first g points such that the leftmost ýℓ+1 value is in column h. (If none of {z1 , . . . , zg } are
in column h then Z(g, h) = Z(g, h+1).) Define Z(0, h) to be 0 for h ∈ [1 . . . c+1].
For g ∈ [1 . . . n] and h ∈ [1 . . . c+1], one has
if h > cg
err(vg , ýℓ ) + Z(g−1, h)
err(vg , ýℓ+1 ) + min{Z(g−1, k) : k ∈ [h . . . c+1]} if h = cg
Z(g, h) =
err(vg , ýℓ+1 ) + Z(g−1, h)
if h < cg
This can be efficiently computed in a bottom-up manner by means of operations on a balanced binary search
tree with nodes 1 . . . c+1, where at the gth step the information about the gth point is added.
Ignoring the case of equality, the other cases add err(vg , ýℓ ) to the interval of columns cg + 1 . . . c+ 1
and err(vg , ýℓ+1 ) to the interval of columns 1 . . . cg −1, where the value of Z(g, h) is the sum of the values
added to column h by vertices v1 through vg . It is well-known how to create a balanced tree where the
addition of a value to an interval of keys, and determining the sum of the intervals covering an index, can be
computed in Θ(log n) time. Further, it can be augmented to also perform the minimum calculation needed
for the equality case in the same time bound. The final modification needed is in the equality case to insure
that for points above the gth one, when the intervals covering cg are summed they give the correct value. To
do this, to the degenerate index interval [cg , cg ] add the value which is the correct value of V (g, cg ) minus
V (g−1, cg ). This makes the sum of the intervals covering column cg equal to the correct value.
The optimal S-valued regression error is min{Z(n, h) : h ∈ [1 . . . c+1]}, and the S-valued recurrence
values can be determined in reverse order, as for trees. For subsequent rounds one can partition the vertices into two subsets containing the two new level sets. Each round of partitioning can be completed in
Θ(n log n) time, completing the proof of Theorem 4.1 b). Thus Theorem 4.1 is proven.
5 Lp Isotonic Regression, 1 < p < ∞
In this section we consider Lp isotonic regression for 1 < p < ∞. While it has been mentioned in the
literature, few concrete algorithms have appeared for p 6= 2. The only ones we are aware of involve linear
orders [1, 7, 26], with the fastest being Ahuja and Orlin’s algorithm [1] which takes Θ(n log U ) time, where
U is the maximum difference of any two data values and the results are restricted to integers.
The approach used here builds on the results for L1 . Section 5.1 introduces “ǫ-partitioning”, which plays
the same role as partitioning did for L1 . In Section 5.2 we find “semi-exact” isotonic regressions which are
as exact as the capability to compute the Lp mean. E.g., they are exact for L2 . In Section 5.3 we find
approximate solutions to within a given error, and in Section 5.4 it is shown that these can be made to be
exact, not just semi-exact, in restricted circumstances.
Runtime of the Lp algorithms: Lp isotonic regression will be determined via partitioning using L1 binaryvalued isotonic regression at each stage. The running time of an Lp algorithm will be the number of stages
8
multiplied by the time to determine the regression values used for the binary regression and then solve it.
For a given DAG the time to solve the binary regression at each stage is:
• tree: Θ(n), from Section 3,
• 2-dimensional grid: Θ(n), from Section 4.1,
• 2-dimensional arbitrary: Θ(n log n), from Section 4.2,
• d-dimensional grid, d ≥ 3: Θ(n2 log n), from Theorem 4.1 c),
• d-dimensional arbitrary, d ≥ 3: Θ(n2 logd n), from Theorem 4.1 d).
• arbitary: Θ(nm + n2 log n), from Angelov et al. [2].
The number of stages will not necessarily be Θ(log n) since the regression values might not be data values,
and determining the values used for the binary regression may take a nonnegligible amount of time.
5.1 ǫ-partitioning
We introduce a partitioning method which generalizes the partitioning step used for L1 regression.
Minimizing the weighted L1 error in equation (2) below is the same as minimizing the weighted Lp
error for {C, D}-partitioning since the new weights reflect the increase in the Lp error between rounding to
the closer of C, D and rounding to the further. The proof directly follows that of Lemma 2.1, with the added
simplification that the Lp mean of a set is unique when p > 1 (i.e., in the proof there, c = d). That proof
shows that g is unique as well, from which one infers that it has no values in (0, 1).
Lemma 5.1 Given 1 < p < ∞, a DAG G, and data (y, w), let f be the Lp isotonic regression. Let C < D
be such that no data value nor level set value of f is in (C, D), and let g be an L1 isotonic regression on G
with data (y′ , w′ ), where
(0, wi · [(D−yi )p − (C −yi )p ]) if yi ≤ C
′
′
(2)
(yi , wi ) =
(1, wi · [(yi −C)p − (yi −D)p ]) otherwise
Then g is unique, {0, 1}-valued, and gi = 0 iff fi ≤ C.
Let C and ǫ > 0 be arbitrary. To apply the above lemma with D = C + ǫ, since there are only finitely
many regression and data values there is an ǫ sufficiently small so that none are in (C, C + ǫ). However,
since we don’t know the regression, we don’t a priori know how small ǫ needs to be, so we treat it as an
infinitesimal. For yi ≤ C the weight for the binary regression is the amount the weighted Lp distance will
increase from C to C +ǫ. Since ǫ is an infinitesimal, this is merely ǫ times the derivative wi p(C − yi )p−1 .
Similarly, for yi > C the weight is ǫ · wi p(yi − C)p−1 . Since all weights have factors of ǫ and p they are
removed, i.e., the binary L1 regression problem being solved is on data of the form
(0, wi (C − yi )p−1 ) if yi ≤ C
(yi′ , wi′ ) =
(3)
(1, wi (yi − C)p−1 ) otherwise
We call this ǫ-partitioning.
Let f be the Lp isotonic regression of the original data. For a level set with mean < C the sum of the
regression errors is a strictly increasing function of its distance from C, as it is if the mean is > C. Thus, if
9
g is an isotonic regression for the ǫ-partitioning, to minimize L1 error it must be 0 on all level sets of f with
mean < C and 1 on all level sets with mean > C. For level sets of f with mean C, g might be any value in
[0, 1]. We partition G by putting all vertices with g ∈ [0, 1) into one subgraph, and those with g = 1 into
the other. Regression values in the former will be in (−∞, C], and those in the latter will be in [C, ∞).
5.2 Semi-Exact Isotonic Regression
We call the below semi-exact since it is as exact as one wishes to compute the Lp mean of a set.
The values of C used for ǫ-partitioning are based on the “minimum lower sets” approach [4]. At the
first step C is the Lp mean of the entire DAG. If there is more than one level set in the isotonic regression
there is at least one with mean < C and at least one with mean > C and hence ǫ-partitioning produces
a non-trivial partitioning. Thus if it produces a trivial partition the Lp regression is the constant function
C and the algorithm is finished, while otherwise the two pieces of the partition are further refined. There
is the possibility that the Lp regression is constant but ǫ-partitioning produces a non-trivial partitioning,
where each piece again has Lp mean C. For analyzing worst-case time complexity this is no worse than the
partitioning when the isotonic regression is not constant. An alternative approach is to check if one piece
has Lp mean C, in which case both do and the regression value is C.
Lp isotonic regression may require Θ(n) stages. For example, if partitioning is used for a linear order,
the L2 metric, and unweighted data 1, 4, 27, . . . , nn , then at each stage only the largest remaining value is
determined to be a level set. Another complication is that one cannot, in general, determine the Lp mean
exactly, and generating a sufficiently accurate approximation may entail significant time. For a DAG of
n vertices with data (y, w), the time to determine all of the means utilized throughout the algorithm will
involve a part that is independent of the data, taking Θ(n2 ) time, and a data-dependent part, Tp (n, y, w), that
depends upon the accuracy desired. This assumes the bracketing data values have been determined, where,
if the mean is C, then the bracketing data values are ýi , ýi+1 such that ýi ≤ C ≤ ýi+1 . The significance of
the bracketing values is discussed in the Appendix. They can be determined by first using O(log n) stages
of ǫ-partitioning on the data values and then continuing with ǫ-partitioning based on the means.
Combining the time to determine the Lp means with the fact that there may be Θ(n) stages, each using
an ǫ-partitioning taking time given at the beginning of Section 5, results in the following:
Theorem 5.2 Let 1 < p < ∞. Given a DAG G and data (y, w), one can determine the semi-exact Lp
isotonic regression in the following time:
• tree: Θ(n2 + Tp (n, y, w)),
• 2-dim grid: Θ(n2 + Tp (n, y, w)),
• 2-dim arb: Θ(n2 log n + Tp (n, y, w)),
• d-dim grid, d ≥ 3: Θ(n3 log n + Tp (n, y, w)),
• d-dim arb, d ≥ 3: Θ(n3 logd n + Tp (n, y, w)),
• arbitrary: Θ(n2 m + n3 log n + Tp (n, y, w)),
where the implied constants for d-dimensional DAGs are functions of d. Further, for p = 2, 3, 4, and 5 the
regressions are exact and Tp (n, y, w) = 0.
10
Proof: Beyond the timing analysis discussed above, the only additional proof needed is the exactness claim.
That follows from the fact that polynomials of degree ≤ 4 can be solved exactly (see the Appendix).
The results for L2 improve upon Spouge, Wan, and Wilbur’s Θ(n3 ) time algorithm [21] for arbitrary
points in 2-dimensional space; for m = o(n2 ) improve upon the Θ(n4 ) algorithm of Maxwell and Muckstadt [14] for arbitrary DAGs; and improve upon using their algorithm for d-dimensional data, both grids
and arbitrary, for d ≥ 3.
By using PAV one can improve the result for trees when p is an integer:
Proposition 5.3 Let p be an integer ≥ 2. Given a tree G and data (y, w), one can determine the semi-exact
Lp isotonic regression in the following time:
• G linear: Θ(n + Tp (n, y, w)) if p is even and Θ(n log n + Tp (n, y, w)) if p is odd,
• G arbitrary tree: Θ(n log n + Tp (n, y, w)).
Further, for p = 2, 3, 4, and 5 the regressions are exact and Tp (n, y, w) = 0.
Proof: For PAV algorithms there are two components to time: the time to keep track of which level sets to
merge, and the time to merge the sets and compute their mean. For a linear order the time to keep track of the
level sets is Θ(n), while for a tree order it can be done in Θ(n log n) time [15]. In the Appendix it is shown
that for even positive integers the time to merge the sets and compute their means is Θ(n + Tp (n, y, w)),
and for p = 2 and 4 each root can be found exactly in constant time. For odd positive integers, merging
the sets and finding their means takes Θ(n log n + Tp (n, y, w)) time, and for p = 3 and 5 each root can be
found exactly in constant time.
5.3 Approximation to Within δ
To find an approximation where the value at each vertex is at most δ from the optimal regression value one
need only use regression values of the form kδ + minni=1 yi , for 0 ≤ k ≤ ⌊(maxni=1 yi − minni=1 yi )/δ⌋. Just
as for L1 , one can do binary search on the possible regression values. Using the remarks at the beginning of
Section 5 concerning running time gives:
Theorem 5.4 Let 1 ≤ p < ∞. Given a DAG G, data (y, w), and δ > 0, the time to determine the Lp
isotonic regression to within δ is:
• tree: Θ(n log K),
• 2-dim: Θ(n log K) for a grid, and Θ(n log n log K) for arbitrary points,
• d-dim, d ≥ 3: Θ(n2 log n log K) for a grid, and Θ(n2 logd n log K) for arbitrary points,
• arbitrary: Θ((nm + n2 log n) log K),
where K = (maxni=1 yi − minni=1 yi ) /δ and the implied constants for d-dimensional DAGs depend on d.
There has been extensive work on approximations, e.g., [3, 9, 10, 20]. A problem related to the one
solved here is to be given a constant C > 0 and find an approximation with regression error at most C times
that of the optimal error. However, to date there is none with a running time which depends on C but is
independent of the data.
11
5.4 Constrained Data Values
In certain situations one can use approximate solutions to achieve exact solutions. We first give a bound on
how close different level set values can be for unweighted binary data.
Lemma 5.5 For 1 < p < ∞ there are constants α(p), β(p) > 0 such that if S1 and S2 are sets of no more
than n binary values then either their Lp means are the same or they differ by at least α(p)/nβ(p) .
Proof: Let M (i, j) denote the Lp mean of i 0’s and j 1’s. Simple calculus shows that
M (i, j) =
jq
iq + j q
(4)
where q = 1/(p − 1). The difference between means, M (i1 , j1 ) − M (i2 , j2 ), is
J1 I2 − J2 I1
(I1 + J1 ) · (I2 + J2 )
(5)
where I1 = iqi , J1 = j1q , I2 = iq2 , and J2 = j2q . We will obtain a lower bound for this.
When i1 + j1 ≤ n and i2 + j2 ≤ n, the denominator is no more than 4n2q . Letting A = j1 i2 and
B = j2 i1 , the numerator is Aq − B q , where A and B are integers with 0 ≤ A, B ≤ n2 . Either A = B, in
which case the level sets have the same value, or else they differ by at least 1. If q ≤ 1, by concavity and
evaluating the derivative of xq at n2 , |Aq − B q | ≥ qn2q−2 . Thus the absolute value of (5) is either 0 or at
least 2qn2q−2 /4n2q = α(p)/nβ(p) for constants α(p) and β(p). A similar results holds when q > 1, using
the fact that xq is convex and the numerator is at least 1q − 0q = 1.
To convert an approximate isotonic regression into an exact one, suppose level set values of the exact
isotonic regression f are in the range a to b and differ by at least δ. Let S = {s1 < s2 . . . < sk } be such
that s1 ≤ a, sk ≥ b, and si+1 ≤ si + δ/2 for 1 ≤ i < k. Let h be an S-valued isotonic regression of the
data. Since consecutive values of S are sufficiently close, no two level sets of f of different mean will round
to the same value of S, even if one rounds up and the other rounds down. Therefore for each vertex v of G,
f (v) is the Lp -mean of the level set of h containing v.
Theorem 5.6 For 1 < p < ∞, if data values are in {0, 1} and weights are in {1, . . . , W }, then for a DAG
G of n vertices the exact Lp isotonic regression can be found in the time given in Theorem 5.4, where log K
is replaced by log nW . The implied constants depend on p.
Further, if data values are in {0, . . . , D} and weights are in {1, . . . , W }, then the exact L2 isotonic
regression can be found in the same time, where log K is replaced by log nDW .
Proof: Sets with ≤ n binary values with weights in {1, . . . , W } have an Lp mean contained in the Lp means
of sets with ≤ nW binary values. Therefore means of such sets differ by at least α(p)/(nW )β(p) . If S is
multiples of half of this, from 0 to 1, an optimal S-valued isotonic regression will be a sufficiently close
approximation so that
the true Lp mean of its level sets is the desired regression. The number of elements in
β(p)
S is Θ (nW )
, so only Θ(log nW ) rounds of ǫ-scaling are needed to identify the level sets of the exact
regression. Replacing their approximate value with the exact mean (Equation (4)) finishes the calculation.
Extending the range of data values for Lp is complicated by the non-analytic form of the Lp mean.
However, for L2 , if data values are in {0, . . . , D} and weights are in {1, . . . , W }, then, by the same analysis
as above, the means of unequal level sets differ by at least 1/(nW )2 , independent of D, and their means are
in the interval [0, D]. Thus at most Θ(log nDW ) iterations are required.
12
(1.5, 0.9)
(2.1, 0.7)
(6.1, 0.2)
(6.1, 0.2) (2.1, 0.7) (1.5, 0.9)
1
2
3
a
a
(2.0, 1.1)
(5.1, 0.6)
c
a
a
(8.0,1.5)
1
d
c
2
c
(5.1, 0.6) (2.0, 1.1)
b1
b
1
d
(8.0,1.5)
b2
(0.0, 1.2) (21.0, 0.4)
(21.0, 0.4)
(0.0, 1.2)
Figure 3: Converting a DAG with multiple values per vertex into one with a single value per vertex
6 Multiple Values per Vertex
Several authors have considered the extension where there are multiple values at each vertex, i.e., at vertex vi
there is a set of weighted values {(yi,j , wi,j ) : 1 ≤ j ≤ ni } for some ni ≥ 1, and the Lp isotonic regression
problem, 1 ≤ p < ∞, is to find the isotonic function z that minimizes
ni
n X
X
i=1 j=1
1/p
wi,j |yi,j − zi |p
(6)
among all isotonic functions. For L2 this simplifies to single-valued isotonic regression where at each vertex
the value is the weighted mean
Pn of the values and the weight is the sum of the weights, but this is no longer
true when p 6= 2. Let N = i=1 ni .
Robertson and Wright [19] studied the statistical properties of L1 isotonic regression with multiple
values and two independent variables, but no algorithm for finding it was given. Chakravarti [5] gave
a Θ(N n) time algorithm for L1 isotonic regression for a tree, and for a linear order Pardalos, Xue and
Yong [16] gave one taking Θ(N log2 N ) time, and asked if Θ(N log N ) was possible.
A simple approach to this problem is to transform it into one where there is a single value per vertex.
Given a DAG G with multiple values per vertex, create a new DAG G′ with one value per vertex as follows:
if vertex vi ∈ G has k weighted values (yi,j , wi,j ), 1 ≤ j ≤ k, then in G′ represent vi with vertices vi,j ,
1 ≤ j ≤ k; represent edge (vi , vℓ ) in G with edge (vi,k , vℓ,1 ) in G′ ; and add edges (vi,j , vi,j+1 ) to G′ for
′ , w′ ), where y ′ is the j th data value at v in decreasing
1 ≤ j < k. Vertex vi,j has the weighted value (yi,j
i
i,j
i,j
′ is its corresponding weight. See Figure 3. G′ has N vertices and m + N − n edges. An
order and wi,j
optimal isotonic regression on G′ will have the same value for vi,1 . . . vi,k , and this is the value for vi ∈ G
in an optimal solution to (6). The initial sorting at the vertices takes Θ(N log N ) time.
An advantage of this approach is that one can invoke whatever algorithm is desired on the resulting DAG.
When the initial DAG was a tree the new DAG is as well, and thus the L1 regression can be determined in
Θ(N log N ) time. This answers the question of Pardalos et al. and improves upon Chakravarti’s result.
For L1 partitioning algorithms another approach is to merely use all N values for the partitioning,
resulting in running times that are the same as in Theorem 5.4 where log K is replaced with log N and an
N log N term is added, representing the initial sorting of the values and the evaluation of regression error
throughout the algorithm. For large N , to reduce the time yet further this approach will be refined.
13
Theorem 6.1 Given a DAG G with multiple weighted values per vertex, with N total values, an L1 isotonic
regression can be found in the same running time as in Theorem 5.4, where log K is replaced by log N and
an (N + n log N ) log log N term is added.
Further, if G is fixed and only N varies then the regression can be found in Θ(N ) time.
Proof: It will be shown that O(log N ) stages are needed and that the total time to initialize data structures,
find the values used for the binary regressions, and evaluate the regression error, is O((N +n log N ) log log N ).
This, plus the time to solve the binary regressions on the DAG, yields the results claimed.
For V [a, b], a and b now refer to regression values a and b, rather than their ranks as in Section 2, since
their global rank will not be known. The values in (a, b) at the vertices in V [a, b] will be called active values.
Rather than finding a median z of all values in (a, b), it suffices to choose an approximate median among
the active values. This can be achieved by determining at each vertex in V [a, b] the unweighted median of
its active values and weighting this by the number of active values it has. Let z be the weighted median of
these medians, breaking ties by choosing the smaller value, and let z ′ be the next largest value at vertices in
V [a, b] (if z is b then set z ′ to be b and z the next smallest value). No more than 3/4 of the active values in
V [a, b] can be in (a, z) and no more than 3/4 are in (z ′ , b). Hence, no matter how the vertices are partitioned
into V [a, z] and V [z ′ , b], in each of them the number of active values is no more than 3/4 of those in V [a, b],
and thus there are at most Θ(log N ) iterations until there are no active values. A final partitioning step
determines which vertices should have regression value a and which should be b.
Initially the values at vertex vi are partially sorted into lg N “bags” of size ni / lg N , where the values in
each bag are smaller than the values in the next. For each bag the smallest value, number of values, and sum
of weights in the bag are determined. The bags are organized into a binary tree using their lowest value as
the key. At each node of the tree is kept the sum of the weights and the number of values in the bags below.
This can all be done in Θ(ni log log N ) time at vi , and Θ(N log log N ) time overall.
Finding the median of the vertex’s active values, the next largest value above z, and the relative error of
using z vs. z ′ , can each be determined by a traversal through the tree and a linear time operation on at most
two bags (e.g., to find the relative error of using z vs. z ′ one needs to determine it for values inside the bag
where z would go, and then use the tree to determine it for values outside the bag). Thus the time at vi at
each stage is Θ(ni / log N + log log N ), and Θ(N/ log N + n log log N ) over all vertices. Since the number
of stages is Θ(log N ), this proves the statement at the beginning of the proof.
To prove the claim concerning a fixed DAG and varying N no bags are used. Each stage involves
multiple iterations, n per stage, where for each vertex the unweighted median of its active values is used for
one of the iterations. After each stage the number of active values at a node is no more than half of what
it was to begin with, and thus stage k takes Θ(n(T (G) + N/2k )) time, where T (G) is the time to solve a
binary-valued L1 isotonic regression on G. Thus the total time is Θ(n(T (G) log N + N )), which is Θ(N )
since G and n are fixed.
7 Final Comments
Table 2 is an update of Table 1, incorporating the results in this paper. Results for Lp were not included in
Table 1 since polynomial time algorithms had only appeared for linear orderings [1, 7, 26]. The semi-exact
algorithms, and the exact L2 algorithms in [14, 21], have a worst-case behavior where each partitioning
stage results in one piece having only a single vertex. Partitioning using the mean probably works well in
practice, but a natural question is whether the worst-case time can be reduced by a factor of Θ̃(n).
14
ordered
set
linear
L1
Θ(n log n)
tree
[1, 22]
Θ(n log n)
2-dim
grid
2-dim
arbitrary
d ≥ 3 dim
grid
d ≥ 3 dim
arbitrary
arbitrary
+
Θ(n log n)
+
Θ(n log2 n)
+
2
Θ(n log n)
*
Θ(n2 logd n)
+
Θ(nm + n2 log n)
[2]
metric
Lp , 1 < p < ∞, semi-exact
Θ(n + Tp ), p even
Θ(n log n + Tp ), p odd
Θ(n2 + Tp ), p arb
PAV, +
Θ(n log n + Tp ), p integer
Θ(n2 + Tp ), p arb
[15] (p = 2), +
Θ(n2 + Tp )
[21] (p = 2), +
Θ(n2 log n + Tp )
+
3
Θ(n log n + Tp )
*
Θ(n3 logd n + Tp )
+
2
Θ(n m + n3 log n + Tp )
+
Lp , 1 < p < ∞, δ approx
Θ(n log K)
L∞
Θ(n log n)
[1]
Θ(n log K)
*
Θ(n log n)
+
Θ(n log K)
+
Θ(n log n log K)
+
2
Θ(n log n log K)
*
Θ(n2 logd n log K)
+
2
Θ (nm + n log n) log K
+
*
Θ(n log n)
*
Θ(n log2 n)
[24]
Θ(n log n)
*
Θ(n logd n)
[24]
Θ(m log n)
[24]
+: shown here
*: implied by the result for arbitrary ordered sets
K: (maxni=1 yi − minni=1 yi ) /δ
Tp : total data-dependent time to determine Lp means
Table 2: Times of fastest known isotonic regression algorithms
“Semi-exact” is exact for p = 2, 3, 4, 5, and T2 = T3 = T4 = T5 = 0.
Partitioning can also improve the time for L1 isotonic regression for unweighted data on an arbitrary
DAG G = (V, E). Partitioning on values a < b is equivalent to finding a minimum vertex cover on the
directed bipartite graph H = (V, E ′ ) with vertex partition V1 , V2 , where V1 is those vertices with data ≤ a
and V2 is the vertices with data ≥ b. There is an edge (v1 , v2 ) ∈ E ′ , with v1 ∈ V1 and v2 ∈ V2 , iff v2 ≺ v1
in G, i.e., iff they violate the isotonic constraint. Vertices in the minimum cover correspond to those where
the value of the {a, b}-valued regression is the further of a or b, instead of the closer. A simple analysis
using the fact that the cover is minimal shows that no new violating pairs are introduced.
For unweighted
√
bipartite graphs the maximum matching algorithm of Hopcroft and Karp takes Θ(E V ) time, from which
a minimum vertex cover can be constructed in linear time. H can be constructed in linear time from the
transitive closure of G, and since the transitive closure can be found in O(n2.376 ) time (albeit with infeasible
constants), the total time over all Θ(log n) partitioning stages is Θ(n2.5 log n). For multidimensional DAGs
this can be reduced to Θ̃(n1.5 ) [25]. However, this approach does not help for Lp regression since the binary
L1 regression problems generated are weighted even when the data is unweighted.
The L1 median, and hence L1 isotonic regression, is not always unique, and a natural question is whether
15
there is a “best” one. A similar situation occurs for L∞ isotonic regression. Algorithms have been developed
for strict L∞ isotonic regression [23, 25], which is the limit, as p → ∞, of Lp isotonic regression. This is
also known as “best best” L∞ regression [13]. Analogously one can define strict L1 isotonic regression to be
the limit, as p → 1+ , of Lp isotonic regression. It can be shown to be well defined, and it is straightforward
to derive the appropriate median for a level set [11]. However, like Lp means, this median cannot always
be computed exactly. Apparently no algorithms have yet been developed for strict L1 isotonic regression,
though PAV is still applicable for linear orders and trees.
One can view Proposition 3.2, concerning complete bipartite graphs, and Section 6, concerning multiple
values per vertex, as being related. The complete bipartite graph partitions the vertices into sets P1 and
P2 . Viewing P1 and P2 as two nodes in a DAG with an edge from P1 to P2 , then the isotonic requirement
in Proposition 3.2 is that no element in P1 has a regression value greater than the regression value of any
element of P2 , but there are no constraints among elements of P1 nor among elements of P2 . In contrast,
the model in Section 6 corresponds to requiring that every element in P1 has the same regression value,
and similarly for P2 . The interpretation of multiple values per vertex as partitioning elements, with no
constraints among elements in the same partition, can be extended to arbitrary DAGs, resulting in each node
of the DAG having an interval of regression values, rather than a single one. A technique similar to that in
Section 6 can be used to convert this problem into a standard isotonic regression with one value per vertex,
but the special structure can likely be exploited to produce faster algorithms.
Finally, the classification problem called isotonic separation [6] or isotonic minimal reassignment [8] is
an isotonic regression problem. In the simplest case, suppose there are two types of objects, red and blue,
and at each vertex there is an observation of a red or blue object. There is a penalty for misclassifying a
red object as blue, and a perhaps different penalty for the opposite error. The goal is to minimize the sum
of the misclassification penalties given that red < blue and the classification must be isotonic. This is a
straightforward binary-valued L1 isotonic regression problem. In the general case with k ordered categories
there is a penalty for misclassifying an object of type i as being of type j. Unfortunately, penalties as
simple as 0-1 correct/incorrect do not satisfy the partitioning property in Lemma 2.1. However, for trees, the
dynamic programming approach used in Section 3 can be modified to find an optimal isotonic separation in
a single pass taking Θ(kn) time. The time required for arbitrary DAGs is unknown.
Acknowledgements
The author thanks the referees for their helpful suggestions.
References
[1] Ahuja, RK and Orlin, JB (2001), “A fast scaling algorithm for minimizing separable convex functions
subject to chain constraints”, Operations Research 49, pp. 784–789.
[2] Angelov, S, Harb, B, Kannan, S, and Wang, L-S (2006), “Weighted isotonic regression under the L1
norm”, Symposium on Discrete Algorithms (SODA), pp. 783–791.
[3] Barlow, RE, Bartholomew, DJ, Bremner, JM, and Brunk, HD (1972), Statistical Inference Under Order
Restrictions, John Wiley.
[4] Brunk, HD (1955), “Maximum likelihood estimates of monotone parameters”, Annals of Mathematical
Statistics 26, pp. 607–616.
16
[5] Chakravarti, N (1992), “Isotonic median regression for orders represented by rooted trees”, Naval
Research Logistics 39, pp. 599–611.
[6] Chandrasekaran, R, Rhy, YU, Jacob, VS, and Hong, S (2005), “Isotonic separation”, INFORMS Journal on Computing 17, pp. 462–474.
[7] Chepoi, V, Cogneau, D, and Fichet, B (1967), “Polynomial algorithms for isotonic regression”, L1
Statistical Procedures and Related Topics, Lecture Notes-Monograph Series vol. 31, Institute of Mathematical Statistics, pp. 147–160.
[8] Dembczynski, K, Greco, S, Kotlowski, W, and Slowinski, R (2007), “Statistical model for rough set
approach to multicriteria classification”, PKDD 2007: 11th European Conference on Principles and
Practice of Knowledge Discovery in Databases, Springer Lecture Notes in Computer Sciencce 4702,
pp. 164–175.
[9] Dykstra, RL and Robertson, T (1982), “An algorithm for isotonic regression of two or more independent variables”, Annals of Statistics 10, pp. 708–716.
[10] Gebhardt, F (1970), “An algorithm for monotone regression with one or more independent variables”,
Biometrika 57, pp. 263–271.
[11] Jackson, D (1921), “Note on the median of a set of numbers”, Bulletin of the American Mathematical
Society 27, pp. 160–164.
[12] Kaufman, Y and Tamir, A (1993), “Locating service centers with precedence constraints”, Discrete
Applied Mathematics 47, pp. 251–261.
[13] Legg, D., Townsend, D. (1984), “Best monotone approximation in L∞ [0,1]”, Journal of Approximation Theory 42, pp. 30–35.
[14] Maxwell, WL and Muckstadt, JA (1985), “Establishing consistent and realistic reorder intervals in
production-distribution systems”, Operations Research 33, pp. 1316–1341.
[15] Pardalos, PM and Xue, G (1999), “Algorithms for a class of isotonic regression problems”, Algorithmica 23, pp. 211–222.
[16] Pardalos, PM, Xue, G-L, and Yong, L (1995), “Efficient computation of an isotonic median regression”, Applied Mathematics Letters 8, pp. 67–70.
[17] Punera, K and Ghosh, J (2008), “Enhanced hierarchical classification via isotonic smoothing”, Proceedings of the International Conference on the World Wide Web 2008, pp. 151–160.
[18] Qian, S (1996), “An algorithm for tree-ordered isotonic median regression”, Statistics and Probability
Letters 27, pp. 195–199.
[19] Robertson, T and Wright, FT (1973), “Multiple isotonic median regression”, Annals of Statistics 1,
pp. 422–432.
[20] Robertson, T, Wright, FT, and Dykstra, RL (1988), Order Restricted Statistical Inference, Wiley.
17
[21] Spouge J, Wan H, and Wilbur WJ (2003), “Least squares isotonic regression in two dimensions”,
Journal of Optimization Theory and Applications 117, pp. 585–605.
[22] Stout, QF (2008), “Unimodal regression via prefix isotonic regression”, Computational Statistics and
Data Analysis 53, pp. 289–297. A preliminary version appeared in “Optimal algorithms for unimodal
regression”, Computing and Statistics 32, 2000.
[23] Stout, QF (2012), “Strict L∞ isotonic regression”, Journal of Optimization Theory and Applications
152, pp. 121–135.
[24] Stout, QF (2012), “Weighted L∞ isotonic regression”, submitted.
www.eecs.umich.edu/˜qstout/abs/LinfinityIsoReg.html
[25] Stout, QF (2012), “Isotonic regression for multiple independent variables”, submitted.
www.eecs.umich.edu/˜qstout/abs/MultidimIsoReg.html
[26] Strömberg, U (1991), “An algorithm for isotonic regression with arbitrary convex distance function”,
Computational Statistics and Data Analysis 11, pp. 205-219.
[27] Thompson, WA Jr. (1962), “The problem of negative estimates of variance components”, Annals of
Mathematical Statistics 33, pp. 273–289.
A
Computing the Lp Mean
The computation of the Lp -mean of a level set for general p is more complicated than for p = 1, 2, or ∞.
We are interested in identifying those aspects which might be more than linear in the time of the remainder
of the algorithm. We assume that the data values have been presorted.
Simple calculus [11] shows that the Lp mean of the data (y, w) is the unique C such that
X
X
wj (C − yj )p−1
(7)
wi (yi − C)p−1 =
yj <C
yi >C
which does not, in general, have an analytic solution. An important aspect is to determine the data values
that bracket the Lp mean, i.e., to determine which are in the RHS and which are on the LHS of (7).
Given the bracketing values, the time to find the mean to desired accuracy may depend upon the data
and we make no assumptions concerning it. Given p, n, and data (y, w), the total time to find all Lp
means used in the algorithm is decomposed into a part independent of the data and the remaining portion,
denoted Tp (n, y, w), that is data dependent. For general p the data independent portion requires evaluating
(7) for each set at least once per partitioning stage, and thus Θ(n) time per stage. Since there can be Θ(n)
stages in each of the algorithms in Section 5, the data independent time for finding means is Θ(n2 ) once
the bracketing values are known. By first partitioning on data values, all of the partitioning algorithms can
determine the bracketing values in time no more than the remaining time
Pof the algorithm.
If p is an even integer, (7) reduces to finding the unique solution of i=1,n wi (yi − C)p−1 = 0, i.e., no
bracketing values nor presorting are required. This is a sum of (p−1)st degree polynomials, and hence can
be determined by summing up, over the data values, the coefficients of each term and then evaluating the
result. For p an odd integer once again the terms are polynomials so, given the bracketing values, one can
use sums of coefficients to evaluate the summations in (7). Since roots of polynomials of degree ≤ 4 can be
found exactly in constant time, T2 = T3 = T4 = T5 = 0.
18
For PAV sets are merged rather than partitioned. This does not change the data independent time for
general p, but for p an even integer each set is reduced to p + 1 coefficients and merging sets merely requires
adding coefficients. Hence the data independent time is Θ(n). For p odd the situation is a bit more complex
since the bracketing values are needed. One can construct a balanced search tree where data values are keys
and at each node, for each term there is the sum of the coefficients beneath it. The only change from being
on the LHS vs. the RHS of (7) is whether the term has a plus or minus sign, and hence, given this tree,
one can descend through the tree and determine the bracketing data values in logarithmic time. Merging
sets corresponds to merging their trees, including maintaining the information about sums of coefficients in
subtrees. With careful merging this can be accomplished in Θ(n log n) total time.
19