A Bilinear Algorithm for Sparse
Representations
Pando Georgiev1 , Panos Pardalos2 , Fabian Theis3
1
University of Cincinnati
ECECS Department
Cincinnati ML 0030, Ohio 45221, USA
E-mail: pgeorgie@ececs.uc.edu
2
University of Florida
Center for Applied Optimization
E-mail: pardalos@ufl.edu
3
Institute of Biophysics, University of Regensburg
D-93040 Regensburg, Germany
E-mail: fabian@theis.name
Abstract
We consider the following sparse representation problem: represent
a given matrix X ∈ IRm×N as a multiplication X = AS of two matrices
A ∈ IRm×n (m 6 n < N ) and S ∈ IRn×N , under requirements that all
m × m submatrices of A are nonsingular, and S is sparse in sense that
each column of S has at least n − m + 1 zero elements. It is known
that under some mild additional assumptions, such representation is
unique, up to scaling and permutation of the rows of S. We show that
finding A (which is the most difficult part of such representation)
can be reduced to a hyperplanes clustering problem. We present a
bilinear algorithm for such clustering, which is robust to outliers. A
computer simulation example is presented showing the robustness of
our algorithm.
Keywords : Sparse Component Analysis, Blind Source Separation, underdetermined mixtures
1
Introduction
Representation of data becomes a fundamental questions in data analysis,
signal processing, data mining, neuroscience, etc., since the collected data
1
grows exponentially. In the present paper we consider a linear representation
of a given matrix exploiting sparsity. Such linear representation of data set
X (given in form of a (m × N )-matrix) is
X = AS,
A ∈ IRm×n , S ∈ IRn×N .
(1)
Under different assumptions on the unknown matrices A (dictionary) and S
(source signals) this linear representation is known as:
1) Independent Component Analysis (ICA) problem if the rows of S are
(discrete) random variables, which are statistically independent as much as
possible;
2) Nonnegative Matrix Factorization (NMF) (see [14]), if the elements of
X, A and S are nonnegative;
3) Sparse representation or Sparse Component Analysis problem if S contains as many zeros as possible.
There is a large amount of papers devoted to ICA problems (see for
instance [7], [12] and references therein) but mostly for the case m > n. We
refer to [13, 16–18] and reference therein for some recent papers on SCA and
underdetermined ICA (m < n).
A related problem is the so called Blind Source Separation (BSS) problem,
in which we know a priori that a representation such as in equation (1) exists
and the task is to recover the sources (and the mixing matrix) as accurately
as possible. A fundamental property of the complete BSS problem is that
such a recovery (under assumptions in 1) and non-Gaussianity of the sources)
is unique up to permutation and scaling of the sources, which makes the BSS
problem so attractive. Similar uniqueness of the recovery is valid if instead
of independence we assume some sparsity of the source matrix (see the next
section for a review of this method).
The main contribution of this paper is a new algorithm for such recovery
under sparsity assumptions, which is robust to outliers. We have to note the
fundamental difference between recovery of sparse signals by our method and
by l1 -norm minimization. The l1 -norm minimization gives solutions which
have generally at most m non-zeros [8], [9] and in some special cases gives
the sparsest solution [9]. We utilize the fact that for almost all data vectors
x (in measure sense) such that the system x = As has a sparse solution with
less than m nonzero elements, this solution is unique, while in [9] the authors
proved that for all data vectors x such that the system x = As has a sparse
solution with less than Spark(A)/2 nonzero elements, this solution is unique.
Note that Spark(A) 6 m + 1, where Spark(A) is the smallest number of
linearly dependent columns of A. So, our assumptions for uniqueness of the
representation (up to permutation and scaling of sources) is less restrictive,
since we require more nonzero elements in the solution, namely m − 1.
2
2
Blind Source Separation based on sparsity
assumptions
In this section we present a method (see [11]) for solving the BSS problem if
the following assumptions are satisfied:
A1) the mixing matrix A ∈ IRm×n has the property that any square m × m
submatrix of it is nonsingular;
A2) each column of the source matrix S has at most m − 1 nonzero elements;
A3) the sources are sufficiently rich represented in the following sense: for
any index set of n − m + 1 elements I = {i1 , ..., in−m+1 } ⊂ {1, ..., n} there
exist at least m column vectors of the matrix S such that each of them has
zero elements in places with indexes in I and each m − 1 of them are linearly
independent.
2.1
Matrix identification
We describe conditions in the sparse BSS problem under which we can identify the mixing matrix uniquely up to permutation and scaling of the columns.
2.1.1
General case – full identifiability
Theorem 1. (Identifiability conditions – general case) Assume that in
the representation X = AS the matrix A satisfies condition A1), the matrix
S satisfies conditions A2) and A3) and only the matrix X is known. Then
the mixing matrix A is identifiable uniquely up to permutation and scaling
of the columns.
2.2
Identification of sources
Theorem 2. (Uniqueness of sparse representation) Let H be the set
of all x ∈ IRm such that the linear system As = x has a solution with at
least n − m + 1 zero components. If A fulfills A1), then there exists a subset
H0 ⊂ H with measure zero with respect to H, such that for every x ∈ H \ H0
this system has no other solution with this property.
From Theorem 2 it follows that the sources are identifiable generically,
i.e. up to a set with a measure zero, if they have level of sparseness grater
than or equal to n − m + 1 (i.e., each column of S has at least n − m + 1
elements) and the mixing matrix is known.
3
Algorithm 1: SCA matrix identification algorithm
Data : samples x(1), . . . , x(T ) of X
Result: estimated mixing matrix Â
1
2
3
4
Hyperplane identification. ¡
¢
¡ n ¢
n
Cluster the columns of X in m−1
groups Hk , k = 1, ..., m−1
such
that the span of the elements of each group Hk produces one
hyperplane and these hyperplanes are different.
Matrix identification.
Cluster the normal vectors to these hyperplanes in the smallest
number of groups Gj , j = 1, ..., n (which gives the number of sources
n) such that the normal vectors to the hyperplanes in each group Gj
lie in a new hyperplane Ĥj .
Calculate the normal vectors âj to each hyperplane Ĥj , j = 1, ..., n.
The matrix  with columns âj is an estimation of the mixing matrix
(up to permutation and scaling of the columns).
Algorithm 2: Source Recovery Algorithm
Data : samples x(1), . . . , x(N ) (vector columns) of the data matrix
X, and mixing matrix A
Result: estimated source matrix S
1
2
3
4
Identify the the set of hyperplanes H produced by taking the linear
hull of every subsets of the columns of A with m − 1 elements;
Repeat for i = 1 to N :
Identify the subspace H ∈ H containing xi := X(:, i), or, in practical
situation with presence of noise, identify the one to which the
distance from xi is minimal and project xi onto H to x̃i ;
if H is produced by the linear hull of column vectors ai1 , ..., aim−1 ,
then find coefficients λi,j such that
x̃i =
m−1
X
λi,j aij .
j=1
5
These coefficients are uniquely determined if x̃i doesn’t belong to the
set H0 with measure zero with respect to to H (see Theorem 2);
Construct the solution S(:, i): it contains λi,j in the place ij for
j = 1, ..., m − k, the rest of its components are zero.
4
3
Sparse Component Analysis
In this section we review a method in [11] for solving of the SCA problem.
Now the conditions are formulated only in terms of the data matrix X.
Theorem 3. (SCA conditions) Assume that m 6 n 6 N and the matrix
X ∈ IRm×N satisfies the following conditions: ¡
¢
n
(i) the columns of X lie in the union H of m−1
different hyperplanes,
each column lies in only one such hyperplane, each hyperplane contains at
least m columns of X such that each m − 1 of them
¡ n−1are
¢ linearly independent.
(ii) for each i ∈ {1, ..., n} there exist p = m−2 different hyperplanes
{Hi,j }pj=1 in H such that their intersection Li = ∩pk=1 Hi,j is one dimensional
subspace.
(iii) any m different Li span the whole IRm .
Then the matrix X is representable uniquely (up to permutation and scaling of the columns of A and S) in the form X = AS, where the matrices
A ∈ IRm×n and S ∈ IRn×N satisfy the conditions A1) and A2), A3) respectively.
4
Skeletons of a finite set of points
Let X be a finite set of points represented by the columns {xj }N
j=1 of the
matrix X ∈ Rm×N . The solution {(wi0 , b0i )}ni=1 of the following minimization
problem:
minimize
N
X
j=1
min |wiT xj − bi | subject to kwi k = 1, bi ∈ R, i = 1, ..., n,
16i6n
(2)
defines w(1) -skeleton of X (introduced in [15]). It consists of a union of n
hyper-planes Hi = {x ∈ Rm : wiT x = bi }, i = 1, ..., n, such that the sum of
minimum distances of every point xj to them is minimal.
Analogically, the solution of the following minimization problem:
minimize
N
X
j=1
min |wiT xj − bi |2
16i6n
subject to kwi k = 1, bi ∈ R, i = 1, ..., n,
(3)
defines w -skeleton of X (introduced in [3]). It consists of union of n hyperplanes {Hi }ni=1 (defined as above) such that the sum of squared minimum
distances of every point xj to them is minimal.
It is clear that if the representation X = AS is sparse (in sense that each
column of S contains at most m − 1 non-zero elements, i.e. it satisfies the
(2)
5
condition A2)) then the above defined two skeletons coincide and the data
points (columns of X) lie on them.
Conversely, assuming that above defined two skeletons coincide, the columns
of X lie on them and the conditions of Theorem 3 are satisfies, then finding these skeletons leads to finding the mixing matrix, and subsequently, a
sparse representation of X in form (1), which is unique (up to scaling and
permutation of the rows of S).
5
Reduction of the clustering problem to a
bilinear optimization problem.
Below we present a bilinear algorithm for finding a modified skeleton, replacing the condition kwi k = 1 in (2) with the condition wiT e = 1, where the
vector e has norm one.
Proposition 4. Assume that the w(1) -skeleton of X is unique and any column of X lies only in one hyperplane of this skeleton. If the minimum value
of the cost function in the clustering problem (2) is zero and bi = 0 then
finding the w(1) -skeleton is equivalent to solving the following bilinear minimization problem (with variables tij ∈ R, ui,j ∈ R, wi ∈ Rm ):
minimize
N X
n
X
tij uij
(4)
j=1 i=1
under constraints
−uij 6 wiT xj 6 uij , ∀i, j
n
X
tij = 1, tij > 0, ∀i, j
(5)
∀i.
(7)
i=1
wiT e
= 1,
(6)
The solution of (4) − (7) is unique and for any j only one component of
tij , i = 1, ..., n is one, the rest are zero, i.e. the matrix T = {tij } gives the
cluster assignment.
Remark. A necessary condition for the minimum value of the cost function in the minimization problem (4) to be zero is xj for any j not to be
collinear to the vector e.
Proof of Proposition 4. Let {wi }ni=1 be the solution of (2). Put tij = 1
if i = argmin{wiT xj } and tij = 0 otherwise. Such argmin is uniquely defined
by the assumption that any column of X lies in only one hyperplane of the
skeleton. Putting also wi′ = wi /wiT e, uij = wiT xj , then a solution of (4) is
given by wi′ , tij , uij , i = 1, ..., n, j = 1, ..., N .
6
Conversely, let wi′ , tij , uij , i = 1, ..., n, j = 1, ..., N be a solution of (4)-(7)
(with minimum value zero). By (6) for any j there is at least one index ij
such that tij j 6= 0 which implies that uij j = 0, i.e. xj lies in a hyperplane
with normal vector wij . Since any column of X lies in only one hyperplane
of the skeleton, we have wiT xj 6= 0 for every i 6= ij which implies tij = 0.
Putting wi = wi′ /kwi′ k we see that {wi }ni=1 is a solution of (2).
In the case when the assumptions of Proposition 4 are satisfied except of
the assumption that the minimum value of the cost function in the clustering
problem (2) is zero, we have that the minimization problems (4)-(7) and the
following
minimize
N
X
j=1
min |wiT xj | subject to wiT e = 1, i = 1, ..., n,
16i6n
(8)
are equivalent. The proof is similar to those of Proposition 4 having in mind
that
N X
n
N
X
X
tij uij >
min uij
j=1 i=1
j=1
16i6n
as equality is achieved only if tij j = 1, where ij = argmin16i6n uij .
If fmin and gmin are the minimum values of the minimization problems
(2) and (8) respectively, we have fmin 6 gmin , so if gmin is near to zero we
can expect that their solutions coincide (this is the case of the presented
example below). The problem (8) has the advantage that it can be solved by
linear programming (reformulated in the problem (4)-(7)), which is robust
to outliers. This stated in the following proposition.
Proposition 5. Assume that the vector e with norm 1 is chosen in a such a
way that the following condition is satisfied: if we take any disjointP
subsets J1
and
then the vector e is not collinear to the vector j∈J1 xj −
P J2 of {1, ..., N },
n
j∈J2 xj . If {wi }i=1 is a solution of (8), then for every i ∈ {1, ..., n} there
is at least one j ∈ {1, ..., N } such that wiT xj = 0, i.e. the hyperplanes with
normal vectors wi contain at least one vector column of X.
Remark. Almost all vectors e (in measure sense) in the unit sphere in Rm
satisfy the assumption in Proposition 5, so in practice e is randomly chosen.
Proof of Proposition 5. Denote
Ji = {j : |wiT xj | = min |wkT xj |}.
16k6n
Then wi is a solution of the following minimization problem:
X
minimize f (w) =
|wT xj | subject to wT e = 1.
j∈Ji
7
(9)
Let i be fixed. Denote for simplicity w = wi . Define:
©
ª
I + = j ∈ J i : w T xj > 0 ,
©
ª
I − = j ∈ J i : w T xj < 0 ,
©
ª
I 0 = j ∈ Ji : wT xj = 0 ,
We have to prove that I 0 6= ∅.
By Kuhn-Tucker’s theorem in its subdifferential form and by the chain
rule for subdifferentials (see for instance [1]) there exists εj ∈ [−1, 1], j ∈ I 0 ,
and λ ∈ R such that
X
X
X
xj −
xj +
εj xj = λe.
(10)
j∈I +
j∈I 0
j∈I −
Here we used the fact that the subdifferential of the function |.| at zero is
equal to [−1, 1].
Multiplying (10) by w we obtain λ = fmin . If fmin = 0, the proposition is
proved. Otherwise λ 6= 0. Now if we assume that I 0 = ∅, by (10) we obtain
a contradiction with the assumption.
The above proposition shows that our method for finding skeletons is
robust to outliers in the following sense: if most of the column vectors of X
lie on hyperplanes, finding these hyperplanes can be performed by solving the
linear programming problem (4)-(7). Note that such robustness to outliers is
not unexpected – it appears for instance in linear regression and clustering
problems when the l2 norm (Euclidean) is replaced by the l1 norm (see for
instance [4]).
The separated linear programming problems (with respect to {tij } and
with respect to ({wi }ni=1 , {uij }) can be solved separately, which leads to the
following analogue (see Algorithm 3) to the k-median clustering algorithm
of Bradley, Mangasarian and Street [4]. Note that the linear programming
problem with respect to {tij } is the cluster assignment step, which is realized
simply as assignment. The finite termination of the algorithm can be proved
analogically as the proof of Theorem 3.7 in [3].
6
Computer simulation example
We present an example created by Matlab showing the robustness of our
algorithm to outliers. We created a matrix S0 ∈ R5×120 from 5 random
variables with normal distribution, using the command S0 = randn(5, 120)
(in Matlab notation). We set to zero two elements (randomly chosen) in each
column of S0 and obtained 2-sparse source matrix S, plotted in Figure 1.
8
Algorithm 3: n-planes clustering algorithm via linear programming
Data : samples x(1), . . . , x(T ) of X
Result: estimated matrix W containing the normal vectors of the
hyperplanes
1
2
3
4
Initialize randomly W = (w1 , . . . , wn ).
for j ← 1, . . . , j0 do
Cluster assignment.
for t ← 1, . . . , T do
Add x(t) to cluster Y(i) , where i is chosen to minimize
|wi⊤ x(t)| (distance to hyperplane given by the i-th column of
W).
end
Exit if the mean distance to hyperplanes is smaller than some
preset value.
Matrix update.
Put Tk = {t ∈ T : x(t) ∈ Y(k) .
for k ← 1, . . . , n do
Solve the following linear programming problem (with respect
to w, u(t),P
t ∈ Tk ):
minimize t∈Tk u(t)
T
under
Pnconstraints − u(t) 6 w x(t) 6 u(t), t ∈
Tk , k=1 wik = 1.
Let w̃k be the solution in place of w and put wk = w̃k /kw̃k k.
end
end
The constant j0 is chosen in practice sufficiently large.
We created randomly
umn is one)
0.5688
0.4337
H=
−0.6317
−0.2988
a normalized mixing matrix (the norm of each col0.1983
−0.4531
−0.2109
−0.8432
−0.4287 −0.3894 −0.7692
−0.2374 −0.0730 0.2047
.
−0.1637 0.8516 −0.6037
−0.8562 −0.3432 −0.0443
and created mixed signal matrix as X = HS.
We added to the first 20 columns of X a noise N created as N = 0.1 ∗
randn(5, 20) (in Matlab notation).
We run our n-planes clustering algorithm several times after obtaining
a hyperplane such that at least 5 columns of X are at distance 10−4 to
9
5
0
−5
0
20
40
60
80
100
120
0
20
40
60
80
100
120
0
20
40
60
80
100
120
0
20
40
60
80
100
120
0
20
40
60
80
100
120
5
0
−5
5
0
−5
5
0
−5
5
0
−5
Figure 1: Example. Artificially created 2-sparse source signals.
it. We remove these columns and run again the algorithm similarly, until
we obtained 10 hyperplanes. They produced the following estimate of the
mixing matrix (after running the Matrix identification part in Algorithm 1
which reconstructs matrix from planes):
0.7692
0.1983
0.3894 0.4287 0.5688
−0.2047 −0.4531 0.0730 0.2374 0.4337
A=
0.6037 −0.2109 −0.8516 0.1637 −0.6317 .
0.0443 −0.8432 0.3432 0.8562 −0.2988
Such perfect reconstruction was not possible when running the algorithm
with precision less than 10−4 . After running the source recovery algorithm
(see Algorithm 2) we obtained the recovered signals plotted in Figure 2.
7
Conclusion
We developed a new algorithm for hyperplane clustering which with combination of other algorithms leads to sparse representation of signals in sense that
10
5
0
−5
0
20
40
60
80
100
120
0
20
40
60
80
100
120
0
20
40
60
80
100
120
0
20
40
60
80
100
120
0
20
40
60
80
100
120
5
0
−5
5
0
−5
5
0
−5
5
0
−5
Figure 2: Example. Recovered source signals using the recovered mixing
matrix A. The samples from 20 to 120 recover perfectly (after permutation
and scaling) the original source signals.
we can reconstruct uniquely (up to permutation and scaling) a given linear
mixture as multiplication of a mixing matrix and a sparse source matrix.
Our algorithm has an advantage that it is robust to outliers and uses
linear programming. An example is presented showing the robustness of our
algorithm.
References
[1] V.M. Alekseev, V.M. Tihomirov and S.V. Fomin, Optimal Control,
Nauka, 1979 (in Russian).
[2] P. Bofill and M. Zibulevsky, “ Underdetermined Blind Source Separation
using Sparse Representation”, Signal Processing, vol. 81, no. 11, pp.
2353-2362, 2001.
11
[3] P.S. Bradley and O. L. Mangasarian, “k-Plane Clustering”, J. Global
optim., 16, (2000), no.1, 23-32.
[4] P. S. Bradley, O. L. Mangasarian and W. N. Street, “Clustering via
Concave Minimization”, Mathematical Programming Technical Report
96-03, May 1996. Advances in Neural Information Processing Systems
9, MIT Press, Cambridge, MA 1997, 368-374.
[5] P.S. Bradley, U. M. Fayyad and O. L. Mangasarian, “Mathematical
progrmming for data mining: formulations and challenges”, INFORMS
J. Computing, 11 (1999), no. 3, 217–238.
[6] A.M. Bronstein, M.M. Bronstein, M. Zibulevsky, Y. Y. Zeevi, “
Blind Separation of Reflections using Sparse ICA”, in Proc. Int. Conf.
ICA2003, Nara, Japan, pp.227-232.
[7] A. Cichocki and S. Amari, Adaptive Blind Signal and Image Processing.
John Wiley, Chichester, 2002.
[8] S. Chen, D. Donoho and M. Saunders, “ Atomic decomposition by basis
pursuit”, SIAM J. Sci. Comput., Vol. 20, no. 1, pp. 33–61, 1998.
[9] D. Donoho and M. Elad, “Optimally sparse representation in general
(nonorthogonal) dictionaries via l1 minimization”, Proc. Nat. Acad. Sci.,
vol.100, no.5, pp. 2197–2202, 2003.
[10] D. Donoho and V. Stodden, “When Does Non-Negative Matrix Factorization Give a Correct Decomposition into Parts?”, Neural Information
Processing Systems (NIPS) 2003 Conference, http://books.nips.cc
[11] P. Georgiev, F. Theis and A. Cichocki, “Sparse Component Analysis and
Blind Source Separation of Underdetermined Mixtures”, IEEE Transactions of Neural Networks, to appear.
[12] A. Hyvärinen, J. Karhunen and E. Oja, Independent Component Analysis, John Wiley & Sons, 2001.
[13] T.-W. Lee, M.S. Lewicki, M. Girolami, T.J. Sejnowski, “Blind sourse
separation of more sources than mixtures using overcomplete representaitons”, IEEE Signal Process. Lett., Vol. 6, no. 4, pp. 87–90, 1999.
[14] D. D. Lee and H. S. Seung “Learning the parts of objects by non-negative
Matrix Factorization”, Nature, Vol. 40, pp. 788–791, 1999.
[15] A. M. Rubinov and J. Ugon, “Skeletons of finite sets of points”, Research working paper 03/06, 2003, School of Information Technology &
Mathematical Sciences, Univ. of Ballarat.
12
[16] F.J. Theis, E.W. Lang, and C.G. Puntonet, A geometric algorithm for
overcomplete linear ICA. Neurocomputing, in print, 2003.
[17] K. Waheed, F. Salem, “Algebraic Overcomplete Independent Component Analysis”, in Proc. Int. Conf. ICA2003, Nara, Japan, pp. 1077–
1082.
[18] M. Zibulevsky, and B. A. Pearlmutter, “Blind source separation by
sparse decomposition in a signal dictionary”, Neural Comput., Vol. 13,
no. 4, pp. 863–882, 2001.
13