Keywords

1 Introduction

Principal Components Analysis (PCA) is a data analysis technique to find orthonormal vectors that explain the variance structure of the data [12]. These are the vectors in which lie most of the variance and they are called principal components (PCs). Since the number of PCs is generally low compared to the total dimension of the dataset, PCA is often use for dimensional reduction or data analysis. Its applications include quality control [2], image reconstruction [20], wave reconstruction [25], and outlier detection [11].

PCs can be computed forward or backward. Classical PCA uses the forward approach: in each iteration it aims to find the direction that yields the maximum information. The backward approach aims to find the direction with least information, so that when it is eliminated the projection of the data into the remaining subspace retains the maximum amount of variation. If only the few PCs with most variation in them are needed then the forward approach is more suitable, while if the aim is to eliminate only the few least useful components then the backward approach is appropriate [1]. The two approaches are equivalent under the \(L_2\)-norm (Euclidean space) but under other norms they can give different results.

The quality of PCA algorithms can be evaluated in two ways: by reconstruction error or variance of the projected points. Under the \(L_2\)-norm these two measures are equivalent. However, \(L_2\)-norm based methods are rather sensitive to outliers so they are not well-suited to noisy datasets. To improve robustness different methods have been applied to PCA. The algorithms in [57] are based on the projection pursuit method for which the \(L_1\)-norm is widely used. [16] describes a greedy method that aims to maximize the \(L_1\)-norm distance of the projected points. In [15] a heuristic estimate for the general \(L_1\) reconstruction error is presented, which assumes that the projected and lifted data are the product of two matrices. [3] proposes a backward \(L_1\)-norm method. [22] introduces two algorithms to compute PCs by minimizing different objectives: an iterative weighted approach that minimizes the reconstruction error, and one that uses linear programming (LP) to maximize the \(L_1\)-norm of the projected points. The LP-based method aims to compute directly the orthonormal matrix of the PCs. It uses LP to perturb a projection matrix, and if this is an improvement it forces the new one to be orthonormal. Technically this can not be done in LP because maximizing a sum of absolute values is NP-hard (see [13, 14, 23, 24, 26] for a discussion of this issue). It can be done by adding binary decision variables to obtain a mixed integer programming (MIP) model but this is far less scalable than a pure LP approach. To resolve this problem [22] relax the MIP model to an LP and restrict moves to small distances. They show that their search algorithm is locally convergent.

Our approach is also based on an LP model, but instead of computing the PCs simultaneously it iteratively computes them in backward order. This enables us to use a direct LP model, so we do not need to restrict it to small steps. The rest of the paper is organized as follows. In Sect. 2 the problem is formulated. In Sect. 3 we introduce a new algorithm for the \(L_1\)-PCA. In Sect. 4 the algorithm is compared with others for robustness and the competitiveness. Section 5 concludes the paper.

2 Problem Formulation

Let \(A \in \mathfrak {R}^{n \times m}\) be the data set, where m denotes the number of attributes (dimension of the original input space) and n the number of instances. The goal of PCA is to find a projection matrix \(X \in \mathfrak {R}^{m \times k}\) whose rows are the bases of a k-dimensional linear subspace (so \(X^TX= I_k\)). This subspace will be referred as the feature space. This matrix must minimize the reconstruction error:

$$\begin{aligned} E_p(X,Y)= \left\| A - YX^T\right\| _p^p \end{aligned}$$
(1)

where \(\left\| \cdot \right\| _p^p \) is the \(L_p\)-norm, and \(Y \in \mathfrak {R}^{n \times k}\) is the coefficient matrix whose rows correspond to the coordinate of each instance of A projected into the feature space spanned by X. If X is fixed then the Y minimizing the error function is uniquely determined by \(Y = AX\), by the projection theorem [19]. So the error function becomes:

$$\begin{aligned} E_p(X)= \left\| A - AXX^T\right\| _p^p \end{aligned}$$
(2)

Another possible approach is to maximize the variance of the projected point in the feature space:

(3)

Classical PCA is based on \(L_2\)-norm (\(p=2\)) in which case minimizing (1) and maximizing (3) are equivalent because of singular value decomposition.

Consider the case \(p=1\) so that we use the \(L_1\)-norm. We minimize the error as in Eq. (2) by solving the problem

$$\begin{aligned}&\min _{X \in \mathfrak {R}^{m \times k}} \sum _{i \in I} \sum _{j \in J} \left| e_{ij} \right| \end{aligned}$$
(4a)
$$\begin{aligned} \text{ s.t. }&\quad X^TX= I_k,\; E=A - AXX^T \end{aligned}$$
(4b)

where \(I =\left\{ 1, \ldots , n \right\} \) and \(J =\left\{ 1, \ldots , m \right\} \). Even if (4) is more robust compared to the \(L_2\)-norm version it is not invariant under rotation of the data, and the shape of equidistance surfaces becomes very skewed [9].

Similarly, setting \(p=1\) in (3):

$$\begin{aligned}&\max _{X \in \mathfrak {R}^{m \times k}} \sum _{i \in I } \sum _{h \in K } \left| y_{ih} \right| \end{aligned}$$
(5a)
$$\begin{aligned} \text{ s.t. }&\quad X^TX= I_k,\; Y = AX \end{aligned}$$
(5b)

where \(K =\left\{ 1, \ldots , k \right\} \). The solution of this problem is also invariant under data rotation because the maximization is done in the feature space [9].

3 Rotation-Invariant \(L_1\)-PCA by LP

In this section we present an iterative LP-based algorithm that aims to solve problem (5). In a forward approach we would iteratively look for the projection that maximizes variance in the feature space. Denoting by \(\varvec{t}\) the projection we are looking for, the problem would be:

$$\begin{aligned}&\max \sum _{i \in I} \left| y_{i} \right|&\end{aligned}$$
(6a)
$$\begin{aligned} \text{ s.t. }&\quad y_{i} = \sum _{j \in \left[ 1,m \right] } a_{ij}t_{j}&i \in I \end{aligned}$$
(6b)
$$\begin{aligned}&\quad \sum _{j \in J} t_j^2 = 1&\end{aligned}$$
(6c)

Having found a solution \(\varvec{proj}_0\) we would then look for the next most interesting projection \(\varvec{proj}_1\) under the added constraint \(\varvec{proj}_1\cdot \varvec{proj}_0=0\); and so on, adding a new orthogonality constraint for each new projection.

Unfortunately (6) can not be solved by LP. As pointed out above, maximizing a sum of absolute values is NP-hard. Even under the \(L_2\)-norm we can not solve it: quadratic objectives are allowed in quadratically constrained quadratic programs, but only if the resulting matrix is positive semidefinite, and in this case it is not because of constraint (6c). Instead we use a backward approach which does not require the solution of any NP-hard problems.

Like many other methods we start by centering and normalizing the dataset by subtracting the centroid and dividing by standard deviations. Then at each dimension iteration we choose a starting vector \(\varvec{p}\) and iteratively improve it via an LP-based heuristic. To find such an improvement we find a vector transformation that minimizes the LP without the (6c) constraint, but forcing the vector to lie in the hyperplane tangent to the unit hypersphere at the point defined by \(\varvec{t}\). This idea is illustrated in Fig. 1. Solving the LP yields a vector that is likely to be quite close to the direction we want, but the result is not a unit vector because it lies in the hyperplane and not on the unit hypersphere. We therefore project the solution onto the hypersphere to obtain a unit vector, which is the new starting vector \(\varvec{t}\). We iterate this procedure until the improvement in the objective function of the LP is smaller than a fixed threshold, which is the only parameter needed by this solution. In practice we found that the procedure converges in a small number of iterations, and that a value such as 0.01 is reasonable.

The LP to solve at every iteration is:

$$\begin{aligned}&\min \sum _{i \in I} l_i&\end{aligned}$$
(7a)
$$\begin{aligned} \text{ s.t. }&\quad l_i \ge - \sum _{j \in J } a_{ij}t_{j}&i \in I \end{aligned}$$
(7b)
$$\begin{aligned}&\quad l_i \ge \sum _{j \in J } a_{ij}t_{j}&i \in I \end{aligned}$$
(7c)
$$\begin{aligned}&\quad \sum _{j \in J} t_j proj_{qj} = 0&proj_q \in X \end{aligned}$$
(7d)
$$\begin{aligned}&\quad \sum _{j \in J} t_j p_j = 1&\end{aligned}$$
(7e)

where X is the matrix that contains all the PCs already computed, (7d) guarantees that \(\varvec{t}\) is orthogonal to them. (7e) assures that \(\varvec{t}\) satisfy the hyperplane equation. \(\varvec{p}\) is the starting vector or the solution of the previous iteration, so the coordinates of the point in which the hyperplane is tangent to the unit hypersphere. We introduced real auxiliary variables \(l_i\) (\(i \in I\)). The inequality constraints enforce \(l_i \ge |\varvec{t} \cdot A_i|\), and because we minimize the sum of the \(l_i\) this forces the objective to be equal to the sum of the absolute values of the projected points. This approach exploits that fact that minimizing a sum of absolute values is trivial in LP.

Fig. 1.
figure 1

Hyperplane iterative heuristic. (a) the starting vector \(\varvec{p}\). (b) \(\varvec{p}\) is projected on the unit hypersphere. (c) the hyperplane tangent to the hypersphere in \(\varvec{p}\). (d) the objective function of (7) projected on the hyperplane. (e) \(\varvec{t}\) the solution of (7). (f) \(\varvec{t}\) will be the new starting vector of the new iteration

figure a

We call this algorithm \(L_1\)-PCAhp, where hp stands for hyperplane: it is summarised in Algorithm 1. \(L_1\)-PCAhp is not globally convergent as it can become trapped in local minima, but it is locally convergent. We can improve the probability of finding a global optimum restarting from different starting vectors, but empirically we found that using the classical \(L_2\)-PCA solution as a starting vector is satisfactory.

4 Computational Results

We implemented \(L_1\)-PCAhp in R using IBM ILOG CPLEX 12.6.3 to solve the LP instances, and cplexAPI is used to invoke the solverFootnote 1. We evaluate it against several other PCA variants: \(L_2\)-PCA, PCAgrid, PCAproj, PCA-\(L_1\), \(L_1\)-PCA and \(L_1\)-PCA*. \(L_2\)-PCA is the classical PCA method implemented in the stats package, the function used is princomp. PCAproj and PCAgrid are solutions based on Projection Pursuit. PCAproj uses the algorithms of [7], PCAgrid uses the robust non-sparse version of [6]; both these implementation are present in the pcaPP package for R. PCA-\(L_1\), \(L_1\)-PCA and \(L_1\)-PCA* are respectively the algorithms introduced in [16], [15] and [3]. These methods are implemented in the pcaL1 R package. In each case we used default algorithm configurations, while for \(L_1\)-PCAhp we set the threshold to 0.01. We believe that using the default configuration is a fair comparison, because algorithm configurations are held constant across all tests.

4.1 Simulated Data

Tests on simulated data provide a controlled comparison of data analysis algorithms, and we use a test from [3]. The goal of this test is to evaluate the impact of outliers while varying their magnitude. Each dataset contains a “true” subspace and a subset of attributes that contain the information; the other attributes contain only noise. A fixed percentage of the instances are outliers with high norm. If a solution is robust to these outliers the reconstruction of the true subspace should be accurate. We also take into account the variance of the projected points.

Each dataset has \(n=1000\) instances and \(m=10\) attributes. The first q attributes define the true subspace, and 10 % of the observations have outliers. These observations have the first p of the \(m-q\) noise attributes affected by the outliers, where p is the number of outlier-contaminated attributes. The true subspace attributes, the noise attributes and the outliers are sampled respectively from Laplace(0,10), Laplace(0,1) and Laplace(\(\mu \),0.01) distributions, where \(\mu \) is the outlier magnitude. We take averages over 100 runs for every possible combination of \( q \in \{ 2,5 \} \), \(p \in \{ 1,2,3 \} \) and \(\mu \in \{ 25,50,75 \} \). We repeat the experiment with Gaussian distributions while maintaining the same parameters.

Figure 2 contains the average performance (variance in feature space and reconstruction error) versus outlier magnitude (\(\mu \)). Plots in which the x-axis is the number of outlier-contaminated attributes (p) are similar, but we omit them for space reasons. \(L_1\)-PCA, PCA-\(L_1\) and \(L_1\)-PCAhp perform better than the other algorithms with respect to variance in feature space. Regarding reconstruction error \(L_1\)-PCA* is the best method, and its breakdown point is higher: the breakdown point is reached where the methods begin to fit the outlier observations better than the non-contaminated data. This confirms the robustness of \(L_1\)-PCAhp as it ranks joint second with \(L_1\)-PCA.

Fig. 2.
figure 2

Performance on the simulated datasets. Variance of the projected points against magnitude of the outliers on the left. Reconstruction error of the “true” subspace against magnitude of the outliers on the right.

4.2 Datasets with Known Outliers

We also applied the various PCA methods to datasets with known outliers. This experiment has been used to prove the robustness of PCA algorithms to outliers [3]. The artificial dataset generated by [10] has 75 instances of which the first 14 are outliers. The others dataset are real-world data. The milk and the pollution dataset were introduced respectively by [8] and [21]. In the milk dataset the outliers are instances 70, 45, 31, 12, 14 and 15 according to [5]. In the pollution dataset instances 29 and 48 are identified as outliers by [7]. We centered and normalized the datasets as in Sect. 3.

To evaluate the robustness of the solutions the PCs are computed as usual with the full datasets. Then we measured the \(L_1\)-norm reconstruction error using Formula (4), and the variance of the projected points. Both were calculated only on the non-outlier instances. If the quality of the solution is good and robust to the outliers then the total reconstruction error should be small and the variance should be large.

Table 1 shows the results. Regarding the variance of the projected points, the best method depends on the dataset. Overall \(L_1\)-PCA has the best average rank, but \(L_1\)-PCAhp has the better average variance (slightly better than \(L_1\)-PCA*). Regarding the reconstruction error \(L_1\)-PCAhp clearly outperforms the others, even \(L_1\)-PCA* which was the previous winner [3]: this is quite surprising because \(L_1\)-PCAhp is based on the maximization of the projected variance and not on the reconstruction error.

Generally the methods based on the \(L_1\)-norm without projection pursuit outperform the others. As expected, the worst performance is \(L_2\)-PCA, as it is well-known to be sensitive to outliers. Scaling the dataset by the standard deviation mitigates the effects of the outliers with high norm, but it is not enough to impart robustness. We removed it from the table to make the results more readable.

Table 1. Rank of the performances for the 2 benchmarks. Average results are in parentheses.

4.3 UCI Datasets

This test is used in several PCA papers [9, 16, 17]. For data analysis problems with a large number of input variables, dimensionality reduction methods are typically used to reduce the number of input variables to simplify the problems without degrading performance. The test is based on the idea that if the dimensionality reduction of the PCA preserves the information then the performance of a classifier should not degrade with a reduced dataset. Moreover, the projection in the feature space can eliminate some information that is not useful for the classifier, thus improving its performance.

The algorithms were applied to several datasets from the UCI machine learning repository [18]. All datasets were centered and normalized as in Sect. 3. For each dataset we extracted all possible features from 1 to 10 or from 1 to m (total number of attributes if smaller, \(k \in \left[ 1, \ldots , \min \left( 10,m \right) \right] \). We choose to limit to 10 the number of maximum extracted features because all the datasets can be reduced to 10 or fewer dimensions without affecting the classifier performances. Adding more PCs only makes the performances fluctuate almost randomly, as shown in Fig. 3. So, as in other work that uses this test [9, 16, 17], we limit the number of extracted features.

Fig. 3.
figure 3

Average Correct Classification Rate versus dimension of the projected space of 4 UCI datasets. The performances of the methods get almost stable after some features are extracted. Adding tests on the remaining dimensions will only dilute the ones relative to the dimensional reduction. We omitted error bars since the variances are quite high and make the graphs hard to read.

We used the one-nearest neighbourhood (1-NN) coded in the class package as a classifier, trained and tested with a 10-fold cross validation. As the results depends on the instances order the tests were conducted 10 times with different shuffles, and we report the average results.

Table 2. Datasets information and Rank of the performances. Average Correct Classification Rates are in parentheses. Bold value are the best Classification Rates of every dataset

Table 2 shows the dataset information and solution rank, with the average Correct Classification Rate (CCR) in parentheses. Algorithm performances vary significantly across datasets: even \(L_2\)-PCA, which usually performs poorly in this test, wins in one case. This is caused by the heterogeneity of the datasets. The algorithms with best average CCRs are PCA-\(L_1\) and PCAgrid. The winner in the highest number of datasets is \(L_1\)-PCA, but its average ranking and CCR are worse than those of the other algorithms. We believe that the poor performance of \(L_1\)-PCA* is explained by the fact that it is designed to minimize reconstruction error: classification is applied to projected points, so algorithms that preserve a high variance among projected points should have an advantage in this test. \(L_1\)-PCAhp exhibits good CCR and ranking, and excels in some datasets. This shows the robustness of the algorithm to different types of dataset and different noise or outliers, and that it can compete with existing methods on real-world problems.

This test shows that there is no method that dominates the others, because they exploit different structures in different datasets. So we can not select a best method for robust PCA, because in real world applications the performances vary greatly according to the data. We suggest applying several methods during the test phase, and keeping the one that performs best.

4.4 Initialization

We wanted to analyze the effect different initialization have on our solution. We repeated the test in Sect. 4.3 applying every time the random and the \(L_2\)-PCA initialization. The Correct Classification Rate obtained is almost identical, with very small fluctuation. The only difference is the computational time, since the \(L_2\)-PCA initialization is \(5\,\%\) faster on average.

We think that the performance does not vary significantly because it generally converges to similar projections under different initializations, so hardly it get stuck in a completely different minimum. The improvement in the computational times are due to the small number of hyperplanes required for convergence. We believe this is because the local minimum to which our solution converges is closer to the classical PCA solution than to a random projection.

4.5 Computational Times

Making a correct time comparison is not an easy task in this situation, because the different methods are implemented in different programming languages and use different MIP solvers. Some can improve the performances with parallelization, others can not. Backwards methods compute all components every time, while the others only compute the required components. Some depend more on the number of original dimensions than on the number of instances.

We coded our method using the same language code and the same MIP solver as pcaL1 R package. We decided to use this setting to have a direct comparison with the others LP-based methods that are implemented in that package. An R wrap function call C code that uses Coin Clp to solve the LP instances.

We used the same dataset generator used in Sect. 4.1, because it allows us to manage directly the size of the problem. The algorithms configuration is the kept the same. All the computations are done in a single core, so no parallelization is involved.

We kept constant the number of attributes. We varied the number of instances from 100 to 50000. We computed each time a number of components equal to half the number of attributes.

Fig. 4.
figure 4

Computational time comparison. The number of attributes is 10.

Figure 4 shows the comparison with \(m = 10\), both the computational time and the instances number use logarithmic scale. As expected the fastest method is always \(L_2\)-PCA. The only method with comparable execution time on large instances is PCA-\(L_1\), which is approximately 10 times slower on average. The two methods that use projection pursuit, PCAproj and PCAgrid, are approximately 100 times slower than \(L_2\)-PCA. We omitted the error bars since both the standard deviation and the mean error were too small to be plotted.

All the methods that use LP are generally slower than the others. The \(L_1\)-PCA* and \(L_1\)-PCAhp computational times are not dependent on the number of components required, because they always compute all components. Our solution is always faster than the other LP-based methods, and this gap increases with the dimension of the dataset. This may seems odd, since our solution needs to solve numerous LP models for every component, even if the dimension of our LP are smaller. The main reason is that we do not have to solve a new model everytime, but simply edit the hyperplane constraint (7e) in every iteration and add a new orthogonality constraint (7d) every time it finds a components.

The execution time of our solution does not depend on the number of components we are searching. In contrast, the performance of \(L_1\)-PCA strongly varies according to the number of required components. If we need only a few components then it has good performance.

The performances of our solution can be improved by using a more improved MIP solver. Also the parallelization will strongly improve the performances.

5 Conclusion

In this paper we introduced a new algorithm for PCA called \(L_1\)-PCAhp which is intuitively straightforward, easy to implement and has only one runtime parameter to tune. Its novelty lies in the way it uses LP, and in its hyperplane-based iterative improvement approach. Moreover it is the only \(L_1\)-PCA LP-based methods with a backward approach that tries to maximize the variance of the projected points in the projection space. In experiments its performance was consistently good compared to other algorithms in the literature. We showed its versatility and good performance both in variance maximization and in reconstruction error minimization, where it excels on real datasets with outliers. In classification tests the different algorithms exploit different information in each dataset, so we suggest that in a real-world setting different PCA methods should be tested and the best one used. Even if some solutions outperform ours in computational time, our PCA is the fastest of the LP-based methods. And they found plenty of real world applications due to their ability to extrapolate different relationship in the data [4, 27].

We believe that our method is a valuable addition to the collection of known methods, due to its high robustness and its easy implementation and tuning.