research-article

Open access

Fast and Accurate Proper Orthogonal Decomposition using Efficient Sampling and Iterative Techniques for Singular Value Decomposition

Authors:

V. Charumathi,

M. Ramakrishna,

Vinita VasudevanAuthors Info & Claims

ACM Transactions on Mathematical Software (TOMS), Volume 48, Issue 2

Article No.: 16, Pages 1 - 24

https://doi.org/10.1145/3506691

Published: 26 May 2022 Publication History

All formats PDF

Abstract

In this article, we propose a computationally efficient iterative algorithm for proper orthogonal decomposition (POD) using random sampling based techniques. In this algorithm, additional rows and columns are sampled and a merging technique is used to update the dominant POD modes in each iteration. We derive bounds for the spectral norm of the error introduced by a series of merging operations. We use an existing theorem to get an approximate measure of the quality of subspaces obtained on convergence of the iteration. Results on various datasets indicate that the POD modes and/or the subspaces are approximated with excellent accuracy with a significant runtime improvement over computing the truncated SVD. We also propose a method to compute the POD modes of large matrices that do not fit in the RAM using this iterative sampling and merging algorithms.

1 Introduction

Proper orthogonal decomposition (POD), also known as Principal component analysis (PCA) or Karhunen–Loeve transform is a dimensionality reduction technique used in a variety of applications including information compression, pattern mining, clustering and classification, facial recognition, missing data estimation, data-driven modeling, and Galerkin projections. PCA aims at finding an optimal set of basis vectors that captures energetically dominant features of the dataset. These basis vectors are also referred to as eigenfeatures or POD modes or principal directions/components (PCs). The assumption is that a few of these modes are sufficient to capture all significant features of the dataset i.e., the data is highly correlated and inherently low rank. In many of these cases, the dataset is also large, often distributed over several computers and dense. The two methods commonly used to compute POD modes are via (a) eigendecomposition of $A^TA$ and (b) SVD of A. In both cases, a full decomposition is expensive and approximate methods are used to obtain the dominant modes. In this article, we explore the use of random sampling based techniques to obtain the POD modes accurately and in a computationally efficient manner. The way our data is arranged, these modes are the left singular vectors of the data matrix.

We use the following notation throughout the article, unless otherwise specified. Lower case letters are used to denote scalars. Lower case bold-face letters are used to denote a vector. Upper case letters are used to denote a matrix. The $i\hbox{th}$ element of vector $\mathbf {a}$ (bold font) is denoted as $a_i$ (non-bold font). The element in the $i\hbox{th}$ row and $j\hbox{th}$ column of the matrix A is denoted by $a_{ij}$. $\mathbf {a}^i$ denotes the $i\hbox{th}$ row vector and $\mathbf {a}_j$ denotes the $j\hbox{th}$ column vector of matrix A. $A_k$ denotes the best rank-k approximation of a matrix A and the approximation to $A_k$ is denoted using tilde, as $\tilde{A}_k$. Let $A_k = U_k\Sigma _k V_k^T$ and $\tilde{A}_k=\tilde{U}_k\tilde{\Sigma }_k\tilde{V}_k^T$ be the SVDs of $A_k$ and $\tilde{A}_k$. If A is obtained by mean centering rows of the data matrix, then $U_k$ and $\tilde{U}_k$ are the accurate and the approximate principal components (PCs or POD modes), respectively. $||A||_p$ denotes the p norm of matrix A. A is assumed to be an $m \times n$ matrix.

Algorithms that approximate $A_k$ can be classified as random projection, random sampling based algorithms or a combination of both. Random projection based algorithms have a multiplicative error bound [Halko et al. 2011a; Sarlos 2006]. They have been used for PCA in Erichson et al. [2017], Halko et al. [2011b], Nguyen et al. [2009], and Rokhlin et al. [2010] with algorithmic optimizations to improve accuracy and numerical stability [Erichson et al. 2019; Gu 2015; Li et al. 2017; Rokhlin et al. 2010] and with out-of-core modifications in Bose et al. [2019] and Halko et al. [2011b] for large matrices. Yamazaki et al. [2017] use random projections to compute the SVD and random sampling (of rows) of additional data to incrementally update the SVD. Drineas and Mahoney [2016] also use a combination of projection and sampling techniques, where projection is used to precondition the matrix so that subsequent uniform sampling works well.

Random sampling methods are of two types: those that sample elements of the matrix and those that sample columns/rows of the matrix. The motivation behind sampling elements [Achlioptas and McSherry 2007] from the matrix is to make the matrix sparser, thereby reducing the time taken to compute SVD of the matrix. Sampling columns/rows is usually done using (a) uniform probability [Drineas et al. 2003; Yamazaki et al. 2017] (b) leverage scores [Drineas et al. 2008; Li et al. 2013; Cohen et al. 2015; Yamazaki et al. 2017] (c) column and/or row norms [Drineas et al. 2003;, 2004; Frieze et al. 2004; Drineas et al. 2006]. In all three cases, error bounds are derived assuming sampling is done with replacement. Uniform and row/column norm based sampling have additive error bounds, while leverage score based sampling has a multiplicative error bound. However, leverage score based sampling is computationally more involved since it requires an estimate of the singular vectors. While row/column norm based sampling require a maximum of two passes over the data to get the sampling probability distribution, uniform sampling requires none. But it can be inefficient as it is equally likely to sample unimportant columns or rows, as for example, zero columns/rows.

The results of the experimental evaluation done by Menon and Elkan [2011] indicates the column norm sampling method has low run times, with accuracies well below the error bounds for standard datasets. Additionally, random sampling of columns/rows has several advantages—it retains the sparsity of the matrix and is easily amenable to incremental improvement. Multiplicative error bounds using row/column norm based sampling can be obtained using adaptive sampling, which has several rounds of sampling based on row/column norms [Deshpande et al. 2006; Deshpande and Vempala 2006].

In this article, our focus is to obtain POD modes of large datasets accurately and in a computationally efficient manner using random sampling based techniques. We evaluated the performance of sampling algorithms proposed in Drineas et al. [2006]. Even with tight error parameters, the top k POD modes computed using these sampling algorithms as well as the subspace spanned by these vectors have significant errors. To improve accuracy, we propose an iterative technique in which additional rows and columns are sampled and the dominant POD modes are updated using a merge-and truncate (MAT) operation [Vasudevan and Ramakrishna 2017] in each iteration. The iteration converges when the angle between the current and updated modes is less than a threshold. We also derive bounds for the spectral norm of the error introduced by a series of MAT operations. We use an existing theorem to get an approximate measure of the quality of subspaces obtained after convergence. Results on various datasets indicate that the POD modes and/or the subspaces are approximated with excellent accuracy with a significant runtime improvement over computing the truncated SVD. Using this iterative algorithm, we also demonstrate a method to compute POD modes of large matrices that do not fit in the RAM of the system.

The article is organized as follows. In Section 2, we discuss a few relevant sampling algorithms found in literature and their drawbacks. This motivates the need for an algorithm with more efficiency and better accuracy. Section 3 contains the proposed iterative algorithm, the error bound for MAT operations and the quality measure. In Section 4, we discuss the results of these algorithms as applied to different datasets. Section 5 concludes the article. Appendix A contains the experimental results for the datasets obtained using row/column norm based sampling algorithms (LTSVD, CTSVD) proposed in Drineas et al. [2006].

The following datasets were used for testing all the algorithms.

(1)

Faces data from Cambridge [2002], a collection of 400 images from 40 subjects, each of size $112\times 92$ giving 400 images of dimension 10,304 each. The matrix is of size $10,\!304\times 400$. This dataset will be referred to as Faces dataset in the article.

(2)

Velocity vector data of a flow generated using a CFD simulation on a grid of size $257\times 257$. The snapshot vectors are of dimension 132,098 and there are 1,024 of them giving a matrix of size $132,\!098\times 1,\!024$. It will be referred as $V_{2D}$

(3)

2414 images of 38 subjects cropped to show only the face of the subject in different illumination conditions from Lee et al. [2005]. The images are of size $192\times 168$ giving 2414 images of dimension 32256 each. Therefore, the matrix size is $32,\!256 \times 2,\!414$. This dataset is referred to as cropped Yale faces (CYF).

(4)

Faces data from Georghiades et al. [2001], a collection of 16,128 images from 28 subjects in different poses and illumination conditions. 18 of the 16,128 images in the database were discarded since they were corrupted. Each image is of size $480\times 640$. We have 16,110 images of dimension 307,200 each organized as a $307,\!200\times 16,\!110$ matrix. This dataset is referred to as Yale faces (YF).

All the datasets are mean-centered row-wise. The principal components are the left singular vectors. It is seen that all the datasets have more rows than columns (“tall and thin”). The first three datasets are small and used for validation of the algorithms. The last dataset is large and the mean-centered data requires 40GB of RAM.

2 Evaluation of Existing Sampling Algorithms AND Motivation

Algorithms to compute low rank approximations of a matrix using sampled columns (LTSVD)/rows and columns(CTSVD) have been proposed in Drineas et al. [2006]. The sampling is done with replacement and the sampled rows and columns are scaled to preserve the Frobenius norm of the matrix. For both algorithms, the authors show that bound for the error in the rank-k approximation

\begin{equation} ||A-\tilde{U}_k\tilde{U}_k^TA||_F^2\le ||A-A_k||^2_F + \epsilon ||A||_F^2 , \end{equation}

(1)

holds with a probability of at least ${(1-\delta)}$. The sampling probability of the $i\hbox{th}$ column, $p_i$ and the number of columns sampled, c, in LTSVD is

\begin{equation} p_i = \frac{|\mathbf {a}_i|^2}{||A||_F^2}, \end{equation}

(2)

\begin{equation} c=4k(1+\sqrt {8\log {(1/\delta)}})^2/\epsilon ^2. \end{equation}

(3)

In the case of CTSVD, the sampling probabilities for columns and rows ($p_i$ and $q_i$) are

\begin{equation} p_i = \frac{|\mathbf {a}_i|^2}{||A||_F^2}, ~~~ q_j = \frac{|\mathbf {d}^j|^2}{||D||_F^2}, \end{equation}

(4)

Here, D is the matrix containing the sampled and scaled columns. The number of rows and columns sampled (c and w) that need to be sampled are

\begin{equation} c=w=k^2(1+\sqrt {\log {(2/\delta)}})^2/\epsilon ^4 , \end{equation}

(5)

$\epsilon$ and $\delta$ are the error parameters and the failure probability that need to be set. An experimental evaluation of the algorithm by Menon and Elkan [2011] concluded that, in practice, the error is much lower than the bound given in Equation (1).

In this article, our focus is accurate computation of the principal components, which are the left singular vectors. Therefore, we implemented the two algorithms in order to evaluate its performance with respect to the error in the singular vectors. In order to evaluate the accuracy of the modes, we used the following metrics:

(1)

The angle between the accurate and approximate left singular vectors ($\theta _i = \cos ^{-1}(u_i^T\tilde{u}_i)$), denoted by mode angle.

(2)

The principal angles between the rank-k subspaces ($\phi _i$). This is computed as the inverse cosine of the singular values of the matrix $U_k^T\tilde{U}_k$ [Björck and Golub 1973].

The pseudo-codes for our implementation and the detailed results are included in Appendix A. A summary of the results is as follows.

(1)

When k (and hence number of columns sampled) is small, the error in the singular values can be as high as 20% for some datasets. For larger values of k, this error is of the order of 5% or less. (see Figure 8 in Appendix A).

(2)

There is a large error in the computed POD modes. As seen in Figure 9, if the mode angle is greater than about $10^{\circ }$, the POD modes have erroneous features. The mode angles obtained using these algorithms are as large as $80^{\circ }$ in some cases (see Figure 10). As expected, the accuracy of the dominant subspaces is better, with principal angles between subspaces less than $40^{\circ }$ in most cases (see Figure 11). This large error occurs because each column is scaled by a factor that depends on the column norm. While this preserves the Frobenius norm, it tends to distort the subspace spanned by the columns. There is no significant improvement in accuracy, even when the error parameters are tightened.

One major problem with the algorithms is that, it is hard to a priori fix the values of the error parameters $\epsilon$ and $\delta$ (see Equations (3) and (5)). Some of these issues are resolved by the adaptive sampling algorithm proposed in Deshpande and Vempala [2006]. This algorithm also has a tighter multiplicative error bound. It is based on approximate volume sampling, which essentially involves multiple rounds of row sampling. In each round, the probability of picking a row is proportional to the squared distance from the span of previously sampled rows.

The steps involved in the algorithm are as follows:

(1)

Sample k rows from A based on the approximate volume sampling (see [Deshpande and Vempala 2006] for details) and compute the basis, Q, spanning these rows.

(2)

Find the space orthogonal to the space spanned by these basis vectors, ${E = A-AQQ^T}$.

(3)

Sample s rows from A based on row norms of E and compute Q, the basis spanning all sampled rows.

(4)

Repeat from (2) for $(k+1)\log _2(k+1)$ times.

In each round 2k rows are sampled, except for the last round where $16k/\epsilon$ rows are sampled. For $k=10$ and $\epsilon = 0.5$, we would need 38 rounds of sampling and 1060 rows will be sampled to find the basis. Once the iteration is over, Q is an approximation to the basis for the row space. To obtain the approximate left-singular vectors, we need to additionally find the SVD of AQ. For the $V_{2D}$ dataset, we see that AQ is approximately the same size as A and it is larger for the Faces dataset (for which you anyway cannot get more than 400 basis vectors for the column space).

The advantage of this method is that (a) the error bound is multiplicative and (b) it directly approximates the dominant left and right subspaces. The sampled rows are not scaled, thus giving a better approximation to the dominant right singular vectors.

The algorithm can be made more efficient if (a) there is some pruning of basis vectors in Q in each iteration, depending on whether it belongs to the dominant subspace or not and (b) computation of $E=A-AQQ^T$ is avoided. Since the memory required by $AQQ^T$ is the same as A, the step becomes inefficient when the matrix sizes are large.

3 Proposed Iterative Sampling Algorithm (ISMA)

In our algorithm, for improved computational efficiency and accuracy, we would like to have several rounds of column or column and row sampling rather than just row sampling. More importantly, in each round, we would also like to prune the basis vectors that are not in the dominant subspace. We would like to stop the iteration when additional sampling does not improve the quality of either the modes or the subspace spanned by the modes. All sampling is done with replacement and duplicates are removed.

We use the following steps in our iteration.

(1)

Sample a set of columns from A based on column norms. Optionally, also sample rows. We use Equations (3) and (5) to estimate the number of columns and rows to be sampled, respectively. Compute SVD of the sampled matrix. Truncate to get the dominant POD modes.

(2)

From the remaining columns, sample an additional set of columns or columns and rows. For this, we explore various sampling strategies. These are discussed in Section 3.1. Find distinct columns (optionally, if rows are sampled, remove duplicates and scale remaining rows), compute POD modes of the newly sampled matrix and use it to update the dominant modes.

(3)

Find the cosines between the previously computed and newly updated dominant modes or the principal cosines between the subspaces. If any one of the cosines is less than the desired value, $\tau$, continue the iteration.

Algorithm 1 details the steps involved in the iteration. The main iteration is the procedure Iterative Sampling and Merging Algorithm (ISMA). It uses three other procedures: GET UPDATE, SAMPLE AND SCALE UNIQUE ROWS and BLOCK MERGE The procedure GET UPDATE returns the POD modes of the newly sampled columns. In case rows are also sampled (the parameter $rows=1$), GET UPDATE in turn calls the procedure SAMPLE AND SCALE UNIQUE ROWS that performs a row sampling, removes duplicates and scales the remaining rows. BLOCK MERGE is used to update the dominant subspace. It is described further in Section 3.2.

The matrix D in GET UPDATE contains the (distinct) sampled columns and the matrix W in SAMPLE AND SCALE UNIQUE ROWS is obtained after additionally sampling rows. If there is no row sampling, GET UPDATE returns the r dominant POD modes, $\tilde{U}_r$ and singular values which are computed using a SVD of $D^TD$. Otherwise, the approximate left singular vectors are computed using the right singular vectors of W (line 31).

When using right singular vectors of W as an approximation to those of D, the modes obtained may not be orthonormal. Therefore, for the first approximation of the modes, we orthonormalize using QR and SVD of a smaller matrix (lines 4–6 of Algorithm 1). In the subsequent iterations, orthonormalization is implicit in the BLOCK MERGE algorithm.

The iteration stops when the POD modes computed in two successive iterations are similar. Similarity is measured using cosines between the modes or principal cosines between subspaces spanned by the modes. After the iterations converge, the quality of the approximated subspace can be measured using Wedin’s theorem. It requires estimation of the singular values and right singular vectors (lines 15–18 in Algorithm 1). We describe this measure in Section 3.3.

3.1 Sampling Strategies

3.1.1 Column Sampling.

In the first round, we do a column norm based sampling, with probabilities as given in Equation (2), to get an approximation of the dominant modes. Additional columns are sampled in subsequent iterations in an attempt to correct the modes. The sampling strategy for the additional columns is a compromise between the number of iterations and the computational effort required in each iteration. We explored three sampling strategies for adding more columns.

(1)

Uniform sampling (UNF) is easy to implement as no extra computation for probabilities is necessary. Larger number of iterations are needed as there is no way to avoid sampling of unimportant or highly correlated columns.

(2)

$L_2$ norm sampling (L2N) has only simple arithmetic manipulation of probabilities from the second iteration but may repeatedly sample columns that are similar or highly correlated.

(3)

ORT requires us to project the matrix onto the current estimate of the dominant subspace ($\tilde{U}$) and sample based on norm of the component orthogonal to this subspace. Probability of sampling columns is computed as,

\begin{equation} p_i=|\mathbf {a}_i-\tilde{U}\tilde{U}^T\mathbf {a}_i|_2^2/\sum {|\mathbf {a}_i-\tilde{U}\tilde{U}^T\mathbf {a}_i|_2^2}, \end{equation}

Note that, it avoids sampling of highly correlated components and can lead to fewer iterations. However, the computational and memory requirements in each iteration is large, making each iteration slow and sometimes impractical for large datasets.

3.1.2 Row Sampling.

We have tried two strategies for row sampling:

(a)

$L_2$ norm of rows and

(b)

leverage scores (LS).

Sampling based on $L_2$ norm of the rows has been used in Drineas et al. [2006] as well as in the adaptive sampling algorithm. Both LS based sampling and uniform sampling have been used by Yamazaki et al. [2017], who propose an incremental algorithm for SVD. Their algorithm reduces the size of the matrix by projecting columns and sampling rows. Computing leverage scores requires an estimate of the left singular vectors. Therefore, it is not done in the first round of sampling.

When rows are also sampled, we use the right singular vectors and the singular values of W as an approximation to those of D. These are used to compute the POD modes (see line 31 of Algorithm (1)). For this approximation to be accurate, the primary requirement is that the singular values computed using W be accurate. In order to maintain the accuracy of the singular values, we scale the sampled rows as specified in the procedure SAMPLE AND SCALE UNIQUE ROWS. We remove duplicate rows that were sampled and scale the remaining rows with $\sqrt {\hat{t}_i/wq_i}$ where w is the number of rows sampled, $\hat{t}_i$ is the number of occurrences of each sampled row, and $q_i$ denotes sampling probability. This scale factor makes the computed SVD equal to the SVD obtained without removing duplicates. This is similar to what is done in Sun et al. [2007] for sampling columns.

3.2 Merge and Truncate (MAT) Operation

Our algorithm requires an efficient technique for estimation of POD vectors in each iteration. Algorithms to estimate these vectors have been proposed in (Bai et al. [2005], Liang et al. [2014], and Qu et al. [2002]). Also, several techniques have been proposed for incremental SVD [Brand 2002; Gu and Eisenstat 1993; Brand 2006; Baker et al. 2012], which can be used to get the principal components. We decided to use the algorithm in Vasudevan and Ramakrishna [2017] (which is a generalization of the algorithm in Iwen and Ong [2016]) for the following reasons.

(a)

We directly obtain the left singular vectors as the output of the operation.

(b)

These left singular vectors are obtained using SVD and QR of smaller matrices. Therefore they are orthogonal by construction.

Let A be a matrix of size $m\times n$ that consists of sub-matrices X and Y containing $n_1$ and $n_2$ columns, respectively, where $n = n_1+n_2$. Assume that the rank-r approximations of X and Y are $U_{1_r}\Sigma _{1_r}V_{1_r}^T$ and $U_{2_r}\Sigma _{2_r}V_{2_r}^T$. The rank-r approximation of A, $\tilde{A}_r$, can be computed from the individual SVDs by first merging the two SVDs and then truncating it to a rank-r approximation as follows.

The component of $U_{2_r}$ orthogonal to $U_{1_r}$ is ${U_t =U_{2_r}-U_{1_r}(U_{1_r}^TU_{2_r})}$. If ${U_t = U_oR}$ is the corresponding QR decomposition, we have

\begin{align} [ X \quad Y ] & \approx [ U_{1_r} \quad U_o ] \begin{bmatrix} \Sigma _{1_r} & (U_{1_r}^T U_{2_r})\Sigma _{2_r} \\ \mathbf {0} & R\Sigma _{2_r} \end{bmatrix} \begin{bmatrix} V_{1_r}^T & \mathbf {0} \\ \mathbf {0} & V_{2_r}^T \end{bmatrix} = [ U_{1_r} \quad U_o ] E \begin{bmatrix} V_{1_r}^T & \mathbf {0} \\ \mathbf {0} & V_{2_r}^T \end{bmatrix} \end{align}

(7)

Now E is a much smaller $(2r) \times (2r)$ matrix. If $E = U_E \Sigma _E V_E^T$, we get

\begin{align} [ X \quad Y ] &\approx [ U_{1_r} \quad U_o ] U_E \Sigma _E V_E^T \begin{bmatrix} V_{1_r}^T & \mathbf {0} \\ \mathbf {0} & V_{2_r}^T \end{bmatrix} = U \Sigma V^T, \end{align}

(8)

where $U = [ U_{1_r} \quad U_o ] U_E$, $\Sigma = \Sigma _E$ and $V = [\begin{array}{@{}c@{\quad }c@{}} \scriptstyle V_{1_r} & \scriptstyle \mathbf {0} \\ \scriptstyle \mathbf {0} & \scriptstyle V_{2_r} \end{array}] V_E$. These singular values and vectors are once again truncated to get $\tilde{U}_r$ and $\tilde{\Sigma }_r$. Algorithm (2) details the steps involved. In general, if there are P partitions, a series of MAT operations can be used to obtain the r left singular vectors. Since we are only interested in the left singular vectors, we do not compute V.

The penalty we pay for getting a more accurate estimate of the dominant subspace is finding the SVD of the newly sampled columns to do a MAT operation in each iteration. However, the number of additional sampled columns in each iteration is much lower than the total number of columns in the matrix. Therefore, a speedup is obtained since the complexity of SVD computation depends quadratically on the number of columns for “tall and thin” matrices.¹ Assuming there are P partitions and all n columns are sampled with same number of columns sampled in each iteration, the total number of flops required for computing a rank-r approximation of A using a series of MAT operations is $\approx \frac{14mn^2}{P} + \frac{192 n^3}{P^2}$ where $r \lt \lt n$ [Vasudevan and Ramakrishna 2017].

3.2.1 Error Analysis for Series of MAT Operations.

We derive an error bound in terms of the spectral norm of the error in the rank-r matrix obtained using a series of MAT operations. For this analysis, we use the following well known results from literature [Li 2014; Mathias 2014]. Let X be an $m\times n$ matrix and $q=\min (m,n)$.

(1)

If $X = U\Sigma V^T$, and $X_r=U_r\Sigma _rV_r^T$ is a rank-r approximation to X obtained by retaining the top r singular values, then, $||X-X_r||_2 = \sigma _{r+1}(X)$ and $||X-X_r||_2\le ||X-B_r||_2$, where $B_r$ is any rank-r matrix.

(2)

If Y is an $m\times t$ sub-matrix of X, then, $\sigma _{i+n-t}(X)\le \sigma _i(Y)\le \sigma _i(X)$

(3)

If $\hat{X}=X+\Delta X$ is a perturbation of X, then $|\sigma _i-\hat{\sigma }_i|\le ||\Delta X||_2$, $i = 1, 2, \dots , q$.

We use the following notation. Let the matrix $X = [X_1 \quad X_2\quad \cdots \quad X_P]$ be divided into P partitions. Define $Y_0 = [X_1]$, $Y_1 = [Y_0 \quad X_2]$ and so on. In general $Y_j = [Y_{j-1} \quad X_{j+1}]$, so that $Y_{P-1} = X$. Let $X_r$ denote the optimal rank-r approximation of X, and ${X_i}_r$ that of $X_i$. Define $Z_1 = [{X_1}_r \quad {X_2}_r]$ and $Z_j = [{Z_{j-1}}_r \quad {X_{j+1}}_r]$. Let $Z = Z_{P-1}$. Therefore the error $E = ||X - Z_r||_2$, where $Z_r$ are the optimal rank-r approximations of Z.

The bound for the error can be derived as follows.

\begin{align} ||X - Z_r||_2 & \le ||X - Z||_2 + ||Z - Z_r||_2 \nonumber \nonumber\\ & \le ||X - Z||_2 + ||Z - X_r||_2 \nonumber \nonumber\\ & \le ||X - Z||_2 + ||Z - X||_2 + ||X_r - X||_2 \nonumber \nonumber\\ & = ||X_r - X||_2 + 2||X - Z||_2 \nonumber \nonumber\\ & = \sigma _{r+1}(X) + 2||X - Z||_2. \end{align}

(9)

We find $||X - Z||_2$ iteratively. For the MAT operation on the first two partitions, we have

\begin{align} ||Z_1 - Y_1||_2 & \le ||X_1 - {X_1}_r||_2 + ||X_2 - {X_2}_r||_2 \nonumber \nonumber\\ & = \sigma _{r+1}(X_1) + \sigma _{r+1}(X_2) \le 2\sigma _{r+1}(X). \end{align}

(10)

For subsequent partitions,

\begin{align} ||Z_j - Y_j||_2 & \le ||{Z_{j-1}}_r - Y_{j-1}||_2 + ||X_{j+1} - {X_{j+1}}_r||_2 \nonumber \nonumber\\ & \le ||{Z_{j-1}}_r - Z_{j-1}||_2 + ||Z_{j-1} - Y_{j-1}||_2 + ||X_{j+1} - {X_{j+1}}_r||_2. \end{align}

(11)

If $Z_i = Y_i + \Delta X$, then $\sigma _{r+1}(Z_i) \le \sigma _{r+1}(Y_i) + ||Z_i - Y_i||_2$. Therefore,

\begin{align} ||Z_j - Y_j||_2 & \le \sigma _{r+1}(Y_{j-1}) + 2 ||Z_{j-1} - Y_{j-1}||_2 + \sigma _{r+1}(X_{j+1}) \nonumber \nonumber\\ & \le \sigma _{r+1}(X) + 2 ||Z_{j-1} - Y_{j-1}||_2 + \sigma _{r+1}(X). \end{align}

(12)

The last inequality follows since $Y_{j-1}$ and $X_{j+1}$ are submatrices of X. Since $||X - Z||_2 = ||Y_{P-1} - Z_{P-1}||_2$, it can be computed iteratively using Equation (12). Therefore, the bound for error, $||X-Z_r||_2$, is $(2^{P+1}-3)\sigma _{r+1}(X)$.² We note the following:

(1)

In the worst case, $\sigma _{r+1}(X_i)$ is close to $\sigma _{r+1}(X)$ and the error is close to $({2^{P+1}-3}) \sigma _{r+1}(X)$. But this typically happens when the size of the submatrix is large, which means P is small.

(2)

If there are many partitions, the number of MAT operations are large, but each $\sigma _{r+1}(X_i)$ is likely to be small. If Y is an $m \times t$ submatrix of X, then $\sigma _{r+1+n-t} (X) \le \sigma _{r+1}(Y) \le \sigma _{r+1}(X)$. Since each submatrix is thin and t is small, the lower bound is quite small. Hence the error in the average case is likely to be much smaller than in the worst case. In practice, we have found this is true.

Overall, if we need k POD vectors to be approximated well, then either the number of partitions should be small or we can set r to be larger than k. We set r to 3k to get good accuracies.

3.3 Quality of Converged Solution

In order to estimate the accuracy of the computed POD modes, we need a measure to quantify the error in the approximation. After convergence, we can get a bound on the error in the subspaces using Wedin’s theorem [Wedin 1972]. Assume that at the end of the iteration, we get $\tilde{A}_r = \tilde{U}_r\tilde{\Sigma }_r \tilde{V}_r^T$, out of which we take the top k singular values and vectors. Define $R = A\tilde{V}_k - \tilde{U}_k \tilde{\Sigma }_k$ and $S = A^T\tilde{U}_k - \tilde{V}_k\tilde{\Sigma }_k$. Assume $\sin (\Theta)$ and $\sin (\Phi)$ are diagonal matrices containing the sine of principal angles between subspaces spanned by $U_k$ and $\tilde{U}_k$ and $V_k$ and $\tilde{V}_k$ respectively. The sines of the principal angles are a measure of the distance between the accurate and approximate dominant left and right subspaces.

Wedin’s theorem [Wedin 1972; Li 1996; Stewart and guang Sun 1990] gives a bound on the norms of the sine of principal angles in terms of the norms of R and S. The statement of the theorem is as follows:

\begin{equation} \sqrt {||\sin (\Theta)||_F^2 + ||\sin (\Phi)||_F^2 } \le \frac{\sqrt {||R||_F^2 + ||S||_F^2}}{\omega } , \end{equation}

(13)

where $\omega \triangleq \min \lbrace \min \nolimits _{j=1, \ldots n-k} |\tilde{\sigma }_k - \sigma _{k+j}|, \tilde{\sigma }_k \rbrace \gt 0$ and $\sigma _i$ are the singular values of A. In order to compute $\omega$, we need all the singular values of A, which are not available. However, at the end of the iteration, we have an approximation of $r(=3k)$ singular values. Assuming that the $k\hbox{th}$ and $(k+1)\hbox{th}$ singular values are approximated well (which was the case in all the datasets we have looked at), we estimate $\omega$ as $\min \lbrace |\tilde{\sigma }_k - \tilde{\sigma }_{k+1}|, \tilde{\sigma }_k\rbrace$. If our iteration converges to the correct subspaces, the right hand side should be close to zero. If it approaches $\sqrt {2k}$, either (a) additional columns need to be sampled or (b) the $k\hbox{th}$ and $(k+1)\hbox{th}$ singular values are very close to each other and small errors in the singular values result in large values for the bound. In the limit when they are identical, any vector in the subspace is a POD mode and the bound can become arbitrary. In this case, a better measure of the quality of the solutions can be obtained after increasing k. Usually increasing k by one or two is sufficient.

Our main focus is the POD modes. However, in order to compute the bound, the singular values and right singular vectors are also required. Since we do not scale columns and use some subset of the columns for computation of the modes, the error in the singular values (not the vectors) could be significant. This is especially true if the total number of columns sampled is much less than the number of columns in the matrix. The left singular vectors will be quite accurate if the columns corresponding to the dominant subspace are captured, but the singular values will not be accurate. If this is the case, we use the technique followed in the adaptive sampling method i.e., after the modes have converged, the singular values and the right singular vectors can be obtained by computing the SVD of $\tilde{U}_k^TA$ as detailed in lines 15–18 in Algorithm (1).

4 Results

All algorithms are run in a 4 core Intel^® Core^TM i7-6700K processor that runs at a maximum clock frequency of 4GHz with hyper-threading on. The system has a 32 GB RAM. For the YF dataset, the mean centered data required 40 GB memory. Therefore for this dataset, the runs for the iterative algorithm were done in a different system, a 8 core Intel^® Xeon^® CPU E5-2650V2 processor that runs at a maximum clock frequency of 2.6GHz with hyper-threading on, with 64GB of RAM.

The algorithms are written in Python and run in Python version 3.5.3. It uses numpy version 1.12.1-3, scipy version 0.18.1-2 and openBLAS version 0.2.19-3, which is multi-threaded. Double precision was used for all computations. We used the linear algebra routines in Scipy, which are based on LAPACK for SVD computation. The random number generator in numpy is used. Runtime and speedup values are an average of five runs. We report speed-up obtained when openBLAS is run using a single thread, to give a measure of the operation count in each algorithm and four threads (since our system is a four core system), which gives an indication of how the runtime scales with number of threads. We have not explicitly parallelized our algorithm and any improvement with multiple threads is entirely due to openBLAS.

The accurate POD modes of A can be computed using SVD($A^TA$) as follows, (a) find right singular vectors and singular values of $A^TA$ using either SVD or eigendecomposition, (b) compute the k left singular values as $\mathbf {u}_i=A\mathbf {v}_i/\sigma _i$. We denote the algorithm as SVD-$A^TA$.

If A is a tall and skinny matrix, computing SVD(A) using SVD-$A^TA$ is faster. This is because SVD(A) takes approximately $\beta mn^2$ floating point operations (FLOPs). SVD-$A^TA$ requires $mn^2$ FLOPs to compute $A^TA$, $(\beta + 16)n^3$ to compute SVD($A^TA$) and another $mn^2$ operations to compute $\mathbf {u}_i$ from $A\mathbf {v}_i/\sigma _i$. Assuming $\beta =6$ [Golub and Van Loan 2007; Björck 2015], this gives $m/(m/3 + (1+8/3)n)$ speedup. The results in Table 1 are close to what is expected. Since SVD-$A^TA$ is faster, we benchmark the speedup of the algorithms with respect to time taken for computing k POD modes using SVD-$A^TA$.

Table 1.

Dataset	SVD(A)		SVD-$A^TA$
	single thread	4 threads	single thread	4 threads
$V_{2D}$	25.28	13.94	8.39	2.75
Faces	0.35	0.17	0.13	0.069
Cropped Yale faces (CYF)	34.04	16.43	15.23	6.86
Yale faces (YF)	-	-	-	-

Table 1. Average Time Taken (in Seconds) by SVD(A) and SVD-$A^TA$ for Different Datasets

Note that SVD of YF could not be computed in our test system (32 GB) since it does not fit in the system’s memory.

4.1 Accuracy and Runtime of the Iterative Algorithms

We evaluated the two versions of ISMA algorithm (Algorithm (1))—an iterative column sampling (ICS) algorithm and an iterative column and row sampling (ICRS) algorithm. We tested the performance of the iterative algorithms with two different convergence criteria: cosines between the modes obtained in successive iterations and the principal cosines between the subspaces computed in successive iterations. As indicated previously, we tried different sampling strategies from the second iteration. Table 2 lists the different sampling strategies we use in this article.

Table 2.

strategy	column sampling	row sampling (when applicable)
L2N	$L_2$ norm sampling of columns in all iterations	–
UNF	uniform sampling from second iteration	$L_2$ norm sampling
ORT	from second iteration, sampling based on orthogonal complement of A projected onto current approximation of modes	$L_2$ norm sampling
LS	uniform sampling from second iteration	leverage score sampling using current approximation of modes from second iteration

Table 2. Sampling Strategies used for Sampling Rows and Columns by Iterative Sampling Algorithms

For all sampling strategies, in the first iteration, columns (and rows in the case of column and row sampling) are sampled based on their respective $L_2$ norms.

Using $\epsilon$ and $\delta$ around $0.6-0.7$ resulted in a reasonable tradeoff between the number of columns (and rows) sampled in each iteration and the number of iterations. For the Faces dataset, we chose a higher $\epsilon , \delta$ since the number of columns in the matrix is quite small. ORT was not used in the iterative algorithms for YF, since the overhead for computing the orthogonal complement in each iteration proved to be large. Table 3 lists the parameter values used for the iterative algorithms. In all cases, the tolerance parameter for convergence $\tau$, is set to 0.99.

Table 3.

Case	k, $\epsilon$, $\delta$	k, $\epsilon$, $\delta$	k, $\epsilon$, $\delta$
	$V_{2D}$, CYF	Faces	YF
I1	10, 0.7, 0.6	10, 1, 0.75	20, 0.6, 0.6
I2	2, 0.7, 0.6	2, 0.7, 0.6	5, 0.6, 0.6

Table 3. Parameter Values (k, $\epsilon$, $\delta$) used in the Iterative Algorithms for Various Datasets

Figure 1 shows the first k mode angles computed by the iterative algorithms using L2N, UNF, ORT, and LS. This is run with convergence of individual modes. As expected, the accuracies are significantly better than for a single round of sampling.

Fig. 1.

We also ran the algorithm with convergence of the subspace spanned by modes as the criterion with the same value for $\tau$. Figure 2 shows the principal angles between the subspaces spanned by the approximate modes and the modes obtained using the truncated SVD. The algorithms give similar results as in Figure 1 for all datasets. Moreover, it is clear that the accuracies are similar for all the sampling strategies, indicating it is sufficient to use uniform distribution after the first round of sampling. This is because UNF is the least complex and computationally more efficient than the other column sampling strategies.

Fig. 2.

To evaluate the quality of subspaces obtained, we computed the measure for error in the subspaces using Equation (13). Table 4(a) shows the error measure for various cases. In general, the error measure is small indicating good accuracies in the modes. In a few cases ($V_{2D}$ ($k=2$) and YF ($k=20$)), the measure is large. As discussed, this could happen if $\sigma _k$ and $\sigma _{k+1}$ are clustered together. Small changes in the singular values can result in large variations in the measure (as seen with L2N and UNF). Figure 7 in Appendix A shows the dominant singular values of all our test datasets. It can be seen that $\sigma _2$ and $\sigma _3$ of $V_{2D}$ are very close in value, whereas $\sigma _4$ is well separated. Therefore when k is increased to 3, the error measure reduces significantly. This can be seen in Table 4(b).

Table 4.

The other case for which the measure is large is for YF (UNF). In this case also, several singular values are clustered together as seen in Figure 7. The error drops when k is increased as seen in Table 4(b). If L2N is used for sampling, the measure is small. It turns out that all the columns were sampled in this case, leading to a low error. In general, it can be seen from Figures 1 and 2 that the accuracy of the modes are much better than indicated by the measure. Also note that, in most cases, our approximation of $\omega$ worked well. The average error in $\omega$ was found to be around 6%.

Figures 3 and 4 contain the speedup obtained when the iterative algorithms are run with convergence of individual modes and subspaces formed by the modes as the convergence criterion, respectively. Only the smaller datasets are included. YF is discussed later. Speedup is evaluated with respect to time taken for computing k POD modes using SVD-$A^TA$. We note that (a) very good speedup is obtained for CYF for both single and multi-threaded execution. The speedup is the least when ORT is used as the sampling strategy, which is as expected. Since the accuracies of all schemes are similar, it makes sense to use the UNF strategy as it gives maximum speedup. (b) For small datasets such as the FACES dataset, it is better to use SVD-$A^TA$. (c) $V_{2D}$ has about 1,000 columns. The results show that it may be worthwhile to use the iterative algorithm if k is small.

Fig. 3.

Fig. 4.

We found that for $k=10$ (I1), nearly all columns were sampled for the Faces, $V_{2D}$ and CYF datasets when convergence was achieved, both in terms of mode and principal angles. For I2 ($k=2$), the number of columns sampled was substantially lower than n in all cases and the speedup is much larger. For the $V_{2D}$ and Faces datasets, the speedup is almost the same for both convergence criteria. However, for CYF, speed up obtained is substantially larger with convergence of subspace for $k=2$. This is because the first two singular values are almost identical. Therefore, more columns are sampled to capture individual modes accurately than for subspace spanned by the modes.

As expected, in comparison to other sampling strategies, ORT is slower. UNF, LS, and L2N have very similar speed up for most cases. This reinforces our conclusion that sampling with uniform sampling from the second iteration is as good as other sampling strategies we tried.

As indicated earlier, mean centered dataset for YF occupies 40GB memory. We implemented Algorithm 1 in a system with 64 GB RAM and compared the results obtained using UNF with (a) SVD($A^TA$) and (b) projection algorithms proposed in Erichson et al. [2017], Halko et al. [2011a], and Rokhlin et al. [2010]. We implemented two versions of projection algorithms: one in which the rows are projected onto the random matrix (RSVD) [Halko et al. 2011a] and the other in which the columns are projected (COLRSVD) [Rokhlin et al. 2010; Erichson et al. 2017]. We report results for RSVD and COLRSVD for the YF dataset with three subspace iterations and $s=k+5$ as suggested in Halko et al. [2011a]. The wall clock and computational times taken for YF dataset using these algorithms is shown in Table 5. The wall clock time includes the time required to read the data from the disk. Figure 5 shows the mode angles obtained. We can see that for Yale faces dataset with $k=5$, the projection algorithms and sampling algorithms have similar accuracy. Sampling algorithms are $~1.4\times$ faster with respect to wall clock times and $~2\times$ faster with respect to computational times. For $k=20$, the wall clock times of projection and sampling algorithms are similar (except for ICS-UNF) but the projection algorithms have a large error in comparison to the iterative sampling algorithms. Also note that, the iterative sampling algorithms are much faster ($7-30\times$) than truncated SVD.

Fig. 5.

Table 5.

Case	Algorithm	single thread		8 threads
		WT(s)	CT(s)	WT(s)	CT(s)
$k=5$ (I2)	ICS-UNF	217.79	50.78	224.062	38.57
	ICRS-UNF	205.71	42.11	222.226	41.73
	RSVD	298.75	108.43	283.5	108.8
	COLRSVD	286.44	109.83	281.78	108.97
	SVD-$A^TA$	7151.07	6964.7	2149.77	1962.77
$k=20$ (I1)	ICS-UNF	626.18	445.34	299.05	130.59
	ICRS-UNF	379.16	198.19	278.74	111
	RSVD	335.12	160.77	290.7	159.32
	COLRSVD	346.89	159.98	290.18	159.48
	SVD-$A^TA$	7403.33	7216.96	2171.17	1984.17

Table 5. Wall Clock (WT) and Computational Times (CT) in Seconds (s) for Projection and Iterative Sampling Algorithms for YF Dataset

Algorithms were run in a 64 GB system since YF does not fit in our test system.

4.2 Incremental-Iterative Algorithm for Matrices that do Not Fit in the RAM

As we already specified before, the mean-centered data matrix for YF dataset requires 40 GB memory and does not fit in a system that has 32 GB RAM. There are two possible ways to compute the POD modes of this dataset in this system:

(1)

Run the iterative algorithm (Algorithm (1)). In the first pass, data is read in chunks that fit in the memory to compute the sampling probabilities. In the next pass, the data is read again to form the sampled matrix. Assuming we have enough memory for the sampled matrix and its left singular vectors in the RAM, $\tilde{U}$ and $\tilde{\Sigma }$ of the matrix are computed. Subsequent iterations may require one or both passes on data depending on the sampling strategies used. The iteration is continued until Algorithm (1) converges.

(2)

Do an incremental computation by running the iterative algorithm on each block of the partitioned data as follows. Load the block into memory and use ICS/ICRS to compute $\tilde{U}$, $\tilde{\Sigma }$ of the block. Use the MAT algorithm to get an updated $\tilde{U}$, $\tilde{\Sigma }$ of the dataset. Delete the partition and the sampled columns and continue the process until all blocks of data have been processed. Algorithm (3) details the steps involved.

The first alternative requires $i+1$ passes of the matrix for ICS where i is the number of iterations for convergence. For ICRS, it takes a maximum of $3i+1$ passes for i iterations. By passes, we mean accessing the entire matrix, though it may be done in blocks. The second alternative requires only one pass over the entire matrix irrespective of whether ICS or ICRS is used to find $\tilde{U}$, $\tilde{\Sigma }$ of the partition. Clearly, the first alternative is severely bottle-necked by disk access times. For YF, we used the second option and used an incremental computation after partitioning the matrix into four blocks. We ran the iterative algorithms on each block using the same parameters as given for YF in Table 3.

Figure 6 shows that both mode angles and the principal angles are less than $5^{\circ }$. Table 6 shows the wall clock time and computational time for ICS-UNF, ICRS-UNF with convergence of (a) modes and (b) subspace spanned by modes. It can be seen that for both convergence criteria, time taken is very similar. As expected, ICRS is faster than ICS. Therefore, we favor ICRS over ICS.

Fig. 6.

Table 6.

Convergence of individual modes
Case	sampling strategy	ICS		ICRS
		WT(s)	CT(s)	WT(s)	CT(s)
$k=5$ (I2)	UNF	146.73	51.14	134.09	35.92
$k=20$ (I1)	UNF	302.06	205.27	217.22	121.41
Convergence of subspace spanned by modes
Case	sampling strategy	ICS		ICRS
		WT(s)	CT(s)	WT(s)	CT(s)
$k=5$ (I2)	UNF	146.59	50.87	125.76	30.69
$k=20$ (I1)	UNF	299.73	204.11	211.72	116.45

Table 6. Wall clock (WT) and computational times (CT) in seconds (s) for incremental computation of POD modes for Yale faces dataset run with a single thread in a 32GB system.

5 Conclusion

In this article, we proposed an iterative algorithm to improve the modes/subspaces spanned by the modes using a previously proposed MAT algorithm. Unlike the earlier methods used for multiple rounds of sampling, we estimate the POD modes in each iteration and stop sampling when no further improvement is obtained in either the modes or subspaces spanned by the modes. The algorithm resulted in good accuracies with significant improvement in runtime even if all columns are sampled when convergence is achieved. We found that using column-norm sampling in the first round and uniform sampling in subsequent rounds resulted in good speedups, with accuracy comparable to using norm of the orthogonal component of the sampled columns.

We proposed a measure using Wedin’s theorem to quantify the accuracy in the computed subspaces. In most cases, the measure was much less than its upper limit, corroborating the fact that the approximated modes were very accurate. In few cases, the measure was large due to clustering of singular values. When the rank of approximation, k, was increased so that this clustering is avoided, the error measure reduced to a large extent. In general, the accuracy of the modes are much better than indicated by the measure.

For large matrices that do not fit in the RAM, we used the iterative algorithm on each partition of the data (that fit in the RAM) to approximate the dominant left singular vectors and singular values of the partition and then used MAT operation to get the modes of the entire data. As mentioned, computing the modes using incremental sampling and MAT is advantageous as it requires only one pass over the data. We obtained very good accuracy using this method when applied on our large dataset.

Footnotes

For “short and fat” matrices, we need to iteratively sample rows to get a runtime improvement when run on a single machine.

The trend for the error is similar in the expression derived in Iwen and Ong [2016] (see Theorem 3 in the reference). They find a bound on the Frobenius norm of the error which grows as $(1+\sqrt {2})^P$, which is slightly worse than ours. We used the spectral norm as it is more appropriate for noisy data as argued in Li et al. [2017]. Our derivation is also simpler and more specific to the case we are looking at, namely, an iterative algorithm which has a series of MAT operations.

In the case of CTSVD C1, we could not obtain 10 values with the filter (see line 13 of Algorithm (4)) and the values plotted are obtained without the filter.

A Experimental Evaluation of LTSVD AND CTSVD

We evaluated the column and column and row norm based sampling algorithms (referred to as LTSVD and CTSVD). Details are contained in Algorithms (4) and (5). The parameters to be set in the algorithms are (a) $\epsilon , \delta$: error and failure probability parameters and (b) k: desired number of POD modes. Based on these parameters, number of columns/rows to be sampled, $c/w$ are set according to Equations (3) and (5). We evaluated the algorithms based on (a) runtime and (b) accuracy for various sets of parameters.

A.1 Results

We would like the error parameters, $\epsilon$ and $\delta$, to be small ($\lt$1). However, since the number of columns sampled depends inversely on powers of $\epsilon$, it increases rapidly and often exceeds the number of columns in the matrix. This limits the parameter values that can be used. To get some representative results, we looked at results for three cases namely, C1: $c \approx n$ except for the YF dataset, where it is lower, C2: $c \approx n/2$ (lower for CYF), and C3: $c \lt \lt n$ (higher for YF). For $V_{2D}$, Faces and CYF, we used $k = 10,5,2$ for the three cases C1, C2, and C3, respectively. Since YF is a larger dataset, with a more gradual decay of singular values, we used $k=20,10,5$ for the three cases. For these cases, $k,\epsilon ,\delta ,c,w$ are shown in Table 7.

Table 7.

Case	k, $\epsilon$, $\delta$	k, $\epsilon$, $\delta$	k, $\epsilon$, $\delta$	k, $\epsilon$, $\delta$
	$V_{2D}$	Faces	CYF	YF
	LTSVD
C1	10, 0.7, 0.45	10, 0.75, 0.8	10, 0.46, 0.44	20, 0.35, 0.35
C2	5, 1, 0.1	5, 0.75, 0.75	5, 1, 0.1	10, 0.3, 0.25
C3	2, 1, 0.8	2, 1, 0.8	2, 1, 0.8	5, 0.3, 0.25
	CTSVD
C1	10, 1.04, 0.94	10, 1.3, 1	10, 0.9, 0.7	20, 0.83, 0.9
C2	5, 1, 0.1	5, 1, 0.1	5, 1, 1	10, 0.62, 0.9
C3	2, 1, 0.8	2, 1, 0.8	2, 1, 0.8	5, 0.5, 0.9

Table 7. Parameter Values (k, $\epsilon$, $\delta$) used in LTSVD and CTSVD Algorithms for Various Datasets

Figure 7 shows the first 20 singular values of the datasets. $V_{2D}$ has a dominant singular value, followed by a sharp decay. The most gradual decay of singular values is for the YF dataset, where the first two modes capture only 30% of the energy. For CYF the first two singular values are very close. For all datasets, but especially for CYF, there is a clustering of the singular values beyond the first 10 values. Note that the singular values of $V_{2D}$ is plotted on a log scale, while others are on a linear scale.

Fig. 7.

We found that the error in approximating $A_k$ is well below the theoretical error bound for the respective algorithms in all cases. Figure 8 shows the percentage error in the singular values.³ It is seen that the error is less than 5% for CYF and YF. The two dominant singular values are captured accurately in CYF even when the error parameter values are close to one. The error is slightly larger for $V_{2D}$ and Faces dataset, but is still less than 20%, except for CTSVD (C3) for the $V_{2D}$ dataset.

Fig. 8.

For many applications in fluid flow and in gene expression modeling, it is the POD modes rather than the singular values that are important. Figure 9 contains a few modes from the $V_{2D}$ and FACES datasets. It is seen that there are visually apparent distortions in the approximated modes, even though the corresponding singular values match well. A measure of the accuracy of the POD modes is the cosine similarity between the approximate mode ($\tilde{\mathbf {u}}_i$) and the one obtained using a truncated SVD of the entire dataset ($\mathbf {u}_i$). Figure 10 shows the angle between the two. We refer to these angles as “mode angles”. It can be seen from Figure 10 that the modes are approximated very poorly. For reference, the corresponding mode angle is indicated in Figure 9, clearly indicating that a large mode angle implies potentially larger distortions.

Fig. 9.

Fig. 10.

The mode angles are known to be sensitive to the clustering of singular values. This is the reason why the first two mode angles are large for CYF ($\sigma _1\sim \sigma _2$). In the limiting case when the singular values are identical, it is only the eigenspace that matters and not the eigenvectors. Also, for applications such as facial recognition, it is the space spanned by the modes rather than the modes themselves that are of interest. For this reason, we also measure the accuracy of the subspaces spanned by k singular vectors rather than the accuracy of each mode. A measure of this accuracy is the principal or canonical angles between the subspaces, whose cosines are the singular values of $\tilde{U}_k^TU_k$ [Björck and Golub 1973]. Figure 11 shows the k principal angles, $\phi _i$, between the two k-dimensional subspaces. Although the accuracy of the subspace is better than the modes themselves, in some cases, there is significant error even for $c\approx n$. The first two principal angles for CYF match well, indicating that the subspace is captured more accurately than the modes themselves (due to almost identical singular values).

Fig. 11.

References

[1]

Dimitris Achlioptas and Frank McSherry. 2007. Fast computation of low-rank matrix approximations. Journal of the ACM 54, 2 (April 2007), 9–es. DOI:

strategy	column sampling	row sampling (when applicable)
L2N	\(L_2\) norm sampling of columns in all iterations	–
UNF	uniform sampling from second iteration	\(L_2\) norm sampling
ORT	from second iteration, sampling based on orthogonal complement of A projected onto current approximation of modes	\(L_2\) norm sampling
LS	uniform sampling from second iteration	leverage score sampling using current approximation of modes from second iteration

Case	k, \(\epsilon\), \(\delta\)	k, \(\epsilon\), \(\delta\)	k, \(\epsilon\), \(\delta\)
	\(V_{2D}\), CYF	Faces	YF
I1	10, 0.7, 0.6	10, 1, 0.75	20, 0.6, 0.6
I2	2, 0.7, 0.6	2, 0.7, 0.6	5, 0.6, 0.6

Case	k, \(\epsilon\), \(\delta\)	k, \(\epsilon\), \(\delta\)	k, \(\epsilon\), \(\delta\)	k, \(\epsilon\), \(\delta\)
	\(V_{2D}\)	Faces	CYF	YF
	LTSVD
C1	10, 0.7, 0.45	10, 0.75, 0.8	10, 0.46, 0.44	20, 0.35, 0.35
C2	5, 1, 0.1	5, 0.75, 0.75	5, 1, 0.1	10, 0.3, 0.25
C3	2, 1, 0.8	2, 1, 0.8	2, 1, 0.8	5, 0.3, 0.25
	CTSVD
C1	10, 1.04, 0.94	10, 1.3, 1	10, 0.9, 0.7	20, 0.83, 0.9
C2	5, 1, 0.1	5, 1, 0.1	5, 1, 1	10, 0.62, 0.9
C3	2, 1, 0.8	2, 1, 0.8	2, 1, 0.8	5, 0.5, 0.9

Abstract

1 Introduction

2 Evaluation of Existing Sampling Algorithms AND Motivation

3 Proposed Iterative Sampling Algorithm (ISMA)

3.1 Sampling Strategies

3.1.1 Column Sampling.

3.1.2 Row Sampling.

3.2 Merge and Truncate (MAT) Operation

3.2.1 Error Analysis for Series of MAT Operations.

3.3 Quality of Converged Solution

4 Results

4.1 Accuracy and Runtime of the Iterative Algorithms

4.2 Incremental-Iterative Algorithm for Matrices that do Not Fit in the RAM

5 Conclusion

Footnotes

A Experimental Evaluation of LTSVD AND CTSVD

A.1 Results

References

Index Terms

Recommendations

On the Computation of the Restricted Singular Value Decomposition via the Cosine-Sine Decomposition

Downdating the Singular Value Decomposition

The QLP Approximation to the Singular Value Decomposition

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

HTML Format

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations