Improved Distributed Principal Component Analysis

Balcan, Maria-Florina; Kanchanapally, Vandana; Liang, Yingyu; Woodruff, David

Computer Science > Machine Learning

arXiv:1408.5823 (cs)

[Submitted on 25 Aug 2014 (v1), last revised 23 Dec 2014 (this version, v5)]

Title:Improved Distributed Principal Component Analysis

Authors:Maria-Florina Balcan, Vandana Kanchanapally, Yingyu Liang, David Woodruff

View PDF

Abstract:We study the distributed computing setting in which there are multiple servers, each holding a set of points, who wish to compute functions on the union of their point sets. A key task in this setting is Principal Component Analysis (PCA), in which the servers would like to compute a low dimensional subspace capturing as much of the variance of the union of their point sets as possible. Given a procedure for approximate PCA, one can use it to approximately solve $\ell_2$-error fitting problems such as $k$-means clustering and subspace clustering. The essential properties of an approximate distributed PCA algorithm are its communication cost and computational efficiency for a given desired accuracy in downstream applications. We give new algorithms and analyses for distributed PCA which lead to improved communication and computational costs for $k$-means clustering and related problems. Our empirical study on real world data shows a speedup of orders of magnitude, preserving communication with only a negligible degradation in solution quality. Some of these techniques we develop, such as a general transformation from a constant success probability subspace embedding to a high success probability subspace embedding with a dimension and sparsity independent of the success probability, may be of independent interest.

Subjects:	Machine Learning (cs.LG)
Cite as:	arXiv:1408.5823 [cs.LG]
	(or arXiv:1408.5823v5 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.1408.5823

Submission history

From: Yingyu Liang [view email]
[v1] Mon, 25 Aug 2014 16:24:43 UTC (292 KB)
[v2] Mon, 1 Sep 2014 01:32:30 UTC (292 KB)
[v3] Thu, 13 Nov 2014 14:51:43 UTC (312 KB)
[v4] Sun, 21 Dec 2014 01:13:00 UTC (346 KB)
[v5] Tue, 23 Dec 2014 03:17:19 UTC (346 KB)

Computer Science > Machine Learning

Title:Improved Distributed Principal Component Analysis

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Improved Distributed Principal Component Analysis

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators