Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Free access

Faster dimension reduction

Published: 01 February 2010 Publication History

Abstract

Data represented geometrically in high-dimensional vector spaces can be found in many applications. Images and videos, are often represented by assigning a dimension for every pixel (and time). Text documents may be represented in a vector space where each word in the dictionary incurs a dimension. The need to manipulate such data in huge corpora such as the web and to support various query types gives rise to the question of how to represent the data in a lower-dimensional space to allow more space and time efficient computation. Linear mappings are an attractive approach to this problem because the mapped input can be readily fed into popular algorithms that operate on linear spaces (such as principal-component analysis, PCA) while avoiding the curse of dimensionality.
The fact that such mappings even exist became known in computer science following seminal work by Johnson and Lindenstrauss in the early 1980s. The underlying technique is often called "random projection." The complexity of the mapping itself, essentially the product of a vector with a dense matrix, did not attract much attention until recently. In 2006, we discovered a way to "sparsify" the matrix via a computational version of Heisenberg's Uncertainty Principle. This led to a significant speedup, which also retained the practical simplicity of the standard Johnson-Lindenstrauss projection. We describe the improvement in this article, together with some of its applications.

References

[1]
Achlioptas, D. Database-friendly random projections: Johnson--Lindenstrauss with binary coins. J. Comput. Syst. Sci. 66, 4 (2003), 671--687.
[2]
Ailon, N., Chazelle, B. Approximate nearest neighbors and the fast Johnson-Lindenstrauss transform. SIAM J. Comput. 39, 1 (2009), 302--322.
[3]
Ailon, N., Liberty, E. Fast dimension reduction using rademacher series on dual bch codes. Discrete and Computational Geometry (2008).
[4]
Ailon, N., Liberty, E., Singer, A. Dense fast random projections and lean walsh transforms. APPROX-RANDOM, 2008, 512--522.
[5]
Alon, N. Problems and results in extremal combinatorics-I. Discrete Math. 273, 1--3 (2003), 31--53.
[6]
Alon, N., Spencer, J. The Probabilistic Method. John Wiley, 2nd edition, 2000.
[7]
Arora, S. Polynomial time approximation schemes for euclidean traveling salesman and other geometric problems. J. ACM 45, 5 (1998), 753--782.
[8]
Arya, S., Mount, D.M. Approximate nearest neighbor queries in fixed dimensions. In Proceedings of the 4th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA) (Austin, TX, United States, 1993), 271--280.
[9]
Arya, S., Mount, D.M., Netanyahu, N.S., Silverman, R., Wu, A.Y. An optimal algorithm for approximate nearest neighbor searching fixed dimensions. J. ACM 45, 6 (1998), 891--923.
[10]
Bern, M.W. Approximate closest-point queries in high dimensions. Inf. Process. Lett. 45, 2 (1993), 95--99.
[11]
Bingham, E., Mannila, H. Random projection in dimensionality reduction: Applications to image and text data. In Knowledge Discovery and Data Mining, 2001, 245--250.
[12]
Borodin, A., Ostrovsky, R., Rabani, Y. Lower bounds for high dimensional nearest neighbor search and related problems. In Proceedings of the 31st Annual Symposium on the Theory of Computing (STOC) (1999), 312--321.
[13]
Chan, T.M. Approximate nearest neighbor queries revisited. Discrete Comput. Geometry 20, 3 (1998), 359--373.
[14]
Christofides, N. Worst-case analysis of a new heuristic for the travelling salesman problem. Technical Report, Graduate School of Industrial Administration, Carnegie-Mellon University, Pittsburgh, 1976, 388.
[15]
Clarkson, K.L. An algorithm for approximate closest-point queries. In Proceedings of the 10th Annual ACM Symposium on Computational Geometry (SoCG) (1994), 160--164.
[16]
Clarkson, K.L. Nearest neighbor queries in metric spaces. Discrete Comput. Geometry 22, 1 (1999), 63--93.
[17]
DasGupta, S., Gupta, A. An elementary proof of the Johnson--Lindenstrauss lemma. Technical Report, UC Berkeley, 1999, 99--106.
[18]
Drineas, P., Mahoney, M.W., Muthukrishnan, S. Sampling algorithms for ℓ2 regression and applications. In Proceedings of the 17th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA) (Miami, FL, United States, 2006).
[19]
Farach-Colton, M., Indyk, P. Approximate nearest neighbor algorithms for Hausdorff metrics via embeddings. In Proceedings of the 40th Annual IEEE Symposium on Foundations of Computer Science (FOCS) (1999), 171--180.
[20]
Frankl, P., Maehara, H. The Johnson--Lindenstrauss lemma and the sphericity of some graphs. J. Comb. Theory Ser. A 44 (1987), 355--362.
[21]
Indyk, P. On approximate nearest neighbors in non-Euclidean spaces. In Proceedings of the 39th Annual IEEE Symposium on Foundations of Computer Science (FOCS) (1998), 148--155.
[22]
Indyk, P. Dimensionality reduction techniques for proximity problems. In Proceedings of the 11th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA) (2000), 371--378.
[23]
Indyk, P. Nearest neighbors in high-dimensional spaces. In Handbook of Discrete and Computational Geometry. CRC, 2004.
[24]
Indyk P., Motwani, R. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proceedings of the 30th Annual ACM Symposium on Theory of Computing (STOC) (1998), 604--613.
[25]
Johnson, W.B., Lindenstrauss, J. Extensions of Lipschitz mappings into a Hilbert space. Contemp. Math. 26 (1984), 189--206.
[26]
Kleinberg, J.M. Two algorithms for nearest-neighbor search in high dimensions. In Proceedings of the 29th Annual ACM Symposium on Theory of Computing (STOC) (1997), 599--608.
[27]
Kushilevitz, E., Ostrovsky, R., Rabani Y. Efficient search for approximate nearest neighbor in high dimensional spaces. SIAM J. Comput. 30, 2 (2000), 457--474.
[28]
Liberty, E., Woolfe, F., Martinsson, P.-G., Rokhlin, V., Tygert, M. Randomized algorithms for the low-rank approximation of matrices. Proc. Natl. Acad. Sci. (PNAS) 104, 51 (2007), 20167--20172.
[29]
Linial, N., London, E., Rabinovich, Y. The geometry of graphs and some of its algorithmic applications. Combinatorica 15, 2 (1995), 215--245.
[30]
Mitchell, J.S.B. Guillotine subdivisions approximate polygonal subdivisions: A simple new method for the geometric k-MST problem. SIAM J. Comput. 28, 4 (1999), 1298--1309.
[31]
Morgenstern, J. Note on a lower bound on the linear complexity of the fast fourier transform. J. ACM 20, 2 (1973), 305--306.
[32]
Morgenstern, J. The linear complexity of computation. J. ACM 22, 2 (1975), 184--194.
[33]
Muthukrishnan, S., Sahinalp, S.C. Simple and practical sequence nearest neighbors with block operations. In Proceedings of the 13th Annual Symposium on Combinatorial Pattern Matching (CPM) (2002), 262--278.
[34]
Papadimitriou, C., Raghavan, P., Tamaki, H., Vempala, S. Latent semantic indexing: A probabilistic analysis. In Proceedings of the 17th Annual Symposium of Database Systems (1998), 159--168.
[35]
Rokhlin, V., Tygert, M. A fast randomized algorithm for overdetermined linear least-squares regression. Proceedings of the National Academy of Science (PNAS) 105, 36 (2008), 13212--13217.
[36]
Sarlós, T. Improved approximation algorithms for large matrices via random projections. In Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS) (Berkeley, CA, 2006).
[37]
Schulman, L. Clustering for edge-cost minimization. In Proceedings of the 32nd Annual Symposium on Theory of Computing (STOC) (2000), 547--555.
[38]
Woolfe, F., Liberty, E., Rokhlin, V., Tygert, M. A fast randomized algorithm for the approximation of matrices. Appl. Comput. Harmon. Anal. 25 (2008), 335--366.
[39]
Yianilos, P.N. Data structures and algorithms for nearest neighbor search in general metric spaces. In Proceedings of the 4th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA) (1993), 311--321.
[40]
Yianilos, P.N. Locally lifting the curse of dimensionality for nearest neighbor search (extended abstract). In Proceedings of the 11th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA) (2000), 361--370.

Cited By

View all
  • (2020)Text Analysis for Honeypot Misuse Inference2020 5th International Conference on Innovative Technologies in Intelligent Systems and Industrial Applications (CITISIA)10.1109/CITISIA50690.2020.9371771(1-7)Online publication date: 25-Nov-2020
  • (2019)Random Features and Random Neurons for Brain-Inspired Big Data Analytics2019 International Conference on Data Mining Workshops (ICDMW)10.1109/ICDMW.2019.00080(522-529)Online publication date: Nov-2019
  • (2019)Applying unweighted least-squares based techniques to stochastic dynamic programming: theory and applicationIET Control Theory & Applications10.1049/iet-cta.2019.0289Online publication date: 2-Jul-2019
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Communications of the ACM
Communications of the ACM  Volume 53, Issue 2
February 2010
147 pages
ISSN:0001-0782
EISSN:1557-7317
DOI:10.1145/1646353
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 February 2010
Published in CACM Volume 53, Issue 2

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article
  • Popular
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)232
  • Downloads (Last 6 weeks)11
Reflects downloads up to 02 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2020)Text Analysis for Honeypot Misuse Inference2020 5th International Conference on Innovative Technologies in Intelligent Systems and Industrial Applications (CITISIA)10.1109/CITISIA50690.2020.9371771(1-7)Online publication date: 25-Nov-2020
  • (2019)Random Features and Random Neurons for Brain-Inspired Big Data Analytics2019 International Conference on Data Mining Workshops (ICDMW)10.1109/ICDMW.2019.00080(522-529)Online publication date: Nov-2019
  • (2019)Applying unweighted least-squares based techniques to stochastic dynamic programming: theory and applicationIET Control Theory & Applications10.1049/iet-cta.2019.0289Online publication date: 2-Jul-2019
  • (2019)Distribution Pattern Learning for Social Resources in a Golden Snub-Nosed Monkey CommunityAdvanced Multimedia and Ubiquitous Engineering10.1007/978-981-32-9244-4_40(282-291)Online publication date: 22-Aug-2019
  • (2019)Sampling and multilevel coarsening algorithms for fast matrix approximationsNumerical Linear Algebra with Applications10.1002/nla.223426:3Online publication date: 6-Mar-2019
  • (2018)SenticNet 5Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence10.5555/3504035.3504254(1795-1802)Online publication date: 2-Feb-2018
  • (2018)PigeonringProceedings of the VLDB Endowment10.14778/3275536.327553912:1(28-42)Online publication date: 1-Sep-2018
  • (2018)Approximate Order-Sensitive k-NN Queries over Correlated High-Dimensional DataIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2018.281215330:11(2037-2050)Online publication date: 1-Nov-2018
  • (2018)Semi-supervised learning for big social data analysisNeurocomputing10.1016/j.neucom.2017.10.010275:C(1662-1673)Online publication date: 31-Jan-2018
  • (2018)Parallel swarm intelligence strategies for large-scale clustering based on MapReduce with application to epigenetics of agingApplied Soft Computing10.1016/j.asoc.2018.04.01269(771-783)Online publication date: Aug-2018
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Digital Edition

View this article in digital edition.

Digital Edition

Magazine Site

View this article on the magazine site (external)

Magazine Site

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media