Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Principal Component Analysis: A Natural Approach to Data Exploration

Published: 24 May 2021 Publication History

Abstract

Principal component analysis (PCA) is often applied for analyzing data in the most diverse areas. This work reports, in an accessible and integrated manner, several theoretical and practical aspects of PCA. The basic principles underlying PCA, data standardization, possible visualizations of the PCA results, and outlier detection are subsequently addressed. Next, the potential of using PCA for dimensionality reduction is illustrated on several real-world datasets. Finally, we summarize PCA-related approaches and other dimensionality reduction techniques. All in all, the objective of this work is to assist researchers from the most diverse areas in using and interpreting PCA.

Supplementary Material

a70-gewers-suppl.pdf (gewers.zip)
Supplemental movie, appendix, image and software files for, Principal Component Analysis: A Natural Approach to Data Exploration

References

[1]
Hervé Abdi and Lynne J. Williams. 2010. Principal component analysis. Wiley Interdisc. Rev.: Comput. Stat. 2, 4 (2010), 433–459.
[2]
Greg W. Anderson, Alice Guionnet, and Ofer Zeitouni. 2009. An Introduction to Random Matrices. Cambridge University Press, Cambridge, UK.
[3]
König Andreas. 1998. A survey of methods for multivariate data projection, visualisation and interactive analysis. In Proceedings of the 5th International Conference on Soft Computing and Information/Intelligent Systems. Citeseer, 55–59.
[4]
Raman Arora, Andrew Cotter, Karen Livescu, and Nathan Srebro. 2012. Stochastic optimization for PCA and PLS. In Proceedings of the 50th Annual Allerton Conference on Communication, Control, and Computing (Allerton’12). 861–868.
[5]
M. Artac, M. Jogan, and A. Leonardis. 2002. Incremental PCA for on-line visual learning and recognition. In Object Recognition Supported by User Interaction for Service Robots, Vol. 3. IEEE Comput. Soc, Quebec City, Que., Canada, 781–784.
[6]
Stephen D. Bay, Dennis Kibler, Michael J. Pazzani, and Padhraic Smyth. 2000. The UCI KDD archive of large data sets for data mining research and experimentation. ACM SIGKDD Explor. Newslett. 2, 2 (2000), 81–85.
[7]
Gordon Bell, Tony Hey, and Alex Szalay. 2009. Beyond the data deluge. Science 323, 5919 (2009), 1297–1298.
[8]
Dimitri P. Bertsekas and John N. Tsitsiklis. 2002. Introduction to Probability. Vol. 1. Athena Scientific, Belmont, MA.
[9]
Christopher M. Bishop. 1999. Bayesian PCA. In Advances in Neural Information Processing Systems. MIT Press, 382–388.
[10]
Christopher M. Bishop. 2006. Pattern Recognition and Machine Learning. Springer.
[11]
Matthew Brand. 2002. Incremental singular value decomposition of uncertain data with missing values. In Proceedings of the European Conference on Computer Vision (ECCV’02), Gerhard Goos, Juris Hartmanis, Jan van Leeuwen, Anders Heyden, Gunnar Sparr, Mads Nielsen, and Peter Johansen (Eds.). Vol. 2350. Springer, Berlin, 707–720.
[12]
Richard G. Brereton. 2018. Principal Component Analysis and Unsupervised Pattern Recognition. John Wiley & Sons, Chapter 4, 163–214. Retrieved from arXiv: https://onlinelibrary.wiley.com/doi/pdf/10.1002/9781118904695.ch4.
[13]
Rasmus Bro and Age K. Smilde. 2014. Principal component analysis. Anal. Methods 6, 9 (2014), 2812–2831.
[14]
Krisztian Buza. 2014. Feedback prediction for blogs. In Data Analysis, Machine Learning and Knowledge Discovery. Springer, 145–152.
[15]
Hervé Cardot and David Degras. 2018. Online principal component analysis in high dimension: Which algorithm to choose?: Online PCA in high dimension. Int. Stat. Rev. 86, 1 (Apr. 2018), 29–50.
[16]
Vassilis Chatzigiannakis and Symeon Papavassiliou. 2007. Diagnosing anomalies and identifying faulty nodes in sensor networks. IEEE Sensors J. 7, 5 (2007), 637–645.
[17]
Chun-Yuan Cheng, Chun-Chin Hsu, and Mu-Chen Chen. 2010. Adaptive kernel principal component analysis (KPCA) for monitoring small disturbances of nonlinear processes. Industr. Eng. Chem. Res. 49, 5 (2010), 2254–2262.
[18]
Paulo Cortez and Aníbal de Jesus Raimundo Morais. 2007. A data mining approach to predict forest fires using meteorological data. In Proceedings of the 13th Portuguese Conference on Artificial Intelligence (EPIA’07), 512–523.
[19]
Luciano da Fontoura Costa and Roberto Marcondes Cesar Jr. 2009. Shape Classification and Analysis: Theory and Practice. CRC Press, Boca Raton.
[20]
L da F. Costa, Francisco A. Rodrigues, Claus C. Hilgetag, and Marcus Kaiser. 2009. Beyond the average: Detecting global singular nodes from local features in complex networks. Europhys. Lett. 87, 1 (2009), 18008.
[21]
Pádraig Cunningham. 2008. Dimension reduction. In Machine Learning Techniques for Multimedia. Springer, 91–112.
[22]
Luciano da F. Costa. 2020. Eigenvalues, Eigenvectors (CDT-28). Retrieved from https://www.researchgate.net/publication/340628834_Eigenvalues_Eigenvectors_CDT-28.
[23]
J. J. Daudin, C. Duby, and P. Trecourt. 1988. Stability of principal component analysis studied by the bootstrap method. Statistics: J. Theor. Appl. Stat. 19, 2 (1988), 241–258.
[24]
Dua Dheeru and Efi Karra Taniskidou. 2017. UCI Machine Learning Repository. Retrieved from http://archive.ics.uci.edu/ml.
[25]
Michael B. Eisen, Paul T. Spellman, Patrick O. Brown, and David Botstein. 1998. Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. U.S.A. 95, 25 (1998), 14863–14868.
[26]
Kim Esbensen and P. Geladi. 2009. Principal Component Analysis: Concept, Geometrical Interpretation, Mathematical Background, Algorithms, History, Practice. Vol. 2. 211–226.
[27]
Brian Everitt and Anders Skrondal. 2002. The Cambridge Dictionary of Statistics. Vol. 106. Cambridge University Press, Cambridge.
[28]
Ian W. Evett and J. Spiehler Ernest. 1987. Rule induction in forensic science. Central research establishment. Home office forensic science service. Aldermaston. Reading, Berkshire RG7 4PN (1987).
[29]
Willliam Feller. 2008. An Introduction to Probability Theory and Its Applications. Vol. 2. John Wiley & Sons.
[30]
Kitty Ferguson. 2002. Tycho and Kepler: The Unlikely Partnership that Forever Changed our Understanding of the Heavens. Bloomsbury Publishing, New York, NY.
[31]
Ronald A. Fisher. 1936. The use of multiple measurements in taxonomic problems. Ann. Hum. Genet. 7, 2 (1936), 179–188.
[32]
Imola K. Fodor. 2002. A Survey of Dimension Reduction Techniques. Technical Report. Lawrence Livermore National Laboratory, Berkeley, CA.
[33]
M. Forma, R. Leardi, C. Armanino, S. Lanteri, P. Conti, and P. Princi. 1988. PARVUS, an Extendable Package of Programs for Data Exploration, Classification and Correlation. Elsevier Scientific Software, Amsterdam.
[34]
K. R. Gabriel. 1971. The biplot graphic display of matrices with applications to principal component analysis. Biometrika 58, 3 (1971), 453–467.
[35]
Christophe Giraud. 2015. Introduction to High-Dimensional Statistics. CRC Press, Boca Raton, FL.
[36]
Joseph F. Hair, William C. Black, Barry J. Babin, Rolph E. Anderson, Ronald L. Tatham et al. 1998. Multivariate Data Analysis. Vol. 5. Prentice Hall, Upper Saddle River, NJ.
[37]
Haitao Zhao, Pong Chi Yuen, and J. T. Kwok. 2006. A novel incremental principal component analysis and its application for face recognition. IEEE Trans. Syst. Man Cybernet., Part B (Cybernet.) 36, 4 (Aug. 2006), 873–886.
[38]
Simon Hales, Neil De Wet, John Maindonald, and Alistair Woodward. 2002. Potential effect of population and climate changes on global distribution of dengue fever: An empirical model. Lancet 360, 9336 (2002), 830–834.
[39]
David J. Hand. 2007. Principles of data mining. Drug Safety 30, 7 (2007), 621–622.
[40]
Lars Kai Hansen, Jan Larsen, Finn Årup Nielsen, Stephen C. Strother, Egill Rostrup, Robert Savoy, Nicholas Lange, John Sidtis, Claus Svarer, and Olaf B. Paulson. 1999. Generalizable patterns in neuroimaging: How many principal components? NeuroImage 9, 5 (1999), 534–544.
[41]
Trevor Hastie, Robert Tibshirani, and Jerome Friedman. 2009. The Elements of Statistical Learning (2nd ed.). Springer, New York, NY, USA.
[42]
Xiao Hu, Raj Subbu, Piero Bonissone, Hai Qiu, and Naresh Iyer. 2008. Multivariate anomaly detection in real-world industrial systems. In Proceedings of the IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence). IEEE, 2766–2771.
[43]
Mia Hubert, Peter J. Rousseeuw, and Stefan Van Aelst. 2008. High-breakdown robust multivariate methods. Stat. Sci. 23, 1 (2008), 92–119.
[44]
J. Edward Jackson and Govind S. Mudholkar. 1979. Control procedures for residuals associated with principal component analysis. Technometrics 21, 3 (1979), 341–349.
[45]
Iain M. Johnstone and Arthur Y. Lu. 2009. On consistency and sparsity for principal component analysis in high dimensions. J. Amer. Statist. Assoc. 104 (2009), 682–693.
[46]
I. T. Jolliffe. 1986. Principal Component Analysis. Springer.
[47]
Juyang Weng, Yilu Zhang, and Wey-Shiuan Hwang. 2003. Candid covariance-free incremental principal component analysis. IEEE Trans. Pattern Anal. Mach. Intell. 25, 8 (Aug. 2003), 1034–1040.
[48]
Kwang In Kim, Keechul Jung, and Hang Joon Kim. 2002. Face recognition using kernel principal component analysis. IEEE Signal Process. Lett. 9, 2 (2002), 40–42.
[49]
A. Levy and M. Lindenbaum. 1998. Sequential Karhunen-Loeve basis extraction and its application to images. In Proceedings of the International Conference on Image (ICIP’98), Vol. 2. IEEE Comput. Soc, Chicago, IL, 456–460.
[50]
Max A. Little, Patrick E. McSharry, Stephen J. Roberts, Declan A. E. Costello, and Irene M. Moroz. 2007. Exploiting nonlinear recurrence and fractal scaling properties for voice disorder detection. BioMed. Eng. OnLine 6, 1 (2007), 23.
[51]
Chengjun Liu. 2004. Gabor-based kernel PCA with fractional power polynomial models for face recognition. IEEE Trans. Pattern Anal. Mach. Intell. 26, 5 (2004), 572–581.
[52]
Sanya Liu, Zhi Liu, Jianwen Sun, and Lin Liu. 2011. Application of synergetic neural network in online writeprint identification. Int. J. Dig. Content Technol. Appl. 5, 3 (2011), 126–135.
[53]
Haiping Lu, Konstantinos N. Plataniotis, and Anastasios N. Venetsanopoulos. 2008. MPCA: Multilinear principal component analysis of tensor objects. IEEE Trans. Neural Netw. 19, 1 (2008), 18–39.
[54]
Sawsan Mahmoud, Ahmad Lotfi, and Caroline Langensiepen. 2016. User activities outliers detection; integration of statistical and computational intelligence techniques. Comput. Intell. 32, 1 (2016), 49–71.
[55]
Sawsan M. Mahmoud, Ahmad Lotfi, and Caroline Langensiepen. 2012. User activities outlier detection system using principal component analysis and fuzzy rule-based system. In Proceedings of the 5th International Conference on Pervasive Technologies Related to Assistive Environments. 1–8.
[56]
Majdi Mansouri, Mohamed Nounou, Hazem Nounou, and Nazmul Karim. 2016. Kernel PCA-based GLRT for nonlinear fault detection of chemical processes. J. Loss Prevent. Process Industr. 40 (2016), 334–347.
[57]
Leland McInnes, John Healy, and James Melville. 2018. UMAP: Uniform manifold approximation and projection for dimension reduction. Retrieved from https://arXiv:1802.03426.
[58]
Sebastian Mika, Bernhard Schölkopf, Alex J. Smola, Klaus-Robert Müller, Matthias Scholz, and Gunnar Rätsch. 1999. Kernel PCA and de-noising in feature spaces. In Advances in Neural Information Processing Systems. MIT Press, 536–542.
[59]
Boaz Nadler. 2008. Finite sample approximation results for principal component analysis: A matrix perturbation approach. Ann. Stat. 36 (2008), 2791–2817.
[60]
Debashis Paul. 2007. Asymptotics of sample eigenstructure for a large dimensional spiked covariance model. Statistica Sinica 17 (2007), 1617–1642.
[61]
Karl Pearson. 1895. Note on regression and inheritance in the case of two parents. Proc. Roy. Soc. London 58 (1895), 240–242.
[62]
Karl Pearson. 1901. On lines and planes of closest fit to systems of points in space. London, Edinburgh, Dublin Philos. Mag. J. Sci. 2, 11 (1901), 559–572.
[63]
Peter J. Rousseeuw, Michiel Debruyne, Sanne Engelen, and Mia Hubert. 2006. Robustness and outlier detection in chemometrics. Crit. Rev. Anal. Chem. 36, 3–4 (2006), 221–242.
[64]
Peter J. Rousseeuw and Mia Hubert. 2011. Robust statistics for outlier detection. Wiley Interdisc. Rev.: Data Min. Knowl. Discovery 1, 1 (2011), 73–79.
[65]
Bernhard Schölkopf, Alexander Smola, and Klaus-Robert Müller. 1997. Kernel principal component analysis. In Proceedings of the International Conference on Artificial Neural Networks. Springer, 583–588.
[66]
Bernhard Schölkopf, Alexander J. Smola, Francis Bach et al. 2002. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press.
[67]
Semeion. 2018. Dataset provided by Semeion, Research Center of Sciences of Communication, Via Sersale 117, 00128, Rome, Italy. Retrieved from www.semeion.it.
[68]
Vincent G. Sigillito, Simon P. Wing, Larrie V. Hutton, and Kile B. Baker. 1989. Classification of radar returns from the ionosphere using neural networks. Johns Hopkins APL Tech. Digest 10, 3 (1989), 262–266.
[69]
Pedro F. B. Silva, Andre R. S. Marcal, and Rubim M. Almeida da Silva. 2013. Evaluation of features for leaf discrimination. In Proceedings of the International Conference Image Analysis and Recognition. Springer, 197–204.
[70]
Jack W. Smith, J. E. Everhart, W. C. Dickson, W. C. Knowler, and R. S. Johannes. 1988. Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In Proceedings of the Annual Symposium on Computer Application in Medical Care. American Medical Informatics Association, 261.
[71]
Gilbert Strang. 1993. Introduction to Linear Algebra. Vol. 3. Wellesley-Cambridge Press, Wellesley, MA.
[72]
Alaa Tharwat. 2016. Principal component analysis-a tutorial. Int. J. Appl. Pattern Recogn. 3, 3 (2016), 197–240.
[73]
Michael E. Tipping and Christopher M. Bishop. 1999. Probabilistic principal component analysis. J. Roy. Stat. Soc.: Ser. B (Stat. Methodol.) 61, 3 (1999), 611–622.
[74]
Dhananjay Tomar, Yamuna Prasad, Manish K. Thakur, and Kanad K. Biswas. 2017. Feature selection using autoencoders. In Proceedings of the International Conference on Machine Learning and Data Science (MLDS’17). IEEE, 56–60.
[75]
Matthew Turk and Alex Pentland. 1991. Eigenfaces for recognition. J. Cogn. Neurosci. 3, 1 (1991), 71–86.
[76]
Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. J. Mach. Learn. Res. 9 (2008), 2579–2605.
[77]
Roman Vershynin. 2012. Introduction to the non-asymptotic analysis of random matrices. In Compressed Sensing: Theory and Applications. Cambridge University Press, Cambridge, UK, 210–268.
[78]
Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. 2008. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th International Conference on Machine Learning. Association for Computing Machinery, New York, NY, 1096–1103.
[79]
Kyle W. Willett, Chris J. Lintott, Steven P. Bamford, Karen L. Masters, Brooke D. Simmons, Kevin R. V. Casteels, Edward M. Edmondson, Lucy F. Fortson, Sugata Kaviraj, William C. Keel et al. 2013. Galaxy Zoo 2: Detailed morphological classifications for 304 122 galaxies from the Sloan Digital Sky Survey. Monthly Notices Roy. Astron. Soc. 435, 4 (2013), 2835–2860.
[80]
Ian H. Witten, Eibe Frank, Mark A. Hall, and Christopher J. Pal. 2016. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann.
[81]
Svante Wold, Kim Esbensen, and Paul Geladi. 1987. Principal component analysis. Chemometr. Intell. Lab. Syst. 2, 1–3 (1987), 37–52.
[82]
Jian Yang, David Zhang, Alejandro F. Frangi, and Jing-yu Yang. 2004. Two-dimensional PCA: A new approach to appearance-based face representation and recognition. IEEE Trans. Pattern Anal. Mach. Intell. 26, 1 (2004), 131–137.
[83]
Jieping Ye, Ravi Janardan, and Qi Li. 2004. GPCA: An efficient dimension reduction scheme for image compression and retrieval. In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 354–363.
[84]
I.-Cheng Yeh. 2006. Exploring concrete slump model using artificial neural networks. J. Comput. Civil Eng. 20, 3 (2006), 217–221.
[85]
Tianqi Yu, Xianbin Wang, and Abdallah Shami. 2017. Recursive principal component analysis-based data outlier detection and sensor data aggregation in IoT systems. IEEE Internet Things J. 4, 6 (2017), 2207–2216.
[86]
Hui Zou and Trevor Hastie. 2005. Regularization and variable selection via the elastic net. J. Roy. Stat. Soc. Ser. B 67 (2005), 301–320.
[87]
Hui Zou, Trevor Hastie, and Robert Tibshirani. 2006. Sparse principal component analysis. J. Comput. Graph. Stat. 15 (2006), 265–286.

Cited By

View all
  • (2024)Pesticide Biosensors for Multiple Target Detection: Improvement Potential with Advanced Data-processing MethodsReviews in Agricultural Science10.7831/ras.12.0_12812(128-146)Online publication date: 2024
  • (2024)Determination of Volatilome Profile in Carbonated Beverages Using n-Hexane as an Extractant by GC-MSSeparations10.3390/separations1108023111:8(231)Online publication date: 27-Jul-2024
  • (2024)On the Search for Potentially Anomalous Traces of Cosmic Ray Particles in Images Acquired by Cmos Detectors for a Continuous Stream of Emerging Observational DataSensors10.3390/s2406183524:6(1835)Online publication date: 13-Mar-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Computing Surveys
ACM Computing Surveys  Volume 54, Issue 4
May 2022
782 pages
ISSN:0360-0300
EISSN:1557-7341
DOI:10.1145/3464463
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 May 2021
Accepted: 01 January 2021
Revised: 01 September 2020
Received: 01 August 2018
Published in CSUR Volume 54, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Statistical methods
  2. covariance and correlation
  3. data visualization
  4. dimensionality reduction
  5. principal component analysis

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • Fundação de amparo à pesquisa do Estado de São Paulo

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)818
  • Downloads (Last 6 weeks)67
Reflects downloads up to 30 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Pesticide Biosensors for Multiple Target Detection: Improvement Potential with Advanced Data-processing MethodsReviews in Agricultural Science10.7831/ras.12.0_12812(128-146)Online publication date: 2024
  • (2024)Determination of Volatilome Profile in Carbonated Beverages Using n-Hexane as an Extractant by GC-MSSeparations10.3390/separations1108023111:8(231)Online publication date: 27-Jul-2024
  • (2024)On the Search for Potentially Anomalous Traces of Cosmic Ray Particles in Images Acquired by Cmos Detectors for a Continuous Stream of Emerging Observational DataSensors10.3390/s2406183524:6(1835)Online publication date: 13-Mar-2024
  • (2024)Fluorescence-Enhanced Assessments for Human Breast Cancer Cell CharacterizationsPhotonics10.3390/photonics1108074611:8(746)Online publication date: 9-Aug-2024
  • (2024)Synergistic Biocontrol and Growth Promotion in Strawberries by Co-Cultured Trichoderma harzianum TW21990 and Burkholderia vietnamiensis B418Journal of Fungi10.3390/jof1008055110:8(551)Online publication date: 5-Aug-2024
  • (2024)Identification and Dynamics Understanding of Novel Inhibitors of Peptidase Domain of Collagenase G from Clostridium histolyticumComputation10.3390/computation1208015312:8(153)Online publication date: 25-Jul-2024
  • (2024)Development and Validation of KRT Knowledge InstrumentClinical Journal of the American Society of Nephrology10.2215/CJN.000000000000047219:7(877-886)Online publication date: 15-May-2024
  • (2024)Research on nowcasting prediction technology for flooding scenarios based on data-driven and real-time monitoringWater Science & Technology10.2166/wst.2024.17489:11(2894-2906)Online publication date: 27-May-2024
  • (2024)Machine Learning Mutual Fund FlowsSSRN Electronic Journal10.2139/ssrn.4812038Online publication date: 2024
  • (2024)Using full-text content to characterize and identify best seller books: A study of early 20th-century literaturePLOS ONE10.1371/journal.pone.030207019:4(e0302070)Online publication date: 26-Apr-2024
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media