Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

DPDR: A Novel Machine Learning Method for the Decision Process for Dimensionality Reduction

  • Original Research
  • Published:
SN Computer Science Aims and scope Submit manuscript

Abstract

This paper discusses the critical decision process of extracting or selecting the features in a supervised learning context. It is often confusing to find a suitable method to reduce dimensionality. There are pros and cons to deciding between a feature selection and feature extraction according to the data’s nature and the user’s preferences. Indeed, the user may want to emphasize the results toward integrity or interpretability and a specific data resolution. This paper proposes a new method to choose the best dimensionality reduction method in a supervised learning context. It also helps to drop or reconstruct the features until a target resolution is reached. This target resolution can be user defined, or it can be automatically defined by the method. The method applies a regression or a classification, evaluates the results, and gives a diagnosis about the best dimensionality reduction process in this specific supervised learning context. The main algorithms used are the random forest algorithms, the principal component analysis algorithm, and the multilayer perceptron neural network algorithm. Six use cases are presented, and every one is based on some well-known technique to generate synthetic data. This research also discusses each choice that can be made in the process, aiming to clarify the issues about the entire decision process of selecting or extracting the features.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

Availability of Data and Materials

We used only datasets that a publicly available.

Code Availability

The code is not published yet. It can be provided on demand.

References

  1. Bellman R, Bellman RE, Corporation R. Dynamic programming. Rand Corporation research study. Princeton University Press; 1957. https://books.google.ca/books?id=rZW4ugAACAAJ.

  2. Dessureault J-S, Massicotte D. DPDRC, a novel machine learning method about the decision process for dimensionality reduction before clustering. AI. 2022;3(1):1–21. https://doi.org/10.3390/ai3010001.

    Article  Google Scholar 

  3. Yu J, Zhong H, Kim SB. An ensemble feature ranking algorithm for clustering analysis. J Classif. 2020;37(2):462–89. https://doi.org/10.1007/s00357-019-09330-8.

    Article  MathSciNet  Google Scholar 

  4. Zebari R, Abdulazeez A, Zeebaree D, Zebari D, Saeed J. A comprehensive review of dimensionality reduction techniques for feature selection and feature extraction. J Appl Sci Technol Trends. 2020;1(2):56–70. https://doi.org/10.38094/jastt1224.

    Article  Google Scholar 

  5. UR A, Paul S. Feature selection and extraction in data mining. In: 2016 Online International Conference on Green Engineering and Technologies (IC-GET), p. 1–3, 2016. https://doi.org/10.1109/GET.2016.7916845.

  6. Hira ZM, Gillies DF. A review of feature selection and feature extraction methods applied on microarray data. Adv Bioinformatics. 2015;2015:198363. https://doi.org/10.1155/2015/198363.

    Article  Google Scholar 

  7. Konig A. Dimensionality reduction techniques for multivariate data classification, interactive visualization, and analysis-systematic feature selection vs. extraction. In: KES’2000. Fourth International Conference on Knowledge-Based Intelligent Engineering Systems and Allied Technologies. Proceedings (Cat. No.00TH8516), 2000, vol. 1, p. 44–551. https://doi.org/10.1109/KES.2000.885757.

  8. Jović A, Brkić K, Bogunović N. A review of feature selection methods with applications. In: 2015 38th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO), p. 1200–5, 2015. https://doi.org/10.1109/MIPRO.2015.7160458.

  9. Mohamad MA, Hassan H, Nasien D, Haron H. A review on feature extraction and feature selection for handwritten character recognition. Int J Adv Comput Sci Appl. 2015. https://doi.org/10.14569/IJACSA.2015.060230.

    Article  Google Scholar 

  10. Khalid S, Khalil T, Nasreen S. A survey of feature selection and feature extraction techniques in machine learning. In: 2014 Science and Information Conference, p. 372–8, 2014. https://doi.org/10.1109/SAI.2014.6918213.

  11. Ghojogh B, Samad MN, Mashhadi SA, Kapoor T, Ali W, Karray F, Crowley M. Feature selection and feature extraction in pattern analysis: a literature review. 2019. arXiv preprint arXiv:1905.02845.

  12. Shah FP, Patel V. A review on feature selection and feature extraction for text classification. In: 2016 International Conference on Wireless Communications, Signal Processing and Networking (WiSPNET), 2016, p. 2264–8. https://doi.org/10.1109/WiSPNET.2016.7566545.

  13. Swiniarski RW, Skowron A. Rough set methods in feature selection and recognition. Pattern Recognit Lett. 2003;24(6):833–49. https://doi.org/10.1016/S0167-8655(02)00196-4.

    Article  Google Scholar 

  14. Lu Y, Cohen I, Zhou XS, Tian Q. Feature selection using principal feature analysis. In: Proceedings of the 15th ACM International Conference on Multimedia. MM ’07. Association for Computing Machinery; 2007. p. 301–4. https://doi.org/10.1145/1291233.1291297.

  15. Mollaee M, Moattar MH. A novel feature extraction approach based on ensemble feature selection and modified discriminant independent component analysis for microarray data classification. Biocybern Biomed Eng. 2016;36(3):521–9. https://doi.org/10.1016/j.bbe.2016.05.001.

    Article  Google Scholar 

  16. Wahab NIA, Mohamed A, Hussain A. Feature selection and extraction methods for power systems transient stability assessment employing computational intelligence techniques. Neural Process Lett. 2012;35(1):81–102. https://doi.org/10.1007/s11063-011-9205-x.

    Article  Google Scholar 

  17. He B, Shah S, Maung C, Arnold G, Wan G, Schweitzer H. Heuristic search algorithm for dimensionality reduction optimally combining feature selection and feature extraction. Proc AAAI Conf Artif Intell. 2019;33(1):2280–7. https://doi.org/10.1609/aaai.v33i01.33012280.

    Article  Google Scholar 

  18. Sreevani Murthy CA. Bridging feature selection and extraction: compound feature generation. IEEE Trans Knowl Data Eng. 2017;29(4):757–70. https://doi.org/10.1109/TKDE.2016.2619712.

    Article  Google Scholar 

  19. Pölsterl S, Conjeti S, Navab N, Katouzian A. Survival analysis for high-dimensional, heterogeneous medical data: exploring feature extraction as an alternative to feature selection. Artif Intell Med. 2016;72:1–11. https://doi.org/10.1016/j.artmed.2016.07.004.

    Article  Google Scholar 

  20. Bolón-Canedo V, Sánchez-Maroño N, Alonso-Betanzos A. Recent advances and emerging challenges of feature selection in the context of big data. Knowl-Based Syst. 2015;86:33–45. https://doi.org/10.1016/j.knosys.2015.05.014.

    Article  Google Scholar 

  21. Manikandan G, Abirami S. A survey on feature selection and extraction techniques for high-dimensional microarray datasets. In: Margret Anouncia S, Wiil UK, editors. Knowledge computing and its applications: knowledge computing in specific domains: Volume II. Berlin: Springer; 2018. p. 311–3. https://doi.org/10.1007/978-981-10-8258-0_14.

    Chapter  Google Scholar 

  22. De Stefano C, Fontanella F, Marrocco C, Scotto di Freca A. A GA-based feature selection approach with an application to handwritten character recognition. Pattern Recognit Lett. 2014;35:130–41. https://doi.org/10.1016/j.patrec.2013.01.026.

    Article  Google Scholar 

  23. Lin J-Y, Ke H-R, Chien B-C, Yang W-P. Classifier design with feature selection and feature extraction using layered genetic programming. Expert Syst Appl. 2008;34(2):1384–93. https://doi.org/10.1016/j.eswa.2007.01.006.

    Article  Google Scholar 

  24. Guyon I, Gunn S, Nikravesh M, Zadeh LA. Feature extraction: foundations and applications. Berlin: Springer; 2008. (Google-Books-ID: FOTzBwAAQBAJ).

    Google Scholar 

  25. Liu H, Motoda H. Feature extraction, construction and selection: a data mining perspective. Berlin: Springer; 1998. (Google-Books-ID: zi_0EdWW5fYC).

    Book  Google Scholar 

  26. Masters T. Modern data mining algorithms in C++ and CUDA C: recent developments in feature extraction and selection algorithms for data science. Apress L.P.; 2020. https://search.ebscohost.com/login.aspx?direct=true &scope=site &db=nlebk &db=nlabk &AN=2494148. Accessed 2022-06-28.

  27. Galli S. python feature engineering cookbook: over 70 recipes for creating, engineering, and transforming features to build machine learning models. Packt Publishing; 2020. https://search.ebscohost.com/login.aspx?direct=true &scope=site &db=nlebk &db=nlabk &AN=2358819. Accessed 2022-06-28.

  28. Biau G, Scornet E. A random forest guided tour. TEST. 2016;25(2):197–227.

    Article  MathSciNet  Google Scholar 

  29. Gulea T. How not to use random forest. 2019. Available at https://medium.com/turo-engineering/how-not-to-use-random-forest-265a19a68576. Accessed 2021-04-28.

  30. Paul A, Mukherjee DP, Das P, Gangopadhyay A, Chintha AR, Kundu S. Improved random forest for classification. IEEE Trans Image Process. 2018;27(8):4012–24. https://doi.org/10.1109/TIP.2018.2834830.

    Article  MathSciNet  Google Scholar 

  31. Ronaghan S. The mathematics of Decision Trees, Random Forest and feature importance in Scikit-learn and Spark. 2019. https://towardsdatascience.com/the-mathematics-of-decision-trees-random-forest-and-feature-importance-in-scikit-learn-and-spark-f2861df67e3. Accessed 2021-03-24.

  32. Chang Y, Li W, Yang Z. Network intrusion detection based on Random Forest and support vector machine. In: 2017 IEEE International Conference on Computational Science and Engineering (CSE) and IEEE International Conference on Embedded and Ubiquitous Computing (EUC), vol. 1, p. 635–8, 2017. https://doi.org/10.1109/CSE-EUC.2017.118.

  33. Radovic M, Ghalwash M, Filipovic N, Obradovic Z. Minimum redundancy maximum relevance feature selection approach for temporal gene expression data. BMC Bioinform. 2017;18(1):9. https://doi.org/10.1186/s12859-016-1423-9.

    Article  Google Scholar 

  34. Keshava N, Mustard JF. Spectral unmixing. IEEE J Mag IEEE Xplore. 2021. Available at https://ieeexplore.ieee.org/document/974727.

  35. Chen C-P, Ding Y-J, Liu S-Y. City economical function and industrial development: case study along the railway line in North Xinjiang in China. J Urban Plan Dev. 2008;134(4):153–8. https://doi.org/10.1061/(ASCE)0733-9488(2008)134:4(153).

    Article  Google Scholar 

  36. Ang L-M, Seng KP, Zungeru AM, Ijemaru GK. Big sensor data systems for smart cities. IEEE Internet Things J. 2017. https://doi.org/10.1109/JIOT.2017.2695535.

    Article  Google Scholar 

  37. Marsal-Llacuna M-L, Colomer-Llinàs J, Meléndez-Frigola J. Lessons in urban monitoring taken from sustainable and livable cities to better address the smart cities initiative. Technol Forecast Soc Change. 2015;90:611–22. https://doi.org/10.1016/j.techfore.2014.01.012.

    Article  Google Scholar 

  38. Rosenblatt F. The perceptron: a probabilistic model for information storage and organization in the brain. Psychol Rev. 1958;65(6):386–408. https://doi.org/10.1037/h0042519.

    Article  Google Scholar 

  39. Taud H, Mas JF. Multilayer perceptron (MLP). In: Camacho Olmedo MT, Paegelow M, Mas J-F, Escobar F, editors. Geomatic approaches for modeling land change scenarios. Lecture notes in geoinformation and cartography. Berlin: Springer; 2018. p. 451–5. https://doi.org/10.1007/978-3-319-60801-3_27.

    Chapter  Google Scholar 

  40. Bounds, Lloyd, Mathew, Waddell. A multilayer perceptron network for the diagnosis of low back pain. In: IEEE 1988 International Conference on Neural Networks, 1988, p. 481–92. https://doi.org/10.1109/ICNN.1988.23963.

  41. Park Y-S, Lek S. Chapter 7—artificial neural networks: multilayer perceptron for ecological modeling. In: Jørgensen SE, editor. Developments in environmental modelling. Ecological model types, vol. 28. Amsterdam: Elsevier; 2016. p. 123–40. https://doi.org/10.1016/B978-0-444-63623-2.00007-4.

    Chapter  Google Scholar 

  42. Kwon K, Kim D, Park H. A parallel MR imaging method using multilayer perceptron. Med Phys. 2017;44(12):6209–24. https://doi.org/10.1002/mp.12600.

    Article  Google Scholar 

  43. Avila J, Hauck T. Scikit-learn cookbook: over 80 recipes for machine learning in python with Scikit-learn. Birmingham: Packt Publishing Ltd; 2017.

    Google Scholar 

  44. Kramer O. Scikit-learn. In: Kramer O, editor. Machine learning for evolution strategies. Studies in big data. Berlin: Springer; 2016. p. 45–53. https://doi.org/10.1007/978-3-319-33383-0_5.

    Chapter  Google Scholar 

  45. Holt J, Sievert S. Training machine learning models faster with dask. In: SciPy Conferences, 2021.

  46. Lemaıtre G, Nogueira F, Aridas CK. Imbalanced-learn: a python toolbox to tackle the curse of imbalanced datasets in machine learning. J Mach Learn Res. 2017;18(17):1–5.

    Google Scholar 

Download references

Acknowledgements

This work has been supported by the “Cellule d’expertise en robotique et intelligence artificielle” of the Cégep de Trois-Rivières and the Natural Sciences and Engineering Research Council.

Funding

This work has been supported by the Natural Sciences and Engineering Research Council.

Author information

Authors and Affiliations

Authors

Contributions

JSD: Conceptualization, Methodology, Software, Writing—Original Draft, Software. DM: Conceptualization, Methodology, Validation, Resources, Writing—Review and Editing, Supervision, Project administration, Funding acquisition.

Corresponding author

Correspondence to Jean-Sébastien Dessureault.

Ethics declarations

Conflict of interest

The authors confirm there are no conflicts of interest.

Ethical approval

The work uses publicly available and non-identifiable information. No ethical approval was needed.

Consent to participate

Not applicable, since no human participant was involved in the evaluation of our study.

Consent for publication

Not applicable, since all datasets used in this study are released by third parties.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Dessureault, JS., Massicotte, D. DPDR: A Novel Machine Learning Method for the Decision Process for Dimensionality Reduction. SN COMPUT. SCI. 5, 124 (2024). https://doi.org/10.1007/s42979-023-02394-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s42979-023-02394-9

Keywords