Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Predicting Salaries with Random-Forest Regression

  • Chapter
  • First Online:
Machine Learning and Data Analytics for Solving Business Problems

Part of the book series: Unsupervised and Semi-Supervised Learning ((UNSESUL))

  • 656 Accesses

Abstract

For companies it is essential to know the market price of the salaries of their current and prospective employees. Predicting such salaries is challenging, as many factors need to be considered, and large real datasets for learning are scarce. For this reason, research on salary predictions is comparably rare and limited. In this study, we investigate whether and how an advanced machine-learning approach, namely ensembles of random-forest regression, can achieve high-quality salary predictions. We use a large real dataset of more than three million employees and more than 300 professions. Our approach learns –for each profession– a random-forest regression model to predict salaries. In our evaluation, we show that this approach performs better than related work on salary prediction by machine-learning approaches with a mean absolute percentage error (MAPE) of 17.1%. We identify reducing the number of possible values of categorical variables, training separate models as well as outlier handling as the key factors for the results achieved.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 139.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 179.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 179.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    While random forests are our main machine-learning technique for salary predictions, regression trees can be used in our approach as an alternative. They are less complex and perform worse than random forests. We compare the prediction performances of random forests versus regression trees in Sect. 1.4.2 in detail.

  2. 2.

    We use such a strategy for the less predictive features company industry and federal state as described in Sect. 1.3.1.

References

  1. C.C. Aggarwal, Data Mining: The Textbook (Springer, Berlin, 2015)

    MATH  Google Scholar 

  2. M. Al-Rubaie, J.M. Chang, Privacy-preserving machine learning: threats and solutions. IEEE Secur. Priv. 17(2), 49–58 (2019)

    Article  Google Scholar 

  3. E. Ameisen, Building Machine Learning Powered Applications (O’Reilly UK Ltd., Farnham, 2020)

    Google Scholar 

  4. D.A. Barbezat, J.W. Hughes, Salary structure effects and the gender pay gap in academia. Res. High. Educ. 46(6), 621–640 (2005)

    Article  Google Scholar 

  5. M.R. Berthold, C. Borgelt, F. Höppner, F. Klawonn, Guide to Intelligent Data Analysis: How to Intelligently Make Sense of Real Data, vol. 42. Texts in Computer Science (Springer, Berlin, 2010)

    Google Scholar 

  6. L. Breiman, Bagging predictors. Mach. Learn. 24(2), 123–140 (1996)

    Article  MATH  Google Scholar 

  7. L. Breiman, Random forests. Mach. Learn. 45(1), 5–32 (2001)

    Article  MATH  Google Scholar 

  8. L. Breiman, J.H. Friedman, R.A. Olshen, C.J. Stone, Classification and Regression Trees (Wadsworth International Group, Fairview, 1984)

    Google Scholar 

  9. S. Chakraborti, A comparative study of performances of various classification algorithms for predicting salary classes of employees. Int. J. Comput. Sci. Inform. Technol. 5(2), 1964–1972 (2014)

    Google Scholar 

  10. R. Couronné, P. Probst, A.-L. Boulesteix, Random forest versus logistic regression: a large-scale benchmark experiment. BMC Bioinform. 19(1) (2018)

    Google Scholar 

  11. DATEV eG. Personal-Benchmark online. https://datev.de/web/de/mydatev/online-anwendungen/datev-personal-benchmark-online/. Accessed 23 Jan 2022

  12. T. Davenport, K. Malone, Deployment as a critical business data science discipline. Harvard Data Sci. Rev. (3.1), Winter 2021 (2021)

    Google Scholar 

  13. C. Dwork, Differential privacy, in International Colloquium on Automata, Languages, and Programming (ICALP) (2006)

    Google Scholar 

  14. Eurostat, European Commission, Degree of Urbanisation. https://ec.europa.eu/eurostat/web/degree-of-urbanisation/methodology. Accessed 23 Jan 2022

  15. Eurostat, European Commission, Gender Pay Gap Statistics. https://ec.europa.eu/eurostat/statistics-explained/index.php?title=Gender_pay_gap_statistics. Accessed 23 Jan 2022

  16. M. Fernández-Delgado, E. Cernadas, S. Barro, D. Amorim, Do we need hundreds of classifiers to solve real world classification problems? J. Mach. Learn. Res. 15, 3133–3181 (2014)

    MathSciNet  MATH  Google Scholar 

  17. J.J. Filho, J. Wainer, Using a hierarchical bayesian model to handle high cardinality attributes with relevant interactions in a classification problem, in International Joint Conference on Artifical Intelligence (2007)

    Google Scholar 

  18. A. Fisher, C. Rudin, F. Dominici, All models are wrong, but many are useful: learning a variable’s importance by studying an entire class of prediction models simultaneously. J. Mach. Learn. Res. 20(177), 1–81 (2019)

    MathSciNet  MATH  Google Scholar 

  19. German Federal Employment Agency, Entgeltatlas. https://con.arbeitsagentur.de/prod/entgeltatlas/. Accessed 23 Jan 2022

  20. German Federal Employment Agency, Occupation Codes for Statistical Messages in Germany. https://www.arbeitsagentur.de/betriebsnummern-service/taetigkeitsschluessel. Accessed 23 Jan 2022

  21. German Federal Office of Statistics, Gehaltsvergleich BETA. https://service.destatis.de/DE/gehaltsvergleich/. Accessed 23 Jan 2022

  22. German Federal Office of Statistics, German Classification of Economic Activities 2008. https://www.destatis.de/DE/Methoden/Klassifikationen/Gueter-Wirtschaftsklassifikationen/Downloads/klassifikation-wz-2008-englisch.html. Accessed 23 Jan 2022

  23. German Federal Office of Statistics, German Classification of Occupations 2010. https://statistik.arbeitsagentur.de/DE/Navigation/Grundlagen/Klassifikationen/Klassifikation-der-Berufe/Klassifikation-der-Berufe-Nav.html. Accessed 21 Oct 2022

  24. German Federal Office of Statistics, Interaktiver Gehaltsvergleich. https://www.destatis.de/DE/Service/Statistik-Visualisiert/Gehaltsvergleich/Methoden/Methodenbericht.pdf. Accessed 24 Jan 2022

  25. German Pension Insurance, Durchschnittseinkommen. https://www.deutsche-rentenversicherung.de/SharedDocs/Glossareintraege/DE/D/durchschnittseinkommen.html. Accessed 23 Jan 2022

  26. A. Goldsteen, G. Ezov, A. Farkash, Reducing risk of model inversion using privacy-guided training. Computing Research Repository (CoRR), abs/2006.15877 (2020)

    Google Scholar 

  27. L.I. Kuncheva, Combining Pattern Classifiers: Methods and Algorithms (John Wiley & Sons, Hoboken, 2004)

    Book  MATH  Google Scholar 

  28. E. Limpert, W.A. Stahel, M. Abbt, Log-normal distributions across the sciences: keys and clues. BioScience 51(5), 341–352 (2001)

    Article  Google Scholar 

  29. X. Liu, L. Xie, Y. Wang, J. Zou, J. Xiong, Z. Ying, A.V. Vasilakos, Privacy and security issues in deep learning: a survey. IEEE Access 9, 4566–4593 (2021)

    Article  Google Scholar 

  30. J. Lu, A. Liu, F. Dong, F. Gu, J. Gama, G. Zhang, Learning under concept drift: a review. IEEE Trans. Knowl. Data Eng. 31(12), 2346–2363 (2019)

    Google Scholar 

  31. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, É. Duchesnay, Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12(85), 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  32. P. Probst, M.N. Wright, A.-L. Boulesteix, Hyperparameters and tuning strategies for random forest. WIREs Data Min. Knowl. Discovery 9(3), e1301 (2019)

    Google Scholar 

  33. J.R. Quinlan, C4.5: Programs for Machine Learning (Morgan Kaufmann, Burlington, 1993)

    Google Scholar 

  34. R. Rahim, T. Husni, Yurniwati, Desyetti, The relation between cash compensation of banking executives, charter value, capital requirements and risk taking. Int. J. Bus. 25(5), 399–420 (2020)

    Google Scholar 

  35. R. Ravi, One-Hot Encoding is making your Tree-Based Ensembles worse, here’s why? https://bit.ly/3Fg81tS. Published in Towards Data Science. Accessed 04 May 2022

  36. S.J. Russell, P. Norvig, Artificial Intelligence: A Modern Approach, 4th edn. (Pearson, London, 2020)

    MATH  Google Scholar 

  37. SAS Institute Inc., The SURVEYREG procedure, in SAS/STAT 13.1 User’s Guide, chapter 98 (SAS Institute Inc., Cary, 2013), pp. 8353–8442

    Google Scholar 

  38. D.S. Sisodia, S. Vishwakarma, A. Pujahari, Evaluation of machine learning models for employee churn prediction, in International Conference on Inventive Computing and Informatics (ICICI) (2017)

    Google Scholar 

  39. P. Viroonluecha, T. Kaewkiriya, Salary predictor system for thailand labour workforce using deep learning, in International Symposium on Communications and Information Technologies (ISCIT) (2018)

    Google Scholar 

  40. Y.-X. Wang, B. Balle, S.P. Kasiviswanathan, Subsampled renyi differential privacy and analytical moments accountant. J. Mach. Learn. Res. 89, 1226–1235 (2019)

    Google Scholar 

  41. I.O. Yigit, H. Shourabizadeh, An approach for predicting employee churn by using data mining, in International Artificial Intelligence and Data Processing Symposium (IDAP) (2017)

    Google Scholar 

  42. M. Zaharia, R.S. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave, X. Meng, J. Rosen, S. Venkataraman, M.J. Franklin, A. Ghodsi, J. Gonzalez, S. Shenker, I. Stoica, Apache spark. Commun. ACM 59(11), 56–65 (2016)

    Article  Google Scholar 

  43. C. Zhang, Y. Liu, The salary of physicians in Chinese public tertiary hospitals: a national cross-sectional and follow-up study. BMC Health Serv. Res. 18(661) (2018)

    Google Scholar 

Download references

Acknowledgements

We thank Professor Dr. Sven Overhage for his ongoing support when conducting this research.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Frank Eichinger .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Eichinger, F., Mayer, M. (2022). Predicting Salaries with Random-Forest Regression. In: Alyoubi, B., Ben Ncir, CE., Alharbi, I., Jarboui, A. (eds) Machine Learning and Data Analytics for Solving Business Problems. Unsupervised and Semi-Supervised Learning. Springer, Cham. https://doi.org/10.1007/978-3-031-18483-3_1

Download citation

Publish with us

Policies and ethics