Predicting Salaries with Random-Forest Regression

Eichinger, Frank; Mayer, Moritz

doi:10.1007/978-3-031-18483-3_1

Frank Eichinger⁶ &
Moritz Mayer^7,8

Part of the book series: Unsupervised and Semi-Supervised Learning ((UNSESUL))

656 Accesses

Abstract

For companies it is essential to know the market price of the salaries of their current and prospective employees. Predicting such salaries is challenging, as many factors need to be considered, and large real datasets for learning are scarce. For this reason, research on salary predictions is comparably rare and limited. In this study, we investigate whether and how an advanced machine-learning approach, namely ensembles of random-forest regression, can achieve high-quality salary predictions. We use a large real dataset of more than three million employees and more than 300 professions. Our approach learns –for each profession– a random-forest regression model to predict salaries. In our evaluation, we show that this approach performs better than related work on salary prediction by machine-learning approaches with a mean absolute percentage error (MAPE) of 17.1%. We identify reducing the number of possible values of categorical variables, training separate models as well as outlier handling as the key factors for the results achieved.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 139.00; Price excludes VAT (USA)

Softcover Book: USD 179.99; Price excludes VAT (USA)

Hardcover Book: USD 179.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Predictive Analysis on HRM Data: Determining Employee Promotion Factors Using Random Forest and XGBoost

Employee Salary Satisfaction Analysis Using Machine Learning

Qualitative Assessment of Machine Learning Classifiers for Employee Performance Prediction

Notes

1.
While random forests are our main machine-learning technique for salary predictions, regression trees can be used in our approach as an alternative. They are less complex and perform worse than random forests. We compare the prediction performances of random forests versus regression trees in Sect. 1.4.2 in detail.
2.
We use such a strategy for the less predictive features company industry and federal state as described in Sect. 1.3.1.

References

C.C. Aggarwal, Data Mining: The Textbook (Springer, Berlin, 2015)
MATH Google Scholar
M. Al-Rubaie, J.M. Chang, Privacy-preserving machine learning: threats and solutions. IEEE Secur. Priv. 17(2), 49–58 (2019)
Article Google Scholar
E. Ameisen, Building Machine Learning Powered Applications (O’Reilly UK Ltd., Farnham, 2020)
Google Scholar
D.A. Barbezat, J.W. Hughes, Salary structure effects and the gender pay gap in academia. Res. High. Educ. 46(6), 621–640 (2005)
Article Google Scholar
M.R. Berthold, C. Borgelt, F. Höppner, F. Klawonn, Guide to Intelligent Data Analysis: How to Intelligently Make Sense of Real Data, vol. 42. Texts in Computer Science (Springer, Berlin, 2010)
Google Scholar
L. Breiman, Bagging predictors. Mach. Learn. 24(2), 123–140 (1996)
Article MATH Google Scholar
L. Breiman, Random forests. Mach. Learn. 45(1), 5–32 (2001)
Article MATH Google Scholar
L. Breiman, J.H. Friedman, R.A. Olshen, C.J. Stone, Classification and Regression Trees (Wadsworth International Group, Fairview, 1984)
Google Scholar
S. Chakraborti, A comparative study of performances of various classification algorithms for predicting salary classes of employees. Int. J. Comput. Sci. Inform. Technol. 5(2), 1964–1972 (2014)
Google Scholar
R. Couronné, P. Probst, A.-L. Boulesteix, Random forest versus logistic regression: a large-scale benchmark experiment. BMC Bioinform. 19(1) (2018)
Google Scholar
DATEV eG. Personal-Benchmark online. https://datev.de/web/de/mydatev/online-anwendungen/datev-personal-benchmark-online/. Accessed 23 Jan 2022
T. Davenport, K. Malone, Deployment as a critical business data science discipline. Harvard Data Sci. Rev. (3.1), Winter 2021 (2021)
Google Scholar
C. Dwork, Differential privacy, in International Colloquium on Automata, Languages, and Programming (ICALP) (2006)
Google Scholar
Eurostat, European Commission, Degree of Urbanisation. https://ec.europa.eu/eurostat/web/degree-of-urbanisation/methodology. Accessed 23 Jan 2022
Eurostat, European Commission, Gender Pay Gap Statistics. https://ec.europa.eu/eurostat/statistics-explained/index.php?title=Gender_pay_gap_statistics. Accessed 23 Jan 2022
M. Fernández-Delgado, E. Cernadas, S. Barro, D. Amorim, Do we need hundreds of classifiers to solve real world classification problems? J. Mach. Learn. Res. 15, 3133–3181 (2014)
MathSciNet MATH Google Scholar
J.J. Filho, J. Wainer, Using a hierarchical bayesian model to handle high cardinality attributes with relevant interactions in a classification problem, in International Joint Conference on Artifical Intelligence (2007)
Google Scholar
A. Fisher, C. Rudin, F. Dominici, All models are wrong, but many are useful: learning a variable’s importance by studying an entire class of prediction models simultaneously. J. Mach. Learn. Res. 20(177), 1–81 (2019)
MathSciNet MATH Google Scholar
German Federal Employment Agency, Entgeltatlas. https://con.arbeitsagentur.de/prod/entgeltatlas/. Accessed 23 Jan 2022
German Federal Employment Agency, Occupation Codes for Statistical Messages in Germany. https://www.arbeitsagentur.de/betriebsnummern-service/taetigkeitsschluessel. Accessed 23 Jan 2022
German Federal Office of Statistics, Gehaltsvergleich BETA. https://service.destatis.de/DE/gehaltsvergleich/. Accessed 23 Jan 2022
German Federal Office of Statistics, German Classification of Economic Activities 2008. https://www.destatis.de/DE/Methoden/Klassifikationen/Gueter-Wirtschaftsklassifikationen/Downloads/klassifikation-wz-2008-englisch.html. Accessed 23 Jan 2022
German Federal Office of Statistics, German Classification of Occupations 2010. https://statistik.arbeitsagentur.de/DE/Navigation/Grundlagen/Klassifikationen/Klassifikation-der-Berufe/Klassifikation-der-Berufe-Nav.html. Accessed 21 Oct 2022
German Federal Office of Statistics, Interaktiver Gehaltsvergleich. https://www.destatis.de/DE/Service/Statistik-Visualisiert/Gehaltsvergleich/Methoden/Methodenbericht.pdf. Accessed 24 Jan 2022
German Pension Insurance, Durchschnittseinkommen. https://www.deutsche-rentenversicherung.de/SharedDocs/Glossareintraege/DE/D/durchschnittseinkommen.html. Accessed 23 Jan 2022
A. Goldsteen, G. Ezov, A. Farkash, Reducing risk of model inversion using privacy-guided training. Computing Research Repository (CoRR), abs/2006.15877 (2020)
Google Scholar
L.I. Kuncheva, Combining Pattern Classifiers: Methods and Algorithms (John Wiley & Sons, Hoboken, 2004)
Book MATH Google Scholar
E. Limpert, W.A. Stahel, M. Abbt, Log-normal distributions across the sciences: keys and clues. BioScience 51(5), 341–352 (2001)
Article Google Scholar
X. Liu, L. Xie, Y. Wang, J. Zou, J. Xiong, Z. Ying, A.V. Vasilakos, Privacy and security issues in deep learning: a survey. IEEE Access 9, 4566–4593 (2021)
Article Google Scholar
J. Lu, A. Liu, F. Dong, F. Gu, J. Gama, G. Zhang, Learning under concept drift: a review. IEEE Trans. Knowl. Data Eng. 31(12), 2346–2363 (2019)
Google Scholar
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, É. Duchesnay, Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12(85), 2825–2830 (2011)
MathSciNet MATH Google Scholar
P. Probst, M.N. Wright, A.-L. Boulesteix, Hyperparameters and tuning strategies for random forest. WIREs Data Min. Knowl. Discovery 9(3), e1301 (2019)
Google Scholar
J.R. Quinlan, C4.5: Programs for Machine Learning (Morgan Kaufmann, Burlington, 1993)
Google Scholar
R. Rahim, T. Husni, Yurniwati, Desyetti, The relation between cash compensation of banking executives, charter value, capital requirements and risk taking. Int. J. Bus. 25(5), 399–420 (2020)
Google Scholar
R. Ravi, One-Hot Encoding is making your Tree-Based Ensembles worse, here’s why? https://bit.ly/3Fg81tS. Published in Towards Data Science. Accessed 04 May 2022
S.J. Russell, P. Norvig, Artificial Intelligence: A Modern Approach, 4th edn. (Pearson, London, 2020)
MATH Google Scholar
SAS Institute Inc., The SURVEYREG procedure, in SAS/STAT 13.1 User’s Guide, chapter 98 (SAS Institute Inc., Cary, 2013), pp. 8353–8442
Google Scholar
D.S. Sisodia, S. Vishwakarma, A. Pujahari, Evaluation of machine learning models for employee churn prediction, in International Conference on Inventive Computing and Informatics (ICICI) (2017)
Google Scholar
P. Viroonluecha, T. Kaewkiriya, Salary predictor system for thailand labour workforce using deep learning, in International Symposium on Communications and Information Technologies (ISCIT) (2018)
Google Scholar
Y.-X. Wang, B. Balle, S.P. Kasiviswanathan, Subsampled renyi differential privacy and analytical moments accountant. J. Mach. Learn. Res. 89, 1226–1235 (2019)
Google Scholar
I.O. Yigit, H. Shourabizadeh, An approach for predicting employee churn by using data mining, in International Artificial Intelligence and Data Processing Symposium (IDAP) (2017)
Google Scholar
M. Zaharia, R.S. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave, X. Meng, J. Rosen, S. Venkataraman, M.J. Franklin, A. Ghodsi, J. Gonzalez, S. Shenker, I. Stoica, Apache spark. Commun. ACM 59(11), 56–65 (2016)
Article Google Scholar
C. Zhang, Y. Liu, The salary of physicians in Chinese public tertiary hospitals: a national cross-sectional and follow-up study. BMC Health Serv. Res. 18(661) (2018)
Google Scholar

Download references

Acknowledgements

We thank Professor Dr. Sven Overhage for his ongoing support when conducting this research.

Author information

Authors and Affiliations

DATEV eG, Nuremberg, Germany
Frank Eichinger
DATEV eG, Nuremberg, Germany
Moritz Mayer
University of Bamberg, Bamberg, Germany
Moritz Mayer

Authors

Frank Eichinger
View author publications
You can also search for this author in PubMed Google Scholar
Moritz Mayer
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Frank Eichinger .

Editor information

Editors and Affiliations

University of Jeddah, Jeddah, Saudi Arabia
Bader Alyoubi
University of Jeddah, Jeddah, Saudi Arabia
Chiheb-Eddine Ben Ncir
University of Jeddah, Jeddah, Saudi Arabia
Ibraheem Alharbi
University of Sfax, Sfax, Tunisia
Anis Jarboui

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Eichinger, F., Mayer, M. (2022). Predicting Salaries with Random-Forest Regression. In: Alyoubi, B., Ben Ncir, CE., Alharbi, I., Jarboui, A. (eds) Machine Learning and Data Analytics for Solving Business Problems. Unsupervised and Semi-Supervised Learning. Springer, Cham. https://doi.org/10.1007/978-3-031-18483-3_1

Download citation

DOI: https://doi.org/10.1007/978-3-031-18483-3_1
Published: 23 September 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-18482-6
Online ISBN: 978-3-031-18483-3
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics