Abstract
For companies it is essential to know the market price of the salaries of their current and prospective employees. Predicting such salaries is challenging, as many factors need to be considered, and large real datasets for learning are scarce. For this reason, research on salary predictions is comparably rare and limited. In this study, we investigate whether and how an advanced machine-learning approach, namely ensembles of random-forest regression, can achieve high-quality salary predictions. We use a large real dataset of more than three million employees and more than 300 professions. Our approach learns –for each profession– a random-forest regression model to predict salaries. In our evaluation, we show that this approach performs better than related work on salary prediction by machine-learning approaches with a mean absolute percentage error (MAPE) of 17.1%. We identify reducing the number of possible values of categorical variables, training separate models as well as outlier handling as the key factors for the results achieved.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
While random forests are our main machine-learning technique for salary predictions, regression trees can be used in our approach as an alternative. They are less complex and perform worse than random forests. We compare the prediction performances of random forests versus regression trees in Sect. 1.4.2 in detail.
- 2.
We use such a strategy for the less predictive features company industry and federal state as described in Sect. 1.3.1.
References
C.C. Aggarwal, Data Mining: The Textbook (Springer, Berlin, 2015)
M. Al-Rubaie, J.M. Chang, Privacy-preserving machine learning: threats and solutions. IEEE Secur. Priv. 17(2), 49–58 (2019)
E. Ameisen, Building Machine Learning Powered Applications (O’Reilly UK Ltd., Farnham, 2020)
D.A. Barbezat, J.W. Hughes, Salary structure effects and the gender pay gap in academia. Res. High. Educ. 46(6), 621–640 (2005)
M.R. Berthold, C. Borgelt, F. Höppner, F. Klawonn, Guide to Intelligent Data Analysis: How to Intelligently Make Sense of Real Data, vol. 42. Texts in Computer Science (Springer, Berlin, 2010)
L. Breiman, Bagging predictors. Mach. Learn. 24(2), 123–140 (1996)
L. Breiman, Random forests. Mach. Learn. 45(1), 5–32 (2001)
L. Breiman, J.H. Friedman, R.A. Olshen, C.J. Stone, Classification and Regression Trees (Wadsworth International Group, Fairview, 1984)
S. Chakraborti, A comparative study of performances of various classification algorithms for predicting salary classes of employees. Int. J. Comput. Sci. Inform. Technol. 5(2), 1964–1972 (2014)
R. Couronné, P. Probst, A.-L. Boulesteix, Random forest versus logistic regression: a large-scale benchmark experiment. BMC Bioinform. 19(1) (2018)
DATEV eG. Personal-Benchmark online. https://datev.de/web/de/mydatev/online-anwendungen/datev-personal-benchmark-online/. Accessed 23 Jan 2022
T. Davenport, K. Malone, Deployment as a critical business data science discipline. Harvard Data Sci. Rev. (3.1), Winter 2021 (2021)
C. Dwork, Differential privacy, in International Colloquium on Automata, Languages, and Programming (ICALP) (2006)
Eurostat, European Commission, Degree of Urbanisation. https://ec.europa.eu/eurostat/web/degree-of-urbanisation/methodology. Accessed 23 Jan 2022
Eurostat, European Commission, Gender Pay Gap Statistics. https://ec.europa.eu/eurostat/statistics-explained/index.php?title=Gender_pay_gap_statistics. Accessed 23 Jan 2022
M. Fernández-Delgado, E. Cernadas, S. Barro, D. Amorim, Do we need hundreds of classifiers to solve real world classification problems? J. Mach. Learn. Res. 15, 3133–3181 (2014)
J.J. Filho, J. Wainer, Using a hierarchical bayesian model to handle high cardinality attributes with relevant interactions in a classification problem, in International Joint Conference on Artifical Intelligence (2007)
A. Fisher, C. Rudin, F. Dominici, All models are wrong, but many are useful: learning a variable’s importance by studying an entire class of prediction models simultaneously. J. Mach. Learn. Res. 20(177), 1–81 (2019)
German Federal Employment Agency, Entgeltatlas. https://con.arbeitsagentur.de/prod/entgeltatlas/. Accessed 23 Jan 2022
German Federal Employment Agency, Occupation Codes for Statistical Messages in Germany. https://www.arbeitsagentur.de/betriebsnummern-service/taetigkeitsschluessel. Accessed 23 Jan 2022
German Federal Office of Statistics, Gehaltsvergleich BETA. https://service.destatis.de/DE/gehaltsvergleich/. Accessed 23 Jan 2022
German Federal Office of Statistics, German Classification of Economic Activities 2008. https://www.destatis.de/DE/Methoden/Klassifikationen/Gueter-Wirtschaftsklassifikationen/Downloads/klassifikation-wz-2008-englisch.html. Accessed 23 Jan 2022
German Federal Office of Statistics, German Classification of Occupations 2010. https://statistik.arbeitsagentur.de/DE/Navigation/Grundlagen/Klassifikationen/Klassifikation-der-Berufe/Klassifikation-der-Berufe-Nav.html. Accessed 21 Oct 2022
German Federal Office of Statistics, Interaktiver Gehaltsvergleich. https://www.destatis.de/DE/Service/Statistik-Visualisiert/Gehaltsvergleich/Methoden/Methodenbericht.pdf. Accessed 24 Jan 2022
German Pension Insurance, Durchschnittseinkommen. https://www.deutsche-rentenversicherung.de/SharedDocs/Glossareintraege/DE/D/durchschnittseinkommen.html. Accessed 23 Jan 2022
A. Goldsteen, G. Ezov, A. Farkash, Reducing risk of model inversion using privacy-guided training. Computing Research Repository (CoRR), abs/2006.15877 (2020)
L.I. Kuncheva, Combining Pattern Classifiers: Methods and Algorithms (John Wiley & Sons, Hoboken, 2004)
E. Limpert, W.A. Stahel, M. Abbt, Log-normal distributions across the sciences: keys and clues. BioScience 51(5), 341–352 (2001)
X. Liu, L. Xie, Y. Wang, J. Zou, J. Xiong, Z. Ying, A.V. Vasilakos, Privacy and security issues in deep learning: a survey. IEEE Access 9, 4566–4593 (2021)
J. Lu, A. Liu, F. Dong, F. Gu, J. Gama, G. Zhang, Learning under concept drift: a review. IEEE Trans. Knowl. Data Eng. 31(12), 2346–2363 (2019)
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, É. Duchesnay, Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12(85), 2825–2830 (2011)
P. Probst, M.N. Wright, A.-L. Boulesteix, Hyperparameters and tuning strategies for random forest. WIREs Data Min. Knowl. Discovery 9(3), e1301 (2019)
J.R. Quinlan, C4.5: Programs for Machine Learning (Morgan Kaufmann, Burlington, 1993)
R. Rahim, T. Husni, Yurniwati, Desyetti, The relation between cash compensation of banking executives, charter value, capital requirements and risk taking. Int. J. Bus. 25(5), 399–420 (2020)
R. Ravi, One-Hot Encoding is making your Tree-Based Ensembles worse, here’s why? https://bit.ly/3Fg81tS. Published in Towards Data Science. Accessed 04 May 2022
S.J. Russell, P. Norvig, Artificial Intelligence: A Modern Approach, 4th edn. (Pearson, London, 2020)
SAS Institute Inc., The SURVEYREG procedure, in SAS/STAT 13.1 User’s Guide, chapter 98 (SAS Institute Inc., Cary, 2013), pp. 8353–8442
D.S. Sisodia, S. Vishwakarma, A. Pujahari, Evaluation of machine learning models for employee churn prediction, in International Conference on Inventive Computing and Informatics (ICICI) (2017)
P. Viroonluecha, T. Kaewkiriya, Salary predictor system for thailand labour workforce using deep learning, in International Symposium on Communications and Information Technologies (ISCIT) (2018)
Y.-X. Wang, B. Balle, S.P. Kasiviswanathan, Subsampled renyi differential privacy and analytical moments accountant. J. Mach. Learn. Res. 89, 1226–1235 (2019)
I.O. Yigit, H. Shourabizadeh, An approach for predicting employee churn by using data mining, in International Artificial Intelligence and Data Processing Symposium (IDAP) (2017)
M. Zaharia, R.S. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave, X. Meng, J. Rosen, S. Venkataraman, M.J. Franklin, A. Ghodsi, J. Gonzalez, S. Shenker, I. Stoica, Apache spark. Commun. ACM 59(11), 56–65 (2016)
C. Zhang, Y. Liu, The salary of physicians in Chinese public tertiary hospitals: a national cross-sectional and follow-up study. BMC Health Serv. Res. 18(661) (2018)
Acknowledgements
We thank Professor Dr. Sven Overhage for his ongoing support when conducting this research.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this chapter
Cite this chapter
Eichinger, F., Mayer, M. (2022). Predicting Salaries with Random-Forest Regression. In: Alyoubi, B., Ben Ncir, CE., Alharbi, I., Jarboui, A. (eds) Machine Learning and Data Analytics for Solving Business Problems. Unsupervised and Semi-Supervised Learning. Springer, Cham. https://doi.org/10.1007/978-3-031-18483-3_1
Download citation
DOI: https://doi.org/10.1007/978-3-031-18483-3_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-18482-6
Online ISBN: 978-3-031-18483-3
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)