Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
279 views

Machine Learning Models For Salary Prediction Dataset Using Python

This document discusses using machine learning models for salary prediction. It explores using linear regression, random forest, and neural networks on a dataset of over 20,000 salaries in the US. The neural network model achieved the highest accuracy at 83.2% while linear regression had the fastest training time of 0.363 seconds. Keywords included linear regression, machine learning, neural networks, random forest, salary prediction, and supervised learning.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
279 views

Machine Learning Models For Salary Prediction Dataset Using Python

This document discusses using machine learning models for salary prediction. It explores using linear regression, random forest, and neural networks on a dataset of over 20,000 salaries in the US. The neural network model achieved the highest accuracy at 83.2% while linear regression had the fastest training time of 0.363 seconds. Keywords included linear regression, machine learning, neural networks, random forest, salary prediction, and supervised learning.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

2022 International Conference on Electrical and Computing Technologies and Applications (ICECTA)

Machine Learning Models for Salary Prediction


Dataset using Python
Reham Kablaoui Ayed Salman
2022 International Conference on Electrical and Computing Technologies and Applications (ICECTA) | 978-1-6654-5600-5/22/$31.00 ©2022 IEEE | DOI: 10.1109/ICECTA57148.2022.9990316

Computer Engineering Department Computer Engineering Department


Kuwait University Kuwait University
Khaldiya, Kuwait Khaldiya, Kuwait
reham.kablaoui@ku.edu.kw ayed.salman@ku.edu.kw

Abstract— In today’s world, salary is the primary source of Another simple supervised machine learning algorithm is
motivation for many regular employees, which makes salary the Random Forest (RF); it builds decision trees from multiple
prediction very important for both employers and employees. It data samples. Then, it uses its majority vote for classification
helps employers and employees to make estimations of the and its average for regression [5]. Random forest is more
expected salary. Fortunately, technological advancements like accurate than traditional decision trees as it prevents
Data Science and Machine Learning (ML) have made salary overfitting. Also, it can handle missing values effectively
prediction more realistic. In this paper, we exploit the benefits while producing predictions without hyperparameter tuning.
of data science to collect a 20,000+ dataset of salaries in the USA.
We then apply three supervised ML techniques to the obtained An enormous technique of ML is the Neural Network
datasets to produce salary prediction. The learning models are (NN); it is a supervised machine learning algorithm widely
linear regression, random forest, and neural networks. The used in many applications. It is a massively parallel distributed
output of the three models is analyzed and compared to show processor made up of simple processing units with a natural
the following; neural network outperforms the other ML models predisposition for learning to store and create experimental
for better accuracy with accuracy level 83.2%, and linear knowledge readily available for usage [6]. In brief, NN is a
regression has the fastest time of 0.363s for training the model. network of multiple processors connected to mimic human
behaviors. NN works by building three types of layers; the
Keywords—Linear Regression, Machine Learning, Neural
input layer, the hidden layer, and the output layer.
Networks, Random Forest, Salary Prediction, Supervised
Learning. Other supervised machine learning techniques are; support
vector machine (SVM), K-nearest neighbors (KNN), Naïve
I. INTRODUCTION Bayes, and nearest centroid. SVMs and KNN are ML
For many people, the most common reason for resignation algorithms that solve classification and regression problems.
is their salaries; higher salaries motivate employees to stay SVM is a low-dimensional input space that can be
more in a company, and low or unraised ones encourage transformed into a higher-dimensional space by the kernel [7].
employees to switch their work to a different company [1]. And then, KNN uses proximity to classify or anticipate how a
Usually, specific human traits, educational background, and set of individual data points will be arranged [8]. The Naïve
work experience highly affect one's salary. Salary prediction Bayes is a quick and simple machine learning approach for
is needed to make one aware of their salary estimation. At the predicting a class of datasets [9]. And lastly, the nearest
same time, it helps to allow a company recognizes what an centroid works by assigning a label to each training data
employee is expecting from them [1]. Fortunately, the fast- closest to the centroid [10].
emerging topics of Data Science and Machine Learning have
In this paper, we build different supervised machine
allowed us to find enormous datasets for salaries and apply
learning techniques on an enormous dataset to make salary
prediction techniques to them.
predictions. The paper proposes the implementation of three
In many disciplines, data science has increased the ML algorithms; linear regression, random forest, and neural
discovery of probabilistic outcomes. Nowadays, data science networks. We implement our models using Python and run our
is an emerging discipline that uses statistical methods and algorithms on Google Colab. Then, a complete classification
computer science knowledge to make meaningful predictions report and the accuracy of each of the models will be reported
and insights in multiple conventional scholarly fields [2]. At and compared. The main aim of this paper is to exploit the use
the same time, another hugely emerging technology is of data science and machine learning techniques to predict
machine learning, coming in with many of its enormous salaries and then provide feedback on the most suitable ML
disciplines and mechanism to allow computers to learn model for such type of data.
without being explicitly programmed. Machine learning
The flow of this paper is as follows; section II includes a
consists of two main concepts; supervised learning, and
complete literature review, and section III has the
unsupervised learning. Supervised learning is when an
implementation of the three models. Then, section IV shows
algorithm can generate a function of input through a given
the results, and section V concludes this paper.
output; the data has a label. In unsupervised learning, data is
unlabeled; the algorithm works in such a way as to make the II. LITERATURE REVIEW
model learn the features on its own [3].
Many researchers are interested in salary prediction, so
A simple supervised machine learning technique is logistic different researchers applied different ML supervised learning
regression; it predicts the probability output of an event techniques for salary prediction. In this section, we compare
occurring. It takes a set of independent inputs and gives a multiple ML models used for salary prediction by conducting
categorically dependent value predicted [4]. The dependent a complete literature review. Thus, this will help us know the
variable is a binary variable that contains data coded as 0 or 1. currently available research and its possible limitations. ML-

978-1-6654-5600-5/22/$31.00 ©2022 IEEE 143

Authorized licensed use limited to: Universidad Tecnica Federico Santa Maria. Downloaded on May 09,2023 at 12:25:21 UTC from IEEE Xplore. Restrictions apply.
2022 International Conference on Electrical and Computing Technologies and Applications (ICECTA)

supervised learning techniques work best for classified data; B. Data preprocessing
which is when input data maps to output data. Salary The dataset obtained is in a CSV format, which helped us
prediction datasets run among many supervised learning to read and clean the data file easily. To begin with, we first
algorithms; this includes regression, random forest, support
had to drop all rows containing NaN values as such values
vector machine, K-nearest neighbors, Naïve Bayes, and
could affect our model prediction by having missing
nearest centroid.
information. Then, we started applying preprocessing
Salary prediction models that use linear regression are techniques.
implemented by Lothe et al. [1] on a small dataset and based
only on four factors, and by Mukherjee & Satyasaivani [11] Our model's learning ability is influenced by; the quality
are based only on employee years of experience. The accuracy of data and the meaningful information it can extract. For this
of both models is between 96% to 98%. Similarly, a regression reason, preprocessing is applied. We apply three techniques
ML model was implemented [12], and salary predictions were of preprocessing; normalization, one hot encoding, and
based only on the employee's job position [12]. At the same binary encoding. Each one of the techniques used is needed
time, regression and SVM models were applied [13] to a huge to preprocess different types of data included in our dataset.
dataset for salary prediction. The results show an accuracy of
79% for regression and 83% for SVM [13]. 1) Normalization: is used for numeric data. In our
dataset, this is the age and the number of working hours. In
Furthermore, the regression, RF, SVM, KNN, and nearest machine learning, normalization transforms the numeric data
centroid techniques predict salary based on different into a value within the range of 0 to 1. Normalization is
companies’ traits and positions for a huge dataset in the UK. needed to ensure that all numeric data is in the same format
The accuracy of the models is as follows; regression model is
when used for training the model. The normalization formula
between 73% to 74%, RF is 81%, SVM is 60%, KNN is 81%,
and the nearest centroid is 65% [14]. Likewise, Martin et al. is shown in equation (1).
[15] apply linear regression, RF, SVM, and KNN models to a
dataset including 4000 job posts in Spain, and the models have (1)
an accuracy of 58% for linear regression, 84% for RF, 66%
for SVM, and 79% for KNN [15]. Another paper included
KNN and Naïve Bayes to predict salary, and accuracy is 2) One hot encoding: is used for categorical data, which
75.9% for KNN and 93.3% Naïve Bayes [16]. is string data that may be one of more than two options. In
our dataset, this is the work class, education, marital status,
From this review, we can say that models with high occupation, and country. The one hot encoding is a
accuracy; either consist of a small dataset or the predictions mechanism used to format the data in such a way as to ease
are made based on a small number of parameters. Thus,
the training process. It produces a binary vector of categorical
making the model’s accuracy rate higher as the dataset is
simple. Nevertheless, in terms of salary prediction conditions, variables with a value of 1 for each row having this option
this will be inaccurate as many factors affect one’s salary. So, and 0 otherwise. We use the Pandas library to convert our
the dataset should include as many parameters as possible. columns into one hot encoding.

III. IMPLEMENTATION 3) Binary encoding: is used for Boolean data or data that
can be one of two options. In our dataset, this is the gender
To implement our ML models, we first find the datasets,
and salary. Binary encoding converts the data in such a way
then apply preprocessing techniques, prepare the training and
testing data, and start modeling the data. that the new column is a statement; it is of a value 1 if it is
true, else it is 0. We use Python functions to convert our
A. Dataset columns into binary encoding.
Our dataset comes from two sources; online research and
C. Preparing the data
an online survey. Firstly, from research, we found 20,000+
datasets from Kaggle on the topic of salary prediction in the After cleaning the data, dropping unwanted rows, and
USA. Then, we conducted an online survey including applying the preprocessing algorithms, our dataset is now
questions similar to the ones in the dataset, and we distributed ready to be used for our proposed learning models. We will
this survey to around 100 people in Kuwait with different first use the train-test split mechanism to prepare the data for
backgrounds. We combined all the data into one CSV file to training and testing.
work on it.
The train-test split is a mechanism used to evaluate the
The datasets focus on predicting whether one’s salary is performance of learning algorithms. It splits the data into two
above 50,000 dollars per year by looking at specific traits. The type sets; a train set and a test set. The training set contains
traits are; age, work sector, education, marital status, inputs with their output; this is needed for the model to learn
occupation, gender, hours per week, and country. The dataset and generalize the concept learned to other data. The testing
also has a column for the salary; this shows if salary is less set is a subset of the data that will allow the model to try and
than or equal to 50,000 dollars or greater than 50,000 dollars. predict the output from the given input after being trained by
The dataset decides the cut-off is 50,000 dollars as it is slightly
the training set.
close to the average household income per year [17].
Scikit-Learn (Sklearn) library in Python can successfully
The dataset mentioned above is tested over three ML
perform the train-test split mechanism. The train-test split
algorithms to predict and check the accuracy of different ML
function from the Sklearn library is used in our code to
models. The models are; logistic regression, random forest,
achieve the train-test split. The test size given is 0.2; this
and neural network models.
means 20% of the dataset should be for testing, and the

144

Authorized licensed use limited to: Universidad Tecnica Federico Santa Maria. Downloaded on May 09,2023 at 12:25:21 UTC from IEEE Xplore. Restrictions apply.
2022 International Conference on Electrical and Computing Technologies and Applications (ICECTA)

random state integer is also specified as SEED of 0 to dropout function takes the rate of 0.3; this is a fraction of the
initialize the random number generator to 0. input unit to be dropped.
D. Data modeling Then, we compile the neural network model based on a
After preparing the data for learning, we have designed loss function, optimizer, and metrics using the compiler
our proposed models; logistic regression, random forest, and function. The loss function finds the error in the learning
neural network. process. We set the loss function as binary cross entropy used
for binary classifications. The optimizer optimizes the input
1) Logistic regression model: is implemented using the weights by comparing the loss function and the prediction;
Sklearn library as it already has a built-in model to build the we assign it to Adam with a learning rate of 0.001. Adam is
logistic regression. The model's implementation is simple, so a stochastic gradient descent method using adaptive
we only need to call three functions; a logistic regression estimation of first and second-order moments. Last, the
function, a "fit" for training (takes input and output train set metrics evaluate the overall performance of the model. We
as parameters), and a "predict" function for predictions (takes set the metrics to accuracy.
input test set as a parameter). The logistic regression function Then, a fit function trains the model for a fixed number of
is given a stochastic average gradient descent (sag) as a times based on the epochs. The fit function takes the input
solver; it is a variation of gradient descent and incremental and output train data, epochs, batch size, and validation data.
aggregated gradient methods. SAG uses a random sample of The epochs is 10; this is the number of iterations to pass over
previous gradient values. After model prediction, we print the the entire train dataset. The batch size is 10; this is the number
model's accuracy and other characteristics provided in the of samples per gradient update, and the validation data is the
next section. test sample data.
2) Random forest model: is implemented using the Last, a predict function predicts the output of the input test
Sklearn library as it already has a built-in model to build the dataset (takes input test set as a parameter). After model
random forest trees. The model's implementation is simple, prediction, we print the model's accuracy and other
so we only need three functions; a random forest classifier, a characteristics provided in the next section.
"fit" function for training (takes input and output train set as
parameters), and a "predict" function for predictions (takes IV. RESULTS
input test set as a parameter). We provide the number of trees The three models are implemented in Google Colab using
in the model as 500 to the random forest classifier. After Python code with the needed libraries; Scikit-Learn, Keras,
model prediction, we print the model's accuracy and other and TensorFlow. All models are used to train and predict the
characteristics provided in the next section. data. In all of our models, we use the predict function to
predict the output of a sample testing input. The result of the
3) Neural network model: is implemented using predict function and the test sample output finds the accuracy,
Tensorflow and Keras; they are libraries in Python for confusion matrix, and a complete classification report of a
machine learning. To develop a deep learning or a neural model.
network model, we use the following functions from
The classification report function provides a report of the
Tensorflow and Keras; sequential function, dense function,
trained model that includes the value of precision, recall, f1-
dropout function, compile function, fit function, and predict
score, and the support of the predicted output. Furthermore,
function.
the report provides the average, macro avg, and weighted avg
The sequential function creates an empty linear stack of against those metrics.
layers; this initiates our model. Then, we call multiple denser
The confusion matrix function provides the overall
layers to fill our model. The dense function creates a fully
model's performance on the test data. The output of this
connected layer of nodes. We provide it with the number of
function is a 2x2 matrix shown in Table I. It shows the actual
units, the activation function, and an input shape (for input
result compared to the predicted output.
layers only) based on the type of the layer. The activation
function determines how input transforms into output. All TABLE I. OUTPUT OF CONFUSION MATRIX.
layers implemented in our code are as follows:
N = total predictions Actual: No Actual: Yes
• Input layer is given 20 units for higher accuracy, the Predicted: No True Negative False Negative
activation function is the Rectifier Linear Unit (relu), and Predicted: Yes False Positive True Positive
the input shape is the reshaped input of the dataset. The accuracy score function provides the accuracy value
• Hidden layers are all also given 20 units, and the of the prediction of the trained model. Furthermore, we
activation function is relu. Our model consists of three calculate the training time of each model by using the time
hidden layers as the data is quite large. library in Python.
• Output layer only contains 1 unit, and the activation A. Logistic regression results
function is sigmoid because our data is divided into In the logistic regression model, the classification report is
binary and multilabel classifications. as shown in Fig. 1. In this model, the confusion matrix is
Then, the dropout function drops some neurons from the [[3838 311] [626 746]], and the accuracy score achieved is
input or hidden layer. It helps to avoid overfitting. The around 83.0%. And the time taken to train this model is around
0.363s.

145

Authorized licensed use limited to: Universidad Tecnica Federico Santa Maria. Downloaded on May 09,2023 at 12:25:21 UTC from IEEE Xplore. Restrictions apply.
2022 International Conference on Electrical and Computing Technologies and Applications (ICECTA)

V. CONCLUSION
In conclusion, we have tested and compared three types
of supervised machine learning models; logistic regression,
random forest, and neural networks. The models are tested on
a salary prediction dataset to see how one’s personal traits
and educational background affect their salary. Our dataset’s
output is a binary output of 1 if the salary is above 50,000
dollars per year or 0 otherwise. From the obtained results, we
Fig. 1. Logistic regression model classification report. can say that, on such a dataset, the neural network model is
the most accurate with 83.2% accuracy, but it is the slowest
B. Random forest results as it needs 82.79s to train the model. Then, random forest is
In the random forest model, the classification report is as the least accurate with 80.7% accuracy, and its time taken is
shown in Fig. 2. In this model, the confusion matrix is [[3693 considered low as it takes 8.489s to train the model. Then,
456] [ 607 765]], and the accuracy score achieved is around logistic regression has an accuracy level between the other
80.7%. And the time taken to train this model is around two models of 83.0% accuracy, but it’s the fastest as it only
8.489s. needs 0.363s to train the model. Therefore, we can conclude
that a neural network is best when accuracy is the main factor,
and random forest or logistic regression is better when time
is the main factor. The three models are supervised machine
learning techniques used to train computers to predict the
output of a given set of inputs; this is makes predictions easier
for any application.
The main limitation of this paper is that the obtained result
is for only one type of dataset, so reduced categories of the
Fig. 2. Random forest model classification report. dataset may slightly differ from the final result. However, on
a large dataset (similar to the one used in this paper), the ML
C. Neural network results models’ accuracy and time will always be the same. In the
In the neural network model, the classification report is as future, we aim to combine different ML models to see how
shown in Fig. 3. In this model, the confusion score is [[3730 this could affect the overall accuracy of the ML algorithms.
419] [ 508 864]], and the accuracy score achieved is around
83.2%. And the time taken to train this model is around VI. ACKNOWLEDGMENT
82.79s. We wish to acknowledge the generous financial support from
the Kuwait Foundation for the Advancement of Sciences
(KFAS) to present this paper at the conference under the
Research Capacity Building/Scientific Missions program.
REFERENCES
[1] M. D. Lothe, P. Tiwari, N. Patil, S. Patil, and Patil, V, “Salary
Prediction using Machine Learning,” International Journal of Advance
Scientific Research and engineering Trends, vol. 6, issue 5, 2021 pp.
199-202.
[2] L. M. Brodie, “What Is Data Science?” In book: Applied Data Science,
2019, pp. 101-130.
Fig. 3. Neural network model classification report.
[3] R. Bansal, J. Singh, and R. Kaur, “Machine learning and its
D. Comparing results of the models applications: A Review,” JASC: Journal of Applied Science and
Computations, 2020, pp. 1076-5131.
In terms of accuracy, from the three classification reports, [4] H. A. Park, “An Introduction to Logistic Regression: From Basic
confusion matrix, and accuracy score, we can see that the Concepts to Interpretation with Particular Attention to Nursing
neural network model has the best accuracy level, precision, Domain,” J Korean Acad Nurs, vol. 43, issue 2, 2013, pp. 154-164.
recall, and f1-score. Then comes the logistic regression, and [5] M. Azhari, A. Alaoui, Z. Acharoui, B. Ettaki, and J. Zerouaoui,
last is the random forest. In terms of time, the neural network “Adaptation of the Random Forest Method: Solving the problem of
Pulsar Search,” SCA '19: Proceedings of the 4th International
is the slowest, then comes the random forest, and the fastest Conference on Smart City Applications, 2019, pp. 1-6.
is the logistic regression. Table II summarizes the overall [6] A. Sharkawy, “Principle of Neural Network and Its Main Types:
result of the accuracy level and time of the three trained Review,” Journal of Advances in Applied & Computational
models. Mathematics, vol. 7, issue 1, 2020, pp. 8-19.
TABLE II. COMPARISON OF MODELS RESULTS. [7] D. Srivastava, and L. Bhambhu, “Data classification using support
vector machine,” Journal of Theoretical and Applied Information
Model Name Technology, vol. 12, issue 1, 2010, pp. 1-7.
Performance Logistic
Random forest Neural [8] Z. Zhang, “Introduction to machine learning: K-nearest neighbors.
Metrics regression Annals of Translational Medicine,” vol. 4, issue 11, 2016, pp. 218-218.
network
[9] T. N. Viet, and L. M. Hoang, “The Naïve Bayes algorithm for learning
Accurarcy 83.0% 80.7% 83.2% data analytics,” Indian Journal of Computer Science and
Engineering, vol. 12, issue 4, 2021, pp. 1038-1043.
Time 0.363s 8.489s 82.79s

146

Authorized licensed use limited to: Universidad Tecnica Federico Santa Maria. Downloaded on May 09,2023 at 12:25:21 UTC from IEEE Xplore. Restrictions apply.
2022 International Conference on Electrical and Computing Technologies and Applications (ICECTA)

[10] S. Johri, S. Debnath, A. Mocherla, A. Singh, A. Prakash, J. Kim, and I. [14] L. Li, X. Liu, and Y. Zhou, “Prediction of Salary in UK,” Computer
Kerenidis, “Nearest Centroid Classification on a Trapped Ion Quantum Science and Engineering department of UC San Diego, 2018.
Computer,” npj Quantum Information vol. 7, issue 1, 2021. [15] I. Martin, A. Mariello, R. Battiti, and J. A. Hern´andez, “Salary
[11] T. Mukherjee, and B. Satyasaivani, “Employee’s Salary Prediction,” Prediction in the IT Job Market with Few High-Dimensional Samples:
International Journal of Advance Research, Ideas and Innovations in A Spanish Case Study,” International Journal of Computational
Technology, vol. 8, issue 3, 2022, pp. 356-359. Intelligence Systems, vol. 11, 2018, pp. 1192-1209.
[12] S. Das, R. Barik, and A. Mukherjee, “Salary Prediction Using [16] K. Gopal, A. Singh, H. Kumar, and S. Sagar, “Salary Prediction Using
Regression Techniques,” SSRN Electronic Journal, 2020. Machine Learning,” International Journal of Innovative Research in
[13] R. Voleti, and B. Jana, “Predictive Analysis of HR Salary using Technology (IJIRT), vol. 8, issue 1, 2021, pp. 380-383.
Machine Learning Techniques,” International Journal of Engineering [17] U.S. household income distribution 2021. Percentage distribution of
Research & Technology (IJERT), vol. 10, issue 1, 2022, pp. 34-37. household income in the United States in 2021 (in U.S. dollars)*
[Graph]. In Statistics.

147

Authorized licensed use limited to: Universidad Tecnica Federico Santa Maria. Downloaded on May 09,2023 at 12:25:21 UTC from IEEE Xplore. Restrictions apply.

You might also like