Modeling

4.
Model Building
The choice of modelling approach and the effort to improve model performance in this project
were based on several key considerations and objectives:
4.1. Modelling Approach
4.1.1. Regression Modelling:

- Linear Regression, Decision Trees, and Random Forests were chosen among others. This
selection is motivated by the project's focus on predicting salary values (Expected CTC), which
is a continuous numerical variable.
4.1.2. Machine Learning Algorithms:

- Gradient Boosting, Support Vector Regression, and Neural Networks were also considered.
These algorithms are capable of capturing non-linear relationships and complexities present in
salary prediction, making them suitable for the task.
4.2. Why This Approach?

- Complexity of Data: The project involves multiple variables with potential non-linear
relationships. Regression models and machine learning algorithms are well-suited to handle
such complexity.
- Interpretability: There's a need for a balance between model accuracy and interpretability.
Some models, like Linear Regression, offer easier interpretability, allowing insights into the
factors influencing salary predictions.
- Scalability: The chosen approaches can efficiently handle large datasets, ensuring scalability
as the dataset size grows.
- Feature Importance: The models can provide insights into the key predictors affecting
salary, helping Delta Limited make informed decisions.
- Previous Success: The selected models have demonstrated effectiveness in similar projects,
providing confidence in their suitability for this salary prediction task.
4.3. Model Selection Considerations
4.3.1. Cross-Validation:
- Cross-validation was employed to ensure the models' robustness and generalization. This
technique helps in estimating how well the models will perform on unseen data.
4.3.2. Hyperparameter Tuning:

- Hyperparameter tuning was performed to optimize model performance. Adjusting
hyperparameters helps fine-tune the models for better predictive accuracy.
- 14 -
4.3.3. Ensemble Methods:
- Ensemble methods, such as Random Forests and Gradient Boosting, were considered to
combine multiple models for improved accuracy. This approach leverages the strengths of
different algorithms.
4.3.4. Bias-Variance Trade-off:

- Controlling overfitting and underfitting was a critical consideration. Ensuring the right
balance between bias and variance is essential for accurate predictions.
4.4. Challenges
4.4.1. Data Quality:

- Mitigating data-related issues, including missing values and outliers, was a challenge. Data
pre-processing steps were essential to address these challenges.
4.4.2. Interpretability:
- Ensuring that the models provide meaningful insights into salary determinants was a priority.
Linear Regression, in particular, supports interpretability.
4.4.3. Resource Intensiveness:

- Handling the computational demands of certain machine learning algorithms, such as Neural
Networks, required careful resource management and optimization.
4.5. KNIME Models Used for Salary Prediction
The KNIME workflow incorporates several nodes for data pre-processing and modelling:
- Excel Reader: This node imports and prepares data from Excel files.
- Statistics Node: Conducts Exploratory Data Analysis (EDA) on numerical variables.
- Rank Correlation Node: Computes correlation coefficients to quantify relationships

between variables.
- Missing Value Node: Handles missing data by replacing them with appropriate values.
- Normalizer Node: Ensures consistency in scale within numeric columns.
- Partitioning Node: Divides data into training and testing sets for model evaluation.
- Model Building Node: Utilizes a combination of nodes ('Learner,' 'Predictor,' and 'Scorer')
for building regression models, including Linear Regression, Decision Trees, Gradient Boost
Trees, and Random Forests.
The holistic approach outlined in this project is designed to address various challenges,
optimize model performance, and provide actionable insights into salary prediction for Delta
Limited.
- 15 -
Figure 5. KNIME models used for salary prediction
4.6. The Regression Model:
According to this model, utilizing the provided coefficients:
Table 3. Regression Modelling Results
Expected CTC = 108,722.149 - 6,910.94 * Total Experience - 21,621.65 * Number of

Companies + 7,771.29 * Number of Publications - 21,605.84 * Certifications + 1.3 * Current
CTC
Total experience, Current CTC, Number of Companies Worked, Number of Publications, and
Certifications are statistically significant predictors of Expected CTC. This significance is
indicated by their respective P-values, all of which are less than 0.05.
- 16 -
Table 4. Regression Statistics for Model Validity and Reliability
- 17 -

Modeling

Uploaded by

Copyright:

Available Formats

Modeling

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Modeling

Uploaded by

Copyright:

Available Formats

4.

4.1. Modelling Approach

4.1.1. Regression Modelling:

4.1.2. Machine Learning Algorithms:

4.2. Why This Approach?

4.3. Model Selection Considerations

4.3.2. Hyperparameter Tuning:

4.3.4. Bias-Variance Trade-off:

4.4.1. Data Quality:

4.4.3. Resource Intensiveness:

4.5. KNIME Models Used for Salary Prediction

- Statistics Node: Conducts Exploratory Data Analysis (EDA) on numerical variables.

- Rank Correlation Node: Computes correlation coefficients to quantify relationships

- Normalizer Node: Ensures consistency in scale within numeric columns.

4.6. The Regression Model:

According to this model, utilizing the provided coefficients:

Table 3. Regression Modelling Results

Expected CTC = 108,722.149 - 6,910.94 * Total Experience - 21,621.65 * Number of

You might also like