Explore how data science can be used to predict employee churn using this data science project presentation, allowing organizations to proactively address retention issues. This student presentation from Boston Institute of Analytics showcases the methodology, insights, and implications of predicting employee turnover. visit https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/ for more data science insights
Exploratory Data Analysis (EDA) is used to analyze datasets and summarize their main characteristics visually. EDA involves data sourcing, cleaning, univariate analysis with visualization to understand single variables, bivariate analysis with visualization to understand relationships between two variables, and deriving new metrics from existing data. EDA is an important first step for understanding data and gaining confidence before building machine learning models. It helps detect errors, anomalies, and map data structures to inform question asking and data manipulation for answering questions.
Predict Backorder on a supply chain data for an OrganizationPiyush Srivastava
The document discusses predicting backorders using supply chain data. It defines backorders as customer orders that cannot be filled immediately but the customer is willing to wait. The data analyzed consists of 23 attributes related to a garment supply chain, including inventory levels, forecast sales, and supplier performance metrics. Various machine learning algorithms are applied and evaluated on their ability to predict backorders, including naive Bayes, random forest, k-NN, neural networks, and support vector machines. Random forest achieved the best accuracy of 89.53% at predicting backorders. Feature selection and data balancing techniques are suggested to potentially further improve prediction performance.
Data preprocessing is required because real-world data is often incomplete, noisy, inconsistent, and in an aggregate form. The goals of data preprocessing include handling missing data, smoothing out noisy data, resolving inconsistencies, computing aggregate attributes, reducing data volume to improve mining performance, and improving overall data quality. Key techniques for data preprocessing include data cleaning, data integration, data transformation, and data reduction.
Data preprocessing is the process of preparing raw data for analysis by cleaning it, transforming it, and reducing it. The key steps in data preprocessing include data cleaning to handle missing values, outliers, and noise; data transformation techniques like normalization, discretization, and feature extraction; and data reduction methods like dimensionality reduction and sampling. Preprocessing ensures the data is consistent, accurate and suitable for building machine learning models.
Dimensionality reduction techniques are used to reduce the number of features or variables in a dataset. This helps simplify models and improve performance. Principal component analysis (PCA) is a common technique that transforms correlated variables into linearly uncorrelated principal components. Other techniques include backward elimination, forward selection, filtering out low variance or highly correlated features. Dimensionality reduction benefits include reducing storage space, faster training times, and better visualization of data.
sarisus hdyses can create targeted .pptx13DikshaDatir
Customer segmentation is a crucial technique used by businesses to better understand their customer base and tailor their marketing strategies accordingly. By dividing customers into distinct groups based on shared characteristics or behaviors, businesses can create targeted marketing campaigns and improve customer satisfaction.
This document discusses the Excel add-in for data mining. It allows users to mine data with a few clicks using advanced algorithms without needing experience in data mining or SQL server configuration. The add-in contains sections for data preparation, modeling, accuracy validation, and connection. Data can be explored, cleaned, and prepared for modeling. Common modeling techniques like classification, estimation, clustering and association are supported. Accuracy of models can be validated against real data.
This document discusses the Excel add-in for data mining. It allows users to mine data with a few clicks using advanced algorithms without needing experience in data mining or SQL server configuration. The add-in contains sections for data preparation, modeling, accuracy validation, and connection. Data can be explored, cleaned, and prepared for modeling. Common modeling algorithms like decision trees, clustering, and association rules are available. Accuracy and validation tools allow testing models on real data. The add-in combines the power of SQL Server Analysis Services with the ease of use of Excel.
PCA and LDA are dimensionality reduction techniques. PCA transforms variables into uncorrelated principal components while maximizing variance. It is unsupervised. LDA finds axes that maximize separation between classes while minimizing within-class variance. It is supervised and finds axes that separate classes well. The document provides mathematical explanations of how PCA and LDA work including calculating covariance matrices, eigenvalues, eigenvectors, and transformations.
This document discusses various data reduction techniques including dimensionality reduction through attribute subset selection, numerosity reduction using parametric and non-parametric methods like data cube aggregation, and data compression. It describes how attribute subset selection works to find a minimum set of relevant attributes to make patterns easier to detect. Methods for attribute subset selection include forward selection, backward elimination, and bi-directional selection. Decision trees can also help identify relevant attributes. Data cube aggregation stores multidimensional summarized data to provide fast access to precomputed information.
This document provides a guide for building generalized linear models (GLMs) in R to accurately model insurance claim frequency and severity. It outlines steps for data preparation, including handling missing data, transforming variables, and removing outliers. It then discusses modeling count/frequency data with Poisson or negative binomial models including an offset for exposure. Severity is typically modeled with a gamma or normal distribution. The document provides examples of investigating interactions and comparing models using AIC, BIC, and residual analysis.
Lead Scoring Group Case Study Presentation.pdfKrishP2
The document describes a case study to build a logistic regression model to predict lead conversion for an online education company. Key steps included:
- Preparing the data by handling missing values, outliers, encoding categorical variables.
- Using recursive feature elimination to select the top 20 predictive features.
- Building a logistic regression model and selecting the final 16 features based on p-values and VIF.
- Choosing a probability threshold of 0.33 based on model evaluation metrics to classify leads as converted or not.
- Calculating lead scores by multiplying the conversion probability by 100.
The top three predictive categorical variables were Tags_Lost to EINS, Tags_Closed by Horizzon,
Regression, multivariate analysis, clustering, and predictive modeling techniques are statistical and machine learning methods for analyzing data. Regression finds relationships between variables, multivariate analysis examines multiple variables simultaneously, clustering groups similar data points, and predictive modeling predicts unknown events. These techniques are used across many fields for tasks like prediction, classification, pattern recognition, and decision making. R software can be used to perform various data analyses using these methods.
Regression, multivariate analysis, clustering, and predictive modeling techniques are statistical and machine learning methods for analyzing data. Regression finds relationships between variables, multivariate analysis examines multiple variables simultaneously, clustering groups similar observations, and predictive modeling predicts unknown events. These techniques are used across many fields to discover patterns, reduce dimensions, classify data, and forecast trends. R software can be used to perform various analyses including regression, clustering, and predictive modeling.
Data Analysis: Statistical Methods: Regression modelling, Multivariate Analysis - Classification: SVM & Kernel Methods - Rule Mining - Cluster Analysis, Types of Data in Cluster Analysis, Partitioning Methods, Hierarchical Methods, Density Based Methods, Grid Based Methods, Model Based Clustering Methods, Clustering High Dimensional Data - Predictive Analytics – Data analysis using R.
B409 W11 Sas Collaborative Stats Guide V4.2marshalkalra
This document provides an overview of numerical summaries and variation within data. It defines key terms like mean, median, mode, range, standard deviation, and variance. It also discusses sources of variation within data like process inputs and conditions versus random temporary events. The document demonstrates how to use SAS software to analyze a cars dataset and create reports and bar charts to describe the data and identify trends and variation.
Similar to Predicting Employee Churn: A Data-Driven Approach Project Presentation (20)
Unlock the power of predictive analytics in digital marketing with our in-depth exploration of conversion prediction. This presentation provides a comprehensive approach to forecasting the effectiveness of digital marketing campaigns by analyzing historical data, user behavior, and campaign metrics. Discover how to use data-driven insights to optimize your marketing strategies, improve conversion rates, and maximize return on investment (ROI). for more information visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Unlock the secrets to developing a highly secure password management system with our comprehensive presentation. This project focuses on integrating robust security measures and multi-factor authentication (MFA) to protect sensitive information. Through detailed explanations and practical examples, you'll learn how to create a system that not only manages passwords effectively but also fortifies them against unauthorized access. for more information visit: https://bostoninstituteofanalytics.org/cyber-security-and-ethical-hacking/
Delve into the world of parameter tampering with our in-depth presentation. This project highlights the risks associated with manipulating request parameters and provides actionable strategies for detection and prevention. Through a series of practical labs and case studies, we demonstrate how parameter tampering can undermine web application security and explore effective measures to safeguard against it. for more information visit: https://bostoninstituteofanalytics.org/cyber-security-and-ethical-hacking/
Explore our comprehensive approach to identifying and mitigating common web application vulnerabilities in Solve Labs. This presentation details four critical security issues: Login Bypass, Admin Credentials Access via SQL Injection, Cross-Site Request Forgery (CSRF), and Cross-Site Scripting (XSS). Each section includes practical labs designed to demonstrate the vulnerabilities, provide insight into their impact, and offer strategies for effective remediation. for more information visit: https://bostoninstituteofanalytics.org/cyber-security-and-ethical-hacking/
Power Consumption Prediction" explores data analysis techniques to forecast energy usage patterns. This presentation focuses on methodologies like time series analysis, machine learning models, and data visualization to predict future power demands accurately. Tailored for a data analysis course or project presentation, this topic illustrates how insights gained can inform sustainable energy planning, optimize resource allocation, and enhance operational efficiency in diverse sectors.https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Salary Prediction" offers a deep dive into leveraging data analytics to predict salaries accurately. This presentation showcases methodologies such as machine learning algorithms and statistical models that analyze factors influencing compensation trends. Ideal for project presentations, this topic explores how businesses can use predictive analytics to optimize budgeting, attract talent, and ensure fair compensation practices in dynamic industries for more details visit https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Product Cluster Analysis" explores the powerful methodology of clustering to reveal valuable market insights. This presentation covers how clustering algorithms group similar products based on various attributes, aiding businesses in segmenting their market, optimizing pricing strategies, and enhancing product offerings. Perfect for students and professionals interested in leveraging data-driven approaches to understand consumer behavior and drive strategic decision-making in competitive markets.
Generate an in-depth report on five organizations/companies that experienced cyber-attacks. The report should encompass the type of attack executed, the factors that enabled the attack, and the result for more details visit: https://bostoninstituteofanalytics.org/cyber-security-and-ethical-hacking/
"Rainfall Predict" delves into the innovative technologies revolutionizing rain forecasting. This presentation explores state-of-the-art meteorological models, data analytics, and machine learning algorithms used to predict precipitation patterns with accuracy. Ideal for students keen on understanding the science behind weather prediction and its practical applications in agriculture, disaster management, and urban planning.
for more information visit:
This presentation by Keerthana student of boston institute of analhytics provides a comprehensive overview of the process and importance of performing a port scan on a website. It covers various types of port scans, tools used for scanning, and a step-by-step guide to conducting a scan. Additionally, the presentation delves into analyzing the functions of open ports, the benefits of port scanning, and potential security threats. A real-world case study is included to illustrate the significance of regular port scans in maintaining network security. Best practices and mitigation strategies are also discussed to equip viewers with the knowledge to protect their networks effectively. for more details visit: https://bostoninstituteofanalytics.org/cyber-security-and-ethical-hacking/
Predict Your Way to Marketing Success: A Data Science Approach to Optimizing Ad Campaign Performance for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Dive into our project presentation by Pavan Kumar Data Science Takes the Wheel: Predicting F1 Race Outcomes for Engaging Media Content for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
This presentation dives into the methodologies and tools used for predicting power consumption. Tailored for students, it covers the importance of power consumption forecasting, various prediction techniques, data requirements, and practical applications in energy management and sustainability.
for more visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Explore common web application vulnerabilities like CSRF and XSS, and learn how ethical hackers use these techniques to identify and fix security weaknesses responsibly. This presentation will also cover best practices for securing web applications and preventing attacks. for more info visit: https://bostoninstituteofanalytics.org/cyber-security-and-ethical-hacking/
This presentation provides an in-depth exploration of various tools available in Kali Linux for conducting website scans through IP addresses. Designed for students, the slides cover the functionality, usage, and practical applications of these tools in cybersecurity and ethical hacking. for more visit: https://bostoninstituteofanalytics.org/cyber-security-and-ethical-hacking/
Discover the cutting-edge integration of artificial intelligence in facial and biometric authentication systems. This presentation examines the technological advancements, implementation strategies, and security benefits of AI-powered authentication methods. Learn how AI enhances accuracy, speed, and reliability in verifying identities, and explore real-world applications and future trends in biometric security.
for more information visit : https://bostoninstituteofanalytics.org/cyber-security-and-ethical-hacking/
This presentation provides an in-depth analysis of HTML injection vulnerabilities in web applications. It explores the mechanisms through which these vulnerabilities are introduced, their potential impacts, and effective mitigation strategies. Through case studies and real-world examples, the report highlights the importance of secure coding practices and regular vulnerability assessments to safeguard web applications from malicious exploits.
for more details visit : https://bostoninstituteofanalytics.org/cyber-security-and-ethical-hacking/
Explore the comprehensive landscape of government-established cybersecurity standards designed to protect digital environments globally. This presentation delves into key international and national frameworks, sector-specific regulations, and best practices for compliance. Ideal for cybersecurity professionals and policymakers, it offers insights into the strategies and requirements essential for maintaining robust cyber defenses.
In the digital age, cybersecurity has become a critical concern for governments worldwide. This presentation explores various government-established standards and regulations designed to foster a secure cyber environment. It covers international, national, and sector-specific standards that aim to protect sensitive information, ensure data privacy, and combat cyber threats.
for more information visit : https://bostoninstituteofanalytics.org/cyber-security-and-ethical-hacking/
In this project presentation, we explore the application of machine learning techniques to detect and predict crime hotspots. By analyzing historical crime data, we aim to identify patterns and trends that can help law enforcement agencies allocate resources more efficiently and proactively address crime-prone areas. Key components of the project include data preprocessing, feature engineering, model selection, and evaluation. The presentation will also cover the implementation of visualization tools to highlight crime hotspots on a map, making the findings easily interpretable for stakeholders. This project demonstrates the potential of data science to enhance public safety and support informed decision-making in crime prevention efforts. for more information visit: https://bostoninstituteofanalytics.org/cyber-security-and-ethical-hacking/
Why You Need Real-Time Data to Compete in E-CommercePromptCloud
In the fast-paced world of e-commerce, real-time data is crucial for staying competitive. By accessing up-to-date information on market trends, competitor pricing, and customer preferences, businesses can make informed decisions quickly. Real-time data enables dynamic pricing strategies, effective inventory management, and personalized marketing efforts, all of which are essential for meeting customer demands and outperforming competitors. Embrace real-time data to stay agile, optimize your operations, and drive growth in the ever-evolving e-commerce landscape. Get in touch for custom web scraping services: https://bit.ly/3WkqYVm
emotional interface - dehligame satta for youbkldehligame1
Welcome to DelhiGame.in, your premier hub for the latest Satta results and gaming updates in Delhi! Check out our live results https://delhigame.in/ and stay informed with the latest updates https://delhigame.in/past-results/ . Join us to experience the thrill of gaming like never before!
Tailoring a Seamless Data Warehouse ArchitectureGetOnData
In today's data-driven world, a well-structured data warehouse is crucial for business success.
Data is a valuable asset in the modern business landscape. Efficient data management and analysis are critical for making informed decisions and driving success.
A data warehouse is a centralized system for storing and managing large volumes of data. It provides a comprehensive view of data, enabling informed decision-making across the organization.
Tailored data warehouses are essential to address specific business needs, ensuring that data management and analysis are efficient and effective.
Embrace the power of a tailored data warehouse to transform your business. Start your journey towards data-driven success today!
Data analytics is a powerful tool that can transform business decision-making across industries. Contact District 11 Solutions, which specializes in data analytics, to make informed decisions and achieve your business goals.
Graph Machine Learning - Past, Present, and Future -kashipong
Graph machine learning, despite its many commonalities with graph signal processing, has developed as a relatively independent field.
This presentation will trace the historical progression from graph data mining in the 1990s, through graph kernel methods in the 2000s, to graph neural networks in the 2010s, highlighting the key ideas and advancements of each era. Additionally, recent significant developments, such as the integration with causal inference, will be discussed.
Docker has revolutionized the way we develop, deploy, and run applications. It's a powerful platform that allows you to package your software into standardized units called containers. These containers are self-contained environments that include everything an application needs to run: code, libraries, system tools, and settings.
Here's a breakdown of what Docker offers:
Faster Development and Deployment:
Spin up new environments quickly: Forget about compatibility issues and dependency management. With Docker, you can create consistent environments for development, testing, and production with ease.
Share and reuse code: Build reusable Docker images and share them with your team or the wider community on Docker Hub, a public registry for Docker images.
Reliable and Consistent Applications:
Cross-platform compatibility: Docker containers run the same way on any system with Docker installed, eliminating compatibility headaches. Your code runs consistently across Linux, Windows, and macOS.
Isolation and security: Each container runs in isolation, sharing only the resources it needs.
Predicting Employee Churn: A Data-Driven Approach Project Presentation
1. NAME: POOJA SHAH
Date of Assignment: 18/11/23
Date of Submission: 11/12/23
Project 2
Title: EMPLOYEE CHURN PREDICTION
2. Project Aim
To determine whether an employee will churn or not , as well as
the loss incurred if it does churn.
Create a system to prevent such churn for peaceful sustainability
of our company.
This capstone project aims to uncover the factors that lead to employee attrition and
explore important questions by developing an employee churn prediction system
3. Overview of
Project
Predicting employee churn involves using machine
learning models to forecast whether an employee is
likely to leave a company in the near future. This is
a crucial task for organizations as it allows them to
take preventive measures such as improving work
conditions, offering incentives, or providing career
development opportunities to retain valuable
employees.
4. Project Contents-
• Problem Formulation
• Data collection
• Importing libraries, loading and
understanding the data
• Exploratory Data Analysis
• Data Preprocessing
• Data Visualization
• Graphs Analysis
• Checking imbalance in dataset
• Balancing the data using SMOTE
• Feature Scaling
• Feature Extraction using PCA
• Model building & Evaluation
• Logistic Regression
• KNN
• Decision Tree Classifier
• Random Forest
• ADA Boost
• Support Vector Classifier
• Comparing different models
• Conclusion
5. Importing libraries, loading and
understanding the data-
• We will be using the following libraries
1) Pandas
2) Numpy
3) Seaborn
4) Matplotlib.pyplot
8. Exploratory Data Analysis
• Shape () -
With the help of shape attribute we can get to
know overall rows and columns in the data.
9. Exploratory Data Analysis
df.isnull() - creates a DataFrame of the
same shape as df, where each entry is True
if the corresponding element in df is NaN
(null), and False otherwise.
.sum() then calculates the sum of True
values along each column, resulting in a
Series that contains the total number of
missing values for each column.
.to_frame() converts the Series into a
DataFrame.
.rename(columns={0:"Total No. of
Missing Values"}) renames the column
containing the total number of missing
values to "Total No. of Missing Values."
missing_data["% of Missing Values"] =
df.isnull().mean()*100:
df.isnull().mean() calculates the proportion
of missing values for each column by taking
the mean (average) of the Boolean values in
the DataFrame. This gives the percentage of
missing values for each column.
*100 is then used to convert the proportions
into percentages.
The result is assigned to a new column in
the missing_data DataFrame called "% of
Missing Values."
11. Exploratory Data Analysis
• df.duplicated()
this method finds duplicate rows in data
• df.duplicated().mean()*100
It converts duplicate values into percentage
12. Exploratory Data Analysis
• column_data_types = df.dtypes:
df.dtypes returns a Series containing the data
type of each column in the DataFrame.
Counting numerical and categorical
columns:
This loop iterates through each column in the
DataFrame and checks its data type.
• np.issubdtype(data_type, np.number)
checks if the data type is a numerical type. If
true, it increments numerical_count; otherwise,
it increments categorical_count.
13. • describe().T –
• It generates descriptive statistics of the DataFrame's
numeric columns.
• .T It is transpose operation. It switches the rows
and columns of the result obtained from describe()
• Getting the Count: The number of non-null values in each
column.
• Mean: The average value of each column.
• Standard Deviation (std): It indicates how much individual
data points deviate from the mean.
• Minimum (min): The smallest value in each column.
• 25th Percentile (25%): Also known as the first quartile, it's
the value below which 25% of the data falls.
• Median (50%): Also known as the second quartile or the
median, it's the middle value when the data is sorted. It
represents the central tendency.
• 75th Percentile (75%): Also known as the third quartile,
it's the value below which 75% of the data falls.
• Maximum (max): The largest value in each column
14. Pre-Processing
• df.rename(columns={"Attrition":
"Employee_Churn"}, inplace=True)
The provided code is using the
rename method in pandas to rename a
column in a DataFrame.
• df.drop(columns=["Over18",
"EmployeeCount",
"EmployeeNumber",
"StandardHours"], inplace=True)
After executing this code, the
specified columns ("Over18",
"EmployeeCount",
"EmployeeNumber", and
"StandardHours") will be removed
from your DataFrame (df).
• df.columns
returns names of all columns
15. Pre-Processing
We will see the names of categorical columns and numerical columns in the DataFrame
printed to the console. This information can be helpful for further analysis, preprocessing, or
visualization tasks that may require handling different types of data separately.
16. Pre-Processing
This code is a common approach for identifying and handling outliers in a dataset using the
IQR method, and it also provides visualizations to assess the impact of the outlier handling
process. It ensures that extreme outliers do not unduly affect the analysis of the data.
The result is a grid of boxplots, where each subplot corresponds to a numerical column in the
DataFrame. This visualization is useful for understanding the distribution and variability of
values in each numerical feature.
20. VISUALISATION – UNIVARIATE ANALYSIS – count plot & Pie Chart sub plot
• The result is a figure containing a count plot and a pie chart, both illustrating employee
churn in terms of counts and percentages, respectively. The count plot shows the
distribution of churn and non-churn instances, while the pie chart provides a visual
representation of the churn rate as a percentage.
22. VISUALISATION – BIVARIATE ANALYSIS – count plot
• Bivariate analysis is a
statistical analysis
technique that involves the
examination of the
relationship between two
variables. It is often used to
understand how one
variable affects or is related
to another variable.
• We then create count plots
for 2 categorical variables
26. VISUALISATION – BIVARIATE ANALYSIS – Hist Plot
• The provided code defines a function named hist_plot that creates a histogram with a kernel
density estimate (KDE) for a specified column in a DataFrame (df).
• plt.show() is used to display all the created plots.
• Each histogram provides a visual representation of the distribution of the specified
numerical columns, and the bars are colored based on whether an employee has churned or
not (as indicated by the 'Employee_Churn' column). This allows for a quick comparison of
the distributions for employees who have churned versus those who haven't in terms of age,
monthly income, and years at the company.
28. VISUALISATION – MULTIVARIATE ANALYSIS – scatter plot
• Scatter plots are used to visualize
the relationship between two
continuous variables.
• Each data point is plotted on a
graph, with one variable on the x-
axis and the other on the y-axis.
• This helps you visualize patterns,
trends, and potential correlation
29. REPLACE
• df['Employee_Churn’]:
This selects the 'Employee_Churn' column in the DataFrame df.
• .replace({'No': 0, 'Yes': 1}):
This method replaces values in the specified column according to
the provided dictionary. In this case, it replaces 'No' with 0 and 'Yes'
with 1.
30. LABEL ENCODER
• This code defines a function
named labelencoder that uses
scikit-learn's LabelEncoder to
encode categorical columns in a
pandas DataFrame into numerical
values.
31. This code is a useful
way to visualize the
pairwise correlations
between features in
your dataset. It helps
identify relationships
between variables and
can be valuable for
feature selection and
understanding the
underlying structure of
your data.
FEATURE SELECTION
32. Checking For Imbalance In Dataset
The code is creating a pie chart
to visually represent imbalanced
data, where the two slices
represent the “Churn" and “Not
Churn" classes with different
explosion and colors to highlight
the imbalance.
The percentages of each class
are displayed on the chart, and a
legend is added for clarity.
33. SMOTE (Synthetic Minority
Over-sampling Technique),is
applied to the training data to
generate synthetic samples for
the minority class (where the
class with a minority of
examples is specified by the
sampling_strategy parameter).
This way, you can address class
imbalance in your dataset and
create a balanced training set for
your machine learning models.
We split our data before using
SMOTE
Balancing The Data using SMOTE
34. The bar plot
provides a visual
representation of
the balanced or
adjusted
distribution of
classes in the
target variable
after SMOTE.
35. Standardization, also known as feature scaling or normalization, is a preprocessing technique
commonly used in machine learning to bring all features or variables to a similar scale.
This process helps algorithms perform better by ensuring that no single feature dominates the
learning process due to its larger magnitude.
Standardization is particularly important for algorithms that rely on distances or gradients,
such as k-nearest neighbors
The goal of standardization is to transform the features so that they have a mean of 0 and a
standard deviation of 1.
This transformation does not change the shape of the distribution of the data; it simply scales
and shifts the data to make it more suitable for modeling.
36. The purpose of standardization is to transform the features so that they have a mean of 0 and a
standard deviation of 1. This is important, especially for algorithms that rely on distance
measures, as it ensures that all features contribute equally to the computations.
In this case, the features in x_sampled are standardized using the StandardScaler, and the result
is stored in the DataFrame standard_df. Each column in standard_df now represents a
standardized version of the corresponding feature in the original dataset.
FEATURE SCALING
37. PCA stands for Principal Component Analysis. It is a dimensionality reduction technique
commonly used in machine learning and statistics.
The main goal of PCA is to transform high-dimensional data into a new coordinate system,
capturing the most important information while minimizing information loss.
PCA achieves this by finding a set of orthogonal axes (principal components) along which
the data varies the most.
PCA – PRINCIPAL COMPONENT ANALYSIS
38. The purpose of standardization is to transform the features so that they have a mean of 0 and a
standard deviation of 1. This is important, especially for algorithms that rely on distance
measures, as it ensures that all features contribute equally to the computations.
In this case, the features in x_sampled are standardized using the StandardScaler, and the result
is stored in the DataFrame standard_df. Each column in standard_df now represents a
standardized version of the corresponding feature in the original dataset.
FEATURE EXTRACTION USING PCA
39. KEY STEPS IN PCA
Standardization: Standardize the features (subtract the mean and divide by the standard
deviation) to ensure that all features have a similar scale.
Covariance Matrix: Compute the covariance matrix for the standardized data. The covariance
matrix represents the relationships between pairs of features.
Eigenvalue Decomposition: Perform eigenvalue decomposition on the covariance matrix. This
yields a set of eigenvalues and corresponding eigenvectors.
Principal Components: The eigenvectors represent the principal components. These are the
directions in feature space along which the data varies the most. The corresponding eigenvalues
indicate the amount of variance captured by each principal component.
Projection: Project the original data onto the new coordinate system defined by the principal
components. This results in a reduced-dimensional representation of the data.
41. TRAIN TEST SPLIT
By splitting your data into training and testing sets, you can use X_train and y_train to train
your machine learning model and then use X_test to evaluate its performance.
This is a common practice to assess how well your model generalizes to unseen data.
42. MODEL BUILDING, CLASSIFICATION
REPORT & EVALUATION
• Will now build the following models
• Logistic Regression
• K-Nearest Neighbors
• Decision Tree Classifier
• Random Forest
• Ada Boost
• Support Vector Classifier
43. Classification Report
• A classification report is a summary of the performance metrics for a classification model.
• Precision: Precision is a measure of how many of the predicted positive instances were actually true positives.
• Precision = (True Positives) / (True Positives + False Positives)
• High precision indicates that the model makes fewer false positive errors.
• Recall (also known as Sensitivity or True Positive Rate): Recall measures the proportion of actual positive instances that were
correctly predicted by the model.
• Recall = (True Positives) / (True Positives + False Negatives)
• High recall indicates that the model captures a large portion of the positive instances.
• F1-Score: The F1-Score is the harmonic mean of precision and recall. It balances the trade-off between precision and recall
and is particularly useful when you want to consider both false positives and false negatives.
• F1-Score = 2 * (Precision * Recall) / (Precision + Recall)
• The F1-Score ranges between 0 and 1, where a higher value indicates a better balance between precision and recall.
• Support: Support represents the number of instances in each class in the test dataset. It gives you an idea of the distribution of
data across different classes.
44. AUCROC_CURVE
• AUCROC_curve
This code will help you visualize the performance of model in terms of its ability to discriminate between the
positive and negative classes. The higher the AUC score, the better the model's performance.
Interpreting the AUC:
0.5 (Random Classifier): If the AUC is 0.5, it means that the model's performance is no better than random chance.
It's essentially saying that the model cannot distinguish between positive and negative cases effectively.
< 0.5 (Worse than Random): If the AUC is less than 0.5, it suggests that the model's performance is worse than
random chance. It is misclassifying cases in the opposite direction.
> 0.5 (Better than Random): If the AUC is greater than 0.5, it indicates that the model is performing better than
random chance. The higher the AUC, the better the model is at discriminating between the classes.
1.0 (Perfect Classifier): An AUC of 1.0 represents a perfect classifier. This means the model achieves perfect
discrimination, correctly classifying all positive cases while avoiding false positives.
45. Logistic Regression – Modelling & Classification Report
• Logistic regression is a statistical
and machine learning model
used for binary classification,
which means it's used when the
target variable (the variable you
want to predict) has two possible
outcomes or classes.
• Classification Report
Class 0 Class 1
Precision 0.79 0.83
Recall 0.86 0.74
F1 Score 0.83 0.78
47. K-Nearest Neighbour (KNN)– Modelling &
Classification Report
• KNN operates based on the principle
that similar data points tend to have
similar labels or values.
• It's a non-parametric algorithm, which
means it doesn't make assumptions
about the underlying data distribution.
• KNN considers all available training
data when making predictions, which
can be advantageous in some cases but
might be computationally expensive for
large datasets.
• Classification Report
Class 0 Class 1
Precision 0.94 0.83
Recall 0.83 0.94
F1 Score 0.88 0.88
49. Decision Tree – Modelling & Classification Report
• A Decision Tree is a popular
supervised ML algorithm used
for both classification and
regression tasks. It is a non-
parametric, non-linear model that
makes predictions by recursively
partitioning the dataset into
subsets based on the most
significant attribute(s) at each
node.
• Classification Report
Class 0 Class 1
Precision 0.78 0.73
Recall 0.74 0.76
F1 Score 0.76 0.74
51. Random Forest– Modelling & Classification Report
• Random Forest is an ensemble
machine learning algorithm that is
widely used for both classification
and regression tasks. It is a
powerful and versatile algorithm
known for its high accuracy and
robustness. Random Forest builds
multiple decision trees during
training and combines their
predictions to produce more reliable
and generalizable results.
• Classification Report
Class 0 Class 1
Precision 0.82 0.91
Recall 0.93 0.78
F1 Score 0.87 0.84
54. AdaBoost – Modelling & Classification Report
• AdaBoost, short for Adaptive
Boosting, is an ensemble learning
method used for classification and
regression tasks. It is particularly
effective in improving the
performance of weak learners
(models that perform slightly better
than random chance). The basic
idea behind AdaBoost is to combine
multiple weak learners to create a
strong classifier.
• Classification Report
Class 0 Class 1
Precision 0.79 0.80
Recall 0.83 0.76
F1 Score 0.81 0.78
56. Support Vector Classifier– Modelling & Classification
Report
• SVMs are adaptable and efficient
in a variety of applications
because they can manage high-
dimensional data and nonlinear
relationships.
• The SVM algorithm has the
characteristics to ignore the
outlier and finds the best
hyperplane that maximizes the
margin. SVM is robust to
outliers.
• Classification Report
Class 0 Class 1
Precision 0.85 0.90
Recall 0.92 0.82
F1 Score 0.89 0.85
58. COMPARING CLASSIFICATION REPORT & AUC SCORE OF VARIOUS MODELS
• Creating dictionary to compare Classification report and AUC Score of different models
60. Conclusion-
•In this Employee Churn prediction process, we started by examining a dataset with 1470
rows and 35 columns. It contained numerical & categorical variables, and we noticed an
imbalance in Employee churn column
•To address the data's characteristics, we performed data preprocessing.
• We bifurcated data into categorical and numerical to find any outliers using boxplot.
• Visualization was done using 3 types
Univariate Analysis – Count plot & Pie Chart
Bivariate Analysis – Count plots & Hist plots
Multivariate Analysis – Scatter diagram
•Later we balanced unbalanced data using SMOTE
•Standardization was used to scale certain features for better model.
• Principal Component Analysis a dimensionality reduction technique was used to to
transform high-dimensional data into a new coordinate system, capturing the most important
information while minimizing information loss.
61. • We divided the dataset into training and testing sets and explored Six different machine
learning models: Logistic Regression, K-Nearest Neighbour, Decision Tree Classifier, Random
forest, AdaBoost and Support Vector Classifier.
• However, our focus was on SVC , which showed the best performance in terms of accuracy,
AUC Score and Precision.
• The chosen Support Vector Classifier model achieved an accuracy of approximately 87.387%,
Precision score of 0.9047, & AUC score is 0.9524. This result ensures that the company can
minimize potential employee churn.
• The top two influential factors for PC1 and PC 3 after applying PCA for prediction Detection.
• In conclusion, by systematically preprocessing the data, selecting the right model, we
successfully built a model that enhances accuracy and lowers the risk of employee churn
prediction. This helps the company make more informed lending decisions and reduces the
chances of financial setbacks.
Conclusion-