Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
13 views16 pages

Draft Xai

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1/ 16

OUR MODEL

Data Data Pre- Feature Training Testing


Acquisition processing Selection Classification Data
Methods
DATA ACQUISITION:
• The aim of this step is to spot and acquire all data-related
problems.
• During this step, spot the various data sources, as data are often
collected from various sources like files and database. The size
and quality of the collected data will determine the efficiency of
the output.
• The more the number of datapoints, the more accurate will be the
prediction.
• The dataset for cardiovascular disease was chosen from Kaggle, a
reputable source, to ensure a robust and reliable foundation for
subsequent analysis and predictive modeling.
Data Pre-Processing:
• In this step we understand the nature, characteristics, format, and
quality of data.
• Data needs cleaning and converting it into a useable format. It's
the method of cleaning the datapoints, selecting the variable to
use, and converting the data into a proper format that is suitable
for analysis within the next step.
• Cleaning of datapoints is required to deal with the quality issues.
Here in the pre-processing step, usually checks for null values
and replace them with the mean of the feature.
• And also, to identity the duplicate datapoints and drop them from
the dataset.
Feature Selection:
• The aim of this step is to identify the best features that gives high
efficiency.
• To get the best features one should find the outliers that are
present in the dataset and have to adjust them using interquartile
range. And after identifying outliers one has to scale the features.
• Feature scaling is the final step of data pre-processing. It is a way
to standardize the independent variables of the dataset into a
specific range. In feature scaling, put our variables within the
same range and within the same scale so that no variable
dominates the other variable.
• For feature scaling, we’ll use StandardScaler class of
sklearn.preprocessing library. And author also used What-If
tool which is an Explainable AI tool from Google.
• The What-If Tool, a visual interface designed to assist you
understand your data sets. It helps in editing data points to see
how the model reacts to changes and also useful for comparing
multiple machine learning models.
• The features that are identified are age, height, weight, systolic
pressure, diastolic pressure and cholesterol.
Training Classification Methods

RANDOM
FOREST

DECISION EXTREME
GRADIENT
TREE
BOOST
DECISION
TREE
DECISION TREE
• Decision Tree is a Supervised learning technique i.e., preferred for solving
Classification problems as our main motto is to classify whether the person is having
cardiovascular disease or not.
• Decision Tree is a tree-structured classification algorithm, where internal nodes
represent the columns or features of a dataset, branches represent the decisions that
has to be made and each leaf node represents the output.
• And the features are selected based on Attribute Selection Measure (ASM) such as
Entropy and Gini index. They are calculated using the below formulas,
Entropy = -∑ pi n i=1 * log(pi) ---equation (1)
Gini index = 1 - ∑ pi n 2 i=1 ---equation (2)
Where ‘i’ is the number of classes and
pi is the probabilities of each class respectively.
The basic idea behind decision tree algorithm is as follows:

Step-1: Select the best attribute using Attribute Selection Measures


(ASM) to split the records using equation (1) and equation (2).
Step-2: The selected attribute is made a decision node and then split
the dataset into smaller subsets.
Step-3: After step2, begins tree building by repeating step-1 and step-
2 recursively for each child either all the tuples belong to the same
attribute value or there are no more remaining attributes.
RANDO
M
FOREST
RANDOM FOREST
• Random forest is a Supervised learning technique based on ensemble learning.
Ensemble learning is a type of learning where you join different decision tree
algorithms multiple times to form a more accurate prediction model.
• As a combination of multiple decision trees, resulting in a forest of trees,
hence they have given the name "Random Forest".
• The random forest algorithm is not biased as it depends on majority voting and
based on that voting it produces the final prediction. Random Forest also uses
the same formulas equation (1) and equation (2) as of Decision Tree .
The Random Forest algorithm is as follows:
Step 1 : Firstly, start the selection of random samples from a given training
dataset.
Step 2 : Next, this algorithm will construct a decision tree for every sample using
decision tree algorithm. Then for each decision tree an outcome is resulted.
Step 3 : In this step, voting will be performed for every outcome that is resulted.
Step 4 : At last, select the most voted outcome result as the final prediction result.
EXTREME GRADIENT BOOSTING
• Extreme Gradient Boosting also referred to as XGBoost. XGBoost is an optimized
gradient boosting library which has many benefits which makes the model highly
efficient, flexible and portable.
• The implementation of the algorithm was done so that it meets the efficiency of
computing time and memory resources. The design goal of XGBoost is to make the
best use of available resources to train the model.
• Some key algorithm implementation features include:
1. Spare Aware implementation with automatic handling of missing data values.
2. Block Structure to support the parallelization of tree construction.
3. Continued Training so that you can further boost an already fitted model on new data.
Testing Data:
• Upon the successful training of the Cardiovascular Disease
Prediction model with the cardiovascular dataset, the subsequent
phase involves subjecting the model to rigorous evaluation.
• This evaluation encompasses the meticulous examination of its
correctness and accuracy through the introduction of a distinct test
dataset.
• The primary objective here is to ascertain whether the model
exhibits an enhanced performance through the judicious utilization
of available data points.
• This phase serves as a critical assessment of the model's efficacy
and its capacity for refinement.
DATASET

You might also like