Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Lecture - 32 - 33

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 65

Learning

Dr. Ashish Kumar


Associate Professor-CSE
Manipal University Jaipur
What is Learning?
• Herbert Simon: “Learning is any process by which a system improves performance from experience.”
• “A computer program is said to learn from experience E with respect to some class of tasks T and
performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.”
– Tom Mitchell
• Learning is the process of converting experience into expertise or knowledge.
• Learning is essential for unknown environments, i.e., when designer lacks omniscience.
• Learning is useful as a system construction method, i.e., expose the agent to reality rather than trying to
write it down.
• Learning modifies the agent's decision mechanisms to improve performance.
• Learning is used when:
• Human expertise does not exist (navigating on Mars),
• Humans are unable to explain their expertise (speech recognition)
• Solution changes with time (routing on a computer network)
• Solution needs to be adapted to particular cases (user biometrics)
Types of Learning
• Learning can be broadly classified into three categories, as mentioned below, based on the nature of the
learning data and interaction between the learner and the environment.
• Supervised Learning – Supervised learning is a type of machine learning where an algorithm learns
from labeled training data, and makes predictions or decisions based on that learning. In supervised
learning, the algorithm is provided with input-output pairs, and the goal is to learn a mapping
function from inputs to outputs. It is called 'supervised' because the process of an algorithm learning
from the training dataset can be compared to a teacher supervising the learning process.
• Example: Image recognition, language translation.

• Unsupervised Learning – Unsupervised learning involves training algorithms with unlabeled data. The
system tries to learn the patterns and the structure from the data without any supervision. The
algorithms work to explore the data and can automatically find patterns or relationships within it.
• Example: Clustering for customer segmentation, anomaly detection.
Types of Learning
• Semi-supervised Learning – Semi-supervised learning is a hybrid approach where the algorithm is
trained on a dataset that contains both labeled and unlabeled data. It's useful when obtaining labeled
data is expensive or time-consuming. The algorithm learns from the small amount of labeled data and
the larger amount of unlabeled data to make predictions.
• Example: Sentiment analysis, speech recognition.

• Reinforcement Learning – Reinforcement learning is a type of machine learning where an agent learns
how to behave in an environment by performing certain actions and receiving rewards or penalties in
return. The agent learns to achieve a goal in an uncertain, potentially complex environment.
Reinforcement learning is based on the idea of agents taking actions in an environment to maximize
some notion of cumulative reward.
• Example: Game playing (e.g., AlphaGo), robotic control.
Learning Classification
• Two techniques of Learning :
Inductive Learning
• An technique of machine learning called inductive learning trains a model to generate predictions based
on examples or observations. During inductive learning, the model picks up knowledge from particular
examples or instances and generalizes it such that it can predict outcomes for brand-new data.
• When using inductive learning, a rule or method is not explicitly programmed into the model. Instead,
the model is trained to spot trends and connections in the input data and then utilize this knowledge to
predict outcomes from fresh data. Making a model that can precisely anticipate the result of subsequent
instances is the aim of inductive learning.
• In supervised learning situations, where the model is trained using labeled data, inductive learning is
frequently utilized. A series of samples with the proper output labels are used to train the model. The
model then creates a mapping between the input data and the output data using this training data. The
output for fresh instances may be predicted using the model after it has been trained.
• Inductive learning is used by a number of well-known machine learning algorithms, such as decision
trees, k-nearest neighbors, and neural networks.
Advantages of Inductive Learning
• Because inductive learning models are flexible and adaptive, they are well suited for handling difficult,
complex, and dynamic information.

• Finding hidden patterns and relationships in data: Inductive learning models are ideally suited for
tasks like pattern recognition and classification because they can identify links and patterns in data
that may not be immediately apparent to humans.

• Huge datasets − Inductive learning models are suitable for applications requiring the processing of
massive quantities of data because they can efficiently handle enormous volumes of data.

• Appropriate for situations where the rules are ambiguous − Since inductive learning models may learn
from examples without explicit programming, they are suitable for situations when the rules are not
precisely described or understood beforehand.
Disadvantages of Inductive Learning
• May overfit to particular data − Inductive learning models that have overfit to specific training data, or
that have learned the noise in the data rather than the underlying patterns, may perform badly on fresh
data.
• Computationally costly possible − The employment of inductive learning models in real-time
applications may be constrained by their computationally costly nature, especially for complex datasets.
• Limited interpretability − Inductive learning models may be difficult to understand, making it difficult
to understand how they arrive at their predictions, in applications where the decision-making process
must be transparent and explicable.
• Inductive learning models are only as good as the data they are trained on, therefore if the data is
inaccurate or inadequate, the model may not perform effectively.
Deductive Learning
• Deductive learning is a method of machine learning in which a model is built using a series of logical
principles and steps. In deductive learning, the model is specifically designed to adhere to a set of
guidelines and processes in order to produce predictions based on brand-new, unexplored data.
• In rule-based systems, expert systems, and knowledge-based systems, where the rules and processes are
clearly set by domain experts, deductive learning is frequently utilized. The model is trained to adhere
to the guidelines and processes in order to derive judgments or predictions from the input data.
• Deductive learning begins with a set of rules and processes and utilizes these rules to generate
predictions on incoming data, in contrast to inductive learning, which learns from particular examples.
Making a model that can precisely adhere to a set of guidelines and processes in order to generate
predictions is the aim of deductive learning.
• Deductive learning is used by a number of well-known machine learning algorithms, such as decision
trees, rule-based systems, and expert systems. Deductive learning is a crucial machine learning strategy
because it enables the development of models that can generate precise predictions in accordance with
predetermined rules and guidelines.
Advantages of Deductive Learning
• More effective − Since deductive learning begins with broad concepts and applies them to particular
cases, it is frequently quicker than inductive learning.

• Deductive learning can sometimes yield more accurate findings than inductive learning since it starts
with certain principles and applies them to the data.

• Deductive learning is more practical when data are sparse or challenging to collect since it requires
fewer data than inductive learning.
Disadvantages of Deductive Learning
• Deductive learning is constrained by the rules that are currently in place, which may be insufficient or
obsolete.

• Deductive learning is not appropriate for complicated issues that lack precise rules or correlations
between variables, nor is it appropriate for ambiguous problems.

• Results that are biased − The quality of the rules and knowledge base, which might add biases and
mistakes to the results, determines how accurate deductive learning is.
Inductive Vs Deductive Learning
Supervised vs. Unsupervised Learning
Prediction Problems: Classification vs. Numeric Prediction
• Classification
• predicts categorical class labels (discrete or nominal)
• classifies data (constructs a model) based on the training set and the values (class labels) in a
classifying attribute and uses it in classifying new data
• Numeric Prediction
• models continuous-valued functions, i.e., predicts unknown or missing values
• Typical applications
• Credit/loan approval:
• Medical diagnosis: if a tumor is cancerous or benign
• Fraud detection: if a transaction is fraudulent
• Web page categorization: which category it is
Classification—A Two-Step Process
• Model construction: describing a set of predetermined classes
• Each tuple/sample is assumed to belong to a predefined class, as determined by the class label
attribute
• The set of tuples used for model construction is training set
• The model is represented as classification rules, decision trees, or mathematical formulae
• Model usage: for classifying future or unknown objects
• Estimate accuracy of the model
• The known label of test sample is compared with the classified result from the model
• Accuracy rate is the percentage of test set samples that are correctly classified by the model
• Test set is independent of training set (otherwise overfitting)
• If the accuracy is acceptable, use the model to classify new data
• Note: If the test set is used to select models, it is called validation (test) set
Process (1): Model Construction
Classification
Algorithms
Training
Data

NAME RANK YEARS TENURED Classifier


Mike Assistant Prof 3 no (Model)
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes IF rank = ‘professor’
Dave Assistant Prof 6 no
OR years > 6
Anne Associate Prof 3 no
THEN tenured = ‘yes’
Process (2): Using the Model in Prediction

Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
Decision Trees
• Decision trees are a popular machine learning technique used for both classification and regression
tasks.
• Decision Trees are a type of Supervised Machine Learning (that is you explain what the input is and
what the corresponding output is in the training data) where the data is continuously split according to
a certain parameter. It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
• The tree can be explained by two entities, namely decision nodes and leaves. The leaves are the decisions
or the final outcomes. And the decision nodes are where the data is split. The decisions or the test are
performed on the basis of features of the given dataset.
• It is called a decision tree because, similar to a tree, it starts with the root node, which expands on
further branches and constructs a tree-like structure. In order to build a tree, we use the CART
algorithm, which stands for Classification and Regression Tree algorithm.
• A decision tree simply asks a question, and based on the answer (Yes/No), it further split the tree into
subtrees.
Decision Trees
Decision Tree Terminologies
• Root Node: Root node is from where the decision tree starts. It represents the entire dataset, which
further gets divided into two or more homogeneous sets.
• Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after getting
a leaf node.
• Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes according to
the given conditions.
• Branch/Sub Tree: A tree formed by splitting the tree.
• Pruning: Pruning is the process of removing the unwanted branches from the tree.
• Parent/Child node: The root node of the tree is called the parent node, and other nodes are called the
child nodes.
Decision Tree algorithm
• Step-1: Begin the tree with the root node, says S, which contains the complete dataset.

• Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).

• Step-3: Divide the S into subsets that contains possible values for the best attributes.

• Step-4: Generate the decision tree node, which contains the best attribute.

• Step-5: Recursively make new decision trees using the subsets of the dataset created in step -3. Continue
this process until a stage is reached where you cannot further classify the nodes and called the final
node as a leaf node.
Example
• Suppose there is a candidate who has a job offer
and wants to decide whether he should accept the
offer or Not. So, to solve this problem, the decision
tree starts with the root node (Salary attribute by
ASM). The root node splits further into the next
decision node (distance from the office) and one
leaf node based on the corresponding labels. The
next decision node further gets split into one
decision node (Cab facility) and one leaf node.
Finally, the decision node splits into two leaf
nodes (Accepted offers and Declined offer).
Consider the diagram:
Attribute Selection Measures
• While implementing a Decision tree, the main issue arises that how to select the best attribute for the
root node and for sub-nodes. So, to solve such problems there is a technique which is called as Attribute
selection measure or ASM. By this measurement, we can easily select the best attribute for the nodes of
the tree.
• There are three popular techniques for ASM, which are:
1. Information Gain 2. Gain Ratio 3. Gini Index
Why ASM is very much required?
• Curse of Dimensionality: High-dimensional datasets often suffer from the curse of dimensionality,
where the complexity of the model increases with the number of features, leading to overfitting.
• Improved Model Performance: Selecting relevant features can lead to simpler, more interpretable
models, and often results in improved predictive performance on unseen data.
• Reduced Training Time: Fewer features mean less computational time and resources required for
training the model.
Information Gain
• Information gain is also called as Kullback-Leibler divergence denoted by IG(S,A) for a set S is the
effective change in entropy after deciding on a particular attribute A.
• It measures the relative change in entropy with respect to the independent variables.
• It calculates how much information a feature provides us about a class.
• According to the value of information gain, we split the node and build the decision tree.
• A decision tree algorithm always tries to maximize the value of information gain, and a node/attribute
having the highest information gain is split first.
• The formula for Information Gain IG(A) for an attribute A with values {a 1,a2,...,av} on dataset S is:

• Where ∣S∣ is the size of dataset S, V is the number of values for attribute A, S v​is the subset of S where
attribute A has the value av​, and H(Sv) is the entropy of subset Sv​.
Entropy
• Entropy, also called as Shannon Entropy is denoted by H(S) for a finite set S, is the measure of the
amount of uncertainty or randomness in data. Decision Trees modified Intuitively; it tells us about the
predictability of a certain event.
• The formula for Entropy is : Where the pi is the probability of randomly
selecting an example in class i.
• Interpretation :
• Higher Entropy means Higher Uncertainty
• Lower Entropy means Lower Uncertainty
• Entropy is maximum “1” (one) when the dataset contains an equal proportion of all classes and
decreases towards “0” (zero) as the dataset becomes pure.
• Entropy is also used in data compression algorithms like Huffman coding. Symbols with higher entropy
require more bits for efficient representation.
ID3 Algorithm
• One of the many Algorithms used to build Decision Trees.
• ID3 stands for Iterative Dichotomiser 3 and is named such because the algorithm iteratively (repeatedly)
dichotomizes(divides) features into two or more groups at each step.
• Invented by Ross Quinlan, ID3 uses a top-down greedy approach to build a decision tree.
• In simple words, the top-down approach means that we start building the tree from the top and the
greedy approach means that at each iteration we select the best feature at the present moment to create
a node.
• ID3 can overfit the training data (to avoid overfitting, smaller decision trees should be preferred over
larger ones).
• This algorithm usually produces small trees, but it does not always produce the smallest possible tree.
• ID3 is harder to use on continuous data (if the values of any given attribute is continuous, then there
are many more places to split the data on this attribute, and searching for the best value to split by can
be time-consuming).
ID3 Algorithm
• a) Take the Entire dataset as an input.

• b) Calculate the Entropy of the target variable, As well as the predictor attributes

• c) Calculate the information gain of all attributes.

• d) Choose the attribute with the highest information gain as the Root Node

• e) Repeat the same procedure on every branch until the decision node of each branch is finalized.
ID3 Algo Example
In this dataset,
• Class is PlayTennis
• Yes / No
• Total 5 Attributes.

• Will not consider “Day” attribute.


• Due to unique values.
ID3 Algo Example
ID3 Algo Example
ID3 Algo Example
ID3 Algo Example
ID3 Algo Example
ID3 Algo Example

Choose the highest info gain attribute


as the root node and draw the tree for
all its possible values.
ID3 Algo Example
ID3 Algo Example
ID3 Algo Example
ID3 Algo Example
ID3 Algo Example

Choose the highest info gain attribute


as the “Sunny” root node and draw the
tree for all its possible values.
ID3 Algo Example
ID3 Algo Example
ID3 Algo Example
ID3 Algo Example
ID3 Algo Example

Choose the highest info gain attribute


as the “Rain” root node and draw the
tree for all its possible values.
ID3 Algo Example
Final Decision Tree is constructed as
there are only leaf nodes at the
bottom.
Problem with Information Gain
• For example, Suppose we have two features, “Color” and “Size” and we want to build a decision tree to
predict the type of fruit based on these two features. The “Color” feature has three outcomes (red, green,
yellow) and the “Size” feature has two outcomes (small, large). Using the information gain method, the
“Color” feature would be chosen as the best feature to split on because it has the highest information
gain. However, this could be a problem because the “Size” feature could be a better feature to split on
because it is less ambiguous and has fewer outcomes.

• To overcome this problem, we can use the gain ratio method.

• Gain Ratio is an alternative to Information Gain that is used to select the attribute for splitting in a
decision tree. It is used to overcome the problem of bias towards the attribute with many outcomes.
Gain Ratio
• Gain Ratio is an alternative to Information Gain that is used to select the attribute for splitting in a
decision tree.
• It is used to overcome the problem of bias towards the attribute with many outcomes.
• Gain Ratio is a measure that takes into account both the information gain and the number of outcomes
of a feature to determine the best feature to split on.
• We use Gain Ratio to normalize the Information Gain by the Split Info = where D(i)
is the probability of each outcome i in the attribute. The lower the value of Split Info, the better the split
is considered to be.
• Now, the formula for gain ratio: Gain Ratio = Information Gain / Split Info
• In decision tree algorithm, the feature with the highest gain ratio is considered as the best feature for
splitting.
Gini Index
• The last measurement is the Gini Index, which is derived separately from a different discipline.
• The Gini Index (or Gini Coefficient) was first introduced to measure the wealth distribution of a nation’s
residents.
• The Gini Index is also known as Gini impurity. It is a measure of how mixed or impure a dataset is.
• The Gini impurity ranges between 0 and 1, where ‘0’ represents a pure dataset and ‘1’ represents a
completely impure dataset. In a pure dataset, all the samples belong to the same class or category. On
the other hand, an impure dataset contains a mixture of different classes or categories. A Gini Index of
'0.5’ denotes equally distributed elements into some classes.
• In decision tree algorithms, a pure dataset is considered ideal as it can be easily divided into subsets of
the same class. An impure dataset is more difficult to divide, as it contains a mixture of different classes.
• It means an attribute with a lower Gini index should be preferred i.e The lower the Gini impurity, the
better the feature is for splitting the dataset.
Gini Index
• The Gini Index is calculated with the below formula :

where p(i) is the probability of a specific class, and the summation is done for all classes present in the
dataset.
Gini Index Example
Gini Index Example
Gini Index Example
Gini Index Example
Gini Index Example
Gini Index Example
Gini Index Example
Gini Index Example
Gini Index Example
Gini Index Example
Gini Index Example
Gini Index Example
Gini Index Example
Decision Tree
Advantages:
• Interpretability: Decision trees are easy to understand and interpret, making them valuable for
decision-making processes.
• Feature Importance: They can rank features based on their contribution to the decision-making
process.

Disadvantages:
• Overfitting: Without proper handling, decision trees can memorize the training data, leading to poor
generalization on unseen data.
• Instability: Small variations in the data can result in a completely different tree, making them sensitive
to input data changes.
Decision Tree
Handling Overfitting:
• Decision trees tend to overfit the data, capturing noise in the training set. Techniques like pruning
(removing branches that don't provide significant predictive power) and setting a minimum number of
samples required to split a node help mitigate overfitting.

Ensemble Methods:
• To enhance performance, decision trees are often used in ensemble methods like Random Forest
(multiple decision trees voting on the output) and Gradient Boosting (boosting the performance by
combining multiple decision trees).
In summary, decision trees are powerful, intuitive, and versatile machine learning techniques with
applications ranging from business and finance to healthcare and natural language processing. Their
effectiveness, especially when combined with ensemble methods, makes them a fundamental tool in the
machine learning toolkit.
“Thank you for being such an
engaged audience during my
presentation.”
- Dr. Ashish Kumar

You might also like