Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Supervised Learning: Adane Letta Mamuye (PHD)

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 41

Supervised Learning

Adane Letta Mamuye (PhD)


adothebigg@gmail.com

May 2019
Outline

• Classification-Introduction
• Decision tree classification
• Tree induction
• Decision tree practical issues
Classification
• Classification is the task of assigning objects to one of several
predefined categories.
• Given a collection of records (training set):
– Each record contains a set of attributes, one of the attributes is
the class.
• Find a model for class attribute as a function of the values of
other attributes.
• Goal: previously unseen records should be assigned a class as
accurately as possible.
• A test set is used to determine the accuracy of the model.
Classification Tasks
Image Classification Example
• Optical character recognition:- recognize characters sets from
their images. In this case there are multiple classes , as many
as there are characters we would like to recognize. Especially
interesting is the case when the characters are hand written:

– People have different handwriting styles


– Characters may be written small or large with pen/ pencil
– There are many possible images corresponding to the same
character
Image Classification Example
• Face recognition: the input is an image, the classes are people
to be recognized, and the learning classification program
should learn to associate the face images to identities.
• More difficult than OCR since:
– There are more classes
– Input images is larger
– There is age difference
– The face is three dimensional and lighting causes significant
changes in an image
Two Types of Learners
• Classifiers: lazy learners or eager learners.
– Lazy Learners: simply store the training data and wait until a
testing data appear
• Have less training time but more time in predicting
• k-nearest neighbor, Case-based reasoning

– Eager learners construct a classification model based on


the given training data before receiving data for
classification
• Take a long time for train and less time to predict
• Decision Tree, Naive Bayes, Artificial Neural Networks
Classification Techniques
• Decision Tree based Methods
• Naive Bayes and Bayesian Belief Networks
• Neural Networks
• Support Vector Machines
• ……
Decision Tree
• Decision tree is a hierarchical model whereby the local region
is identified in a sequence of recursive splits in a smaller
number of steps.

• Implements the divide-and-conquer strategy.


Decision Tree
• A decision tree classifier builds classification models from an input data
set.

There could be more than one tree How a decision tree used for
for the same data classification

• Each leaf in the case of classification is the class code and in regression
is a numeric value.
Building a Decision Tree

• Two step method:

– Tree Construction: determine the best split to find out all the
branches and the leaf nodes of the tree.

– Tree Pruning (Optimization): identify and remove branches in


the Decision Tree that are not useful for classification
• Pre-Pruning
• Post Pruning
How to Determine the Best Split
• The goodness of the split is quantified by node impurity:
– Information gain (entropy): attributes are assumed to be
categorical

– Gain ration: extension to information gain


• Overcome the maximal information gained by a single
attribute

– Gini index: attributes are assumed to be continuous


• Assume there exist several possible split values for each
attribute
Information Gain
• The encoding information that would be gained by branching
on A.
Gain(A) = Info(D) - InfoA(D)
• Gain(A) tells us how much would be gained by branching on
A.

• The attribute A with the highest information gain is chosen as


the splitting attribute at node N.
Information Gain

Entropy in information theory specifies the minimum number of


bits needed to encode the classification accuracy of an instance.
Information Gain
• Suppose we were to partition the tuples in D on some
attribute A having v distinct values, {a1, a2, …, av}
• Attribute A can be used to split D into v partitions or subsets,
{D1,D2,…, Dv}, where Dj contains those tuples in D that have
outcome aj of A.
• These partitions would correspond to the branches grown
from node N.
• The expected information required to classify a tuple from D
based on the partitioning by A:
Play Tennis Example
Which attribute do we take first?
DT Induction Algorithm
Stopping Criteria for Tree Induction

• Stop expanding a node when all the records belong to the


same class

• Stop expanding a node when all the records have similar


attribute values
Tree pruning

• The node is not split further if the number of training


instances reaching a node is smaller thatn a certain threshold.
– Prepruning: stopping tree construction early on before it is
full
– Postpruning: to get simpler tree (to find and prune
unnecessary sub trees)
• Comparing prepruning and postpruning:
– The prepruning is faster but postpruning leads to more
accurate trees.
Issues in decision trees
• Overfitting happens learns the detail and noise in the training
data… negatively impact the performance.
• A decision tree is said to overfit the training data if
– It results in poor accuracy to classify test samples
– It has too many branches, that reflect anomalies

• Avoiding overfitting :
– Prune the tree: leaf nodes (sub-trees) are removed from
the tree as long as the pruned tree performs better on the
test data than the larger tree.
Issues in decision trees
• Underfitting refers to a model that can neither model the
training data nor generalize to new data.

• Underfitting: when model is too simple, both training and test


errors are large

• Easy to detect given a good performance metric


DT Algorithms
• Many Algorithms:
– Hunt’s Algorithm (one of the earliest)
– ID3, C4.5
– CART
– CHAID(CHi-squared Automatic Interaction Detector)
– SLIQ, SPRINT, MARS
Decision Tree Classifier Advantages

• Inexpensive to construct
• Extremely fast at classifying unknown records
• Easy to interpret for small-sized trees:- can be converted to if-
then rules that is easily understandable
• Accuracy is comparable to other classification techniques for
many simple data sets
Decision Tree Classifier Disadvantages
• Prone to overfitting.
• Require some kind of measurement as to how well they are
doing.
• Need to be careful with parameter tuning.
• Can create biased learned trees if some classes dominate.
Neural Network
• An ANN is just a parallel computational system consisting of
many simple processing elements connected together in a
specific way.

• Fundamental processing elements of a neural network is a


neuron
OR
• Networks of non-linear elements, interconnected through
adjustable weights.
• Non-linear elements have their inputs a weighted sum of the
outputs of other elements – like a network of biological
neurons do.
Why are ANNs Worth Studying?

• They are extremely powerful computational devices

• Massive parallelism makes them very efficient

• They are particularly fault tolerant


How ANN works
• Receives inputs from other source

• Combines them in someway

• Performs a generally nonlinear operation on the result

• Outputs the final result


How ANN works

• The three basic components of the (artificial) neuron are:


– Connecting link (synopsis)
– Adder function: for computing the weighted sum of the
inputs

• ɵj is called the bias, is a numerical value associated with


the neuron.
– An activation function: for computing the weighted sum of
the inputs
How ANN works
ANN Learning
• The inputs are fed simultaneously into the input layer.

• The weighted outputs of these units are fed into hidden layer.

• The weighted outputs of the last hidden layer are inputs to


units making up the output layer.
ANN Classification
• Input: Classification data— it contains classification attribute
– Data is divided, as in any classification problem [training data
and testing data]

• All values of attributes in the database are changed to contain


values in the interval [0,1]

• Two basic normalization techniques


– Max-Min normalization
– Decimal Scaling normalization
A multilayer Feed Forward Network
• INPUT: records without class attribute with normalized attributes values.

• INPUT VECTOR:- X = {x1, x2, … , xn} where n is the number of (non class)
attributes.

• INPUT LAYER:- there are as many nodes as non- class attributes i.e. as
the length of the input vector.

• HIDDEN LAYER:- the number of nodes in the hidden layer and the
number of hidden layers depends on implementation.

• OUTPUT LAYER:- corresponds to the class attribute.


A multilayer Feed Forward Network

• In the out put layer there are as many nodes as classes (values
of the class attribute).

• Network is fully connected, i.e. each unit provides input to


each unit in the next forward layer.
Classification By Back Propagation
• Back propagation learns by iteratively processing a set of
training data (samples).

• For each sample, weights are modified to minimize the error


between network’s classification and actual classification.

• These modifications are made in the “backwards” direction,


i.e., from the output layer down to the first hidden layer–
hence the name back propagation.
Steps in Back Propagation Algorithm
• STEP ONE: initialize the weights and biases.

• The weights in the network are initialized to random numbers


in the interval [-1,1] or [-0.5, 0.5].

• The biases are similarly initialized to small random numbers.


Steps in Back Propagation Algorithm
• STEP TWO: propagate the input forward.

• The inputs pass through the input units, unchanged.

• Next, the net input and output of each unit in the hidden and
output layers are computed.

• Each unit in the hidden and output layers takes its net input and
then applies an activation function to it
– The logistic, or sigmoid, function is used.
Steps in Back Propagation Algorithm
• STEP THREE: back propagate the error.
• The error is propagated backward by updating the weights
and biases to reflect the error of the network’s prediction.
• For a j unit in the output layer, the error Errj is computed:
Steps in Back Propagation Algorithm
• FOUR:
• The weights and biases are updated to reflect the propagated
errors

• The variable l is the learning rate, a constant typically having a value


between 0.0 and 1.0.
• If the learning rate is too small, then learning will occur at a very
slow pace.
• If the learning rate is too large, then oscillation between
inadequate solutions may occur.
Steps in Back Propagation Algorithm
• STEP FIVE: terminating conditions.
• All Δij in the previous epoch are so small as to be below some
specified threshold, or

• The percentage of tuples misclassified in the previous epoch is


below some threshold, or

• A prespecified number of epochs has expired.


• In practice, several thousands of epochs may be required
before the weights will converge.
ANN Weakness
• The complex internal structure shows black box behavior: it is
very hard to get an idea of the meaning of the internal
computations.
• Another feature of neural networks is their random behavior.
– The training process contains random elements. When this is
repeated, the same input set may yield very different
networks.

– Sometimes they differ in performance, one showing good


behavior, while others behave badly.
Thank you

You might also like