Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
17 views

Week 4 Part 1 Classification

The document discusses classification and prediction in data mining, which are used to categorize data or predict future trends, such as classifying bank loan applications as safe or risky. Classification involves dividing a dataset into classes based on specific outcomes using algorithms like decision trees, rules, neural networks, and Bayesian networks. Examples are provided of classification models for credit approval and airport security screening.

Uploaded by

Michael Zewdie
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Week 4 Part 1 Classification

The document discusses classification and prediction in data mining, which are used to categorize data or predict future trends, such as classifying bank loan applications as safe or risky. Classification involves dividing a dataset into classes based on specific outcomes using algorithms like decision trees, rules, neural networks, and Bayesian networks. Examples are provided of classification models for credit approval and airport security screening.

Uploaded by

Michael Zewdie
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 71

CSEg4301: Introduction to data mining

Week-4 Part-1
Classification & Prediction
4th Year Semester-1(Elective)
Data mining algorithms: Classification and Prediction
Used to extract models describing important data
classes or to predict future data trends.
- E.g., Classification Model :- Built to categorize
bank loan applications as either safe or risky.

- E.g., Prediction Model :-Built to predict the


expenditures for potential customers on
computer equipment given their income and
occupation.
What is classification?
Definition:- Classification is the process of sub-dividing a data set with regard to a number of
specific outcomes.
Classification Problem:- Given a database D={t1,t2,…,tn} and a set of classes C={C1,…,Cm},
The Classification Problem is to define a mapping f: DC where each ti is assigned to one class.
Problem is to create classes to classify data with the help of given set of data called training set
Classification is the operation most commonly supported by commercial data mining tools. It is an
operation that enables organisations to discover patterns in large or complex data sets in order to
solve specific business problems.
 It predicts categorical class labels.
 It classifies data (constructs a model) based on the training set and the values (class labels) in
a classifying attribute and uses it in classifying new data.
 Classify future or unknown, systematic approach, mathematical techniques.
• For example, we might want to classify our customers into ‘high’ and ‘low’ categories with
regard to credit risk. The category or ‘class’ into which each customer is placed is the
‘outcome’ of our classification.
A crude method would be to classify customers by whether their income is above or below a
certain amount.
More Examples:
• Teachers classify students’ grades as A,B, C, D, or F.
• Identify mushrooms as poisonous or edible.
• Identify individuals with credit risks.
Classification contd….
• It can be defined as the process of finding a model (or
function) that describes and distinguishes data classes
or concepts, for the purpose of being able to use the
model to predict the class of objects whose class label
is unknown.
• The derived model is based on the analysis of a set of
training data (i.e., data objects whose class label is
known).
y its custome
Why Classification? A motivating application

ted topay ba
Credit approval
• A bank wants to classify its customers based on whether they are
expected to pay back their approved loans. The history of past
customers is used to train the classifier.

oryof past cu
• The classifier provides rules, which identify potentially reliable
future customers.

er The classif
Rules
• Credit approval

Classification rule:
If age = “31...40”and income = high then credit_rating = excellent

Future customers Suhas : age = 35, income = high


⇒Excellent credit rating
Heena : age = 20, income = medium
⇒fair credit rating
Classification — A Two-Step Process

Model construction:
• describing a set of predetermined classes: Excellent and Fair using
training set.
Model is represented using classification rules.
Classification Process (1):model
Construction
• Model Construction
• Training Data
Classification Algorithms(rule)
• IF rank = ‘professor’ OR years > 6
THEN teach = ‘yes‘ Classifier(Model)
Classification Process (1):Model Construction
classification Process (2):Model
Construction
Model construction example.
Classification contd…(Example: )

• An airport security screening station is


used to deter mine if passengers are
potential terrorist or criminals. To do this,
the face of each passenger is scanned and
its basic pattern(distance between eyes,
size, and shape of mouth, head etc) is
identified. This pattern is compared to
entries in a database to see if it matches
any patterns that are associated with
Classification algorithms
• Decision tree
• Rule based induction
• Neural network
• Bayesian network
• Genetic algorithm
Classification process phases
A) learning phase
B) Classification phase
Training Data  Learning algorithm  produces unknown
values
Data Mining - Decision Tree Induction
• A decision tree is a structure that includes a root node, branches, and leaf
nodes. Each internal node denotes a test on an attribute, each branch
denotes the outcome of a test, and each leaf node holds a class label. The
topmost node in the tree is the root node.
• The following decision tree is for the concept buy computer that indicates
whether a customer at a company is likely to buy a computer or not? Each
internal node represents a test on an attribute. Each leaf node represents
a class.
The benefits of having a decision tree are as follows
• It does not require any domain knowledge.
• It is easy to comprehend.
• The learning and classification steps of a decision tree are simple and fast.
• Decision Tree (DT) are supervised Classification algorithms.
They are:
• easy to interpret (due to the tree structure)
• Decision trees extract predictive information in the form of
human-understandable tree-rules. Decision Tree is a algorithm
useful for many classification problems that that can help explain
the model’s logic using human-readable “If…. Then…” rules.
• reliable and robust algorithm.
• simple to implement.
They can:
• work on categorical attributes,
• handle many attributes, so big p smaller n cases.
• Each decision in the tree can be seen as an feature.
Decision tree Examples
• Decision tree is flow chart like tree structure.
• Each node denotes testing of an attribute value.
• Each branch represents outcome of test. Root Node

• Leaves represents decision.


Branches

Leaf Node Leaf Node

Set of possible Set of possible


answers answers
A classification model can be represented in various forms, such
as Rule Based:
1) IF-THEN rules,

student ( class , "undergraduate") AND concentration ( level,


"high") ==> class A

student (class ,"undergraduate") AND concentrtion


(level,"low") ==> class B

student (class , "post graduate") ==> class C


Example of a Decision Tree
cal cal us
i i o
or or nu
teg
teg
nti
ass
ca ca co cl
Tid Refund Marital Taxable
Splitting Attributes
Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No Refund
No
Yes No
3 No Single 70K
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Married
Single, Divorced
6 No Married 60K No
7 Yes Divorced 220K No TaxInc NO
8 No Single 85K Yes < 80K > 80K
9 No Married 75K No
NO YES
10 No Single 90K Yes
10

Training Data Model: Decision Tree


Example of Decision Tree contd..

cal cal us
i i o
or or nu
teg
teg
nti
ass Single,
ca ca co cl MarSt
Married Divorced
Tid Refund Marital Taxable
Status Income Cheat
NO Refund
1 Yes Single 125K No
Yes No
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that
10 No Single 90K Yes fits the same data!
10
Decision Tree Classification Task
Tid Attrib1 Attrib2 Attrib3 Class
Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Decision
Apply Tree
Tid Attrib1 Attrib2 Attrib3 Class
Model
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?


Deduction
14 No Small 95K ?

15 No Large 67K ?
10

Test Set
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married Assign Cheat to “No”

TaxInc NO
< 80K > 80K

NO YES
Typical Applications(classification)
• credit approval
• target marketing
• medical diagnosis
• treatment effectiveness analysis
• Image processing
Classification contd….
Classification—A Two-Step Process
1)First Step: (Model Construction)
- Model built describing a pre-determined set of data classes or concepts.
- A class label attribute determines which tuple belongs to which pre-determined class.
- This model is built by analyzing a training data set.
- Each tuple in a training data set is called a training sample.
- This model is represented in the form of:
o Classification Rules
o Decision Trees
o Mathematical Formulae
- Eg: Database having customer credit info
o Classification Rule – identifies customers with ‘Excellent credit rating’ or ‘Fair credit
rating’.
Classification contd….
2) Second Step: (Using the model in prediction)
- Model is used for classification
o Predictive accuracy of the model is estimated. –
There are several methods in this.
• Holdout Method is the simple technique
• Uses test set of class labelled samples.
• These are randomly selected and are independent of
training samples.
• The Accuracy of the Model on a given test set is the % of
test set samples that are correctly classified by the model.
• If the accuracy of the model is acceptable the model can
be used to classify future data tuples for which the class
label is not known.
Machine learning techniques
Supervised vs. Unsupervised
Learning
Supervised learning (classification)
o Supervision: The training data (observations, measurements,
etc.) are
accompanied by labels indicating the class of the observations
o New data is classified based on the training set
Unsupervised learning
o The class labels of training data is unknown
o Given a set of measurements, observations, etc. with the aim of
establishing the existence of classes or clusters in the data
Ex. Clustering
Use cases for supervised
learning
Supervised Learning
Usecase2
Supervised learning usecase2
cont’d…
Supervised learning use case
questions
Points to remember
supervised learning)
Unsupervised learning
Usecase
Unsupervised learning contd..
Unsupervised learning contd..
Unsupervised learning contd..
Unsupervised learning contd..
Points to remember(unsupervised learning)
Training sets
• Given a collection of records(instances).
Each record contains a set of attributes, one of the attributes is the
class.
• Find a model for class attribute as a function of the values of
other attributes.
• Goal: previously unseen records should be assigned a class as
accurately as possible.
• A test set is used to determine the accuracy of the model.
Usually, the given data set is divided into training and test
sets, with training set used to build the model and test set used
to validate it.
Examples of Classification Task
• Predicting tumor cells as benign or malignant

• Classifying credit card transactions


as legitimate or fraudulent

• Classifying secondary structures of protein


as alpha-helix, beta-sheet, or random
coil

• Categorizing news stories as finance,


weather, entertainment, sports, etc
Classification Algorithms
The creation of a tree is a quest for:
• purity (only pure node: only yes or no)
• the smallest tree
At each level, choose the attribute that produces the “purest” nodes (i.e.
choosing the attribute with the highest information gain)
Algorithms:
• one rule(1R)
• ID3
• Bayes
• Decision Trees
One R Usecase contd..
Consider outlook :(weather dataset) Frequency
Table
One R Usecase contd..
Usecase cont’d…
Building the Tree
Induction Decision Tree(ID3)
Training Dataset(Use case)
Output: ID 3 for
“buys_computer”
Another Use case: Marks
Algorithm for ID3
Algorithm for ID3
Advantages of ID3
Information Theory
Information Theory
Entropy
Information Gain (ID3)
Information Gain in Decision Tree
induction
Attribute selection based on
highest gain
Attribute selection
Continue to split tree further
Result: Decision Trees
One R Usecase-Test data
HANDLING Numerical attributes
NUMERICAL WEATHER
DATA SET
ONE R ALGIRITHM FOR
NUMERICAL ATTRIBUTES
ONE R ALGIRITHM FOR
NUMERICAL ATTRIBUTES
Frequency Tables
The best predictor is..
PREDICTOR CONTRIBUTION
End of week4 Part-1

You might also like