0% found this document useful (0 votes)

17 views

Week 4 Part 1 Classification

The document discusses classification and prediction in data mining, which are used to categorize data or predict future trends, such as classifying bank loan applications as safe or risky. Classification involves dividing a dataset into classes based on specific outcomes using algorithms like decision trees, rules, neural networks, and Bayesian networks. Examples are provided of classification models for credit approval and airport security screening.

Uploaded by

Michael Zewdie

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views

Week 4 Part 1 Classification

Uploaded by

Michael Zewdie

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 71

CSEg4301: Introduction to data mining

Week-4 Part-1
Classification & Prediction
4th Year Semester-1(Elective)
Data mining algorithms: Classification and Prediction
Used to extract models describing important data
classes or to predict future data trends.
- E.g., Classification Model :- Built to categorize
bank loan applications as either safe or risky.

- E.g., Prediction Model :-Built to predict the

expenditures for potential customers on
computer equipment given their income and
occupation.
What is classification?
Definition:- Classification is the process of sub-dividing a data set with regard to a number of
specific outcomes.
Classification Problem:- Given a database D={t1,t2,…,tn} and a set of classes C={C1,…,Cm},
The Classification Problem is to define a mapping f: DC where each ti is assigned to one class.
Problem is to create classes to classify data with the help of given set of data called training set
Classification is the operation most commonly supported by commercial data mining tools. It is an
operation that enables organisations to discover patterns in large or complex data sets in order to
solve specific business problems.
 It predicts categorical class labels.
 It classifies data (constructs a model) based on the training set and the values (class labels) in
a classifying attribute and uses it in classifying new data.
 Classify future or unknown, systematic approach, mathematical techniques.
• For example, we might want to classify our customers into ‘high’ and ‘low’ categories with
regard to credit risk. The category or ‘class’ into which each customer is placed is the
‘outcome’ of our classification.
A crude method would be to classify customers by whether their income is above or below a
certain amount.
More Examples:
• Teachers classify students’ grades as A,B, C, D, or F.
• Identify mushrooms as poisonous or edible.
• Identify individuals with credit risks.
Classification contd….
• It can be defined as the process of finding a model (or
function) that describes and distinguishes data classes
or concepts, for the purpose of being able to use the
model to predict the class of objects whose class label
is unknown.
• The derived model is based on the analysis of a set of
training data (i.e., data objects whose class label is
known).
y its custome
Why Classification? A motivating application

ted topay ba
Credit approval
• A bank wants to classify its customers based on whether they are
expected to pay back their approved loans. The history of past
customers is used to train the classifier.

oryof past cu
• The classifier provides rules, which identify potentially reliable
future customers.

er The classif
Rules
• Credit approval

Classification rule:
If age = “31...40”and income = high then credit_rating = excellent

Future customers Suhas : age = 35, income = high

⇒Excellent credit rating
Heena : age = 20, income = medium
⇒fair credit rating
Classification — A Two-Step Process

Model construction:
• describing a set of predetermined classes: Excellent and Fair using
training set.
Model is represented using classification rules.
Classification Process (1):model
Construction
• Model Construction
• Training Data
Classification Algorithms(rule)
• IF rank = ‘professor’ OR years > 6
THEN teach = ‘yes‘ Classifier(Model)
Classification Process (1):Model Construction
classification Process (2):Model
Construction
Model construction example.
Classification contd…(Example: )

• An airport security screening station is

used to deter mine if passengers are
potential terrorist or criminals. To do this,
the face of each passenger is scanned and
its basic pattern(distance between eyes,
size, and shape of mouth, head etc) is
identified. This pattern is compared to
entries in a database to see if it matches
any patterns that are associated with
Classification algorithms
• Decision tree
• Rule based induction
• Neural network
• Bayesian network
• Genetic algorithm
Classification process phases
A) learning phase
B) Classification phase
Training Data  Learning algorithm  produces unknown
values
Data Mining - Decision Tree Induction
• A decision tree is a structure that includes a root node, branches, and leaf
nodes. Each internal node denotes a test on an attribute, each branch
denotes the outcome of a test, and each leaf node holds a class label. The
topmost node in the tree is the root node.
• The following decision tree is for the concept buy computer that indicates
whether a customer at a company is likely to buy a computer or not? Each
internal node represents a test on an attribute. Each leaf node represents
a class.
The benefits of having a decision tree are as follows
• It does not require any domain knowledge.
• It is easy to comprehend.
• The learning and classification steps of a decision tree are simple and fast.
• Decision Tree (DT) are supervised Classification algorithms.
They are:
• easy to interpret (due to the tree structure)
• Decision trees extract predictive information in the form of
human-understandable tree-rules. Decision Tree is a algorithm
useful for many classification problems that that can help explain
the model’s logic using human-readable “If…. Then…” rules.
• reliable and robust algorithm.
• simple to implement.
They can:
• work on categorical attributes,
• handle many attributes, so big p smaller n cases.
• Each decision in the tree can be seen as an feature.
Decision tree Examples
• Decision tree is flow chart like tree structure.
• Each node denotes testing of an attribute value.
• Each branch represents outcome of test. Root Node

• Leaves represents decision.

Branches

Leaf Node Leaf Node

Set of possible Set of possible

answers answers
A classification model can be represented in various forms, such
as Rule Based:
1) IF-THEN rules,

student ( class , "undergraduate") AND concentration ( level,

"high") ==> class A

student (class ,"undergraduate") AND concentrtion

(level,"low") ==> class B

student (class , "post graduate") ==> class C

Example of a Decision Tree
cal cal us
i i o
or or nu
teg
teg
nti
ass
ca ca co cl
Tid Refund Marital Taxable
Splitting Attributes
Status Income Cheat

1 Yes Single 125K No

2 No Married 100K No Refund
No
Yes No
3 No Single 70K
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Married
Single, Divorced
6 No Married 60K No
7 Yes Divorced 220K No TaxInc NO
8 No Single 85K Yes < 80K > 80K
9 No Married 75K No
NO YES
10 No Single 90K Yes
10

Training Data Model: Decision Tree

Example of Decision Tree contd..

cal cal us
i i o
or or nu
teg
teg
nti
ass Single,
ca ca co cl MarSt
Married Divorced
Tid Refund Marital Taxable
Status Income Cheat
NO Refund
1 Yes Single 125K No
Yes No
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that
10 No Single 90K Yes fits the same data!
10
Decision Tree Classification Task
Tid Attrib1 Attrib2 Attrib3 Class
Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No

4 Yes Medium 120K No

Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn

8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes

Model
10

Training Set
Decision
Apply Tree
Tid Attrib1 Attrib2 Attrib3 Class
Model
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?

Deduction
14 No Small 95K ?

15 No Large 67K ?
10

Test Set
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married Assign Cheat to “No”

TaxInc NO
< 80K > 80K

NO YES
Typical Applications(classification)
• credit approval
• target marketing
• medical diagnosis
• treatment effectiveness analysis
• Image processing
Classification contd….
Classification—A Two-Step Process
1)First Step: (Model Construction)
- Model built describing a pre-determined set of data classes or concepts.
- A class label attribute determines which tuple belongs to which pre-determined class.
- This model is built by analyzing a training data set.
- Each tuple in a training data set is called a training sample.
- This model is represented in the form of:
o Classification Rules
o Decision Trees
o Mathematical Formulae
- Eg: Database having customer credit info
o Classification Rule – identifies customers with ‘Excellent credit rating’ or ‘Fair credit
rating’.
Classification contd….
2) Second Step: (Using the model in prediction)
- Model is used for classification
o Predictive accuracy of the model is estimated. –
There are several methods in this.
• Holdout Method is the simple technique
• Uses test set of class labelled samples.
• These are randomly selected and are independent of
training samples.
• The Accuracy of the Model on a given test set is the % of
test set samples that are correctly classified by the model.
• If the accuracy of the model is acceptable the model can
be used to classify future data tuples for which the class
label is not known.
Machine learning techniques
Supervised vs. Unsupervised
Learning
Supervised learning (classification)
o Supervision: The training data (observations, measurements,
etc.) are
accompanied by labels indicating the class of the observations
o New data is classified based on the training set
Unsupervised learning
o The class labels of training data is unknown
o Given a set of measurements, observations, etc. with the aim of
establishing the existence of classes or clusters in the data
Ex. Clustering
Use cases for supervised
learning
Supervised Learning
Usecase2
Supervised learning usecase2
cont’d…
Supervised learning use case
questions
Points to remember
supervised learning)
Unsupervised learning
Usecase
Unsupervised learning contd..
Unsupervised learning contd..
Unsupervised learning contd..
Unsupervised learning contd..
Points to remember(unsupervised learning)
Training sets
• Given a collection of records(instances).
Each record contains a set of attributes, one of the attributes is the
class.
• Find a model for class attribute as a function of the values of
other attributes.
• Goal: previously unseen records should be assigned a class as
accurately as possible.
• A test set is used to determine the accuracy of the model.
Usually, the given data set is divided into training and test
sets, with training set used to build the model and test set used
to validate it.
Examples of Classification Task
• Predicting tumor cells as benign or malignant

• Classifying credit card transactions

as legitimate or fraudulent

• Classifying secondary structures of protein

as alpha-helix, beta-sheet, or random
coil

• Categorizing news stories as finance,

weather, entertainment, sports, etc
Classification Algorithms
The creation of a tree is a quest for:
• purity (only pure node: only yes or no)
• the smallest tree
At each level, choose the attribute that produces the “purest” nodes (i.e.
choosing the attribute with the highest information gain)
Algorithms:
• one rule(1R)
• ID3
• Bayes
• Decision Trees
One R Usecase contd..
Consider outlook :(weather dataset) Frequency
Table
One R Usecase contd..
Usecase cont’d…
Building the Tree
Induction Decision Tree(ID3)
Training Dataset(Use case)
Output: ID 3 for
“buys_computer”
Another Use case: Marks
Algorithm for ID3
Algorithm for ID3
Advantages of ID3
Information Theory
Information Theory
Entropy
Information Gain (ID3)
Information Gain in Decision Tree
induction
Attribute selection based on
highest gain
Attribute selection
Continue to split tree further
Result: Decision Trees
One R Usecase-Test data
HANDLING Numerical attributes
NUMERICAL WEATHER
DATA SET
ONE R ALGIRITHM FOR
NUMERICAL ATTRIBUTES
ONE R ALGIRITHM FOR
NUMERICAL ATTRIBUTES
Frequency Tables
The best predictor is..
PREDICTOR CONTRIBUTION
End of week4 Part-1

Tree Based Classifiers: Dinesh R
No ratings yet
Tree Based Classifiers: Dinesh R
54 pages
CH 6
No ratings yet
CH 6
72 pages
Lec 16,17
No ratings yet
Lec 16,17
90 pages
Classification: Basic Concepts and Decision Trees
No ratings yet
Classification: Basic Concepts and Decision Trees
71 pages
DWDM Unit Iv
No ratings yet
DWDM Unit Iv
81 pages
Liaquat Majeed Sheikh: National University of Computer and Emerging Sciences
No ratings yet
Liaquat Majeed Sheikh: National University of Computer and Emerging Sciences
79 pages
Decision Tree and Evalaution
No ratings yet
Decision Tree and Evalaution
50 pages
Lect 1
No ratings yet
Lect 1
38 pages
Classification Algorithm
No ratings yet
Classification Algorithm
78 pages
Wk. 5.2. Decision Trees [27.10.2020]
No ratings yet
Wk. 5.2. Decision Trees [27.10.2020]
80 pages
Data Mining Introduction
No ratings yet
Data Mining Introduction
35 pages
Datamining-lect3 - Classification. Decision Trees. Evaluation
No ratings yet
Datamining-lect3 - Classification. Decision Trees. Evaluation
95 pages
Machine_Learning_Lecture_08_Decision Tree Learning (1)
No ratings yet
Machine_Learning_Lecture_08_Decision Tree Learning (1)
67 pages
Data Mining Slide
No ratings yet
Data Mining Slide
35 pages
datamining-lect10a-Classsification-basics-DT
No ratings yet
datamining-lect10a-Classsification-basics-DT
87 pages
01 Classification
No ratings yet
01 Classification
77 pages
Data Mining: Classification
No ratings yet
Data Mining: Classification
87 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
59 pages
Decision Trees
No ratings yet
Decision Trees
88 pages
Lecture 3 Data Mining
No ratings yet
Lecture 3 Data Mining
30 pages
Classification Slides
No ratings yet
Classification Slides
147 pages
5-Classification (2)
No ratings yet
5-Classification (2)
59 pages
SupervisedLearning_Classification
No ratings yet
SupervisedLearning_Classification
20 pages
Data Mining
No ratings yet
Data Mining
33 pages
06-Classification_Part1
No ratings yet
06-Classification_Part1
44 pages
DM Lect8
No ratings yet
DM Lect8
56 pages
Business Intelligence DM2 WEKA Classification
No ratings yet
Business Intelligence DM2 WEKA Classification
102 pages
Chapter - 5 Machine Learning
0% (1)
Chapter - 5 Machine Learning
25 pages
Unit II Part 1
No ratings yet
Unit II Part 1
62 pages
Classification Basic Concepts, Decision Trees, and Model Evaluation
No ratings yet
Classification Basic Concepts, Decision Trees, and Model Evaluation
67 pages
Classification: Lecture Notes For Chapters 4 & 5
No ratings yet
Classification: Lecture Notes For Chapters 4 & 5
42 pages
Implementasi Naive Bayes
No ratings yet
Implementasi Naive Bayes
23 pages
Lecture 2
No ratings yet
Lecture 2
98 pages
22mbada303 Module 5
No ratings yet
22mbada303 Module 5
61 pages
ML Classification Tree
No ratings yet
ML Classification Tree
36 pages
Lecture 6 - Decision Trees
No ratings yet
Lecture 6 - Decision Trees
43 pages
Decision Tree
No ratings yet
Decision Tree
43 pages
05 Classification Part1
No ratings yet
05 Classification Part1
35 pages
CH03 Classification Part I
No ratings yet
CH03 Classification Part I
58 pages
Instructor:: Doaa Adil Mohamed Altayeb
No ratings yet
Instructor:: Doaa Adil Mohamed Altayeb
34 pages
Week 8 - Understanding the Decision Tree
No ratings yet
Week 8 - Understanding the Decision Tree
28 pages
Chap4 - Basic - Classification-Admin and Economy
No ratings yet
Chap4 - Basic - Classification-Admin and Economy
31 pages
Lecture Notes For Chapter 4 Introduction To Data Mining: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 4 Introduction To Data Mining: by Tan, Steinbach, Kumar
101 pages
Lecture3 2020classification PDF
No ratings yet
Lecture3 2020classification PDF
124 pages
Decision Trees
No ratings yet
Decision Trees
31 pages
09 - ML - Decision Tree
No ratings yet
09 - ML - Decision Tree
45 pages
3 - Sınıflandırma 2
No ratings yet
3 - Sınıflandırma 2
62 pages
Lecture 11-Classification-M
No ratings yet
Lecture 11-Classification-M
33 pages
Regresi Klasifikasi Klasterasi Asosiasi
No ratings yet
Regresi Klasifikasi Klasterasi Asosiasi
38 pages
DECISION TREE
No ratings yet
DECISION TREE
38 pages
Classification: Basic Concepts and Decision Trees
No ratings yet
Classification: Basic Concepts and Decision Trees
56 pages
Lecture 14&15
No ratings yet
Lecture 14&15
81 pages
DM Lec6
No ratings yet
DM Lec6
18 pages
Data Mining
100% (13)
Data Mining
25 pages
Lec 6
No ratings yet
Lec 6
39 pages
DMDW_Unit 3_Classification
No ratings yet
DMDW_Unit 3_Classification
43 pages
Datamining-Lect5 Decision Tree
No ratings yet
Datamining-Lect5 Decision Tree
38 pages
Data Mining Classification: Naïve Bayes Classifier Lecture Notes For Chapter 4 &5
No ratings yet
Data Mining Classification: Naïve Bayes Classifier Lecture Notes For Chapter 4 &5
26 pages
I Can Master Subtraction, Grades K - 2
From Everand
I Can Master Subtraction, Grades K - 2
Carson Dellosa Education
No ratings yet
I Can Master Division, Grades 3 - 4
From Everand
I Can Master Division, Grades 3 - 4
Carson Dellosa Education
No ratings yet
Week-2-Data Warehouse and Olap
No ratings yet
Week-2-Data Warehouse and Olap
57 pages
4205-Comuter Systems Security Course Outline
No ratings yet
4205-Comuter Systems Security Course Outline
5 pages
Week 9 Part 1 Clustering
No ratings yet
Week 9 Part 1 Clustering
44 pages
CSEg - 4207 - Course Outline
No ratings yet
CSEg - 4207 - Course Outline
4 pages
Assignmentt
No ratings yet
Assignmentt
22 pages
MLT Question Bank New
No ratings yet
MLT Question Bank New
6 pages
Modern Horizons in Agriculture Volume 122012024
No ratings yet
Modern Horizons in Agriculture Volume 122012024
377 pages
What Is Machine Learning-UNIT III
No ratings yet
What Is Machine Learning-UNIT III
12 pages
Basics of Machine Learning
No ratings yet
Basics of Machine Learning
77 pages
Machine Learning Unit 4
100% (1)
Machine Learning Unit 4
78 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
10 pages
Project Report First Phase @8 Suhana
No ratings yet
Project Report First Phase @8 Suhana
32 pages
Supervised Unsupervised
No ratings yet
Supervised Unsupervised
12 pages
ida unit-4
No ratings yet
ida unit-4
19 pages
ML (Unit-1)
No ratings yet
ML (Unit-1)
17 pages
(Ebook) Statistical Reinforcement Learning: Modern Machine Learning Approaches by Masashi Sugiyama ISBN 9781439856895, 1439856893 - Get the ebook in PDF format for a complete experience
100% (2)
(Ebook) Statistical Reinforcement Learning: Modern Machine Learning Approaches by Masashi Sugiyama ISBN 9781439856895, 1439856893 - Get the ebook in PDF format for a complete experience
50 pages
Cbse - Department of Skill Education Artificial Intelligence
No ratings yet
Cbse - Department of Skill Education Artificial Intelligence
10 pages
Machine Learning Tutorial
100% (1)
Machine Learning Tutorial
44 pages
LLSPS - INT - 2831 - Predicting Life Expectancy Using Machine Learning
100% (1)
LLSPS - INT - 2831 - Predicting Life Expectancy Using Machine Learning
36 pages
Applications of GIS & RS For Wetland Management in Mudigere Taluk, Chikkamagalur District, Karnataka
No ratings yet
Applications of GIS & RS For Wetland Management in Mudigere Taluk, Chikkamagalur District, Karnataka
7 pages
Data Science Interview Questions
100% (1)
Data Science Interview Questions
68 pages
Artificial Neural Network Supervised Learning
No ratings yet
Artificial Neural Network Supervised Learning
14 pages
Unit 2 Soft
No ratings yet
Unit 2 Soft
14 pages
Aiml Mca
No ratings yet
Aiml Mca
38 pages
Machine Learning: Abstract
No ratings yet
Machine Learning: Abstract
11 pages
Machine Learning and Web Scraping Lesson02
No ratings yet
Machine Learning and Web Scraping Lesson02
29 pages
Deep Learning
0% (1)
Deep Learning
5 pages
Week 03
No ratings yet
Week 03
28 pages
Assignment Ict Ai Machine Learning (1) - 084742
No ratings yet
Assignment Ict Ai Machine Learning (1) - 084742
7 pages
Semi-Supervised Learning A Brief Review
No ratings yet
Semi-Supervised Learning A Brief Review
6 pages
Applications of Machine Learning To Diagnosis and Treatment of Neurodegenerative Diseases
No ratings yet
Applications of Machine Learning To Diagnosis and Treatment of Neurodegenerative Diseases
17 pages
Data Analytics and AI
100% (9)
Data Analytics and AI
267 pages
MFDM™ Ai
50% (4)
MFDM™ Ai
48 pages
ML Lab 08 Manual - Logisitic Regression (Ver7)
No ratings yet
ML Lab 08 Manual - Logisitic Regression (Ver7)
9 pages
Regression: UNIT - V Regression Model
100% (1)
Regression: UNIT - V Regression Model
21 pages