0% found this document useful (0 votes)

70 views

02-Data Mining Functionalities-2

This document introduces key concepts in data mining including: - Data mining can be used on various data types including databases, data streams, time series data, graphs, and more. - Data mining consists of descriptive and predictive tasks like characterization, discrimination, classification, prediction, and clustering. - Popular algorithms include those for mining frequent patterns, associations, correlations, and outliers. - Classification and prediction techniques include rules, decision trees, and neural networks. - Interestingness measures the usefulness of patterns found through data mining.

Uploaded by

Lakshmi Priya B

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

70 views

02-Data Mining Functionalities-2

Uploaded by

Lakshmi Priya B

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

Chapter 1.

Introduction

◼ Data Mining: On what kind of data?

◼ Data mining functionality
◼ Classification of data mining systems
◼ Top-10 most popular data mining algorithms
◼ Major issues in data mining
◼ Overview of the course

July 19, 2021 Data Mining: Concepts and Techniques 1

Data Mining: On What Kinds of Data?
◼ Database-oriented data sets and applications
◼ Relational database, data warehouse, transactional database
◼ Advanced data sets and advanced applications
◼ Data streams and sensor data
◼ Time-series data, temporal data, sequence data (incl. bio-sequences)
◼ Structure data, graphs, social networks and multi-linked data
◼ Object-relational databases
◼ Heterogeneous databases and legacy databases
◼ Spatial data and spatiotemporal data
◼ Multimedia database
◼ Text databases
◼ The World-Wide Web

July 19, 2021 Data Mining: Concepts and Techniques 2

Data Mining Functionalities
Data mining tasks can be classified into two categories
◼ Descriptive mining – Characterize the general properties of the data
in the database.
◼ Predictive mining – Perform inference on the current data in order to
make prediction.

Concepts/Class Description: Characterization and Discrimination

◼ Data can be associated with classes or concepts

◼ Describe individual classes and concepts in summarized, concise, and

precise terms
◼ Such descriptions of a class or concept are called class/concept

description
Data characterization
◼ It is a summarization of the general characteristics or features of target class
of data.
◼ The data corresponding to the user-specified class are typically collected by a
database query.

◼ There are several methods for effective data summarization and

characterization - Simple data summaries based on statistical measures, data
cube based OLAP roll-up operation etc.,

◼ The output of data characterization can be presented in various formats:pie

charts, bar charts, curves, multidimensional data cubes, and
multidimensional table, including crosstabs.

◼ Example: the user may like to study the characteristics of software products
whose sales increased by 10% in the last year.
Data Discrimination

◼ It is a comparison of the general features of target class data objects with the
general features of objects from one or a set of contrasting classes.

◼ The target and contrasting classes can be specified by the user, and the
corresponding data objects are retrieved through database queries.

◼ The forms of output presentation are similar to those for characteristic

descriptions.

◼ Discrimination descriptions expressed in rule form are referred to as

discriminant rules.

◼ Example: the user may like to compare the general features of software
products whose sales increased by 10% in the last year with those whose sales
decreased by at least 30% during the same period.
Mining Frequent Patterns, Associations, and Correlations

◼ Frequent patterns, are patterns that occur frequently in data.

◼ There many kinds of frequent patterns, including itemsets, subsequences, and

substructures.

◼ A frequent itemset refers to a set of items that frequently appear together in a

transactional data set, such as milk and bread.

◼ A frequently occurring subsequence, such as the pattern that customers tend

to purchase first a PC, followed by a digital camera, and then a memory card,
is a (frequent) sequential pattern.

◼ A substructure can refer to different structural forms, such as graphs, trees, or

lattices, which may be combined with itemsets or subsequences. If a
substructure occurs frequently, it is called a (frequent) structured pattern.

Mining frequent patterns leads to discovery of interesting associations and

correlations within data.
Associations Analysis
◼ To find the items which was frequently purchased together
◼ Eg: Rule is mined from transactional Database
◼ Single-dimensional association:
◼ buys(T, “computer”)  buys(T, “software”)
[support = 1%, confidence = 75%]
“T” is a variable representing customer, confidence or centainty, predicate
-buys
◼ Multi-dimensional association:
◼ age(X, “20..29”)  income(X, “20..29K”)  buys(X,
“PC”)
[support = 2%, confidence = 60%]
Association involves more than one predicate
◼ Association rules are discarded as uninteresting if they do not satisfy both a
min support threshold and min confidence threshold
◼ Additional analysis can be performed to uncover statistical correlations
between associated attribute value pairs
Classification and Prediction

◼ Classification is the process of finding a model (or function) that

describes and distinguishes data classes or concepts.
◼ Use the model to predict the class of objects whose class label is
unknown.
◼ The derived model is based on the analysis of a set of training data
(i.e., data objects whose class label is known).
◼ A decision tree is a flow-chart-like tree structure, where each
node denotes a test on an attribute value, each branch represents an
outcome of the test, and tree leaves represent classes or class
distributions.
◼ Decision trees can easily be converted to classification rules.
◼ Neural Network : Collection of neuron-like processing units
with weighted connections between the units.
Classification and Predicition
A classification model can be represented in different forms a) if-then rules b) a
decision tree c) a neural network
Eg: To classify customers based on age and income

Age(X,”youth”) AND income(X,”high”) ==> class(X,”A”)

Age(X,”youth”) AND income(X,”low”) ==> class(X,”B”)
Age(X,”middle_aged”) ==> class(X,”C”)
Age(X,”senior”) age? ==> class(X,”C”)

youth middle_aged,
senior

income? class C

high
low

class A class B
Classification and Prediction

◼ Classification predicts categorical labels regression models continuous valued

functions

◼ Prediction is used to predict missing or unavailable numerical data values

rather than class labels.

◼ Regression analysis is a statistical methodology that is most often used for

numeric prediction.

◼ Regression also encompasses the identification of distribution trends based on

the available data
Clustering

◼ Clustering analyzes data objects without consulting a known class label.

◼ The data labels are not present in the training data because they are not known
to begin with. (Unsupervised learning)

◼ Clustering can be used to generate such labels.

◼ The objects are clustered or grouped based on the principle of maximizing the
intraclass similarity and minimizing the interclass similarity.

◼ A 2D-plot of customer data with respect to the customer locations in a city

Cluster Analysis

◼ Outlier Analysis : A database may contain data objects that do not comply
with the general behavior or model of the data. These data objects are outliers.

◼ Most data mining methods discard outliers as noise or exceptions.

◼ However, in some applications such as fraud detection, the rare events can be
more interesting than the more regularly occurring ones.

◼ The analysis of outlier data is referred to as outlier mining.

◼ Example : Outlier analysis may uncover fraudulent usage of credit cards by

detecting purchases of extremely large amounts for a given account number in
comparison to regular charges incurred by the same account.

◼ Outlier values are detected with respect to the locations or the purchase
frequency and types of purchase
Evolution Analysis

Data evolution analysis describes and models regularities or trends for objects
whose behavior changes over time.

Example: A data mining study of stock exchange data may identify stock
evolution regularities for overall stocks and for the stocks of particular
companies.
Interestingness of Patterns

◼ A data mining system has the potential to generate thousands of patterns, or

rules. But only a small fraction of the patterns potentially generated would
actually be of interest to any giver user.
◼ An interesting pattern represents knowledge.
◼ Interestingness measures
◼ A pattern is interesting if it is easily understood by humans, valid on
new or test data with some degree of certainty, potentially useful,
novel, or validates some hypothesis that a user seeks to confirm
◼ Objective versus subjective interestingness measures
◼ Objective: based on statistics and structures of patterns, e.g., support,
confidence, etc.
◼ Subjective: based on user’s belief in the data, e.g., unexpectedness,
novelty, actionability, etc.
Interestingness of Patterns –Objective
Measures
◼ An objective measure for association rules of the form S ==> Y is rule
support

◼ Support: Represents the percentage of transactions from a transaction

database that a given rule satisfies.

◼ Another objective measure of association rules is confidence : Degree

of certainty of the detected association.

support(X=> Y) = P(XUY)
confidence(X=> Y) = P(Y/X)
No. of tuples containing both X and Y
support (X=> Y) = ---------------------------------------------------
total number of tuples

No. of tuples_ containing both X and Y

confidence (X=> Y) = --------------------------------------------------
Number of tuples containing X
Interestingness of Patterns

◼ Objective measures
◼ Accuracy and coverage for if-then-rules
◼ Accuracy: Percentage of data correctly classified by a rule.
◼ Coverage is similar to support percentage of data to which a rule
applies

◼ Subjective Measures:
◼ Based on user beliefs in the data: these measures find patterns
interesting if the patterns are unexpected or provide strategic
information on which the user can act referred as “ACTIONABLE”
◼ Can a DM provides all of interesting patterns – Completeness
◼ Can a DM generate only interesting patterns – an optimization problem
Classification of Data Mining Systems
Data mining is an interdisciplinary field, including database systems, statistics, machine
learning, visualization, and information science
Classification according to the kinds of databases mined:
If classifying according to the special types of data handled, we may have time-series,
text stream data, multimedia data mining systems, or World Wide Web mining system.
Classification according to the kinds of knowledge mined:
◼ Based on data mining functionalities such as characterization, discrimination,
association and correlation analysis, classification, clustering, prediction, outlier and
evolution analysis.
◼ Based on levels of abstraction including generalized knowledge (high level of
abstraction), primitive-level knowledge (raw data level), knowledge at multiple levels
(several levels of abstraction)
Classification according to the kinds of techniques utilized:
◼ Data mining systems can be categorized according to the underlying data mining
techniques employed or degree of user interaction.
Classification according to the applications adapted:
◼ For example, data mining systems may be tailored specifically for finance,
telecommunications, DNA, stock markets, e-mail, and so on.
Data Mining Task Primitives

◼ A data mining query is defined in terms of data mining task primitives.

◼ These primitives allow the user interactively communicate with the data
mining system during discovery in order to direct the mining process, or
examine the findings from different angles or depths.

◼ The data mining primitives specify the following:

◼ The set of task-relevant data to be mined: This specifies the portions of the
database or the set of data in which the user is interested. This includes the
database attributes or data warehouse dimensions of interest.

◼ The kind of knowledge to be mined: This specifies the data mining functions
to be performed, such as characterization, discrimination, association or
correlation analysis, classification, prediction, clustering, outlier analysis, or
evolution analysis.
Data Mining Task Primitives

The background knowledge to be used in the discovery process: This

knowledge about the domain to be mined is useful for guiding the
knowledge discovery process and for evaluating the patterns found.
The interestingness measures and thresholds for pattern evaluation: They
may be used to guide the mining process or, after discovery, to evaluate
the discovered patterns.
The expected representation for visualizing the discovered patterns: This
refers to the form in which discovered patterns are to be displayed,
which may include rules, tables, charts, graphs, decision trees, and
cubes.
Major Issues in Data Mining

The issues in data mining regarding mining methodology are given below.

◼ Mining different kinds of knowledge in databases:

◼ Because different users can be interested in different kinds of knowledge, data
mining should cover a wide spectrum of data analysis and knowledge discovery
tasks, including data characterization, discrimination, association and correlation
analysis, classification, prediction, clustering, outlier analysis, and evolution
analysis.
◼ The task may use database in different ways and require the development of
numerous techniques
◼ Interactive mining of knowledge at multiple levels of abstraction:
◼ The data mining process should be interactive.
◼ Interactive mining allows users to focus the search for patterns, providing and
refining data mining requests based on returned results. Specifically, knowledge
should be drilling down, rolling up, and pivoting through the data space and
knowledge space interactively.
◼ Mining knowledge in multidimensional space:
◼ We should explore data in multidimensional space
◼ Search for the knowledge among combinations of dimensions at varying
levels of abstraction
Major Issues in Data Mining

Data mining an Interdisciplinary effort:

The power of DM can be enhanced by integrating new methods from multiple
disciplines. Eg: Mining of software bugs benefits from incorporation of software
engineering knowledge into the data mining process

Handling noisy or incomplete data:

The data stored in a database may reflect noise, exceptional cases, or incomplete data
objects. When mining data regularities, these objects may confuse the process, causing
the knowledge model constructed to overfit the data.

Pattern evaluation-the interestingness problem:

A data mining system can uncover thousands of patterns. Techniques are needed to
assess the interestingness of patterns based on the measures

Incorporation of background knowledge:

Domain knowledge related to databases, such as integrity constraints and deduction
rules, can help focus and speed up a data mining process, or judge the interestingness
of discovered patterns.
Major Issues in Data Mining

Data mining query languages and ad hoc data mining:

Data mining query languages need to be developed to allow users to describe ad hoc
data mining tasks by facilitating the specification of the relevant sets of data for
analysis, the domain knowledge, the kinds of knowledge to be mined, and the
conditions and constraints to be enforced on the discovered patterns.

Presentation and visualization of data mining results:

Discovered knowledge should be expressed in high-level languages, visual representations,
or other expressive forms so that the knowledge can be easily understood and directly
usable by humans

◼ Applications and social impacts:

◼ Domain-specific data mining & invisible data mining
◼ Protection of data security, integrity, and privacy
SUMMARY
◼ Data Source
◼ Data Mining Functionalities
◼ Data Mining Primitive tasks
◼ Data Mining Classification
◼ Data Mining – Major Issues

July 19, 2021 Data Mining: Concepts and Techniques 23

Risk and Opportunities Procedure
100% (1)
Risk and Opportunities Procedure
4 pages
Lecture2 DataMiningFunctionalities
No ratings yet
Lecture2 DataMiningFunctionalities
18 pages
Chapter 6 Data Mining
No ratings yet
Chapter 6 Data Mining
39 pages
Data Mining: An Overview From A Database Perspective
No ratings yet
Data Mining: An Overview From A Database Perspective
30 pages
CH 2
No ratings yet
CH 2
37 pages
Soln 1
100% (1)
Soln 1
6 pages
DMW - Unit 1
No ratings yet
DMW - Unit 1
21 pages
4 Datamining
No ratings yet
4 Datamining
90 pages
មេរៀនទី១
No ratings yet
មេរៀនទី១
40 pages
Module1 DataMining Ktustudents - in
No ratings yet
Module1 DataMining Ktustudents - in
24 pages
Unit 3 Data Mining
No ratings yet
Unit 3 Data Mining
21 pages
CHAPTER1-datamining
No ratings yet
CHAPTER1-datamining
33 pages
UNIT-1 Introduction To Data Mining
No ratings yet
UNIT-1 Introduction To Data Mining
29 pages
Data Warehouse
No ratings yet
Data Warehouse
19 pages
Assignment Solution
No ratings yet
Assignment Solution
27 pages
Lecture Notes 1.1 & 1.2
No ratings yet
Lecture Notes 1.1 & 1.2
8 pages
DMWH M1
No ratings yet
DMWH M1
25 pages
Data Mining Intro
No ratings yet
Data Mining Intro
46 pages
DM-unit 1
No ratings yet
DM-unit 1
22 pages
DM 1 PDF
No ratings yet
DM 1 PDF
67 pages
Data Mining - Prashant
No ratings yet
Data Mining - Prashant
10 pages
Data Mining
No ratings yet
Data Mining
8 pages
CS-505 Introduction To Data Mining Exercises: Page 1 of 4
No ratings yet
CS-505 Introduction To Data Mining Exercises: Page 1 of 4
4 pages
Data Mining1 1
No ratings yet
Data Mining1 1
10 pages
Data Mining Unit I notes
No ratings yet
Data Mining Unit I notes
29 pages
Sheet 1 Solution1
No ratings yet
Sheet 1 Solution1
4 pages
Unit 1
No ratings yet
Unit 1
27 pages
DM Sem U-1
No ratings yet
DM Sem U-1
50 pages
DATA MINING
No ratings yet
DATA MINING
7 pages
U1_1
No ratings yet
U1_1
13 pages
Module_III_data_mining
No ratings yet
Module_III_data_mining
7 pages
Data Mining and Data Warehousing
No ratings yet
Data Mining and Data Warehousing
47 pages
Data Mining Slides
No ratings yet
Data Mining Slides
65 pages
Chapter-1 - Introduction To Data Mining
No ratings yet
Chapter-1 - Introduction To Data Mining
10 pages
Chapter 1
No ratings yet
Chapter 1
16 pages
Unit 3 - Data Mining - WWW - Rgpvnotes.in PDF
No ratings yet
Unit 3 - Data Mining - WWW - Rgpvnotes.in PDF
10 pages
Types of attributes-1
No ratings yet
Types of attributes-1
8 pages
Chapter 1___Data Mining and Data Warehouse
No ratings yet
Chapter 1___Data Mining and Data Warehouse
44 pages
BCA-404: Data Mining and Data Ware Housing
No ratings yet
BCA-404: Data Mining and Data Ware Housing
19 pages
Unit 1 Data Mining task
No ratings yet
Unit 1 Data Mining task
7 pages
UNIT 1 Introduction of Data Mining
No ratings yet
UNIT 1 Introduction of Data Mining
11 pages
ITS632 Lecture2 Data
No ratings yet
ITS632 Lecture2 Data
61 pages
DataWarehouseMining Complete Notes
No ratings yet
DataWarehouseMining Complete Notes
55 pages
DMW-M1-Ktunotes.in
No ratings yet
DMW-M1-Ktunotes.in
75 pages
Chapter 2. Business Problem and Data-Driven Decision
No ratings yet
Chapter 2. Business Problem and Data-Driven Decision
22 pages
Unit 1
No ratings yet
Unit 1
21 pages
DataMiningTechniques
No ratings yet
DataMiningTechniques
24 pages
Class 9 (Chap #4)
No ratings yet
Class 9 (Chap #4)
9 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
11 pages
DM_UNIT-1_FUNDAMENTALS OF DATA MINING (1)
No ratings yet
DM_UNIT-1_FUNDAMENTALS OF DATA MINING (1)
43 pages
Data Mining
No ratings yet
Data Mining
25 pages
Data Mining Unit 1
No ratings yet
Data Mining Unit 1
39 pages
DWDM R19 Unit 1
No ratings yet
DWDM R19 Unit 1
27 pages
Syllabus:: 1.1 Data Mining
No ratings yet
Syllabus:: 1.1 Data Mining
30 pages
DW&DM(Unit -4)
No ratings yet
DW&DM(Unit -4)
9 pages
Data Mining-Unit-1
No ratings yet
Data Mining-Unit-1
21 pages
Data Mining
No ratings yet
Data Mining
87 pages
DATA MINING Notes (Upate)
No ratings yet
DATA MINING Notes (Upate)
25 pages
Unit 3 PPT (BA)
No ratings yet
Unit 3 PPT (BA)
19 pages
Data Accquisition
No ratings yet
Data Accquisition
6 pages
Business Statistics I Essentials
From Everand
Business Statistics I Essentials
Louise Clark
5/5 (5)
Mining Frequent Patterns, Associations and Correlations: Basic Concepts and Methods
No ratings yet
Mining Frequent Patterns, Associations and Correlations: Basic Concepts and Methods
20 pages
01 Lecture1-Randomness-Probability
No ratings yet
01 Lecture1-Randomness-Probability
58 pages
05 Lecture4 - Estimation
No ratings yet
05 Lecture4 - Estimation
37 pages
B.E. Computer Science and Engineering Ucs1504 Artificial Intelligence 2020-2021 31-10-2020 AN
No ratings yet
B.E. Computer Science and Engineering Ucs1504 Artificial Intelligence 2020-2021 31-10-2020 AN
1 page
19-Linked List.c
No ratings yet
19-Linked List.c
3 pages
Fbs Week 5 Grade 7 8 Leap
No ratings yet
Fbs Week 5 Grade 7 8 Leap
4 pages
SSL 38 Sresky
No ratings yet
SSL 38 Sresky
1 page
Aesthetic Experience: and Literary Hermeneutics
No ratings yet
Aesthetic Experience: and Literary Hermeneutics
389 pages
Draft Old Oak and Park Royal Opportunity Area Planning Framework
No ratings yet
Draft Old Oak and Park Royal Opportunity Area Planning Framework
130 pages
AE2 Speaking Midterm April.2022 - Topic and Instructions
No ratings yet
AE2 Speaking Midterm April.2022 - Topic and Instructions
1 page
Receipt - 1 - 11 - 2024 12 - 00 - 00 AM
No ratings yet
Receipt - 1 - 11 - 2024 12 - 00 - 00 AM
1 page
Fans PDF
No ratings yet
Fans PDF
1 page
Philosophy and Objectives of Edukasyon Sa Pagpapakatao
No ratings yet
Philosophy and Objectives of Edukasyon Sa Pagpapakatao
5 pages
The Management of Alarm Systems
100% (2)
The Management of Alarm Systems
242 pages
Yash Resume
No ratings yet
Yash Resume
3 pages
Sennheiser Rs 175 Manuale Utente (English Version)
No ratings yet
Sennheiser Rs 175 Manuale Utente (English Version)
36 pages
Energies: A New Power Sharing Scheme of Multiple Microgrids and An Iterative Pairing-Based Scheduling Method
No ratings yet
Energies: A New Power Sharing Scheme of Multiple Microgrids and An Iterative Pairing-Based Scheduling Method
20 pages
Factors Affecting Centralisation and Decentralisation: Presented By:-Himanshu Sharma
No ratings yet
Factors Affecting Centralisation and Decentralisation: Presented By:-Himanshu Sharma
12 pages
(Ebook) Linear Feedback Control: Analysis and Design with MATLAB (Advances in Design and Control) by Dingyu Xue, YangQuan Chen, Derek P. Atherton ISBN 9780898716382, 0898716381 - The full ebook with all chapters is available for download
100% (1)
(Ebook) Linear Feedback Control: Analysis and Design with MATLAB (Advances in Design and Control) by Dingyu Xue, YangQuan Chen, Derek P. Atherton ISBN 9780898716382, 0898716381 - The full ebook with all chapters is available for download
59 pages
Focus Group Discussion Assessment Rubric
100% (1)
Focus Group Discussion Assessment Rubric
2 pages
FE Other Disciplines CBT Specs
No ratings yet
FE Other Disciplines CBT Specs
3 pages
Mechanical Sample Book PDF For Gate Exam PDF
No ratings yet
Mechanical Sample Book PDF For Gate Exam PDF
28 pages
Safety Valve 410 S G K
No ratings yet
Safety Valve 410 S G K
6 pages
2nd Lecture On Skeletal Muscle Physiology by Dr. Mudassar Ali Roomi
100% (1)
2nd Lecture On Skeletal Muscle Physiology by Dr. Mudassar Ali Roomi
29 pages
Analyzing Word Problems (Word Clues and The Operation To Use)
No ratings yet
Analyzing Word Problems (Word Clues and The Operation To Use)
5 pages
Sand Reclamation and Conditioning
No ratings yet
Sand Reclamation and Conditioning
13 pages
Motor Feeder Cable & Cable Tray Sizing and Data
No ratings yet
Motor Feeder Cable & Cable Tray Sizing and Data
5 pages
Essay Writing Skills
No ratings yet
Essay Writing Skills
7 pages
140 H
100% (1)
140 H
7 pages
Mar. 2004 - Smasse Inset Malawi Review - Main
100% (1)
Mar. 2004 - Smasse Inset Malawi Review - Main
41 pages
Distributed Computing and Artificial Intelligence, 18th International Conference
No ratings yet
Distributed Computing and Artificial Intelligence, 18th International Conference
239 pages
A Study On Customer Perception Towards HDFC Limited
No ratings yet
A Study On Customer Perception Towards HDFC Limited
13 pages
FS 1 Learning Ep 14
No ratings yet
FS 1 Learning Ep 14
8 pages
Call for Papers_BSA_Vienna_2025_Bourdieu_finale
No ratings yet
Call for Papers_BSA_Vienna_2025_Bourdieu_finale
4 pages