Chapter 1. Introduction: December 8, 2021 Data Mining: Concepts and Techniques
Chapter 1. Introduction: December 8, 2021 Data Mining: Concepts and Techniques
Chapter 1. Introduction: December 8, 2021 Data Mining: Concepts and Techniques
Introduction
Task-relevant Data
Data Selection
Warehouse
Data Cleaning
Data Integration
Databases
December 8, 2021 Data Mining: Concepts and Techniques 6
CRISP-DM
CRISP-DM
CRISP-DM
Six Sigma - DMAIC
Define. Concerned with the definition of project goals and
boundaries, and the identification of issues that need to be
addressed to achieve the higher sigma level.
“Then, it doesn’t matter which way you go,” said the Cat.
Business Intelligence
Increasing potential
to support
business decisions End User
Decision
Making
Data Exploration
Statistical Summary, Querying, and Reporting
Database
Technology Statistics
Machine Visualization
Learning Data Mining
Pattern
Recognition Other
Algorithms Disciplines
General functionality
Descriptive data mining
Predictive data mining
Different views lead to different classifications
Data view: Kinds of data to be mined
Knowledge view: Kinds of knowledge to be discovered
Method view: Kinds of techniques utilized
Application view: Kinds of applications adapted
Outlier analysis
Outlier: Data object that does not comply with the general behavior
of the data
Noise or exception? Useful in fraud detection, rare events analysis
Periodicity analysis
Similarity-based analysis
Fallacy 1. There are data mining tools that we can turn loose on
our data repositories and use to find answers to our problems.
◦ Reality. There are no automatic data mining tools that will solve problems
mechanically “while you wait.” Rather, data mining is a process.
◦ Reality. The return rates vary, depending on the startup costs, analysis
personnel costs, data warehousing preparation costs, and so on.
Fallacies of Data Mining (2)
Fallacy 4. Data mining software packages are intuitive and easy to
use.
◦ Reality. Ease of use varies, and data analysts must combine subject
matter knowledge with an analytical mind and a familiarity with the overall
business or research model.
practices of Knowledge
Data Mining and Knowledge
Discovery and Data Mining Discovery (DAMI or DMKD)
(PKDD) IEEE Trans. On Knowledge
Pacific-Asia Conf. on and Data Eng. (TKDE)
Knowledge Discovery and Data KDD Explorations
Mining (PAKDD) ACM Trans. on KDD
December 8, 2021 Data Mining: Concepts and Techniques 35
Where to Find References? DBLP, CiteSeer, Google
Task-relevant data
Database or data warehouse name
Database tables or data warehouse cubes
Condition for data selection
Relevant attributes or dimensions
Data grouping criteria
Type of knowledge to be mined
Characterization, discrimination, association, classification,
prediction, clustering, outlier analysis, other data mining tasks
Background knowledge
Pattern interestingness measurements
Visualization/presentation of discovered patterns
December 8, 2021 Data Mining: Concepts and Techniques 49
Primitive 3: Background Knowledge
Simplicity
e.g., (association) rule length, (decision) tree size
Certainty
e.g., confidence, P(A|B) = #(A and B)/ #(B), classification
reliability or accuracy, certainty factor, rule strength, rule quality,
discriminating weight, etc.
Utility
potential usefulness, e.g., support (association), noise threshold
(description)
Novelty
not previously known, surprising (used to remove redundant
rules, e.g., Illinois vs. Champaign rule implication support ratio)
Motivation
A DMQL can provide the ability to support ad-hoc and
interactive data mining
By providing a standardized language like SQL
Hope to achieve a similar effect like that SQL has on
relational database
Foundation for system development and evolution
Facilitate information exchange, technology transfer,
commercialization and wide acceptance
Design
DMQL is designed with the primitives described earlier
Pattern Evaluation
Knowl
Data Mining Engine edge-
Base
Database or Data
Warehouse Server