Intro of Data Mining
Intro of Data Mining
Intro of Data Mining
1
Introduction
Motivation: Why data mining?
What is data mining?
Data Mining: On what kind of data?
Data mining functionality
Are all the patterns interesting?
Classification of data mining systems
Major issues in data mining
2
Why Data Mining?
The Explosive Growth of Data: from terabytes to petabytes
Data collection and data availability
Automated data collection tools, database systems, Web,
computerized society
Major sources of abundant data
Business: Web, e-commerce, transactions, stocks, …
Science: Remote sensing, bioinformatics, scientific simulation, …
Society and everyone: news, digital cameras,
We are drowning in data, but starving for knowledge!
“Necessity is the mother of invention”—Data mining—Automated analysis of
massive data sets
3
Evolution of Database Technology
1960s:
Data collection, database creation, IMS and network DBMS
1970s:
Relational data model, relational DBMS implementation
1980s:
RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
Application-oriented DBMS (spatial, scientific, engineering, etc.)
1990s:
Data mining, data warehousing, multimedia databases, and Web
databases
2000s
Stream data management and mining
Data mining and its applications
Web technology (XML, data integration) and global information systems
4
What Is Data Mining?
5
Why Data Mining?—Potential Applications
6
Why Data Mining?—Potential Applications
Other Applications
Text mining (news group, email, documents) and Web
mining
Stream data mining
Bioinformatics and bio-data analysis
7
Market Analysis and Management
8
Market Analysis and Management
Cross-market analysis
Associations/co-relations between product sales, &
prediction based on such association
Customer profiling
What types of customers buy what products
Customer requirement analysis
Identifying the best products for different customers
Predict what factors will attract new customers
9
Fraud Detection & Mining Unusual Patterns
11
Data Mining: A KDD Process
Task-relevant Data
Data Selection
Warehouse
Data Cleaning
Data Integration
Databases
12
Steps of a KDD Process
13
Architecture: Typical Data Mining System
Pattern evaluation
Data
Databases Warehouse
14
Data Mining: On What Kinds of Data?
Relational database
Data warehouse
Transactional database
Advanced database and information repository
Spatial and temporal data
Time-series data
Stream data
Multimedia database
15
Data Mining Functionalities
Concept description: Characterization and discrimination
Generalize, summarize, and contrast data characteristics
Association (correlation and causality)
Diaper Beer [0.5%, 75%]
Classification and Prediction
Construct models (functions) that describe and distinguish classes
or concepts for future prediction
Presentation: decision-tree, classification rule, neural network
16
Data Mining Functionalities
Cluster analysis
Class label is unknown: Group data to form new classes, e.g.,
Outlier analysis
Outlier: a data object that does not comply with the general
17
Are All the “Discovered” Patterns Interesting?
18
Data Mining: Confluence of Multiple Disciplines
Database
Statistics
Systems
Machine
Learning
Data Mining Visualization
Algorithm Other
Disciplines
19
Data Mining: Classification Schemes
20
Multi-Dimensional View of Data Mining
Data to be mined
Relational, data warehouse, transactional, stream,
object-oriented/relational, active, spatial, time-series,
text, multi-media, heterogeneous, WWW
Knowledge to be mined
Characterization, discrimination, association,
classification, clustering, trend/deviation, outlier
analysis, etc.
Multiple/integrated functions and mining at multiple
levels
21
Multi-Dimensional View of Data Mining
Techniques utilized
Database-oriented, data warehouse (OLAP), machine
learning, statistics, visualization, etc.
Applications adapted
Retail, telecommunication, banking, fraud analysis,
bio-data mining, stock market analysis, Web mining,
etc.
22
OLAP Mining: Integration of Data Mining and Data Warehousing
23
Major Issues in Data Mining
Mining methodology
Mining different kinds of knowledge from diverse data
types, e.g., bio, stream, Web
Performance: efficiency, effectiveness, and scalability
Pattern evaluation: the interestingness problem
Incorporation of background knowledge
Handling noise and incomplete data
Parallel, distributed and incremental mining methods
Integration of the discovered knowledge with existing
one: knowledge fusion
24
Major Issues in Data Mining
User interaction
Data mining query languages and ad-hoc mining
Expression and visualization of data mining results
Interactive mining of knowledge at multiple levels of
abstraction
Applications and social impacts
Domain-specific data mining & invisible data mining
Protection of data security, integrity, and privacy
25
Summary
Data mining: discovering interesting patterns from large amounts of
data
A natural evolution of database technology, in great demand, with
wide applications
A KDD process includes data cleaning, data integration, data
selection, transformation, data mining, pattern evaluation, and
knowledge presentation
Mining can be performed in a variety of information repositories
Data mining functionalities: characterization, discrimination,
association, classification, clustering, outlier and trend analysis, etc.
Data mining systems and architectures
Major issues in data mining
26
Where to Find References?
More conferences on data mining
PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM (2001), etc.
Data mining and KDD
Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc.
Journal: Data Mining and Knowledge Discovery, KDD Explorations
Database systems
Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, DASFAA
Journals: ACM-TODS, IEEE-TKDE, JIIS, J. ACM, etc.
AI & Machine Learning
Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), etc.
Journals: Machine Learning, Artificial Intelligence, etc.
Statistics
Conferences: Joint Stat. Meeting, etc.
Journals: Annals of statistics, etc.
Visualization
Conference proceedings: CHI, ACM-SIGGraph, etc.
Journals: IEEE Trans. visualization and computer graphics, etc.
27