01 - Data Mining Introduction
01 - Data Mining Introduction
DATA MINING
2
What Is Data Mining?
• Alternative names:
– Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data dredging,
information harvesting, business intelligence, etc.
• Is everything “data mining”?
– Simple search and query processing.
3
Knowledge discovery from databases
• This is a view from typical database systems
and data warehousing communities
• Data mining plays an essential role in the
knowledge discovery process
Databases 4
Example: A Web Mining Framework
End User
Increasing potential Decisio
to support n
business decisions
Making
Data Presentation Business
Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst
Data Exploration
Statistical Summary, Querying, and Reporting
7
Multi-Dimensional View of Data Mining
• Data to be mined
– Database data (extended-relational, object-oriented,
heterogeneous, legacy), transactional data, stream, time-series,
sequence, text and web, multi-media, graphs & social and
information networks.
• Knowledge to be mined (or: Data mining functions)
– Characterization, discrimination, association, classification,
clustering, trend/deviation, outlier analysis, etc.
– Descriptive vs. predictive data mining ?
– What is difference between predictive and descriptive model?
A descriptive mining will exploit the past data that are stored in
databases and provide you with the accurate report. In a
Predictive mining, it identifies patterns found in past and
transactional data to find risks and future outcomes.
8
Multi-Dimensional View of Data Mining
• Techniques utilized
– Warehouse , machine learning, statistics, pattern
recognition, visualization, high-performance, etc.
• Applications adapted
– telecommunication, banking, fraud analysis, bio-data
mining, stock market analysis, text mining, Web
mining, etc.
9
Data Mining: On What Kinds of Data?
14
Data Mining Function: (5) Outlier Analysis
• Outlier analysis
– Outlier: A data object that does not comply with the general
behavior of the data
– Noise or exception? ―
– Methods: by product of clustering or regression analysis, …
– Useful in fraud detection, rare events analysis
15
Time and Ordering: Sequential Pattern,
Trend and Evolution Analysis
• Sequence, trend and evolution analysis
– Trend, time-series, and deviation analysis: e.g., regression and
value prediction
– Sequential pattern mining
○ e.g., first buy digital camera, then buy large SD memory
cards
– Periodicity analysis
– Motifs and biological sequence analysis
○ Approximate and consecutive motifs
– Similarity-based analysis
• Mining data streams
– Ordered, time-varying, potentially infinite, data streams
16
Structure and Network Analysis
• Graph mining
– Finding frequent subgraphs (e.g., chemical compounds), trees (XML),
substructures (web fragments)
• Information network analysis
– Social networks: actors (objects, nodes) and relationships (edges)
○ e.g., author networks in CS, terrorist networks
– Multiple heterogeneous networks
○ A person could be multiple information networks: friends, family,
classmates, …
– Links carry a lot of semantic information: Link mining
• Web mining
– Web is a big information network: from PageRank to Google
– Analysis of Web information networks
○ Web community discovery, opinion mining, usage mining, …
17
Evaluation of Knowledge
19
Applications of Data Mining
• Web page analysis: from web page classification, clustering to PageRank &
HITS algorithms
• Collaborative analysis & recommender systems
• Basket data analysis to targeted marketing
• Biological and medical data analysis: classification, cluster analysis
(microarray data analysis), biological sequence analysis, biological
network analysis
• Data mining and software engineering (e.g., IEEE Computer, Aug. 2009
issue)
• From major dedicated data mining systems/tools (e.g., SAS, MS SQL-
Server Analysis Manager, Oracle Data Mining Tools) to invisible data
mining
20
Major Issues in Data Mining (2)