Module1 IntroToDataMining
Module1 IntroToDataMining
— Chapter 1 —
1
Chapter 1. Introduction
■ Why Data Mining?
■ What Is Data Mining?
■ A Multi-Dimensional View of Data Mining
■ What Kinds of Data Can Be Mined?
■ What Kinds of Patterns Can Be Mined?
■ Summary
2
Why Data Mining?
■ The Explosive Growth of Data: from terabytes to petabytes
■ Data collection and data availability
computerized society
■ Major sources of abundant data
simulation, …
■ Society and everyone: news, digital cameras, YouTube
3
Chapter 1. Introduction
■ Why Data Mining?
■ What Is Data Mining?
■ A Multi-Dimensional View of Data Mining
■ What Kinds of Data Can Be Mined?
■ What Kinds of Patterns Can Be Mined?
■ Summary
4
What Is Data Mining?
5
What Is Data Mining?
■ Alternative names
■ Knowledge discovery (mining) in
databases (KDD), knowledge extraction,
data/pattern analysis, information
harvesting, business intelligence, etc.
■ Watch out: Is everything “data mining”?
■ Simple search and query processing
■ (Deductive) expert systems
6
Knowledge Discovery (KDD) Process
■ This is a view from typical database
systems and data warehousing
Pattern Evaluation
communities
■ Data mining plays an essential role in
the knowledge discovery process
Data Mining
Task-relevant Data
Data Cleaning
Data Integration
Databases
7
Example: A Web Mining Framework
■ Data mining
knowledge-base
8
Data Mining in Business Intelligence
Increasing potential
to support
business decisions End User
Decision
Making
Data Exploration
Statistical Summary, Querying, and Reporting
10
Which View Do You Prefer?
■ Which view do you prefer?
■ KDD vs. ML/Stat. vs. Business Intelligence
■ Depending on the data, applications, and your focus
■ Data Mining vs. Data Exploration
■ Business intelligence view
■ Warehouse, data cube, reporting but not much mining
■ Business objects vs. data mining tools
■ Supply chain example: mining vs. OLAP vs. presentation tools
■ Data presentation vs. data exploration
11
Chapter 1. Introduction
■ Why Data Mining?
■ What Is Data Mining?
■ A Multi-Dimensional View of Data Mining
■ What Kinds of Data Can Be Mined?
■ What Kinds of Patterns Can Be Mined?
■ Summary
12
Multi-Dimensional View of Data Mining
■ Data to be mined
■ Database data (extended-relational, object-oriented, heterogeneous,
legacy), data warehouse, transactional data, stream, spatiotemporal,
time-series, sequence, text and web, multi-media, graphs & social and
information networks
■ Knowledge to be mined (or: Data mining functions)
■ Characterization, discrimination, association, classification, clustering,
trend/deviation, outlier analysis, etc.
■ Descriptive vs. predictive data mining
■ Multiple/integrated functions and mining at multiple levels
■ Techniques utilized
■ Data-intensive, data warehouse (OLAP), machine learning, statistics,
pattern recognition, visualization, high-performance, etc.
■ Applications adapted
■ Retail, telecommunication, banking, fraud analysis, bio-data mining, stock
market analysis, text mining, Web mining, etc.
13
Chapter 1. Introduction (Week 2)
■ Why Data Mining?
■ What Is Data Mining?
■ A Multi-Dimensional View of Data Mining
■ What Kinds of Data Can Be Mined?
■ What Kinds of Patterns Can Be Mined?
■ Summary
14
Data Mining: On What Kinds of Data?
■ Database-oriented data sets and applications
■ Relational database, data warehouse, transactional database
■ Object-relational databases, Heterogeneous databases and legacy databases
■ Advanced data sets and advanced applications
■ Data streams and sensor data
■ Time-series data, temporal data, sequence data (incl. bio-sequences)
■ Structure data, graphs, social networks and information networks
■ Spatial data and spatiotemporal data
■ Multimedia database
■ Text databases
■ The World-Wide Web
15
Chapter 1. Introduction (Week 2)
■ Why Data Mining?
■ What Is Data Mining?
■ A Multi-Dimensional View of Data Mining
■ What Kinds of Data Can Be Mined?
■ What Kinds of Patterns Can Be Mined?
■ Summary
16
Data Mining Function: (1) Generalization
■ Information integration and data warehouse
construction
■ Data cleaning, transformation, integration, and
multidimensional data model
■ Data cube technology
■ Scalable methods for computing (i.e.,
materializing) multidimensional aggregates
■ OLAP (online analytical processing)
Figure: multidimensional
■ Multidimensional concept description: data cube, commonly used
for data warehousing
Characterization and discrimination
■ Generalize, summarize, and contrast data
characteristics, e.g., dry vs. wet region
17
Data Mining Function: (2) Association and
Correlation Analysis
■ Frequent patterns (or frequent itemsets)
■ What items are frequently purchased together in your
Walmart?
■ Association, correlation vs. causality
■ A typical association rule
■ Diaper 🡪 Beer [0.5%, 75%] (support, confidence)
■ Are strongly associated items also strongly correlated?
■ How to mine such patterns and rules efficiently in large
datasets?
■ How to use such patterns for classification, clustering, and
other applications?
18
Data Mining Function: (3) Classification
■ Classification and label prediction
■ Construct models (functions) based on some training examples
■ Describe and distinguish classes or concepts for future prediction
■ E.g., classify countries based on (climate), or classify cars based
on (gas mileage)
■ Predict some unknown class labels
■ Typical methods
■ Decision trees, naïve Bayesian classification, support vector
machines, neural networks, rule-based classification,
pattern-based classification, logistic regression, …
■ Typical applications:
■ Credit card fraud detection, direct marketing, classifying stars,
diseases, web-pages, …
19
Data Mining Function: (4) Cluster Analysis
20
Data Mining Function: (5) Outlier Analysis
■ Outlier analysis
■ Outlier: A data object that does not comply with the general
behavior of the data
■ Noise or exception? ― One person’s garbage could be
another person’s treasure
■ Methods: by product of clustering or regression analysis, …
■ Useful in fraud detection, rare events analysis
21
Time and Ordering: Sequential Pattern,
Trend and Evolution Analysis
■ Sequence, trend and evolution analysis
■ Trend, time-series, and deviation analysis: e.g., regression
cards
■ Periodicity analysis
■ Similarity-based analysis
22
Structure and Network Analysis
■ Graph mining
■ Finding frequent subgraphs (e.g., chemical compounds), trees (XML),
substructures (web fragments)
■ Information network analysis
■ Social networks: actors (objects, nodes) and relationships (edges)
■ e.g., author networks in CS, terrorist networks
classmates, …
■ Links carry a lot of semantic information: Link mining
■ Web mining
■ Web is a big information network: from PageRank to Google
■ Analysis of Web information networks
■ Web community discovery, opinion mining, usage mining, …
23
Evaluation of Knowledge
■ Are all mined knowledge interesting?
■ One can mine tremendous amount of “patterns”
…)
■ Some may not be representative, may be transient, …
■ Coverage
■ Accuracy
■ Timeliness
■ …
24
Chapter 1. Introduction
■ Why Data Mining?
■ What Is Data Mining?
■ A Multi-Dimensional View of Data Mining
■ What Kinds of Data Can Be Mined?
■ What Kinds of Patterns Can Be Mined?
■ Summary
25
Data Mining: Confluence of Multiple Disciplines
Pattern
Machine Statistics
Recogniti
Learning
on
26
Why Confluence of Multiple Disciplines?
■ High-dimensionality of data
■ Micro-array may have tens of thousands of dimensions
27
Chapter 1. Introduction
■ Why Data Mining?
■ What Is Data Mining?
■ A Multi-Dimensional View of Data Mining
■ What Kinds of Data Can Be Mined?
■ What Kinds of Patterns Can Be Mined?
■ Summary
28
Applications of Data Mining
■ Web page analysis: from web page classification, clustering to
PageRank & HITS algorithms
■ Collaborative analysis & recommender systems
■ Basket data analysis to targeted marketing
■ Biological and medical data analysis: classification, cluster analysis
(microarray data analysis), biological sequence analysis, biological
network analysis
■ Data mining and software engineering
■ From major dedicated data mining systems/tools (e.g., SAS, MS
SQL-Server Analysis Manager, Oracle Data Mining Tools) to invisible
data mining
29
Summary
■ Data mining: Discovering interesting patterns and knowledge from massive
amount of data
■ A natural evolution of science and information technology, in great demand,
with wide applications
■ A KDD process includes data cleaning, data integration, data selection,
transformation, data mining, pattern evaluation, and knowledge
presentation
■ Mining can be performed in a variety of data
■ Data mining functionalities: characterization, discrimination, association,
classification, clustering, trend and outlier analysis, etc.
■ Data mining technologies and applications
■ Major issues in data mining
30
* Data Mining: Concepts and Techniques
31
Major Issues in Data Mining (1)
■ Mining Methodology
■ Mining various and new kinds of knowledge
■ Mining knowledge in multi-dimensional space
■ Data mining: An interdisciplinary effort
■ Boosting the power of discovery in a networked environment
■ Handling noise, uncertainty, and incompleteness of data
■ Pattern evaluation and pattern- or constraint-guided mining
■ User Interaction
■ Interactive mining
■ Incorporation of background knowledge
■ Presentation and visualization of data mining results
32
Major Issues in Data Mining (2)
33
A Brief History of Data Mining Society
34
Conferences and Journals on Data Mining
35
Where to Find References? DBLP, CiteSeer, Google
36