Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
23 views

01 - Data Mining Introduction

Uploaded by

salehaalsaleh602
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

01 - Data Mining Introduction

Uploaded by

salehaalsaleh602
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 21

Introduction

DATA MINING

Dr. Mohammad Alsaudi


Why Data Mining?

The Explosive Growth of Data: from terabytes to petabytes


– Data collection and data availability
○ Automated data collection tools, database systems, Web,
computerized society.
– Major sources data generation
○ Web, e-commerce, transactions, stocks, …
○ Remote sensing, bioinformatics, scientific simulation, etc
○ news, digital cameras, YouTube.

2
What Is Data Mining?

• Data mining (knowledge discovery from data)


Extraction of interesting ( previously unknown and potentially
useful) patterns or knowledge from huge amount of data.
– Data mining: a misnomer?

• Alternative names:
– Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data dredging,
information harvesting, business intelligence, etc.
• Is everything “data mining”?
– Simple search and query processing.

3
Knowledge discovery from databases
• This is a view from typical database systems
and data warehousing communities
• Data mining plays an essential role in the
knowledge discovery process

Databases 4
Example: A Web Mining Framework

• Web mining usually involves


– Data cleaning
– Data integration from multiple sources
– Warehousing the data A data warehouse is an electronic system
for storing information in a manner that is secure, reliable, easy
to retrieve, and easy to manage.
– Data cube construction
– Data selection for data mining
– Data mining
– Presentation of the mining results
– Patterns and knowledge to be used or stored into knowledge-
base
5
Data Mining in Business Intelligence

End User
Increasing potential Decisio
to support n
business decisions
Making
Data Presentation Business
Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst

Data Exploration
Statistical Summary, Querying, and Reporting

Data Preprocessing/Integration, Data Warehouses


DBA
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
6
KDD Process: A Typical View from ML and
Statistics
• This is a view from typical machine learning and statistics communities

Input Data Data Pre- Data Post-


Processing Mining Processin
g

Data integration Pattern discovery Pattern evaluation


Normalization Association & Pattern selection
correlation
Feature selection Classification Pattern
Dimension reduction interpretation
Clustering
Outlier analysis Pattern visualization
…………

7
Multi-Dimensional View of Data Mining

• Data to be mined
– Database data (extended-relational, object-oriented,
heterogeneous, legacy), transactional data, stream, time-series,
sequence, text and web, multi-media, graphs & social and
information networks.
• Knowledge to be mined (or: Data mining functions)
– Characterization, discrimination, association, classification,
clustering, trend/deviation, outlier analysis, etc.
– Descriptive vs. predictive data mining ?
– What is difference between predictive and descriptive model?
A descriptive mining will exploit the past data that are stored in
databases and provide you with the accurate report. In a
Predictive mining, it identifies patterns found in past and
transactional data to find risks and future outcomes.
8
Multi-Dimensional View of Data Mining

• Techniques utilized
– Warehouse , machine learning, statistics, pattern
recognition, visualization, high-performance, etc.
• Applications adapted
– telecommunication, banking, fraud analysis, bio-data
mining, stock market analysis, text mining, Web
mining, etc.

9
Data Mining: On What Kinds of Data?

• Database-oriented data sets and applications


– Relational database, data warehouse, transactional database

• Advanced data sets and advanced applications


– Data streams and sensor data
– Time-series data, temporal data, sequence data (incl. bio-sequences)
– Structure data, graphs, social networks and multi-linked data
– Object-relational databases
– Heterogeneous databases and legacy databases
– Spatial data and spatiotemporal data
– Multimedia database
– Text databases
– The World-Wide Web
10
Data Mining Function: (1) Generalization

• Information integration and data warehouse construction


– Data cleaning, transformation, integration, and multidimensional
data model
• Data cube technology
– Scalable methods for computing (i.e., materializing)
multidimensional aggregates
– OLAP (online analytical processing)

• Multidimensional concept description: Characterization


and discrimination
– Generalize, summarize, and contrast data characteristics, e.g.,
dry vs. wet region
11
Data Mining Function: (2) Association and
Correlation Analysis

• Frequent patterns (or frequent itemsets)


– What items are frequently purchased together in your Walmart?

• Association, correlation vs. causality


– A typical association rule
○ Diaper  Beer [0.5%, 75%] (support, confidence)
– Are strongly associated items also strongly correlated?

• How to mine such patterns and rules efficiently in large


datasets?
• How to use such patterns for classification, clustering,
and other applications?
12
Data Mining Function: (3) Classification

• Classification and label prediction


– Construct models (functions) based on some training examples
– Describe and distinguish classes or concepts for future prediction
○ E.g., classify countries based on (climate), or classify cars based
on (gas mileage)
– Predict some unknown class labels
• Typical methods
– Decision trees, naïve Bayesian classification, support vector
machines, neural networks, rule-based classification, pattern-based
classification, logistic regression, …
• Typical applications:
– Credit card fraud detection, direct marketing, classifying stars,
diseases, web-pages, …
13
Data Mining Function: (4) Cluster Analysis

• Unsupervised learning (i.e., Class label is unknown)


• Group data to form new categories (i.e., clusters), e.g.,
cluster houses to find distribution patterns
• Principle: Maximizing intra-class similarity & minimizing
interclass similarity
• Many methods and applications

14
Data Mining Function: (5) Outlier Analysis

• Outlier analysis
– Outlier: A data object that does not comply with the general
behavior of the data
– Noise or exception? ―
– Methods: by product of clustering or regression analysis, …
– Useful in fraud detection, rare events analysis

15
Time and Ordering: Sequential Pattern,
Trend and Evolution Analysis
• Sequence, trend and evolution analysis
– Trend, time-series, and deviation analysis: e.g., regression and
value prediction
– Sequential pattern mining
○ e.g., first buy digital camera, then buy large SD memory
cards
– Periodicity analysis
– Motifs and biological sequence analysis
○ Approximate and consecutive motifs
– Similarity-based analysis
• Mining data streams
– Ordered, time-varying, potentially infinite, data streams
16
Structure and Network Analysis

• Graph mining
– Finding frequent subgraphs (e.g., chemical compounds), trees (XML),
substructures (web fragments)
• Information network analysis
– Social networks: actors (objects, nodes) and relationships (edges)
○ e.g., author networks in CS, terrorist networks
– Multiple heterogeneous networks
○ A person could be multiple information networks: friends, family,
classmates, …
– Links carry a lot of semantic information: Link mining
• Web mining
– Web is a big information network: from PageRank to Google
– Analysis of Web information networks
○ Web community discovery, opinion mining, usage mining, …
17
Evaluation of Knowledge

• Are all mined knowledge interesting?


– One can mine tremendous amount of “patterns” and knowledge
– Some may fit only certain dimension space (time, location, …)
– Some may not be representative, may be transient, …

• Evaluation of mined knowledge → directly mine only


interesting knowledge?
– Descriptive vs. predictive
– Coverage
– Typicality vs. novelty
– Accuracy
– Timeliness
– … 18
Data Mining: Confluence of Multiple Disciplines

Machine Pattern Statistics


Learning Recognition

Applications Data Mining Visualization

Algorithm Database High-Performance


Technology Computing

19
Applications of Data Mining

• Web page analysis: from web page classification, clustering to PageRank &
HITS algorithms
• Collaborative analysis & recommender systems
• Basket data analysis to targeted marketing
• Biological and medical data analysis: classification, cluster analysis
(microarray data analysis), biological sequence analysis, biological
network analysis
• Data mining and software engineering (e.g., IEEE Computer, Aug. 2009
issue)
• From major dedicated data mining systems/tools (e.g., SAS, MS SQL-
Server Analysis Manager, Oracle Data Mining Tools) to invisible data
mining
20
Major Issues in Data Mining (2)

• Efficiency and Scalability


– Efficiency and scalability of data mining algorithms
– Parallel, distributed, stream, and incremental mining methods
• Diversity of data types
– Handling complex types of data
– Mining dynamic, networked, and global data repositories
• Data mining and society
– Social impacts of data mining
– Privacy-preserving data mining
– Invisible data mining
21

You might also like