CH 2
CH 2
CH 2
Motivation:
“Necessity is the Mother of Invention”
Data Explosion Problem
Automated data collection tools and mature database technology
lead to tremendous amounts of data stored in databases, data
warehouses and other information repositories
We are drowning in data, but starving for knowledge
Database
Statistics
Technology
Machine Visualization
Data Mining
Learning
Pattern Other
Recognition Algorithm Disciplines
Data Mining Tasks
Prediction Tasks
Use some variables to predict unknown or future values of other
variables.
Description Tasks
characterize the general properties of the data in the database.
Common data mining tasks
Classification [Predictive]
Clustering [Descriptive]
Association Rule Discovery [Descriptive]
Sequential Pattern Discovery [Descriptive]
Regression [Predictive]
Deviation/Anamoly Detection [Predictive]
Data mining functionalities
Data characterization: Data characterization is a
summarization of the general characteristics or features of a
target class of data. The data corresponding to the user-specified
class are typically collected by a database query.
Typical applications
Credit/loan approval
Target Marketing
Medical diagnosis :if a tumor is cancerous or benign
Fraud detection: if a transaction is fraudulent
Web page categorization: which category it is
Classification—A Two-Step Process
Model construction: describing a set of predetermined classes Each
tuple/sample is assumed to belong to a predefined class, as determined
by the class label attribute The set of tuples used for model
construction is training set. The model is represented as classification
rules, decision trees, or mathematical formulae
Clustering Algorithms
K-means and its variants
Hierarchical clustering
Density-based clustering
Supervised vs. Unsupervised
Learning
Supervised learning (classification) :
The training data (observations, measurements, etc.) are
accompanied by labels indicating the class of the observations
New data is classified based on the training set
Outlier values may also be detected with respect to the location and
type of purchase, or the purchase frequency.
Data Mining Architecture
Data mining is a very important process where potentially useful
and previously unknown information is extracted from large
volumes of data.
There are a number of components involved in the data mining
process. These components constitute the architecture of a data
mining system.
The major components of any data mining system are data
source, data warehouse server, data mining engine, pattern
evaluation module, graphical user interface and
knowledge base
Data Mining Architecture
Data Mining Architecture
Data Sources
Database, data warehouse, World Wide Web (WWW), text files and
other documents are the actual sources of data. You need large
volumes of historical data for data mining to be successful.
Different Processes
The data needs to be cleaned, integrated and selected before
passing it to the database or data warehouse server.
Database or Data Warehouse Server
The database or data warehouse server contains the actual data that
is ready to be processed. Hence, the server is responsible for
retrieving the relevant data based on the data mining request of the
user.
Data Mining Architecture
Data Mining Engine
It consists of a number of modules for performing data mining tasks
including association, classification, characterization, clustering,
prediction, time-series analysis etc.
Pattern Evaluation Modules
The pattern evaluation module is mainly responsible for the measure of
interestingness of the pattern by using a threshold value. It interacts
with the data mining engine to focus the search towards interesting
patterns.
Graphical User Interface
The graphical user interface module communicates between the user
and the data mining system. This module helps the user use the system
easily and efficiently without knowing the real complexity behind the
process. When the user specifies a query or a task, this module interacts
with the data mining system and displays the result in an easily
understandable manner.
Data Mining Architecture
Knowledge Base
The knowledge base is helpful in the whole data mining process. It
might be useful for guiding the search or evaluating the
interestingness of the result patterns. The knowledge base might
even contain user beliefs and data from user experiences that can be
useful in the process of data mining. The data mining engine might
get inputs from the knowledge base to make the result more
accurate and reliable.
Data Mining Issues
Major issues in data mining are partitioned in five
groups:
Mining methodology
User interaction
Efficiency and scalability
Diversity of data types
Data mining and society
Data Mining Issues
Mining methodology
Mining different kinds of knowledge in databases
mining of knowledge at multiple levels of abstraction
Handling noisy or incomplete data
Pattern evaluation
User interaction
Interactive mining
Incorporation of background knowledge
Query languages and ad hoc mining
Presentation and visualization of data mining results
Data Mining Issues
Efficiency and scalability
Efficiency and scalability of data mining algorithms
Parallel, distributed, and incremental mining algorithms
Diversity of data types
Handling of relational and complex types of data
Mining information from heterogeneous databases and global
information systems
Data mining and Society
Social impact of data mining
Privacy preserving in data mining
Invisible data mining
Data Mining Applications
Here is the list of areas where data mining is widely used
Healthcare and Insurance
Measuring Treatment Effectiveness – This application of data mining
involves comparing and contrasting symptoms, causes and courses of
treatment to find the most effective course of action for a certain illness or
condition. For example, patient groups who are treated with different drug
regimens can be compared to determine which treatment plans work best
and save the most money.
Retail Industry
Data mining in retail industry helps in identifying customer buying patterns
and trends that lead to improved quality of customer service and good
customer retention and satisfaction.
Analysis of effectiveness of sales campaigns.
Customer Retention.
Product recommendation and cross-referencing of items.
Market basket analysis
Data Mining Applications
Banking/Finance
The financial data in banking and financial industry is generally reliable and
of high quality which facilitates systematic data analysis and data mining.
Some of the typical cases are as follows −
Loan payment prediction and customer credit policy analysis.
Classification and clustering of customers for targeted marketing.
Detection of money laundering and other financial crimes.
Intrusion Detection
Data mining can help improve intrusion detection by adding a level of focus
to anomaly detection. It helps an analyst to distinguish an activity from
common everyday network activity.
Monitoring and analyzing traffic
Identifying abnormal activity
Other applications are:
Bio Informatics
Crime agencies
Scientific Applications