CH 2

Unit 3
Motivation:
“Necessity is the Mother of Invention”
 Data Explosion Problem
 Automated data collection tools and mature database technology
lead to tremendous amounts of data stored in databases, data
warehouses and other information repositories
 We are drowning in data, but starving for knowledge
 Solution: Data warehousing and data mining
 Data warehousing and on-line analytical processing
 Extraction of interesting knowledge (rules, regularities, patterns,

constraints) from data in large databases
Why data mining?
 Commercial point of view
 Data has become the key competitive advantage of companies
 Examples: Facebook, Google, Amazon
 Being able to extract useful information out of the data is key for
exploiting them commercially.
 Scientific point of view
 Scientists are at an unprecedented position where they can collect TB of
information
 Examples: Sensor data, astronomy data, social network data, gene data
 We need the tools to analyze such data to get a better understanding of
the world and advance science
 Scale (in data size and feature dimension)
 Why not use traditional analytic methods?
 Enormity of data, curse of dimensionality
 The amount and the complexity of data does not allow for manual
processing of the data. We need automated techniques.
What is Data Mining?
 Data mining (knowledge discovery in databases):
 Extraction of interesting (non-trivial, implicit, previously unknown
and potentially useful) patterns from data in large databases
 Alternative names and their “inside stories”:

 Data mining: a misnomer?
 Knowledge discovery(mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, business
intelligence, etc.
Knowledge Discovery in Databases(KDD)
KDD cistm
 Data cleaning: also known as data cleansing, it is a phase in which
noise data and irrelevant data are removed from the collection.
 Data integration: at this stage, multiple data sources, often
heterogeneous, may be combined in a common source.
 Data selection: at this step, the data relevant to the analysis is decided
on and retrieved from the data collection.
 Data transformation: also known as data consolidation, it is a phase
in which the selected data is transformed into forms appropriate for the
mining procedure.
 Data mining: it is the crucial step in which clever techniques are
applied to extract patterns potentially useful.
 Pattern evaluation: in this step, strictly interesting patterns
representing knowledge are identified based on given measures.
 Knowledge representation: is the final phase in which the discovered
knowledge is visually represented to the user. This essential step uses
visualization techniques to help users understand and interpret the
data mining results.
Data Mining: Classification Schemes
 Decisions in data mining
 Kinds of databases to be mined
 Kinds of knowledge to be discovered
 Kinds of techniques utilized
 Kinds of applications adapted

Classification Criteria in Data Mining
 Databases to be mined
 Relational, transactional, object-oriented, object-relational, active,
spatial, time-series, text, multi-media, heterogeneous, legacy,
WWW, etc.
 Knowledge to be mined
 Characterization, discrimination, association, classification,
clustering, trend, deviation and outlier analysis, etc.
 Multiple/integrated functions and mining at multiple levels
 Techniques utilized
 Database-oriented, data warehouse (OLAP), machine learning,
statistics, visualization, neural network, etc.
 Applications adapted
 Retail, telecommunication, banking, fraud analysis, DNA mining,
stock market analysis, Web mining, Weblog analysis, etc.
Data Mining: Confluence of Multiple Disciplines
Database
Statistics
Technology
Machine Visualization
Data Mining
Learning
Pattern Other
Recognition Algorithm Disciplines
Data Mining Tasks
 Prediction Tasks
 Use some variables to predict unknown or future values of other
variables.
 Description Tasks
 characterize the general properties of the data in the database.
Common data mining tasks
 Classification [Predictive]
 Clustering [Descriptive]
 Association Rule Discovery [Descriptive]
 Sequential Pattern Discovery [Descriptive]
 Regression [Predictive]
 Deviation/Anamoly Detection [Predictive]
Data mining functionalities
 Data characterization: Data characterization is a
summarization of the general characteristics or features of a
target class of data. The data corresponding to the user-specified
class are typically collected by a database query.
 For example, one may wish to characterize the customers of a

store who regularly rent more than 30 movies a year. With a data
cube containing summarization of data, simple OLAP
operations fit the purpose of data characterization.
 Discrimination: Data discrimination produces what are called
discriminant rules and is basically the comparison of the general
features of objects between two classes referred to as the target class
and the contrasting class.
 For example, one may wish to compare the general characteristics of

the customers who rented more than 30 movies in the last year with
those whose rental account is lower than 5.
 The techniques used for data discrimination are similar to the

techniques used for data characterization with the exception
that data discrimination results include comparative measures.
Mining Frequent Patterns, Associations, and
Correlations
 Frequent patterns, as the name suggests, are patterns that occur
frequently in data. There are many kinds of frequent patterns,
including:
 itemsets,
 Subsequences,
 substructures.
 Mining frequent patterns leads to the discovery of interesting
associations and correlations within data.
 Association analysis: Association analysis studies the
frequency of items occurring together in transactional databases,
and based on a threshold called support, identifies the frequent
item sets. Another threshold, confidence, which is the
conditional probability than an item appears in a transaction
when another item appears, is used to pinpoint association rules.
 Association analysis is widely used for market basket or

transaction data analysis.
 An example of such a rule, mined from the AllElectronics transactional
database, is
buys(X; “computer”)=>buys(X; “software”) [support = 1%;

confidence = 50%]
 where X is a variable representing a customer. A confidence, or

certainty, of 50% means that if a customer buys a computer,
there is a 50% chance that she will buy software as well. A 1%
support means that 1% of all of the transactions under analysis
showed that computer and software were purchased together.
 This association rule involves a single attribute or predicate (i.e.,

buys) that repeats. Association rules that contain a single
predicate are referred to as single-dimensional association rules.
 A data mining system may find association rules like
age(X, “20:::29”)^income(X, “20K:::29K”)=>buys(X, “CD player”)
[support = 2%, confidence = 60%]
 Note that this is an association between more than one attribute, or

predicate (i.e., age, income, and buys). each attribute is referred to as a
dimension, the above rule can be referred to as a multidimensional
association rule.
 Typically, association rules are discarded as uninteresting if they do not

satisfy both a minimum support threshold and a minimum confidence
threshold. Additional analysis can be performed to uncover interesting
statistical correlations between associated attribute-value pairs.
 Classification : Classification is the process of finding a model (or
function) that describes and distinguishes data classes or concepts, for
the purpose of being able to use the model to predict the class of
objects whose class label is unknown. The derived model is based on
the analysis of a set of training data (i.e., data objects whose class label
is known).
 Classification model can be represented in various forms such as

 IF-THEN Rules
 A decision tree
 Neural network
 Support Vector Machine(SVM)
 Bayesian Classification
Classification vs. Prediction
 Classification predicts categorical class labels (discrete or
nominal) classifies data (constructs a model) based on the
training set and the values (class labels) in a classifying attribute
and uses it in classifying new data
 Prediction models continuous-valued functions, i.e., predicts

unknown or missing values
 Typical applications
 Credit/loan approval
 Target Marketing
 Medical diagnosis :if a tumor is cancerous or benign
 Fraud detection: if a transaction is fraudulent
 Web page categorization: which category it is
Classification—A Two-Step Process
 Model construction: describing a set of predetermined classes Each
tuple/sample is assumed to belong to a predefined class, as determined
by the class label attribute The set of tuples used for model
construction is training set. The model is represented as classification
rules, decision trees, or mathematical formulae
 Model usage: for classifying future or unknown objects
 Estimate accuracy of the model The known label of test sample

is compared with the classified result from the model Accuracy rate
is the percentage of test set samples that are correctly classified by
the model
Classification
Classification
Classification
Clustering
 Clustering: Clustering is a process of partitioning a set of data (or
objects) into a set of meaningful sub-classes, called clusters.
 However, unlike classification, in clustering, class labels are unknown

and it is up to the clustering algorithm to discover acceptable classes.
 Clustering is also called unsupervised classification, because the

classification is not dictated by given class labels.
 There are many clustering approaches, all based on the principle of

maximizing the similarity between objects in a same class (intra-class
similarity) and minimizing the similarity between objects of different
classes (inter-class similarity).
Clustering
Applications of Cluster
Analysis
Understanding –
Group related documents for
browsing,
group genes and proteins that
have similar functionality,
or group stocks with similar
price fluctuations
Clustering Algorithms
K-means and its variants
Hierarchical clustering
Density-based clustering
Supervised vs. Unsupervised
Learning
 Supervised learning (classification) :
 The training data (observations, measurements, etc.) are
accompanied by labels indicating the class of the observations
 New data is classified based on the training set
 Unsupervised learning (clustering)

 The class labels of training data is unknown
 Given a set of measurements, observations, etc. with the aim of
establishing the existence of classes or clusters in the data
 Outlier analysis: Outliers are data elements that cannot be grouped in
a given class or cluster. Also known as exceptions or surprises, they are
often very important to identify. While outliers can be considered noise
and discarded in some applications, they can reveal important
knowledge in other domains, and thus can be very significant and their
analysis valuable.
 Outlier analysis may uncover fraudulent usage of credit cards by

detecting purchases of extremely large amounts for a given account
number in comparison to regular charges incurred by the same
account.
 Outlier values may also be detected with respect to the location and
type of purchase, or the purchase frequency.
Data Mining Architecture
 Data mining is a very important process where potentially useful
and previously unknown information is extracted from large
volumes of data.
 There are a number of components involved in the data mining
process. These components constitute the architecture of a data
mining system.
 The major components of any data mining system are data
source, data warehouse server, data mining engine, pattern
evaluation module, graphical user interface and
knowledge base
 Data Sources
 Database, data warehouse, World Wide Web (WWW), text files and
other documents are the actual sources of data. You need large
volumes of historical data for data mining to be successful.
 Different Processes
 The data needs to be cleaned, integrated and selected before
passing it to the database or data warehouse server.
 Database or Data Warehouse Server
 The database or data warehouse server contains the actual data that
is ready to be processed. Hence, the server is responsible for
retrieving the relevant data based on the data mining request of the
user.
 Data Mining Engine
 It consists of a number of modules for performing data mining tasks
including association, classification, characterization, clustering,
prediction, time-series analysis etc.
 Pattern Evaluation Modules
 The pattern evaluation module is mainly responsible for the measure of
interestingness of the pattern by using a threshold value. It interacts
with the data mining engine to focus the search towards interesting
patterns.
 Graphical User Interface
 The graphical user interface module communicates between the user
and the data mining system. This module helps the user use the system
easily and efficiently without knowing the real complexity behind the
process. When the user specifies a query or a task, this module interacts
with the data mining system and displays the result in an easily
understandable manner.
 Knowledge Base
 The knowledge base is helpful in the whole data mining process. It
might be useful for guiding the search or evaluating the
interestingness of the result patterns. The knowledge base might
even contain user beliefs and data from user experiences that can be
useful in the process of data mining. The data mining engine might
get inputs from the knowledge base to make the result more
accurate and reliable.
Data Mining Issues
 Major issues in data mining are partitioned in five
groups:
 Mining methodology
 User interaction
 Efficiency and scalability
 Diversity of data types
 Data mining and society
Data Mining Issues
 Mining methodology
 Mining different kinds of knowledge in databases
 mining of knowledge at multiple levels of abstraction
 Handling noisy or incomplete data
 Pattern evaluation
 User interaction
 Interactive mining
 Incorporation of background knowledge
 Query languages and ad hoc mining
 Presentation and visualization of data mining results
Data Mining Issues
 Efficiency and scalability
 Efficiency and scalability of data mining algorithms
 Parallel, distributed, and incremental mining algorithms
 Diversity of data types
 Handling of relational and complex types of data
 Mining information from heterogeneous databases and global
information systems
 Data mining and Society
 Social impact of data mining
 Privacy preserving in data mining
 Invisible data mining
Data Mining Applications
Here is the list of areas where data mining is widely used
 Healthcare and Insurance
 Measuring Treatment Effectiveness – This application of data mining
involves comparing and contrasting symptoms, causes and courses of
treatment to find the most effective course of action for a certain illness or
condition. For example, patient groups who are treated with different drug
regimens can be compared to determine which treatment plans work best
and save the most money.
 Detecting Fraud and Abuse – This involves establishing normal patterns,

then identifying unusual patterns of medical claims by clinics, physicians,
labs, or others. This application can also be used to identify inappropriate
referrals or prescriptions and insurance fraud and fraudulent medical
claims. The Texas Medicaid Fraud and Abuse Detection System is a good
example of a business using data mining to detect fraud.
 Education
 Concerns with developing methods that discover knowledge from data
originating from educational Environments.
 The goals is identified as predicting students’ future learning behavior,
studying the effects of educational support. Data mining can be used by an
institution to take accurate decisions and also to predict the results of the
student. With the results the institution can focus on what to teach and
how to teach.
 Retail Industry
Data mining in retail industry helps in identifying customer buying patterns
and trends that lead to improved quality of customer service and good
customer retention and satisfaction.
 Analysis of effectiveness of sales campaigns.
 Customer Retention.
 Product recommendation and cross-referencing of items.
 Market basket analysis
 Banking/Finance
The financial data in banking and financial industry is generally reliable and
of high quality which facilitates systematic data analysis and data mining.
Some of the typical cases are as follows −
 Loan payment prediction and customer credit policy analysis.
 Classification and clustering of customers for targeted marketing.
 Detection of money laundering and other financial crimes.
 Intrusion Detection
Data mining can help improve intrusion detection by adding a level of focus
to anomaly detection. It helps an analyst to distinguish an activity from
common everyday network activity.
 Monitoring and analyzing traffic
 Identifying abnormal activity
 Other applications are:
 Bio Informatics
 Crime agencies
 Scientific Applications

CH 2

Uploaded by

Copyright:

Available Formats

CH 2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CH 2

Uploaded by

Copyright:

Available Formats

Unit 3

 Solution: Data warehousing and data mining

 Data warehousing and on-line analytical processing

 Extraction of interesting knowledge (rules, regularities, patterns,

 Alternative names and their “inside stories”:

 Kinds of knowledge to be discovered

 Kinds of techniques utilized

 Kinds of applications adapted

 For example, one may wish to characterize the customers of a

 For example, one may wish to compare the general characteristics of

 The techniques used for data discrimination are similar to the

 Association analysis is widely used for market basket or

buys(X; “computer”)=>buys(X; “software”) [support = 1%;

 where X is a variable representing a customer. A confidence, or

 This association rule involves a single attribute or predicate (i.e.,

 Note that this is an association between more than one attribute, or

 Typically, association rules are discarded as uninteresting if they do not

 Classification model can be represented in various forms such as

 Prediction models continuous-valued functions, i.e., predicts

 Model usage: for classifying future or unknown objects

 Estimate accuracy of the model The known label of test sample

 However, unlike classification, in clustering, class labels are unknown

 Clustering is also called unsupervised classification, because the

 There are many clustering approaches, all based on the principle of

 Unsupervised learning (clustering)

 Outlier analysis may uncover fraudulent usage of credit cards by

 Detecting Fraud and Abuse – This involves establishing normal patterns,

You might also like