IT326 - Ch1
IT326 - Ch1
IT326 - Ch1
Alternative names:
Knowledge discovery (mining) from data (KDD), knowledge extraction,
data/pattern analysis, business intelligence, etc...
What is Data Mining?
What is the difference between data mining and database query?
Find all credit applicants with last Find all credit applicants who are high
name of Smith. credit risks. (classification)
Identify customers who have Identify customers with similar buying
purchased more than $10,000 in the habits. (Clustering)
last month. Find all items which are frequently
Find all customers who have purchased with milk. (association rules)
purchased milk
6
Knowledge Discovery (KDD) Process
7
2.
3. Data cleaning
4. Data integration
5. Data transformation Data Pre-processing
6. Data mining
7. Pattern evaluation
8. Knowledge presentation
Data Mining Tasks
8
Frequent Outlier
Classification Clustering Pattern and Analysis
[Predictive] [Descriptive] Association
[Descriptive] [Predictive]
A simple metaphor for the concept of classification
9
Direct Marketing:
Goal: Reduce cost of mailing/advertising by targeting a set of consumers likely to
buy a new product.
Approach:
◼ Use the data for a similar product introduced before. We know which customers decided to buy
and which decided otherwise. This {buy, don’t buy} decision forms the class label.
◼ Collect various demographic, lifestyle, and company-interaction related information about all
such customers.
◼ Type of business, where they stay, how much they earn, etc…
◼ Use this information as input attributes to train a classifier model.
Kids Sorting\grouping game as a problem of clustering
12
Ungrouped (clustered)
data
Group by color
Market Segmentation:
Goal: subdivide a market into distinct subsets of customers where any subset may
conceivably be selected as a market target to be reached with a distinct
marketing mix.
Approach:
◼ Collect different attributes of customers based on their geographical and lifestyle related
information.
◼ Find clusters of similar customers.
◼ Measure the clustering quality by observing buying patterns of customers in same cluster vs.
those from different clusters.
Frequent patterns, Association and Correlation Analysis
15
Support= 50%
(means that 50% of all the transactions under analysis show
that bread and milk are purchased together).
Confidence= 75%
(means that if a customer buys Bread, there is a 75% chance
that she will buy Milk as well.)
Outlier Analysis
16
Outlier: A data object that does not comply with the general behavior of
the data.
Useful in fraud detection, and rare events analysis.
Example: “Find unusual activity in a client’s banking transactions”?
Outlier analysis may uncover fraudulent usage of credit cards by detecting purchases of
unusually large amounts for a given account number in comparison to regular charges
incurred by the same account. Outlier values may also be detected with respect to the
locations and types of purchase, or the purchase frequency.
Summary
17