CIS 674 Introduction to Data Mining
Srinivasan Parthasarathy srini@[Link] Office Hours: TTH 2-3:18PM DL317
Prentice Hall 1
Introduction Outline
Goal: Provide an overview of data mining. Define data mining Data mining vs. databases Basic data mining tasks Data mining development Data mining issues
Prentice Hall
Introduction
Data is produced at a phenomenal rate Our ability to store has grown Users expect more sophisticated information How? UNCOVER HIDDEN INFORMATION DATA MINING
Prentice Hall 3
Data Mining
Objective: Fit data to a model Potential Result: Higher-level meta information that may not be obvious when looking at raw data Similar terms
Exploratory data analysis Data driven discovery Deductive learning
Prentice Hall 4
Data Mining Algorithm
Objective: Fit Data to a Model
Descriptive Predictive
Preferential Questions
Which technique to choose?
ARM/Classification/Clustering Answer: Depends on what you want to do with data?
Search Strategy Technique to search the data
Interface? Query Language? Efficiency
Prentice Hall 5
Database Processing vs. Data Mining Processing
Query
Well defined SQL
Query
Poorly defined No precise query language
Output
Precise Subset of database
Output
Fuzzy Not a subset of database
6
Prentice Hall
Query Examples
Database
Find all credit applicants with last name of Smith.
Identify customers who have purchased more than $10,000 in the last month. Find all customers who have purchased milk
Data Mining
Find all credit applicants who are poor credit
risks. (classification) Identify customers with similar buying habits. (Clustering) Find all items which are frequently purchased with milk. (association rules)
Prentice Hall 7
Data Mining Models and Tasks
Prentice Hall
Basic Data Mining Tasks
Classification maps data into predefined groups or classes
Supervised learning Pattern recognition Prediction
Regression is used to map a data item to a real valued prediction variable. Clustering groups similar data together into clusters.
Unsupervised learning Segmentation Partitioning
Prentice Hall 9
Basic Data Mining Tasks (contd)
Summarization maps data into subsets with associated simple descriptions.
Characterization Generalization
Link Analysis uncovers relationships among data.
Affinity Analysis Association Rules Sequential Analysis determines sequential patterns.
Prentice Hall 10
Ex: Time Series Analysis
Example: Stock Market Predict future values Determine similar patterns over time Classify behavior
Prentice Hall
11
Data Mining vs. KDD
Knowledge Discovery in Databases (KDD): process of finding useful information and patterns in data. Data Mining: Use of algorithms to extract the information and patterns derived by the KDD process.
Prentice Hall
12
Knowledge Discovery Process
Data mining: the core of knowledge discovery Knowledge Interpretation process.
Data Mining Task-relevant Data Data transformations Preprocessed Data Selection
Data Cleaning
Data Integration Databases
KDD Process Ex: Web Log
Selection:
Select log data (dates and locations) to use
Preprocessing:
Remove identifying URLs Remove error logs
Transformation:
Sessionize (sort and group)
Data Mining:
Identify and count patterns Construct data structure
Interpretation/Evaluation:
Identify and display frequently accessed sequences.
Potential User Applications:
Cache prediction Personalization
Prentice Hall 14
Data Mining Development
Relational Data Model SQL Association Rule Algorithms Data Warehousing Scalability Techniques Similarity Measures Hierarchical Clustering IR Systems Imprecise Queries Textual Data Web Search Engines Bayes Theorem Regression Analysis EM Algorithm K-Means Clustering Time Series Analysis
DATA MINING
Algorithm Design Techniques Algorithm Analysis Data Structures
HIGH PERFORMANCE
Prentice Hall
Neural Networks Decision Tree Algorithms
15
KDD Issues
Human Interaction Overfitting Outliers Interpretation Visualization Large Datasets High Dimensionality
Prentice Hall 16
KDD Issues (contd)
Multimedia Data Missing Data Irrelevant Data Noisy Data Changing Data Integration Application
Prentice Hall 17
Social Implications of DM
Privacy Profiling Unauthorized use
Prentice Hall
18
Data Mining Metrics
Usefulness Return on Investment (ROI) Accuracy Space/Time
Prentice Hall
19
Database Perspective on Data Mining
Scalability Real World Data Updates Ease of Use
Prentice Hall
20
Outline of Todays Class
Statistical Basics
Point Estimation Models Based on Summarization Bayes Theorem Hypothesis Testing Regression and Correlation
Similarity Measures
Prentice Hall 21
Point Estimation
Point Estimate: estimate a population parameter. May be made by calculating the parameter for a sample. May be used to predict value for missing data. Ex:
R contains 100 employees 99 have salary information Mean salary of these is $50,000 Use $50,000 as value of remaining employees salary. Is this a good idea?
Prentice Hall 22
Estimation Error
Bias: Difference between expected value and actual value. Mean Squared Error (MSE): expected value of the squared difference between the estimate and the actual value: Why square? Root Mean Square Error (RMSE)
Prentice Hall 23
Jackknife Estimate
Jackknife Estimate: estimate of parameter is obtained by omitting one value from the set of observed values.
Treat the data like a population Take samples from this population Use these samples to estimate the parameter
Let (hat) be an estimate on the entire pop. Let (j)(hat) be an estimator of the same form with observation j deleted Allows you to examine the impact of outliers!
Prentice Hall 24
Maximum Likelihood Estimate (MLE)
Obtain parameter estimates that maximize the probability that the sample data occurs for the specific model. Joint probability for observing the sample data by multiplying the individual probabilities. Likelihood function:
Maximize L.
Prentice Hall 25
MLE Example
Coin toss five times: {H,H,H,H,T} Assuming a perfect coin with H and T equally likely, the likelihood of this sequence is:
However if the probability of a H is 0.8 then:
Prentice Hall
26
MLE Example (contd)
General likelihood formula:
Estimate for p is then 4/5 = 0.8
Prentice Hall 27
Expectation-Maximization (EM)
Solves estimation with incomplete data. Obtain initial estimates for parameters. Iteratively use estimates for missing data and continue until convergence.
Prentice Hall
28
EM Example
Prentice Hall
29
EM Algorithm
Prentice Hall
30
Bayes Theorem Example
Credit authorizations (hypotheses): h1=authorize purchase, h2 = authorize after further identification, h3=do not authorize, h4= do not authorize but contact police Assign twelve data values for all combinations of credit and income:
1 Excellent Good Bad x1 x5 x9 2 x2 x6 x10 3 x3 x7 x11 4 x4 x8 x12
From training data: P(h1) = 60%; P(h2)=20%; P(h3)=10%; P(h4)=10%.
Prentice Hall 31
Bayes Example(contd)
Training Data:
ID 1 2 3 4 5 6 7 8 9 10 Income 4 3 2 3 4 2 3 2 3 1 Credit Excellent Good Excellent Good Good Excellent Bad Bad Bad Bad
Prentice Hall
Class h1 h1 h1 h1 h1 h1 h2 h2 h3 h4
xi x4 x7 x2 x7 x8 x2 x11 x10 x11 x9
32
Bayes Example(contd)
Calculate P(xi|hj) and P(xi) Ex: P(x7|h1)=2/6; P(x4|h1)=1/6; P(x2|h1)=2/6; P(x8|h1)=1/6; P(xi|h1)=0 for all other xi. Predict the class for x4: Calculate P(hj|x4) for all hj. Place x4 in class with largest value. Ex: P(h1|x4)=(P(x4|h1)(P(h1))/P(x4) =(1/6)(0.6)/0.1=1. x4 in class h1.
Prentice Hall 33
Other Statistical Measures
Chi-Squared
O observed value E Expected value based on hypothesis.
Jackknife Estimate
estimate of parameter is obtained by omitting one value from the set of observed values.
Regression
Predict future values based on past values Linear Regression assumes linear relationship exists.
Find values to best fit the data
y = c0 + c1 x1 + + cn xn
Correlation
Prentice Hall 34
Similarity Measures
Determine similarity between two objects. Similarity characteristics:
Alternatively, distance measure measure how unlike or dissimilar objects are.
Prentice Hall 35
Similarity Measures
Prentice Hall
36
Distance Measures
Measure dissimilarity between objects
Prentice Hall
37
Information Retrieval
Information Retrieval (IR): retrieving desired information from textual data. Library Science Digital Libraries Web Search Engines Traditionally keyword based Sample query:
Find all documents about data mining.
DM: Similarity measures; Mine text/Web data.
Prentice Hall 38
Information Retrieval (contd)
Similarity: measure of how close a query is to a document. Documents which are close enough are retrieved. Metrics: Precision = |Relevant and Retrieved| |Retrieved| Recall = |Relevant and Retrieved| |Relevant|
Prentice Hall 39
IR Query Result Measures and Classification
IR
Prentice Hall
Classification
40