0% found this document useful (0 votes)

216 views40 pages

CIS 674 Introduction To Data Mining: Srinivasan Parthasarathy Srini@cse - Ohio-State - Edu Office Hours: TTH 2-3:18PM DL317

Uploaded by

Deepak Soman

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

216 views40 pages

CIS 674 Introduction To Data Mining: Srinivasan Parthasarathy Srini@cse - Ohio-State - Edu Office Hours: TTH 2-3:18PM DL317

Uploaded by

Deepak Soman

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

CIS 674 Introduction to Data Mining

Srinivasan Parthasarathy srini@[Link] Office Hours: TTH 2-3:18PM DL317

Prentice Hall 1

Introduction Outline
Goal: Provide an overview of data mining. Define data mining Data mining vs. databases Basic data mining tasks Data mining development Data mining issues

Prentice Hall

Introduction
Data is produced at a phenomenal rate Our ability to store has grown Users expect more sophisticated information How? UNCOVER HIDDEN INFORMATION DATA MINING
Prentice Hall 3

Data Mining
Objective: Fit data to a model Potential Result: Higher-level meta information that may not be obvious when looking at raw data Similar terms
Exploratory data analysis Data driven discovery Deductive learning
Prentice Hall 4

Data Mining Algorithm

Objective: Fit Data to a Model
Descriptive Predictive

Preferential Questions
Which technique to choose?
ARM/Classification/Clustering Answer: Depends on what you want to do with data?

Search Strategy Technique to search the data

Interface? Query Language? Efficiency
Prentice Hall 5

Database Processing vs. Data Mining Processing

Query
Well defined SQL

Query
Poorly defined No precise query language

Output
Precise Subset of database

Output
Fuzzy Not a subset of database
6

Prentice Hall

Query Examples
Database
Find all credit applicants with last name of Smith.
Identify customers who have purchased more than $10,000 in the last month. Find all customers who have purchased milk

Data Mining

Find all credit applicants who are poor credit

risks. (classification) Identify customers with similar buying habits. (Clustering) Find all items which are frequently purchased with milk. (association rules)
Prentice Hall 7

Data Mining Models and Tasks

Prentice Hall

Basic Data Mining Tasks

Classification maps data into predefined groups or classes
Supervised learning Pattern recognition Prediction

Regression is used to map a data item to a real valued prediction variable. Clustering groups similar data together into clusters.
Unsupervised learning Segmentation Partitioning
Prentice Hall 9

Basic Data Mining Tasks (contd)

Summarization maps data into subsets with associated simple descriptions.
Characterization Generalization

Link Analysis uncovers relationships among data.

Affinity Analysis Association Rules Sequential Analysis determines sequential patterns.
Prentice Hall 10

Ex: Time Series Analysis

Example: Stock Market Predict future values Determine similar patterns over time Classify behavior

Prentice Hall

Data Mining vs. KDD

Knowledge Discovery in Databases (KDD): process of finding useful information and patterns in data. Data Mining: Use of algorithms to extract the information and patterns derived by the KDD process.

Prentice Hall

Knowledge Discovery Process

Data mining: the core of knowledge discovery Knowledge Interpretation process.
Data Mining Task-relevant Data Data transformations Preprocessed Data Selection

Data Cleaning
Data Integration Databases

KDD Process Ex: Web Log

Selection:
Select log data (dates and locations) to use

Preprocessing:
Remove identifying URLs Remove error logs

Transformation:
Sessionize (sort and group)

Data Mining:
Identify and count patterns Construct data structure

Interpretation/Evaluation:
Identify and display frequently accessed sequences.

Potential User Applications:

Cache prediction Personalization
Prentice Hall 14

Data Mining Development

Relational Data Model SQL Association Rule Algorithms Data Warehousing Scalability Techniques Similarity Measures Hierarchical Clustering IR Systems Imprecise Queries Textual Data Web Search Engines Bayes Theorem Regression Analysis EM Algorithm K-Means Clustering Time Series Analysis

DATA MINING

Algorithm Design Techniques Algorithm Analysis Data Structures

HIGH PERFORMANCE
Prentice Hall

Neural Networks Decision Tree Algorithms

KDD Issues
Human Interaction Overfitting Outliers Interpretation Visualization Large Datasets High Dimensionality
Prentice Hall 16

KDD Issues (contd)

Multimedia Data Missing Data Irrelevant Data Noisy Data Changing Data Integration Application
Prentice Hall 17

Social Implications of DM
Privacy Profiling Unauthorized use

Prentice Hall

Data Mining Metrics

Usefulness Return on Investment (ROI) Accuracy Space/Time

Prentice Hall

Database Perspective on Data Mining

Scalability Real World Data Updates Ease of Use

Prentice Hall

Outline of Todays Class

Statistical Basics
Point Estimation Models Based on Summarization Bayes Theorem Hypothesis Testing Regression and Correlation

Similarity Measures
Prentice Hall 21

Point Estimation
Point Estimate: estimate a population parameter. May be made by calculating the parameter for a sample. May be used to predict value for missing data. Ex:
R contains 100 employees 99 have salary information Mean salary of these is $50,000 Use $50,000 as value of remaining employees salary. Is this a good idea?
Prentice Hall 22

Estimation Error
Bias: Difference between expected value and actual value. Mean Squared Error (MSE): expected value of the squared difference between the estimate and the actual value: Why square? Root Mean Square Error (RMSE)
Prentice Hall 23

Jackknife Estimate
Jackknife Estimate: estimate of parameter is obtained by omitting one value from the set of observed values.
Treat the data like a population Take samples from this population Use these samples to estimate the parameter

Let (hat) be an estimate on the entire pop. Let (j)(hat) be an estimator of the same form with observation j deleted Allows you to examine the impact of outliers!
Prentice Hall 24

Maximum Likelihood Estimate (MLE)

Obtain parameter estimates that maximize the probability that the sample data occurs for the specific model. Joint probability for observing the sample data by multiplying the individual probabilities. Likelihood function:

Maximize L.
Prentice Hall 25

MLE Example
Coin toss five times: {H,H,H,H,T} Assuming a perfect coin with H and T equally likely, the likelihood of this sequence is:

However if the probability of a H is 0.8 then:

Prentice Hall

MLE Example (contd)

General likelihood formula:

Estimate for p is then 4/5 = 0.8

Prentice Hall 27

Expectation-Maximization (EM)
Solves estimation with incomplete data. Obtain initial estimates for parameters. Iteratively use estimates for missing data and continue until convergence.

Prentice Hall

EM Example

Prentice Hall

EM Algorithm

Prentice Hall

Bayes Theorem Example

Credit authorizations (hypotheses): h1=authorize purchase, h2 = authorize after further identification, h3=do not authorize, h4= do not authorize but contact police Assign twelve data values for all combinations of credit and income:
1 Excellent Good Bad x1 x5 x9 2 x2 x6 x10 3 x3 x7 x11 4 x4 x8 x12

From training data: P(h1) = 60%; P(h2)=20%; P(h3)=10%; P(h4)=10%.

Prentice Hall 31

Bayes Example(contd)
Training Data:
ID 1 2 3 4 5 6 7 8 9 10 Income 4 3 2 3 4 2 3 2 3 1 Credit Excellent Good Excellent Good Good Excellent Bad Bad Bad Bad
Prentice Hall

Class h1 h1 h1 h1 h1 h1 h2 h2 h3 h4

xi x4 x7 x2 x7 x8 x2 x11 x10 x11 x9

Other Statistical Measures

Chi-Squared
O observed value E Expected value based on hypothesis.

Jackknife Estimate
estimate of parameter is obtained by omitting one value from the set of observed values.

Regression
Predict future values based on past values Linear Regression assumes linear relationship exists.
Find values to best fit the data

y = c0 + c1 x1 + + cn xn

Correlation
Prentice Hall 34

Similarity Measures
Determine similarity between two objects. Similarity characteristics:

Alternatively, distance measure measure how unlike or dissimilar objects are.

Prentice Hall 35

Similarity Measures

Prentice Hall

Distance Measures
Measure dissimilarity between objects

Prentice Hall

Information Retrieval
Information Retrieval (IR): retrieving desired information from textual data. Library Science Digital Libraries Web Search Engines Traditionally keyword based Sample query:
Find all documents about data mining.

DM: Similarity measures; Mine text/Web data.

Prentice Hall 38

Information Retrieval (contd)

IR Query Result Measures and Classification

IR
Prentice Hall

Classification
40

CIS 674 Introduction To Data Mining: Srinivasan Parthasarathy Srini@cse - Ohio-State - Edu Office Hours: TTH 2-3:18PM DL317
No ratings yet
CIS 674 Introduction To Data Mining: Srinivasan Parthasarathy Srini@cse - Ohio-State - Edu Office Hours: TTH 2-3:18PM DL317
40 pages
Part 1
No ratings yet
Part 1
40 pages
Data Mining Unit 1
No ratings yet
Data Mining Unit 1
91 pages
Data Mining Classification Techniques
No ratings yet
Data Mining Classification Techniques
167 pages
Data Mining: Intro & Advanced Topics
No ratings yet
Data Mining: Intro & Advanced Topics
368 pages
Data Mining Overview and Techniques
No ratings yet
Data Mining Overview and Techniques
84 pages
Data Mining and Warehousing Insights
No ratings yet
Data Mining and Warehousing Insights
22 pages
Part 2
No ratings yet
Part 2
165 pages
Data Mining Techniques: Introductory and Advanced Topics
100% (1)
Data Mining Techniques: Introductory and Advanced Topics
17 pages
Turban ch05
No ratings yet
Turban ch05
54 pages
Data Mining Techniques Overview
No ratings yet
Data Mining Techniques Overview
33 pages
Data Mining Methods Overview
No ratings yet
Data Mining Methods Overview
38 pages
Data Mining: Tools and Techniques
No ratings yet
Data Mining: Tools and Techniques
54 pages
2 Buss Intel Analytics
No ratings yet
2 Buss Intel Analytics
43 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
27 pages
Data Mining for Business Intelligence
No ratings yet
Data Mining for Business Intelligence
11 pages
Data Analytics Syllabus for CSE3105
No ratings yet
Data Analytics Syllabus for CSE3105
2 pages
Business Intelligence: A Managerial Approach (2 Edition)
No ratings yet
Business Intelligence: A Managerial Approach (2 Edition)
58 pages
Business Analytics Overview by Gaurav Dixit
No ratings yet
Business Analytics Overview by Gaurav Dixit
26 pages
Sas Semma
100% (1)
Sas Semma
39 pages
Turban Dss9e Ch05
No ratings yet
Turban Dss9e Ch05
38 pages
Introduction to Bayesian Classification
No ratings yet
Introduction to Bayesian Classification
19 pages
Introduction to Statistics and Applications
No ratings yet
Introduction to Statistics and Applications
54 pages
Statistical Methods in Data Mining
No ratings yet
Statistical Methods in Data Mining
26 pages
Data Mining: Predictive & Descriptive Models
No ratings yet
Data Mining: Predictive & Descriptive Models
55 pages
Understanding Data Mining Techniques
No ratings yet
Understanding Data Mining Techniques
7 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
43 pages
Data Mining Algorithms Overview
No ratings yet
Data Mining Algorithms Overview
5 pages
Data Mining: © Pearson Education Limited 1995, 2005
No ratings yet
Data Mining: © Pearson Education Limited 1995, 2005
50 pages
Data Mining: Techniques and Processes
No ratings yet
Data Mining: Techniques and Processes
25 pages
Simplifying Data Mining Concepts
No ratings yet
Simplifying Data Mining Concepts
6 pages
Data Mining Techniques and Applications
No ratings yet
Data Mining Techniques and Applications
13 pages
Data Mining Concepts and Techniques
No ratings yet
Data Mining Concepts and Techniques
206 pages
R Programming for Data Science Basics
No ratings yet
R Programming for Data Science Basics
16 pages
Data Mining and KDD Process Overview
No ratings yet
Data Mining and KDD Process Overview
20 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
32 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
52 pages
Data Science and Machine Learning Overview
No ratings yet
Data Science and Machine Learning Overview
76 pages
Data Mining and Predictive Modeling
No ratings yet
Data Mining and Predictive Modeling
71 pages
Data Warehousing and Mining Overview
No ratings yet
Data Warehousing and Mining Overview
20 pages
Introduction to Data Mining & Machine Learning
100% (1)
Introduction to Data Mining & Machine Learning
51 pages
Data Mining for Retail Decisions
No ratings yet
Data Mining for Retail Decisions
40 pages
Introduction to Data Mining Concepts
No ratings yet
Introduction to Data Mining Concepts
38 pages
Data Mining Techniques Overview
No ratings yet
Data Mining Techniques Overview
8 pages
Introduction to Data Mining Concepts
No ratings yet
Introduction to Data Mining Concepts
87 pages
Data Mining: Predictive & Descriptive Models
No ratings yet
Data Mining: Predictive & Descriptive Models
62 pages
Chapter 1
No ratings yet
Chapter 1
23 pages
Overview of Data Mining Techniques
No ratings yet
Overview of Data Mining Techniques
93 pages
Data Mining Techniques and Concepts
No ratings yet
Data Mining Techniques and Concepts
39 pages
Understanding Data Mining Techniques
No ratings yet
Understanding Data Mining Techniques
30 pages
Introduction to Data Mining Concepts
No ratings yet
Introduction to Data Mining Concepts
65 pages
Understanding Data Objects and Attributes
No ratings yet
Understanding Data Objects and Attributes
67 pages
Understanding Data Mining Basics
No ratings yet
Understanding Data Mining Basics
17 pages
Data Mining and Machine Learning Basics
No ratings yet
Data Mining and Machine Learning Basics
35 pages
Overview of Data Mining Techniques
No ratings yet
Overview of Data Mining Techniques
6 pages
Understanding Data Mining Concepts
No ratings yet
Understanding Data Mining Concepts
40 pages
IIT Bombay Structural Materials Quiz
No ratings yet
IIT Bombay Structural Materials Quiz
1 page
Machine Learning for High Entropy Alloys
No ratings yet
Machine Learning for High Entropy Alloys
1 page
Kinematic Analysis of Crank Rocker Mechanism
No ratings yet
Kinematic Analysis of Crank Rocker Mechanism
15 pages
Personal Information of Meera P Nair
No ratings yet
Personal Information of Meera P Nair
1 page
Experimental Conditions Overview
No ratings yet
Experimental Conditions Overview
1 page
Langmuir Probe Prototype for ICI-1 Rocket
No ratings yet
Langmuir Probe Prototype for ICI-1 Rocket
168 pages
Nuclear Fusion: FERA's Breakthrough Method
100% (1)
Nuclear Fusion: FERA's Breakthrough Method
14 pages
Community Physiotherapy and Rehabilitation
No ratings yet
Community Physiotherapy and Rehabilitation
1 page
Document Analysis Overview
No ratings yet
Document Analysis Overview
27 pages
Document Analysis and Encoding Insights
No ratings yet
Document Analysis and Encoding Insights
1 page
Data Encoding and Compression Techniques
No ratings yet
Data Encoding and Compression Techniques
1 page
Workday Integration On Demand Whitepaper
100% (3)
Workday Integration On Demand Whitepaper
12 pages
Function-Oriented Software Design Guide
No ratings yet
Function-Oriented Software Design Guide
14 pages
MikroTik Router Configuration Basics
No ratings yet
MikroTik Router Configuration Basics
4 pages
Accellera Open Verification Library Overview
No ratings yet
Accellera Open Verification Library Overview
6 pages
Optimize EDIUS with Sandy Bridge Tech
No ratings yet
Optimize EDIUS with Sandy Bridge Tech
12 pages
Cypress RoadmapFlash Memory
No ratings yet
Cypress RoadmapFlash Memory
28 pages
Mobileye vs Google: Self-Driving Future Insights
No ratings yet
Mobileye vs Google: Self-Driving Future Insights
14 pages
ServiceNow Glide Scripting Overview
No ratings yet
ServiceNow Glide Scripting Overview
140 pages
Understanding SQL Commands and Standards
100% (1)
Understanding SQL Commands and Standards
37 pages
Object Interaction in UML Diagrams
No ratings yet
Object Interaction in UML Diagrams
35 pages
Zenis Software Operation Manual 2.5
No ratings yet
Zenis Software Operation Manual 2.5
51 pages
Refresher Course on Digital Logic Design
No ratings yet
Refresher Course on Digital Logic Design
1 page
CS1201 Data Structures Overview
No ratings yet
CS1201 Data Structures Overview
4 pages
Mastering PivotTables and Charts
No ratings yet
Mastering PivotTables and Charts
41 pages
C Operators Explained: Types & Usage
No ratings yet
C Operators Explained: Types & Usage
6 pages
L01 - Introduction To Studio 5000 Logix Designer - Lab Manual
88% (8)
L01 - Introduction To Studio 5000 Logix Designer - Lab Manual
96 pages
Dynamic vs Static Scoping in C Programming
No ratings yet
Dynamic vs Static Scoping in C Programming
3 pages
Punchout Catalog
No ratings yet
Punchout Catalog
8 pages
SFML Essentials: Game Development Guide
No ratings yet
SFML Essentials: Game Development Guide
26 pages
User's Guide For The AT&T Global Network Client For Linux: System Requirements and Installation
No ratings yet
User's Guide For The AT&T Global Network Client For Linux: System Requirements and Installation
2 pages
iBiz Muamalat: Corporate Banking FAQ
No ratings yet
iBiz Muamalat: Corporate Banking FAQ
9 pages
Segment Tree Implementation in Python
No ratings yet
Segment Tree Implementation in Python
5 pages
Understanding Cloud Computing Architecture
No ratings yet
Understanding Cloud Computing Architecture
15 pages
Cloud Lab Manual
No ratings yet
Cloud Lab Manual
83 pages
THDB ADA UserGuide 03 PDF
No ratings yet
THDB ADA UserGuide 03 PDF
25 pages
Test Design Techniques Overview
No ratings yet
Test Design Techniques Overview
64 pages
Case Study: MIS, DSS, TPS in Action
No ratings yet
Case Study: MIS, DSS, TPS in Action
11 pages
Software Engineering Syllabus - TU
No ratings yet
Software Engineering Syllabus - TU
3 pages
Enhanced Back-Propagation for Neural Networks
No ratings yet
Enhanced Back-Propagation for Neural Networks
5 pages
Exam Hall Allotment System Overview
0% (1)
Exam Hall Allotment System Overview
2 pages

CIS 674 Introduction To Data Mining: Srinivasan Parthasarathy Srini@cse - Ohio-State - Edu Office Hours: TTH 2-3:18PM DL317

Uploaded by

CIS 674 Introduction To Data Mining: Srinivasan Parthasarathy Srini@cse - Ohio-State - Edu Office Hours: TTH 2-3:18PM DL317

Uploaded by

CIS 674 Introduction to Data Mining

Srinivasan Parthasarathy srini@[Link] Office Hours: TTH 2-3:18PM DL317

Data Mining Algorithm

Search Strategy Technique to search the data

Database Processing vs. Data Mining Processing

Find all credit applicants who are poor credit

Data Mining Models and Tasks

Basic Data Mining Tasks

Basic Data Mining Tasks (contd)

Link Analysis uncovers relationships among data.

Ex: Time Series Analysis

Data Mining vs. KDD

Knowledge Discovery Process

KDD Process Ex: Web Log

Potential User Applications:

Data Mining Development

Algorithm Design Techniques Algorithm Analysis Data Structures

Neural Networks Decision Tree Algorithms

KDD Issues (contd)

Data Mining Metrics

Database Perspective on Data Mining

Outline of Todays Class

Maximum Likelihood Estimate (MLE)

However if the probability of a H is 0.8 then:

MLE Example (contd)

Estimate for p is then 4/5 = 0.8

Bayes Theorem Example

From training data: P(h1) = 60%; P(h2)=20%; P(h3)=10%; P(h4)=10%.

xi x4 x7 x2 x7 x8 x2 x11 x10 x11 x9

Other Statistical Measures

Alternatively, distance measure measure how unlike or dissimilar objects are.

DM: Similarity measures; Mine text/Web data.

Information Retrieval (contd)

IR Query Result Measures and Classification

You might also like