Vinee
Vinee
Vinee
Overview
Introduction
Explanation of Data Mining Techniques Advantages
Applications
Privacy
KDD: KDD stands forKNOWLEDGE DISCOVER DATA BASE KDD identify the invisible correlation
Data collection(1960):using
Data warehouse&decision
Data Warehousing
Data Warehouse:
is a repository (or archive) of information
gathered from multiple sources, stored under a unified schema, at a single site.
Collect data Store in single repository Allows for easier query development as a single repository can be
queried.
DATA WAREHOUSING:
Data warehousing has some OLAP operations OLAP stands for OnlineAnalyticalProcess It stores the historical data It performs only read pattern It deals with the long term operations OLAP operations are rollup
OLAP - On-line Analytical Processing Provides you with a very good view of what is happening, but can not predict what will happen in the future or why it is happening
Discovery of Knowledge
Steps:
Business Understanding: what problem are we trying to
solve? What is the business trying to achieve? Data Understanding: do we have the data to be able to answer this questions? If not, what is the cost of acquiring that additional information? Data Preparation: all data is dirty and needs to be cleaned and transformed. This is the heavy lifting stage. Analysis & Modeling: the tools must be chosen based on what the business is trying to understand and the data available. Evaluate Outcomes: how well does the model actually works from a statistical point of view (significance) and from a business point of view (actionability)? Deployment: driving the insight into the business.
Association Rules
Classification (training instances) with their and given the past instances
associated class, Classification is the process of predicting the class of a new item. Therefore to classify the new item and identify to which class it belongs Example: A bank wants to classify its Home Loan Customers into groups according to their response to bank advertisements. The bank might use the classifications Responds Rarely, Responds Sometimes, Responds Frequently. The bank will then attempt to find rules about the customers that respond Frequently and Sometimes. The rules could be used to predict needs of potential customers.
Doctor
Carpenter
Income
<30K >50K <40K
Income
>90K
Income
<50K >100K
Bad
Good
Bad
Good
Bad
Good
Clustering
Clustering algorithms find groups of items that are similar. It divides a data set so that records with similar content are in the same group, and groups are as different as possible from each other. (2)
Example: Insurance company could use clustering to group clients by their age, location and types of insurance purchased. The categories are unspecified and this is referred to as unsupervised learning
Regression
Regression deals with the prediction of a value, rather than a class.
Example: Find out if there is a relationship
between smoking patients and cancer related illness. It removes the noisy data Given values: X1, X2... Xn Objective predict variable Y One way is to predict coefficients a0, a1, a2
Y = a0 + a1X1 + a2X2 + anXn
Regression
Example graph:
Line of Best Fit Curve Fitting
Association Rules
An association algorithm creates rules that describe
of the time they will buy nails. Ex: computer=>antivirussoftware[support=2%,confidence=60%]. Support ex: (A=>B)=P(AUB) Confidence ex:(A=>B)=P(B/A)
Association Rules
Support: is a measure of what fraction of the
population satisfies both the antecedent and the consequent of the rule Support is the measure in rule of interestingness. Support is 2% means that 2% of the transaction under analysis show that computer and antivirus are purchased together
Association Rules
Confidence: is a measure of how often the
consequent is true when the antecedent is true. It is the measure ot rule of interestingness Example: Confidence is 60% that means 60% of the customer who purchased the computer also purchase the software
ADVANTAGES:
Provides new knowledge from existing data
Public databases Government sources Company Databases
Weatherforecast Insurance Government Health care New knowledge can be used to improve services or products
Risk Assessment
Identify Customers that pose high credit risk
Fraud Detection
Identify people misusing the system. E.g. People who
Customer Care
Identify customers likely to change providers Identify customer needs
industry manufacturing
Telecommunicationb industry
Biological data analysis Scientific application
Retail
Intrusion detection
Reduced direct mail costs by 30% while garnering 95% of the campaigns revenue.
Privacy Concerns
Effective Data Mining requires large sources of data To achieve a wide spectrum of data, link multiple data sources Linking sources leads can be problematic for privacy as follows: If the following histories of a customer were linked:
Shopping History Credit History Bank History Employment History
collected data
References
Silberschatz, Korth, Sudarshan, Database System Concepts, 5th Edition, Mc Graw Hill, 2005 2. http://www.twocrows.com/glossary.htm, Two Crows, Data Mining Glossary 3. http://en.wikipedia.org/wiki/Data_mining, Wikipedia 4. http://phoenix.phys.clemson.edu/tutorials/exce l/regression.html 5. http://wwwmaths.anu.edu.au/~steve/pdcn.pdf
1.