Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Oracle Data Mining Case Study: Xerox

Download as pdf or txt
Download as pdf or txt
You are on page 1of 28

Oracle 10g DB

Data Warehousing Oracle Data Mining


ETL
Case Study: Xerox
<Insert Picture Here>

OLAP Statistics Session ID: S283051


Data Mining

Charlie Berger
Sr. Dir. Product Management, Life & Health Sciences Industry & Data Mining Technologies
Oracle Corporation
charlie.berger@oracle.com

Tracy E. Thieret
Principal Scientist, Imaging and Systems Technology Center
Xerox Innovation Group
Webster, New York

Copyright © 2006 Oracle Corporation


What is Data Mining?
• Process of sifting through massive amounts
of data to find hidden patterns and discover
new insights

• Data Mining can provide valuable results:


• Identify factors more associated with a target
attribute (Attribute Importance)
• Predict individual behavior (Classification)
• Find profiles of targeted people or items
(Decision Trees)
• Segment a population (Clustering)
• Determine important relationships with the
population (Associations)
• Find fraud or rare “events” (Anomaly Detection)

Copyright © 2006 Oracle Corporation


Data Mining: Find hidden Patterns
• Data Mining can find previously hidden patterns and
relationships to help you:
• Make informed predictions and…
• Better understand customers
• Data Mining can help answer questions such as:
• Which customers are likely to churn or attrite?
• Which customers are likely to respond to this offer?
• Which employees are likely to leave?
• What “next product” should I recommend to this customer?
• Which factors are most associated with a target attribute e.g. high value
customers
• Which customer or transactions are most “unnatural” or possibly
suspicious?

Copyright © 2006 Oracle Corporation


Data Mining: Discover New Insights
• Data Mining uncover hidden patterns and relationships to
help you:
• Discover new segments, clusters, and subgroups and …
• Data Mining can help answer questions such as:
• What are the profiles subpopulations or items of interest e.g. churners,
profitable customers, defective product, etc.
• What natural segments or clusters exist in my data?
• Which items are typically purchased together?
• What items seems to fail together?
• Which genes are most associated with this disease?

Copyright © 2006 Oracle Corporation


Oracle Data Mining 10gR2
Oracle in-Database Mining Engine
• Oracle Data Miner (GUI)
• Simplified, guided data mining
• Spreadsheet Add-In for Predictive Analytics
• “1-click data mining” from a spreadsheet
• PL/SQL API & Java (JDM) API
• Develop advanced analytical applications
• Wide range of algorithms
• Anomaly detection
• Attribute importance
• Association rules
• Clustering
• Classification & regression
• Nonnegative matrix factorization
• Structured & unstructured data (text mining)
• BLAST (life sciences similarity search algorithm)
Copyright © 2006 Oracle Corporation
10g Statistics & SQL Analytics
FREE (Included in Oracle SE & EE)

• Ranking functions • Descriptive Statistics


• rank, dense_rank, cume_dist, percent_rank, ntile • average, standard deviation, variance, min, max, median
(via percentile_count), mode, group-by & roll-up
• Window Aggregate functions • DBMS_STAT_FUNCS: summarizes numerical columns
(moving and cumulative) of a table and returns count, min, max, range, mean,
• Avg, sum, min, max, count, variance, stddev, stats_mode, variance, standard deviation, median,
first_value, last_value quantile values, +/- n sigma values, top/bottom 5 values

• LAG/LEAD functions • Correlations


• Direct inter-row reference using offsets • Pearson’s correlation coefficients, Spearman's and
Kendall's (both nonparametric).
• Reporting Aggregate functions
• Sum, avg, min, max, variance, stddev, count, • Cross Tabs
ratio_to_report • Enhanced with % statistics: chi squared, phi coefficient,
Cramer's V, contingency coefficient, Cohen's kappa
• Statistical Aggregates
• Correlation, linear regression family, covariance • Hypothesis Testing
• Student t-test , F-test, Binomial test, Wilcoxon Signed
• Linear regression Ranks test, Chi-square, Mann Whitney test, Kolmogorov-
• Fitting of an ordinary-least-squares regression line Smirnov test, One-way ANOVA
to a set of number pairs.
• Frequently combined with the COVAR_POP,
• Distribution Fitting
COVAR_SAMP, and CORR functions. • Kolmogorov-Smirnov Test, Anderson-Darling Test, Chi-
Squared Test, Normal, Uniform, Weibull, Exponential

Note: Statistics and SQL Analytics are included in Oracle


• Pareto Analysis (documented)
Database Standard Edition • 80:20 rule, cumulative results table

Copyright © 2006 Oracle Corporation


In-Database Analytics
Advantages
Oracle 10g DB
• Data remains in the database at all Data Warehousing

times…with appropriate access security ETL

control mechanisms—fewer moving parts OLAP Statistics

• Straightforward inclusion within interesting Data Mining

and arbitrarily complex queries


• Real-world scalability—available for mission critical
appls
• Enabling pipelining of results without costly
materialization
• Scalable & Performant
• Real-time scoring 2.5 million records scored in 6 seconds
on a single CPU system

Copyright © 2006 Oracle Corporation


Oracle 10g DB
Data Warehousing

ETL
<Insert Picture Here> Oracle Data Mining 10g
OLAP Statistics D E M O N S T R A T I O N
Data Mining

Copyright © 2006 Oracle Corporation


Oracle Data Mining Oracle Data Mining provides
summary statistical information
prior to data mining

Copyright © 2006 Oracle Corporation


Oracle Data Mining
Oracle Data Mining provides
model performance and
evaluation viewers

Oracle Data
Mining’s
Activity
Guides
simplify &
automate
data mining
for business
users

Copyright © 2006 Oracle Corporation


Oracle Data Mining

Apply model
viewers

Additional model
evaluation viewers

Copyright © 2006 Oracle Corporation


Example #1:
Simple, Predictive SQL

• Select customers who are more than 60% likely to


purchase a 6 month CD and display their marital
status

SELECT * from(
SELECT A.CUST_ID, A.MARITAL_STATUS,
PREDICTION_PROBABILITY(CD_BUYERS76485_DT, 1
USING A.*) prob
FROM CBERGER.CD_BUYERS A)
WHERE prob > 0.6;

Copyright © 2006 Oracle Corporation


Oracle Data Mining 10g R2
Decision Trees
Problem: Find customers
likely to buy a new car and
• Decision Trees Income their profiles
• Classification
>$50K <=$50K
• Prediction
Gender Age
• Customer
“profiling”
M FF >35 <=35

Status Gender HH Size

Married Single F M >4 <=4

Buy = 0 Buy = 1 Buy = 0 Buy = 1 Buy = 0 Buy = 1

IF (Income >50K AND Gender=F AND Status >Single… ), THEN P(Buy Car=1)
Confidence= .77
Support = 250

Copyright © 2006 Oracle Corporation


Oracle Data Mining 10g R2
Anomaly Detection
Problem: Detect
• “One-Class” SVM Models rare cases
• Fraud, noncompliance
• Outlier detection
• Network intrusion detection
• Disease outbreaks
• Rare events, true novelty
X2
X1

X2
X1

Copyright © 2006 Oracle Corporation


Oracle Data Mining
Algorithm Summary 10gR2

Problem Algorithm Applicability


Classification Decision Tree Popular / Rules / transparency
Naïve Bayes Embedded app

Support Vector Machine Wide / narrow data

Adaptive Bayes Network Rules / transparency

Regression Support Vector Machine Wide / narrow data

Attribute reduction
Attribute Importance Minimum Description Identify useful data
Length (MDL) Reduce data noise
Market basket analysis
Association Rules Apriori Link analysis

Clustering Hierarchical K-Means Product grouping


Text mining
Hierarchical O-Cluster Gene and protein analysis
Text analysis
Feature Extraction NMF Feature reduction

Copyright © 2006 Oracle Corporation


Integration with Oracle BI EE

Oracle Data
Mining reveals
important
relationships,
patterns,
predictions &
Create Categories insights to the
of Customers business users

Copyright © 2006 Oracle Corporation


Spreadsheet Add-In for Predictive Analytics

• Enables Excel
users to “mine”
Oracle or Excel
data using “one
click” Predict and
Explain predictive
analytics features
• Users select a table
or view, or point to
data in Excel, and
select a target
attribute

Copyright © 2006 Oracle Corporation


Oracle 10g DB
Data Warehousing

ETL
<Insert Picture Here> Oracle Data Miner 10gR2
OLAP Statistics
Code Generation Release
Data Mining

Copyright © 2006 Oracle Corporation


Oracle Data Miner (gui)
10gR2 Summer OTN Release

• PL/SQL code
generation for
Mining Activities

Copyright © 2006 Oracle Corporation


Oracle Data Miner (gui)
10gR2 Summer OTN Release

Copyright © 2006 Oracle Corporation


Analytics vs.
1. In-Database Analytics Engine 1. External Analytical Engine
Basic Statistics (Free) Basic Statistics
Data Mining Data Mining
Oracle 10g DB
Text Mining Text Mining (separate: SAS EM for Text)
Data Warehousing
Advanced Statistics
ETL

2. Development OLAP Statistics 2. Development


Platform Data Mining Platform
Java (standard) SAS Code (proprietary)
SQL (standard)
J2EE (standard)
3. Costs (ODM: $20K cpu) 3. Costs (SAS EM: $150K/5 users)
Simplified environment Annual Renewal Fee
(~40% each year)
Single server
Security

Copyright © 2006 Oracle Corporation


Oracle 10g DB
Data Warehousing

ETL
<Insert Picture Here>
OLAP Statistics
Partners
Data Mining

Copyright © 2006 Oracle Corporation


SAP Business Warehouse
Connector (ODM-BW Connector)

• Seamless
integration for SAP
customers
• Secure
• Data remains in
database
• Single version of
truth
• Easy to use

Copyright © 2006 Oracle Corporation


SPSS Clementine
• NASDAQ-listed, top 25
software company
• 35+ year heritage in
analytic technologies
• Operations in over 60
countries
• More than 95% of FORTUNE
1000 are SPSS customers
• Combine SPSS Clementine
ease of use with ODM
in-Database functionality
& scalability
• Build, store, browse and score
models in the Database for
optimal performance
• For more information :
• SPSS – Roger Lonsberry, (312) 651-3475 or rlonsberry@spss.com
• Oracle – Alan Manewitz, (925) 984-9910 or alan.manewitz@oracle.com
• Oracle – Charlie Berger, (781) 744-0324 or charlie.berger@oracle.com

Copyright © 2006 Oracle Corporation


InforSense -- A Single Optimized Environment for
Real Time Business Analytics within the Database

Oracle Deploy the analytic workflow


Decision Tree as a service embedding to
Model
BPEL, SFA, CRM
Oracle Data
Sources Interact with (visualize) data
at any step in the workflow

InforSenseService

Deployment

Oracle Deploy the analytic workflow


Functionalities: as an Oracle Portal
Data Mining
Preprocess
Statistics
Text
OLAP
Scheduler

SAS free analytics: leverage Oracle analytics Integrative analytics: unified analytical environment
SQL free analytics: drag-drop application build Automated analytics: deploy to Oracle Portal and BPEL
Visual analytics: interactive visualisation

Copyright © 2006 Oracle Corporation


Oracle Real-Time Decision Engine
For enabling Operational Business Intelligence

Telco Fins Retail Health Travel Others

Contact
Contact Front
Front
IVR
IVR Web
Web ATM
ATM Kiosk
Kiosk
Center
Center Office
Office

Oracle Real-Time Decision Eligibility Prediction / Learning


Engine Scoring Engine Engine
(RTD) Engine

Campaign
Campaign Business
Business Oracle
OracleData
Data
Management
Management Intelligence
Intelligence Mining
Mining

Data
DataWarehouse
Warehouse

Copyright © 2006 Oracle Corporation


Benefits of Oracle’s Approach
In-Database Analytics Benefit
• Platform for Analytical • Eliminates data movement and
Applications security exposure
• Fastest: DataÆInformation

• Wide range of data mining • Supports most analytical


algorithms & statistical problems
functions
• Runs on multiple platforms • Applications may be developed
and deployed

• Built on Oracle Technology • Grid, RAC, integrated BI,…


• SQL & PL/SQL available
• Leverage existing skills

Copyright © 2006 Oracle Corporation


“This presentation is for informational purposes only and may not be incorporated into a contract or agreement.”

You might also like