Data Mining: Business Intelligence

Data Mining
Business Intelligence
outline
 Data Mining and KDD
 Why Data Mining
 Applications of Data Mining
 Data Preprocessing
 Data Mining techniques
 Visualization of the results
 Summary
2
Data Mining and KDD
3
Looking for knowledge
 The Explosive Growth of Data
 The World Wide Web
 Business: e-commerce, transactions, stocks, …
 Science: Remote sensing, bioinformatics, scientific simulation
 Society and everyone: news, digital cameras, YouTube, forums,
blogs, Google & Co
 We are drowning in data, but starving for knowledge!
 Avoid data tombs
 “Necessity is the mother of invention”—Data mining—Automated
analysis of massive data sets.
4
What is Data Mining?
 Data mining (knowledge discovery from data)
 Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge from
huge amount of data
 Alternative names
 Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data
dredging, information harvesting, business intelligence, etc.
5
Knowledge Discovery (KDD)
Process
Pattern Evaluation
Data Mining
Task-relevant Data
Data Selection
Warehouse
Data Cleaning
Data Integration
Data sources
6
Data Mining and Business
Intelligence
Increasing potential
to support End User
business decisions Decision
Making
Data Presentation Business

Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst
Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses

DBA
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
Quantity of data 7
Data Mining: confluence of multiple
disciplines
Database
Technology Statistics
Machine Visualization
Data Mining
Learning
Pattern
Recognition Other
Algorithms Disciplines
8
Why Data Mining?
9
Why is Data Mining so complex? A
matter of data dimensions
 Tremendous amount of data
 Walmart – Customer buying patterns – a data warehouse 7.5 Terabytes
large in 1995
 VISA – Detecting credit card interoperability issues – 6800 payment
transactions per second
 High-dimensionality of data
 Many dimensions to be combined together
 Data cube example: time, location, product  sales
 High complexity of data
 Time-series data, temporal data, sequence data
 Structure data, graphs, social networks and multi-linked data
 Spatial, spatiotemporal, multimedia, text and Web data
10
What does Data Mining provide me
with? (1)
 Multidimensional concept description: Characterization and
discrimination
 Generalize, summarize, and contrast data characteristics, e.g.,
dry vs. wet regions
 Characterization describes things in the same class,
discrimination describes how to separate different classes
 Frequent patterns, association, correlation vs. causality
 Wine  Spaghetti [0.3% of all basket cases, 75% of cases when
tomato sauce is bought]
 Is this correlation or not?
11
with? (2)
 Classification and prediction
 Construct models (functions) that describe and distinguish
classes or concepts for future prediction
E.g., classify countries based on climate, or classify cars based
on gas mileage
 Predict some unknown or missing numerical values
 Cluster analysis
 Class label is unknown: Group data to form new classes, e.g.,
cluster houses to find distribution patterns
 Maximizing intra-class similarity & minimizing interclass
similarity
12
with? (3)
 Outlier analysis
 Outlier: Data object that does not comply with the general
behavior of the data
 Fraud detection is the main application area
 Noise or exception?
 Trend and evolution analysis
 Trend and deviation: e.g., regression analysis
 Sequential pattern mining: e.g., digital camera  large SD
memory
 Periodicity analysis
 Similarity-based analysis
13
Applications of Data Mining
Market Analysis and Management
 Data sources:
 credit card transactions, loyalty cards, smart cards, discount
coupons, ...
 Target marketing
 Find clusters of “model” customers who share the same
characteristics:
• Geographics (lives in Rome, lives in Trentino)
• Demographics (married, between 21-35, at least one child, family income
more than 40.000€/year)
• Psychographics (likes new products, consistently uses the Web)
• Behaviors (searches info in Internet, always defends her decisions)
 Determine customer purchasing patterns over time
14
Market Analysis and Management
 Cross-market analysis
 Find associations between product sales, and predict based on
such association
 Compare the sales in the US and in Italy, find associations in
old products and predict if new ones will have success
 Customer profiling
 What types of customers buy what products
 Customers with age between 20-30 and income > 20K€ will buy
product A
 Customer requirement analysis
 Identify the best products for different groups of customers
 Predict what factors will attract new customers
15
Corporate Analysis
 Finance Planning and Asset Evaluation
 Cash flow prediction and analysis
 Cross-sectional and time-series analysis (financial ratio, trend
analysis)
 Resource Planning
 summarize and compare the resources and spending
 Competition
 monitor competitors and market directions
 group customers into classes and a class-based pricing procedure
 set pricing strategy in a highly competitive market
 Other examples?
16
What’s next?
 Data Preprocessing
 Why is it needed?
 Data cleaning
 Data integration and transformation,
 Data reduction
 Discretization and Concept hiererchy
 Frequent patterns, association rules
 Cluster Analysis
Are you sleeping?
 Summary
17
Data Preprocessing
18
Why Data Preprocessing?
 Data in the real world is dirty
 incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
• e.g., occupation=“ ”, birthdate=“31/12/2099”
 noisy: containing errors or outliers
• e.g., Salary=“-10”
 inconsistent: containing discrepancies in codes or names
• e.g., Age=“42” Birthday=“03/07/1997” (we are in 2007!!)
• e.g., Was rating “1,2,3”, now rating “A, B, C”
• e.g., discrepancy between duplicate records. In one copy of the data
customer A has to pay 200.000€, in the second copy of the data A does not
have to pay anything.
19
Why is data dirty?
 Incomplete data may come from
 “Not applicable” data value when collected
 Different considerations between the time when the data was
collected and when it is analyzed.
 Human/hardware/software problems
 Noisy data (incorrect values) may come from
 Faulty data collection instruments
 Human or computer error at data entry
 Errors in data transmission
 Inconsistent data may come from
 Different data sources
 Functional dependency violation (e.g., modify some linked data)
20
Why Is Data Preprocessing
Important?
21
Data Preprocessing
1. Data cleaning – missing values
“Data cleaning is one of the three biggest problems in data
warehousing”— Ralph Kimball
 Fill in missing values

 Name=“John”, Occupation=“Lawyer”, Age=“28”, Salary=“”
 Ignore the record (is it always feasible?)
 Manually filling missing attributes
 Automatically insert a constant
 Automatically insert the mean value (relative to the record class)
 Most probable value: make some inference!
22
Data Preprocessing
1. Data cleaning – binning
 Handle noisy data
 Binning, clustering, regression (not details)
 Binning
1. Sort data by price (€): 4, 8, 9, 15, 21, 21, 24, 25, 26
2. Partition into equal-frequency (equi-depth) bins:
 Bin 1: 4, 8, 9
 Bin 2: 15, 21, 21
 Bin 3: 24, 25, 26
3. Smoothing by bin means:
 Bin 1: 7, 7, 7
 Bin 2: 19, 19, 19
 Bin 3: 25, 25, 25
23
Data Preprocessing
1. Data cleaning – clustering
noise
24
Data Preprocessing
2. Integration and transformation
 Data Integration combines data from multiple sources
into a coherent store D1 D2 D3
 Schema integration
 Integrate metadata from different sources
 A.cust-id  B.cust-number D1,2,3
 Entity identification problem:
 Identify real world entities from multiple data sources, e.g., Bill
Clinton = William Clinton
 Detecting and resolving data value conflicts
 For the same real world entity, attribute values from different
sources are different (e.g., cm vs. inch)
25
Data Preprocessing
 Data integration can lead to redundant attributes
 Same object (A.house = B.residence)
 Derivates (A.annualIncome =  B.salary+C.rentalIncome)
 Redundant attributes can be discoverd via correlation
analysis
 A mathematical method detecting the correletion between two
attributes
 Correlation coefficient (Pearson’s product moment coefficient):
the higher it is, the stronger the correlation between attributes
 Χ2 (chi-square) test
 No details on these methods here
26
Data Preprocessing
 Aggregation:
 Sum the sales of different branches (in different data sources) to
compute the company sales
 Generalization:
 concept hierarchy climbing
 From integer attribute age to classes of age (children, adult, old)
 Normalization: scaled to fall within a small, specified range
 Change the range from [-∞,+ ∞] to [-1,+1]
 {-13, -6, -3, 10, 100}  {-0.13, -0.06, -0.03, 0.1, 1}
27
Data Preprocessing
3. Data reduction
 Data reduction
 Obtain a reduced representation of the data set that is much
smaller in volume but yet produce the same (or almost the
same) analytical results
 Different reduction types (dimensions, numerosity,
discretization)
 Dimensionality: Attribute subset selection
 Example with a decision tree (left branches True, right False)
A4?
Initial attribute set:
{A1, A2, A3, A1? A6? Reduced attribute
A4, A5, A6} set: {A1, A4, A6}
Class 1 Class 2 Class 1 Class 2

28
Data Preprocessing
3. Data reduction
 Dimensionality: Principal Components Analysis
 Given N data vectors from n-dimensions, find k ≤ n orthogonal
vectors (principal components) that can be best used to represent
data
 Works for numeric data only
 Used when the number of dimensions is large
 Numerosity: Clustering
 Partition data set into clusters based on similarity, and store cluster
representation (e.g., centroid and diameter) only
2 clusters
Sparse data leads
to many clusters
– non effective
29
Data Preprocessing
3. Data reduction
 Numerosity: Sampling
 obtaining a small sample s to represent the whole data set N
 Problem: How to select a representative sampling set
 Random sampling is not enough – representative samples should
be preserved
 Stratified sampling: Approximate the percentage of each class (or
subpopulation of interest) in the overall database
Random sampling Stratified sampling
No samples
from here
30
Data Preprocessing
4. Discretization - concept hierarchy
 Three types of attributes
 Nominal — values from an unordered set (color, profession)
 Ordinal — values from an ordered set (military or academic rank)
 Continuous — numbers (integer or real numbers)
 Discretization
 Divide the range of a continuous attribute into intervals
 Reduces data size and its complexity
 Some data mining algorithms do not support continuous types, and in
those cases discretization is mandatory
 Some useful methods:
 Binning, clustering (already presented)
 Entropy-based discretization (no details here)
31
Data Preprocessing
4. Discretization - concept hierarchy
 Concept hierarchy generation
 For categorical data
 Specification of an ordering between attributes (schema level)
• street < city < state < country
 Specification of a hierarchy of values (data level)
• {Urbana, Champaign, Chicago} < Illinois
 Automatic generation using the number of distinct values
• For the set of attributes: {street, city, state, country}
• IF: |street| = 600.000, |city|=3.000, |state|=300, |country|=15
• THEN: street < city < state < country
32
Outline
 Frequent patterns, association rules
• Support and confidence
• Decision trees
• Bayesian classifiers
• Support Vector Machines
• Lazy learning
 Cluster Analysis
 Summary
33
Data Mining techniques
34
Frequent pattern analysis
 What is it?
 Frequent pattern: a pattern (a set of items, subsequences,
substructures, etc.) that occurs frequently in a data set
 Frequent pattern analysis: searching for frequent patterns
 Motivation: Finding inherent regularities in data
• Which products are bought together? Yesterday’s wine and spaghetti
example
• What are the subsequent purchases after buying a PC?
• Can we automatically classify web documents?
 Applications
• Basket data analysis
• Cross-marketing
• Catalog design
• Sale campaign analysis
35
Basic Concepts: Frequent Patterns
and Association Rules (1)
Transaction-id Items bought
1 Wine, Bread, Spaghetti Itemsets (= transactions
2 Wine, Cocoa, Spaghetti in this example)
3 Wine, Spaghetti, Cheese
4 Bread, Cheese, Sugar

5 Bread, Cocoa, Spaghetti, Cheese,
Sugar
Goal: find all rules of type X  Y between items in an itemset

with minimum:
Support s - probability that an itemset contains X  Y
Confidence c – conditional probability that an itemset containing X
contains also Y
36
Support and confidence
That is.
support, s, probability that a transaction contains {A  B }
s = P(A  B )
confidence, c, conditional probability that a transaction
having A also contains B.
c = P(A|B).
 Rules that satisfy both a minimum support threhold
(min_sup) and a mimimum confidence threhold
(min_conf) are called strong.
37
1 Wine, Bread, Spaghetti
Suppose:
2 Wine, Cocoa, Spaghetti
support s = 50%
confidence
4 Bread, Cheese, Sugar c=50%
Sugar
Support is used to define frequent patterns (sets of
products in more than s% itemsets)
{Wine} in itemsets 1, 2, 3 (support = 60%)
{Bread} in itemsets 1, 4, 5 (support = 60%)
{Spaghetti} in itemsets 1, 2, 3, 5 (support = 80%)
{Cheese} in itemsets 3, 4, 5 (support = 60%)
{Wine, Spaghetti} in itemsets 1, 2, 3 (support = 60%)
38
1 Wine, Bread, Spaghetti
Suppose:
2 Wine, Cocoa, Spaghetti
support s = 50%
confidence
4 Bread, Cheese, Sugar c=50%
Sugar
Confidence defines association rules: X  Y rules in frequent
patterns whose confidence is bigger than c
Suggestion: {Wine, Spaghetti} is the only frequent
pattern to be considered. Why?
Association rules:
Wine  Spaghetti (support=60%, confidence=100%)
Spaghetti  Wine (support=60%, confidence=75%)
39
Advanced concepts in Asssociation
Rules discovery
 Algorithms must face scalability problems
 Apriori: If there is any itemset which is infrequent, its superset
should not be generated/tested!
 Advanced problems
 Boolean vs. quantitative associations
age(x, “30..39”) and income(x, “42..48K”)  buys(x, “car”)
[s=1%, c=75%]
 Single level vs. multiple-level analysis
What brands of wine are associated with what brands of
spaghetti?
Are support and confidence
clear?
40
Another example for association
rules
Support s = 40%
1 Margherita, Beer, Coke
Confidence c = 70%
2 Margherita, Beer
3 Quattro stagioni, Coke
4 Margherita, Coke
41
Another example for association
rules
Support s = 40%
1 Margherita, Beer, Coke
Confidence c = 70%
2 Margherita, Beer
3 Quattro stagioni, Coke
4 Margherita, Coke
Frequent itemsets: Association rules:

{Margherita} = 75% Beer  Margherita [c=50%,s=100%]
{Beer} = 50%
{Coke} = 75%
{Margherita, Beer} = 50%
{Margherita, Coke} = 50%
42
Classification vs. Prediction
 Classification
 Characterizes (describes) a set of items belonging to a training
set; these items are already classified according to a label attribute
 The characterization is a model
 The model can be applied to classify new data (predict the class
they should belong to)
 Prediction
 models continuous-valued functions, i.e., predicts unknown or
missing values
 Applications
 Credit approval, target marketing, fraud detection
43
Classification: the process
1. Model construction
 The class label attribute defines the class each item should belong
to
 The set of items used for model construction is called training set
 The model is represented as classification rules, decision trees, or
mathematical formulae
2. Model usage
 Estimate accuracy of the model
• On the training set
• On a generalization of the training set
 If the accuracy is acceptable, use the model to classify data tuples
whose class labels are not known
44
Model construction Classification
Algorithms
Training
Data
NAME RANK YEARS TENURED Classifier

Mike Assistant Prof 3 no (Model)
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes IF rank = ‘professor’
Dave Assistant Prof 6 no
OR years > 6
Anne Associate Prof 3 no
THEN tenured = ‘yes’
45
IF rank = ‘professor’
Model usage OR years > 6
THEN tenured = ‘yes’
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
46
Supervised vs. Unsupervised
Learning
 Supervised learning (classification)
 Supervision: The training data (observations, measurements,
etc.) are accompanied by labels indicating the class of the
observations
 New data is classified based on the training set
 Unsupervised learning (clustering)
 The class labels of training data is unknown
 Given a set of measurements, observations, etc. with the aim of
establishing the existence of classes or clusters in the data
47
Evaluating generated models
 Accuracy
 classifier accuracy: predicting class label
 predictor accuracy: guessing value of predicted attributes
 Speed
 time to construct the model (training time)
 time to use the model (classification/prediction time)
 Robustness
 handling noise and missing values
 Scalability
 efficiency in disk-resident databases
 Interpretability
 understanding and insight provided by the model
48
Example of Classification
 Example: Suppose that we have a database of customers on

the AllEletronics mailing list. The database describes
attributes of the customers, such as their name, age, income,
occupation, and credit rating. The customers can be classified
as to whether or not they have purchased a computer at
AllElectronics.
 Suppose that new customers are added to the database and
that you would like to notify these customers of an upcoming
computer sale. To send out promotional literature to every
new customers in the database can be quite costly. A more
cost-efficient method would be to target only those new
customers who are likely to purchase a new computer. A
classification model can be constructed and used for this
purpose.
49
Each internal node
represents a test on
an attribute. Each
leaf node represents
a class.
A decision tree for the concept buys_computer, indicating whether or

not a customer at AllElectronics is likely to purchase a computer.
Assoc. Prof. Dr. D. T. Anh

50
Classification techniques
Decision Trees (1)
Investment type choice
Income > 20K€
no yes
Age > 60
Low risk
no yes
Married? Mid risk
no yes
High risk Mid risk
51
Decision Trees (2)
 How are the attributes in decision trees selected?
 Two well-known indexes are used
• Information gain selects the most informative attribute in
distinguishing the items between the classes
• It biases towards attributes with a large set of values
• Gain ratio faces the information gain limitations
52
Bayesian classifiers (2)
 Bayesian classification
 A statistical classification technique
• Predicts class membership probabilities
 Founded on the Bayes theorem
P( X | H ) P( H )
P( H | X ) 
P( X )
• What if X = “Red and rounded” and H = “Apple”?
 Performance
• The simplest implementation (Naïve Bayes) can be compared to decision trees
and neural networks
 Incremental
• Each training example can increase/decrease the probability that an
hypothesis in correct
53
Other Classification Methods
 k-nearest neighbor classifier
 case-based reasoning
 Genetic algorithm
 Rough set approach
 Fuzzy set approaches
54
The k-Nearest Neighbor Algorithm
 All instances (samples) correspond to points in the n-dimensional

space.
 The nearest neighbor are defined in terms of Euclidean distance.
The Euclidean distance of two points, X = (x1, x2, …,xn) and Y = (y1,
y2, …,yn) is
n
d(X,Y) =   (xi –yi)2
1
 When given an unknown sample, the k-Nearest Neighbor classifier
search for the space for the k training samples that are closest to
the unknown sample xq. The unknown sample is assigned the most
common class among its k nearest neighbors. The algorithm has to
vote to determine the most common class among the k nearest
neighbor. When k = 1, the unknown sample is assigned the class of
the training sample that is closest to it in the space.
 Once we have obtained xq’s k-nearest neighbors using the distance
function, it is time for the neighbors to vote in order to determine
xq’s class. 55
Genetic Algorithms
 GA: based on an analogy to biological evolution

 Each rule is represented by a string of bits.
 Example: The rule “IF A1 and Not A2 then C2“ can be
encoded as the bit string “100”, where the two left bits
represent attributes A1 and A2, respectively, and the rightmost
bit represents the class. Similarly, the rule “IF NOT A1 AND
NOT A2 THEN C1” can be encoded as “001”.
 An initial population is created consisting of randomly generated
rules
 Based on the notion of survival of the fittest, a new population is
formed to consists of the fittest rules and their offsprings
 The fitness of a rule is represented by its classification accuracy
on a set of training examples
 Offsprings are generated by crossover and mutation.
56
5 minutes break!
57
Support Vector Machines
 One of the most advanced classification techniques
 Left figure: a small margin between the classes is found
 Right figure: the largest margin is found
 Support vector machines (SVMs) are able to identify the right figure margin
58
SVMs + Kernel Functions
 Is data always linearly separable?
 NO!!!
 Solution: SVMs + Kernel Functions
How to split this? SVM SVM + Kernel

Functions
59
Lazy learning
 Lazy learning
 Simply stores training data (or only minor processing) and waits
until it is given a test tuple
 Less time in training but more time in predicting
 Uses a richer hypothesis space (many local linear functions),
and hence the accuracy is higher
 Instance-based learning
 Subcategory of lazy learning
 Store training examples and delay the processing (“lazy
evaluation”) until a new instance must be classified
 An example: k-nearest neighbor approach
60
k-nearest neighbor
 All instances correspond to points in the n-Dimensional
space – x is the instance to be classified
 The nearest neighbor are defined in terms of Euclidean
distance, dist(X1, X2)
 For discrete-valued, k-NN returns the most common value
among the k training examples nearest to x
Which class should the It depends on k!!!

green circle belong to? k=3  Red
K=5  Blue
61
Prediction techniques
An overview
 Prediction is different from classification
 Classification refers to predict categorical class label
 Prediction models continuous-valued functions
 Major method for prediction: regression
 model the relationship between one or more independent or
predictor variables and a dependent or response variable
 Regression analysis
 Linear and multiple regression
 Non-linear regression
 Other regression methods: generalized linear model, Poisson
regression, log-linear models, regression trees
 No details here
62
What is cluster Analysis?
 Cluster: a collection of data objects
 Similar to one another within the same cluster
 Dissimilar to the objects in other clusters
 Cluster analysis
 Finding similarities between data according to the characteristics
found in the data and grouping similar data objects into clusters
 It belongs to unsupervised learning
 Typical applications
 As a stand-alone tool to get insight into data distribution
 As a preprocessing step for other algorithms (day 1 slides)
63
Examples of cluster analysis
 Marketing:
 Help marketers discover distinct groups in their customer bases
 Land use:
 Identification of areas of similar land use in an earth observation
database
 Insurance:
 Identifying groups of motor insurance policy holders with a high
average claim cost
 City-planning:
 Identifying groups of houses according to their house type, value,
and geographical location
64
Good clustering
 A good clustering method will produce high quality clusters
with
 high intra-class similarity
 low inter-class similarity
 Dissimilarity/Similarity metric: Similarity is expressed in
terms of a distance function, typically metric: d(i, j)
 The definitions of distance functions are usually very
different for interval-scaled, boolean, categorical, ordinal
ratio, and vector variables.
 It is hard to define “similar enough” or “good enough”
65
A small example
How to cluster this data?
This process is not

easy in practice. Why?
66
Visualization of the results
 Presentation of the results or knowledge obtained from
data mining in visual forms
 Examples
 Scatter plots
 Association rules
 Decision trees
 Clusters
67
Summary
Data Mining
Why Data and KDD
Mining?
Data
preprocessing
Some
scenarios
Classification
Clustering
68

Data Mining: Business Intelligence

Uploaded by

Copyright:

Available Formats

Data Mining: Business Intelligence

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining: Business Intelligence

Uploaded by

Copyright:

Available Formats

Data Mining

Data Presentation Business

Data Preprocessing/Integration, Data Warehouses

 Fill in missing values

Class 1 Class 2 Class 1 Class 2

4 Bread, Cheese, Sugar

Goal: find all rules of type X  Y between items in an itemset

3 Quattro stagioni, Coke

3 Quattro stagioni, Coke

Frequent itemsets: Association rules:

NAME RANK YEARS TENURED Classifier

 Example: Suppose that we have a database of customers on

A decision tree for the concept buys_computer, indicating whether or

Assoc. Prof. Dr. D. T. Anh

Income > 20K€

Married? Mid risk

High risk Mid risk

 All instances (samples) correspond to points in the n-dimensional

 GA: based on an analogy to biological evolution

How to split this? SVM SVM + Kernel

Which class should the It depends on k!!!

This process is not

You might also like