Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Data Mining: Business Intelligence

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 68

Data Mining

Business Intelligence
outline
 Data Mining and KDD
 Why Data Mining
 Applications of Data Mining
 Data Preprocessing
 Data Mining techniques
 Visualization of the results
 Summary

2
Data Mining and KDD

3
Looking for knowledge
 The Explosive Growth of Data
 The World Wide Web
 Business: e-commerce, transactions, stocks, …
 Science: Remote sensing, bioinformatics, scientific simulation
 Society and everyone: news, digital cameras, YouTube, forums,
blogs, Google & Co
 We are drowning in data, but starving for knowledge!
 Avoid data tombs
 “Necessity is the mother of invention”—Data mining—Automated
analysis of massive data sets.

4
What is Data Mining?
 Data mining (knowledge discovery from data)
 Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge from
huge amount of data
 Alternative names
 Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data
dredging, information harvesting, business intelligence, etc.

5
Knowledge Discovery (KDD)
Process
Pattern Evaluation

Data Mining

Task-relevant Data

Data Selection
Warehouse
Data Cleaning

Data Integration

Data sources
6
Data Mining and Business
Intelligence
Increasing potential
to support End User
business decisions Decision
Making

Data Presentation Business


Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst

Data Exploration
Statistical Summary, Querying, and Reporting

Data Preprocessing/Integration, Data Warehouses


DBA
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
Quantity of data 7
Data Mining: confluence of multiple
disciplines
Database
Technology Statistics

Machine Visualization
Data Mining
Learning

Pattern
Recognition Other
Algorithms Disciplines

8
Why Data Mining?

9
Why is Data Mining so complex? A
matter of data dimensions
 Tremendous amount of data
 Walmart – Customer buying patterns – a data warehouse 7.5 Terabytes
large in 1995
 VISA – Detecting credit card interoperability issues – 6800 payment
transactions per second
 High-dimensionality of data
 Many dimensions to be combined together
 Data cube example: time, location, product  sales
 High complexity of data
 Time-series data, temporal data, sequence data
 Structure data, graphs, social networks and multi-linked data
 Spatial, spatiotemporal, multimedia, text and Web data

10
What does Data Mining provide me
with? (1)
 Multidimensional concept description: Characterization and
discrimination
 Generalize, summarize, and contrast data characteristics, e.g.,
dry vs. wet regions
 Characterization describes things in the same class,
discrimination describes how to separate different classes
 Frequent patterns, association, correlation vs. causality
 Wine  Spaghetti [0.3% of all basket cases, 75% of cases when
tomato sauce is bought]
 Is this correlation or not?

11
What does Data Mining provide me
with? (2)
 Classification and prediction
 Construct models (functions) that describe and distinguish
classes or concepts for future prediction
E.g., classify countries based on climate, or classify cars based
on gas mileage
 Predict some unknown or missing numerical values
 Cluster analysis
 Class label is unknown: Group data to form new classes, e.g.,
cluster houses to find distribution patterns
 Maximizing intra-class similarity & minimizing interclass
similarity

12
What does Data Mining provide me
with? (3)
 Outlier analysis
 Outlier: Data object that does not comply with the general
behavior of the data
 Fraud detection is the main application area
 Noise or exception?
 Trend and evolution analysis
 Trend and deviation: e.g., regression analysis
 Sequential pattern mining: e.g., digital camera  large SD
memory
 Periodicity analysis
 Similarity-based analysis

13
Applications of Data Mining
Market Analysis and Management
 Data sources:
 credit card transactions, loyalty cards, smart cards, discount
coupons, ...
 Target marketing
 Find clusters of “model” customers who share the same
characteristics:
• Geographics (lives in Rome, lives in Trentino)
• Demographics (married, between 21-35, at least one child, family income
more than 40.000€/year)
• Psychographics (likes new products, consistently uses the Web)
• Behaviors (searches info in Internet, always defends her decisions)
 Determine customer purchasing patterns over time

14
Applications of Data Mining
Market Analysis and Management
 Cross-market analysis
 Find associations between product sales, and predict based on
such association
 Compare the sales in the US and in Italy, find associations in
old products and predict if new ones will have success
 Customer profiling
 What types of customers buy what products
 Customers with age between 20-30 and income > 20K€ will buy
product A
 Customer requirement analysis
 Identify the best products for different groups of customers
 Predict what factors will attract new customers

15
Applications of Data Mining
Corporate Analysis
 Finance Planning and Asset Evaluation
 Cash flow prediction and analysis
 Cross-sectional and time-series analysis (financial ratio, trend
analysis)
 Resource Planning
 summarize and compare the resources and spending
 Competition
 monitor competitors and market directions
 group customers into classes and a class-based pricing procedure
 set pricing strategy in a highly competitive market
 Other examples?

16
What’s next?
 Data Preprocessing
 Why is it needed?
 Data cleaning
 Data integration and transformation,
 Data reduction
 Discretization and Concept hiererchy
 Data Mining techniques
 Frequent patterns, association rules
 Classification and prediction
 Cluster Analysis
 Visualization of the results
Are you sleeping?
 Summary

17
Data Preprocessing

18
Why Data Preprocessing?
 Data in the real world is dirty
 incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
• e.g., occupation=“ ”, birthdate=“31/12/2099”
 noisy: containing errors or outliers
• e.g., Salary=“-10”
 inconsistent: containing discrepancies in codes or names
• e.g., Age=“42” Birthday=“03/07/1997” (we are in 2007!!)
• e.g., Was rating “1,2,3”, now rating “A, B, C”
• e.g., discrepancy between duplicate records. In one copy of the data
customer A has to pay 200.000€, in the second copy of the data A does not
have to pay anything.

19
Why is data dirty?
 Incomplete data may come from
 “Not applicable” data value when collected
 Different considerations between the time when the data was
collected and when it is analyzed.
 Human/hardware/software problems
 Noisy data (incorrect values) may come from
 Faulty data collection instruments
 Human or computer error at data entry
 Errors in data transmission
 Inconsistent data may come from
 Different data sources
 Functional dependency violation (e.g., modify some linked data)

20
Why Is Data Preprocessing
Important?

21
Data Preprocessing
1. Data cleaning – missing values
“Data cleaning is one of the three biggest problems in data
warehousing”— Ralph Kimball

 Fill in missing values


 Name=“John”, Occupation=“Lawyer”, Age=“28”, Salary=“”
 Ignore the record (is it always feasible?)
 Manually filling missing attributes
 Automatically insert a constant
 Automatically insert the mean value (relative to the record class)
 Most probable value: make some inference!

22
Data Preprocessing
1. Data cleaning – binning
 Handle noisy data
 Binning, clustering, regression (not details)
 Binning
1. Sort data by price (€): 4, 8, 9, 15, 21, 21, 24, 25, 26
2. Partition into equal-frequency (equi-depth) bins:
 Bin 1: 4, 8, 9
 Bin 2: 15, 21, 21
 Bin 3: 24, 25, 26
3. Smoothing by bin means:
 Bin 1: 7, 7, 7
 Bin 2: 19, 19, 19
 Bin 3: 25, 25, 25

23
Data Preprocessing
1. Data cleaning – clustering

noise

24
Data Preprocessing
2. Integration and transformation
 Data Integration combines data from multiple sources
into a coherent store D1 D2 D3
 Schema integration
 Integrate metadata from different sources
 A.cust-id  B.cust-number D1,2,3
 Entity identification problem:
 Identify real world entities from multiple data sources, e.g., Bill
Clinton = William Clinton
 Detecting and resolving data value conflicts
 For the same real world entity, attribute values from different
sources are different (e.g., cm vs. inch)

25
Data Preprocessing
2. Integration and transformation
 Data integration can lead to redundant attributes
 Same object (A.house = B.residence)
 Derivates (A.annualIncome =  B.salary+C.rentalIncome)
 Redundant attributes can be discoverd via correlation
analysis
 A mathematical method detecting the correletion between two
attributes
 Correlation coefficient (Pearson’s product moment coefficient):
the higher it is, the stronger the correlation between attributes
 Χ2 (chi-square) test
 No details on these methods here

26
Data Preprocessing
2. Integration and transformation
 Aggregation:
 Sum the sales of different branches (in different data sources) to
compute the company sales
 Generalization:
 concept hierarchy climbing
 From integer attribute age to classes of age (children, adult, old)
 Normalization: scaled to fall within a small, specified range
 Change the range from [-∞,+ ∞] to [-1,+1]
 {-13, -6, -3, 10, 100}  {-0.13, -0.06, -0.03, 0.1, 1}

27
Data Preprocessing
3. Data reduction
 Data reduction
 Obtain a reduced representation of the data set that is much
smaller in volume but yet produce the same (or almost the
same) analytical results
 Different reduction types (dimensions, numerosity,
discretization)
 Dimensionality: Attribute subset selection
 Example with a decision tree (left branches True, right False)
A4?
Initial attribute set:
{A1, A2, A3, A1? A6? Reduced attribute
A4, A5, A6} set: {A1, A4, A6}

Class 1 Class 2 Class 1 Class 2


28
Data Preprocessing
3. Data reduction
 Dimensionality: Principal Components Analysis
 Given N data vectors from n-dimensions, find k ≤ n orthogonal
vectors (principal components) that can be best used to represent
data
 Works for numeric data only
 Used when the number of dimensions is large
 Numerosity: Clustering
 Partition data set into clusters based on similarity, and store cluster
representation (e.g., centroid and diameter) only

2 clusters
Sparse data leads
to many clusters
– non effective

29
Data Preprocessing
3. Data reduction
 Numerosity: Sampling
 obtaining a small sample s to represent the whole data set N
 Problem: How to select a representative sampling set
 Random sampling is not enough – representative samples should
be preserved
 Stratified sampling: Approximate the percentage of each class (or
subpopulation of interest) in the overall database
Random sampling Stratified sampling

No samples
from here
30
Data Preprocessing
4. Discretization - concept hierarchy
 Three types of attributes
 Nominal — values from an unordered set (color, profession)
 Ordinal — values from an ordered set (military or academic rank)
 Continuous — numbers (integer or real numbers)
 Discretization
 Divide the range of a continuous attribute into intervals
 Reduces data size and its complexity
 Some data mining algorithms do not support continuous types, and in
those cases discretization is mandatory
 Some useful methods:
 Binning, clustering (already presented)
 Entropy-based discretization (no details here)

31
Data Preprocessing
4. Discretization - concept hierarchy
 Concept hierarchy generation
 For categorical data
 Specification of an ordering between attributes (schema level)
• street < city < state < country
 Specification of a hierarchy of values (data level)
• {Urbana, Champaign, Chicago} < Illinois
 Automatic generation using the number of distinct values
• For the set of attributes: {street, city, state, country}
• IF: |street| = 600.000, |city|=3.000, |state|=300, |country|=15
• THEN: street < city < state < country

32
Outline
 Data Mining techniques
 Frequent patterns, association rules
• Support and confidence
 Classification and prediction
• Decision trees
• Bayesian classifiers
• Support Vector Machines
• Lazy learning
 Cluster Analysis
 Visualization of the results
 Summary

33
Data Mining techniques

34
Frequent pattern analysis
 What is it?
 Frequent pattern: a pattern (a set of items, subsequences,
substructures, etc.) that occurs frequently in a data set
 Frequent pattern analysis: searching for frequent patterns
 Motivation: Finding inherent regularities in data
• Which products are bought together? Yesterday’s wine and spaghetti
example
• What are the subsequent purchases after buying a PC?
• Can we automatically classify web documents?
 Applications
• Basket data analysis
• Cross-marketing
• Catalog design
• Sale campaign analysis

35
Basic Concepts: Frequent Patterns
and Association Rules (1)
Transaction-id Items bought
1 Wine, Bread, Spaghetti Itemsets (= transactions
2 Wine, Cocoa, Spaghetti in this example)
3 Wine, Spaghetti, Cheese

4 Bread, Cheese, Sugar


5 Bread, Cocoa, Spaghetti, Cheese,
Sugar

Goal: find all rules of type X  Y between items in an itemset


with minimum:
Support s - probability that an itemset contains X  Y
Confidence c – conditional probability that an itemset containing X
contains also Y

36
Support and confidence
That is.
support, s, probability that a transaction contains {A  B }
s = P(A  B )
confidence, c, conditional probability that a transaction
having A also contains B.
c = P(A|B).
 Rules that satisfy both a minimum support threhold
(min_sup) and a mimimum confidence threhold
(min_conf) are called strong.

37
Basic Concepts: Frequent Patterns
and Association Rules (2)
Transaction-id Items bought
1 Wine, Bread, Spaghetti
Suppose:
2 Wine, Cocoa, Spaghetti
support s = 50%
3 Wine, Spaghetti, Cheese
confidence
4 Bread, Cheese, Sugar c=50%
5 Bread, Cocoa, Spaghetti, Cheese,
Sugar
Support is used to define frequent patterns (sets of
products in more than s% itemsets)
{Wine} in itemsets 1, 2, 3 (support = 60%)
{Bread} in itemsets 1, 4, 5 (support = 60%)
{Spaghetti} in itemsets 1, 2, 3, 5 (support = 80%)
{Cheese} in itemsets 3, 4, 5 (support = 60%)
{Wine, Spaghetti} in itemsets 1, 2, 3 (support = 60%)
38
Basic Concepts: Frequent Patterns
and Association Rules (3)
Transaction-id Items bought
1 Wine, Bread, Spaghetti
Suppose:
2 Wine, Cocoa, Spaghetti
support s = 50%
3 Wine, Spaghetti, Cheese
confidence
4 Bread, Cheese, Sugar c=50%
5 Bread, Cocoa, Spaghetti, Cheese,
Sugar
Confidence defines association rules: X  Y rules in frequent
patterns whose confidence is bigger than c
Suggestion: {Wine, Spaghetti} is the only frequent
pattern to be considered. Why?
Association rules:
Wine  Spaghetti (support=60%, confidence=100%)
Spaghetti  Wine (support=60%, confidence=75%)
39
Advanced concepts in Asssociation
Rules discovery
 Algorithms must face scalability problems
 Apriori: If there is any itemset which is infrequent, its superset
should not be generated/tested!
 Advanced problems
 Boolean vs. quantitative associations
age(x, “30..39”) and income(x, “42..48K”)  buys(x, “car”)
[s=1%, c=75%]
 Single level vs. multiple-level analysis
What brands of wine are associated with what brands of
spaghetti?
Are support and confidence
clear?

40
Another example for association
rules
Transaction-id Items bought
Support s = 40%
1 Margherita, Beer, Coke
Confidence c = 70%
2 Margherita, Beer

3 Quattro stagioni, Coke

4 Margherita, Coke

41
Another example for association
rules
Transaction-id Items bought
Support s = 40%
1 Margherita, Beer, Coke
Confidence c = 70%
2 Margherita, Beer

3 Quattro stagioni, Coke

4 Margherita, Coke

Frequent itemsets: Association rules:


{Margherita} = 75% Beer  Margherita [c=50%,s=100%]
{Beer} = 50%
{Coke} = 75%
{Margherita, Beer} = 50%
{Margherita, Coke} = 50%

42
Classification vs. Prediction
 Classification
 Characterizes (describes) a set of items belonging to a training
set; these items are already classified according to a label attribute
 The characterization is a model
 The model can be applied to classify new data (predict the class
they should belong to)
 Prediction
 models continuous-valued functions, i.e., predicts unknown or
missing values
 Applications
 Credit approval, target marketing, fraud detection

43
Classification: the process
1. Model construction
 The class label attribute defines the class each item should belong
to
 The set of items used for model construction is called training set
 The model is represented as classification rules, decision trees, or
mathematical formulae
2. Model usage
 Estimate accuracy of the model
• On the training set
• On a generalization of the training set
 If the accuracy is acceptable, use the model to classify data tuples
whose class labels are not known

44
Classification: the process
Model construction Classification
Algorithms
Training
Data

NAME RANK YEARS TENURED Classifier


Mike Assistant Prof 3 no (Model)
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes IF rank = ‘professor’
Dave Assistant Prof 6 no
OR years > 6
Anne Associate Prof 3 no
THEN tenured = ‘yes’

45
Classification: the process
IF rank = ‘professor’
Model usage OR years > 6
THEN tenured = ‘yes’
Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes

46
Supervised vs. Unsupervised
Learning
 Supervised learning (classification)
 Supervision: The training data (observations, measurements,
etc.) are accompanied by labels indicating the class of the
observations
 New data is classified based on the training set
 Unsupervised learning (clustering)
 The class labels of training data is unknown
 Given a set of measurements, observations, etc. with the aim of
establishing the existence of classes or clusters in the data

47
Evaluating generated models
 Accuracy
 classifier accuracy: predicting class label
 predictor accuracy: guessing value of predicted attributes
 Speed
 time to construct the model (training time)
 time to use the model (classification/prediction time)
 Robustness
 handling noise and missing values
 Scalability
 efficiency in disk-resident databases
 Interpretability
 understanding and insight provided by the model

48
Example of Classification

 Example: Suppose that we have a database of customers on


the AllEletronics mailing list. The database describes
attributes of the customers, such as their name, age, income,
occupation, and credit rating. The customers can be classified
as to whether or not they have purchased a computer at
AllElectronics.
 Suppose that new customers are added to the database and
that you would like to notify these customers of an upcoming
computer sale. To send out promotional literature to every
new customers in the database can be quite costly. A more
cost-efficient method would be to target only those new
customers who are likely to purchase a new computer. A
classification model can be constructed and used for this
purpose.
49
Each internal node
represents a test on
an attribute. Each
leaf node represents
a class.

A decision tree for the concept buys_computer, indicating whether or


not a customer at AllElectronics is likely to purchase a computer.

Assoc. Prof. Dr. D. T. Anh


50
Classification techniques
Decision Trees (1)
Investment type choice

Income > 20K€

no yes

Age > 60
Low risk
no yes

Married? Mid risk

no yes

High risk Mid risk

51
Classification techniques
Decision Trees (2)
 How are the attributes in decision trees selected?
 Two well-known indexes are used
• Information gain selects the most informative attribute in
distinguishing the items between the classes
• It biases towards attributes with a large set of values
• Gain ratio faces the information gain limitations

52
Classification techniques
Bayesian classifiers (2)
 Bayesian classification
 A statistical classification technique
• Predicts class membership probabilities
 Founded on the Bayes theorem
P( X | H ) P( H )
P( H | X ) 
P( X )
• What if X = “Red and rounded” and H = “Apple”?
 Performance
• The simplest implementation (Naïve Bayes) can be compared to decision trees
and neural networks
 Incremental
• Each training example can increase/decrease the probability that an
hypothesis in correct

53
Other Classification Methods
 k-nearest neighbor classifier
 case-based reasoning
 Genetic algorithm
 Rough set approach
 Fuzzy set approaches

54
The k-Nearest Neighbor Algorithm

 All instances (samples) correspond to points in the n-dimensional


space.
 The nearest neighbor are defined in terms of Euclidean distance.
The Euclidean distance of two points, X = (x1, x2, …,xn) and Y = (y1,
y2, …,yn) is
n
d(X,Y) =   (xi –yi)2
1
 When given an unknown sample, the k-Nearest Neighbor classifier
search for the space for the k training samples that are closest to
the unknown sample xq. The unknown sample is assigned the most
common class among its k nearest neighbors. The algorithm has to
vote to determine the most common class among the k nearest
neighbor. When k = 1, the unknown sample is assigned the class of
the training sample that is closest to it in the space.
 Once we have obtained xq’s k-nearest neighbors using the distance
function, it is time for the neighbors to vote in order to determine
xq’s class. 55
Genetic Algorithms

 GA: based on an analogy to biological evolution


 Each rule is represented by a string of bits.
 Example: The rule “IF A1 and Not A2 then C2“ can be
encoded as the bit string “100”, where the two left bits
represent attributes A1 and A2, respectively, and the rightmost
bit represents the class. Similarly, the rule “IF NOT A1 AND
NOT A2 THEN C1” can be encoded as “001”.
 An initial population is created consisting of randomly generated
rules
 Based on the notion of survival of the fittest, a new population is
formed to consists of the fittest rules and their offsprings
 The fitness of a rule is represented by its classification accuracy
on a set of training examples
 Offsprings are generated by crossover and mutation.

56
5 minutes break!

57
Classification techniques
Support Vector Machines
 One of the most advanced classification techniques
 Left figure: a small margin between the classes is found
 Right figure: the largest margin is found
 Support vector machines (SVMs) are able to identify the right figure margin

58
Classification techniques
SVMs + Kernel Functions
 Is data always linearly separable?
 NO!!!
 Solution: SVMs + Kernel Functions

How to split this? SVM SVM + Kernel


Functions

59
Classification techniques
Lazy learning
 Lazy learning
 Simply stores training data (or only minor processing) and waits
until it is given a test tuple
 Less time in training but more time in predicting
 Uses a richer hypothesis space (many local linear functions),
and hence the accuracy is higher
 Instance-based learning
 Subcategory of lazy learning
 Store training examples and delay the processing (“lazy
evaluation”) until a new instance must be classified
 An example: k-nearest neighbor approach

60
Classification techniques
k-nearest neighbor
 All instances correspond to points in the n-Dimensional
space – x is the instance to be classified
 The nearest neighbor are defined in terms of Euclidean
distance, dist(X1, X2)
 For discrete-valued, k-NN returns the most common value
among the k training examples nearest to x

Which class should the It depends on k!!!


green circle belong to? k=3  Red
K=5  Blue

61
Prediction techniques
An overview
 Prediction is different from classification
 Classification refers to predict categorical class label
 Prediction models continuous-valued functions
 Major method for prediction: regression
 model the relationship between one or more independent or
predictor variables and a dependent or response variable
 Regression analysis
 Linear and multiple regression
 Non-linear regression
 Other regression methods: generalized linear model, Poisson
regression, log-linear models, regression trees
 No details here

62
What is cluster Analysis?
 Cluster: a collection of data objects
 Similar to one another within the same cluster
 Dissimilar to the objects in other clusters
 Cluster analysis
 Finding similarities between data according to the characteristics
found in the data and grouping similar data objects into clusters
 It belongs to unsupervised learning
 Typical applications
 As a stand-alone tool to get insight into data distribution
 As a preprocessing step for other algorithms (day 1 slides)

63
Examples of cluster analysis
 Marketing:
 Help marketers discover distinct groups in their customer bases
 Land use:
 Identification of areas of similar land use in an earth observation
database
 Insurance:
 Identifying groups of motor insurance policy holders with a high
average claim cost
 City-planning:
 Identifying groups of houses according to their house type, value,
and geographical location

64
Good clustering
 A good clustering method will produce high quality clusters
with
 high intra-class similarity
 low inter-class similarity
 Dissimilarity/Similarity metric: Similarity is expressed in
terms of a distance function, typically metric: d(i, j)
 The definitions of distance functions are usually very
different for interval-scaled, boolean, categorical, ordinal
ratio, and vector variables.
 It is hard to define “similar enough” or “good enough”

65
A small example
How to cluster this data?

This process is not


easy in practice. Why?

66
Visualization of the results
 Presentation of the results or knowledge obtained from
data mining in visual forms
 Examples
 Scatter plots
 Association rules
 Decision trees
 Clusters

67
Summary
Data Mining
Why Data and KDD
Mining?

Data
preprocessing

Some
scenarios

Classification

Clustering

68

You might also like