Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Data Mining M2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

SCSA3001 Data Mining And Data Warehousing

ASSOCIATION RULE MINING


Mining frequent patterns - Associations and correlations - Mining methods -
Finding Frequent itemset using Candidate Generation - Generating Association
Rules from Frequent Itemsets - Mining Frequent itemset without Candidate
Generation-Mining various kinds of association rules - Mining Multi-Level
Association Rule-Mining Multi-Dimensional Association Rule Mining Correlation
analysis - Constraint based association mining.
MINING FREQUENT PATTERNS
 Frequent patterns are patterns (e.g., itemsets, subsequences, or substructures) that appear
frequently in a data set. For example, a set of items, such as milk and bread that appear
frequently together in a transaction data set is a frequent itemset.
 A subsequence, such as buying first a PC, then a digital camera, and then a memory card, if it
occurs frequently in a shopping history database, is a (frequent) sequential pattern.
 A substructure can refer to different structural forms, such as subgraphs, subtrees, or
sublattices, which may be combined with itemsets or subsequences. If a substructure occurs
frequently, it is called a (frequent) structured pattern
Applications
Market Basket Analysis: given a database of customer transactions, where each transaction is a
set of items the goal is to find groups of items which are frequently purchased together.
Telecommunication (each customer is a transaction containing the set of phone calls)
Credit Cards/ Banking Services (each card/account is a transaction containing the set of
customer’s payments)
Medical Treatments (each patient is represented as a transaction containing the ordered set of
diseases)
Basketball-Game Analysis (each game is represented as a transaction containing the ordered set
of ball passes)
Association Rule Definitions
 I={i1, i2, ..., in}: a set of all the items
 Transaction T: a set of items such that T I

60
SCSA3001 Data Mining And Data Warehousing

 Transaction Database D: a set of transactions


 A transaction T  I contains a set AI of some items, if A T
 An Association Rule: is an implication of the form AB, where A, B I
It has two measures
1. Support
2. Confidence
 The rule A  B holds in the transaction set D with support s, where s is the percentage of
transactions in D that contain A[B (i.e., the union of sets A and B, or say, both A and B). This is
taken to be the probability, P(AUB)
 The rule A  B has confidence c in the transaction set D, where c is the percentage of
transactions in D containing A that also contain B. This is taken to be the conditional
probability, P(B|A).

Examples
Rule form: ―Body ® Head [support, confidence]‖.
buys(x, ―diapers‖) ® buys(x, ―beers‖) [0.5%, 60%]
major(x, ―CS‖) ^ takes(x, ―DB‖) ® grade(x, ―A‖) [1%, 75%]
ASSOCIATIONS AND CORRELATIONS
Association Rule: Basic Concepts
Given: (1) database of transactions, (2) each transaction is a list of items (purchased by a customer
in a visit)
Find: all rules that correlate the presence of one set of items with that of another set of items
E.g., 98% of people who purchase tires and auto accessories also get automotive services done
Applications
* Maintenance Agreement (What the store should do to boost Maintenance
Agreement sales)
– Home Electronics * (What other products should the store stocks up?)

61
SCSA3001 Data Mining And Data Warehousing

– Attached mailing in direct marketing


– Detecting - ping-pongling of patients, faulty – collisions
Rule Measures: Support and Confidence
Find all the rules X & YZ with minimum confidence and support
– Support, s, probability that a transaction contains {X 4 Y 4 Z}
– Confidence, c, conditional probability that a transaction having {X 4 Y} also contains Z
Let minimum support 50%, and minimum confidence 50%, we have
A  C (50%, 66.6%)
C A (50%, 100%)

Table 3.1
Association Rule Mining: A Road Map
• Boolean vs. quantitative associations (Based on the types of values handled)
– buys(x, ―SQLServer‖) ^ buys(x, ―DMBook‖) ® buys(x, ―DBMiner‖) [0.2%, 60%]
– age(x, ―30..39‖) ^ income(x, ―42..48K‖) ® buys(x, ―PC‖) [1%, 75%]
• Single dimension vs. multiple dimensional associations (see ex. above)
• Single level vs. multiple-level analysis
– What brands of beers are associated with what brands of diapers?
• Various extensions
– Correlation, causality analysis
• Association does not necessarily imply correlation or causality
– Maxpatterns and closed itemsets
– Constraints enforced
• E.g., small sales (sum < 100) trigger big buys (sum > 1,000)?

62
SCSA3001 Data Mining And Data Warehousing

Market – Basket analysis


A market basket is a collection of items purchased by a customer in a single transaction, which is
a well-defined business activity. For example, a customer's visits to a grocery store or an online
purchase from a virtual store on the Web are typical customer transactions. Retailers accumulate
huge collections of transactions by recording business activities over time. One common analysis
run against a transactions database is to find sets of items, or itemsets, that appear together in
many transactions. A business can use knowledge of these patterns to improve the Placement of
these items in the store or the layout of mail- order catalog page and Web pages. An itemset
containing i items is called an i- itemset. The percentage of transactions that contain an itemset is
called the itemset's support. For an itemset to be interesting, its support must be higher than a
user-specified minimum. Such itemsets are said to be frequent.

Figure 3.1 Market Basket Analysis


Computer  financial_management_ software [support = 2%, confidence = 60%]
Rule support and confidence are two measures of rule interestingness. They respectively reflect
the usefulness and certainty of discovered rules. A support of 2% for association Rule means that
2% of all the transactions under analysis show that computer and financial management software
are purchased together. A confidence of 60% means that 60% of the customers who purchased a
computer also bought the software. Typically, association rules are considered interesting if they
satisfy both a minimum support threshold and a minimum confidence threshold.

63
SCSA3001 Data Mining And Data Warehousing

MINING METHODS
• Mining Frequent Pattern with candidate generation
• Mining Frequent Pattern without candidate generation
MINING FREQUENT PATTERNS WITH CANDIDATE GENERATION
The method that mines the complete set of frequent item sets with candidate generation.
Apriori property & The Apriori Algorithm. Apriori property
• All nonempty subsets of a frequent item set most also be frequent.
– An item set I does not satisfy the minimum support threshold, min-sup, then I is not frequent,
i.e., support (I) < min-sup
– If an item A is added to the item set I then the resulting item set (I U A) cannot occur more
frequently than I.
• Monotonic functions are functions that move in only one direction.
• This property is called anti-monotonic.
• If a set cannot pass a test, all its supersets will fail the same test as well.
• This property is monotonic in failing the test.
The Apriori Algorithm
• Join Step: Ck is generated by joining Lk-1with itself
• Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent kitemset
Method

1) L1 = find_frequent_1 itemsets(D);
2) for (k = 2; Lk-1; k++) {
3) Ck = apriori_gen (Lk-1, min_sup);
4) For each transaction t D { // scan D for counts
5) Ct = subset (Ck, t); // get the subsets of t that are candidates
6) for each candidate c  Ct
7) c.count++;
8) }
9) Lk = {c  Ck|c.count ≥ min_sup}
10) }
11) return L = UkLL;

64
SCSA3001 Data Mining And Data Warehousing

Procedure a priori_gen (Lk-1: frequent (k=t) itemsets; min_sup; minimum support)


1) for each itemset l1 Lk-1
2) for each itemset l2 Lk-1
3) If (l1[1] = l2[1]) ^ (l1[2] = l2[2]) ^…^ (l1[k-2] = l2[k-2]) ^ (l1[k-1] < l2[k-1]) that {
4) c = l1 x l2; // join step: generate candidates
5) if has_infrequent_subset (c, L k-1) then
6) Delete c, // prune step: remove unfruitful candidate
7) else add c to Ck;
8) }
9) Return Ck;
Procedure has_infrequent_subset (c: candidate k itemsetm; Lk-1: frequent (k-1) itemsets); // use
prior knowledge
1) for each (k-1) subset s of c
2) if s  Lk-1 then
3) return TRUE;
4) return FALSE;

Table 3.1 Mining Frequent Patterns with candidate Generation

65
SCSA3001 Data Mining And Data Warehousing

MINING FREQUENT ITEM SET WITHOUT CANDIDATE GENERATION


Frequent Pattern Growth Tree Algorithm
It grows long patterns from short ones using local frequent items
• “abc” is a frequent pattern
• Get all transactions having “abc”: DB|abc
• “d” is a local frequent item in DB | abc € abcd is a frequent pattern
Construct FP-tree from a Transaction Database

Table 3.2 Construct FP-tree from a Transaction Database

66
SCSA3001 Data Mining And Data Warehousing

Find Patterns Having P from P-conditional Database


• Starting at the frequent item header table in the FP-tree
• Traverse the FP-tree by following the link of each frequent item P
• Accumulate all of transformed prefix paths of items p to form P’s conditional pattern base
Benefits of the FP-tree Structure
• Completeness:
– Never breaks a long pattern of any transaction
– Preserves complete information for frequent pattern mining
• Compactness
– Reduce irrelevant information—infrequent items are gone
– Frequency descending ordering: more frequent items are more likely to be shared
– Never be larger than the original database (if not count node-links and counts)
MINING VARIOUS KINDS OF ASSOCIATION RULES

• Mining Multi-level association rule


• Mining Multi dimensional Association Rule
Mining multilevel association rules from transactional databases.

Figure 3.1 Mining multilevel association rules from transactional databases.

67
SCSA3001 Data Mining And Data Warehousing

• Items often form hierarchy.


• Items at the lower level are expected to have lower support.
• Rules regarding itemsets at appropriate levels could be quite useful.
• Transaction database can be encoded based on dimensions and levels
• We can explore shared multi-level mining
MINING MULTI-LEVEL ASSOCIATIONS
• A top_down, progressive deepening approach:
– First find high-level strong rules:
milk ® bread [20%, 60%].
– Then find their lower-level ―weaker‖ rules:
2% milk ® wheat bread [6%, 50%].
• Variations at mining multiple-level association rules.
– Level-crossed association rules:
2% milk ® Wonder wheat bread
– Association rules with multiple, alternative hierarchies:
2% milk ® Wonder bread
Multi-level Association: Uniform Support vs. Reduced Support
• Uniform Support: the same minimum support for all levels
– + One minimum support threshold. No need to examine itemsets containing any item whose
ancestors do not have minimum support.
– Lower level items do not occur as frequently. If support threshold
• too high miss low level associations
• too low  generate too many high level associations
• Reduced Support: reduced minimum support at lower levels
– There are 4 search strategies:
• Level-by-level independent
• Level-cross filtering by k-itemset
• Level-cross filtering by single item
• Controlled level-cross filtering by single item

68
SCSA3001 Data Mining And Data Warehousing

Multi-level Association: Redundancy Filtering


• Some rules may be redundant due to ―ancestor‖ relationships between items.
• Example
– milk  wheat bread [support = 8%, confidence = 70%]
– 2% milk wheat bread [support = 2%, confidence = 72%]
• We say the first rule is an ancestor of the second rule.
• A rule is redundant if its support is close to the ―expected‖ value, based on the rule’s ancestor
Multi-Level Mining: Progressive Deepening
• A top-down, progressive deepening approach:
 First mine high-level frequent items: milk (15%), bread (10%)
 Then mine their lower-level ―weaker‖ frequent itemsets: 2% milk (5%), wheat bread (4%)
• Different min_support threshold across multi-levels lead to different algorithms:
 If adopting the same min_support across multi-levels then toss t if any of t’s ancestors is
infrequent.
 If adopting reduced min_support at lower levels then examine only those descendents whose
ancestor’s support is frequent/non-negligible
Mining Multidimensional Association mining
Mining our AllElectronics database, we may discover the Boolean association rule
buys (X, “digital camera”) ⇒buys(X, “HP printer”).
Following the terminology used in multidimensional databases,
Single- dimensional or intradimensional associations rule because it contains a single distinct
predicate (e.g., buys) with multiple occurrences (i.e., the predicate occurs more than once within
the rule). Such rules are commonly mined from transactional data.
Considering each database attribute or warehouse dimension as a predicate, we can therefore mine
association rules containing multiple predicates such as
age(X, “20 . . . 29”) occupation(X, “student”) ⇒buys(X, “laptop”).
Association rules that involve two or more dimensions or predicates can be referred to as
multidimensional association rules. Rule contains three predicates (age, occupation, and buys),
each of which occurs only once in the rule. Hence, we say that it has no repeated predicates.
Multidimensional association rules with no repeated predicates are called inter dimensional
association rules. We can also mine multidimensional association rules with repeated predicates,

69
SCSA3001 Data Mining And Data Warehousing

which contain multiple occurrences of some predicates. These rules are called hybrid-dimensional
association rules.
An example of such a rule is the following, where the predicate buys is repeated:
age(X, “20 . . . 29”) ⇒buys(X, “laptop”) ⇒buys(X, “HP printer”).
Database attributes can be nominal or quantitative. The values of nominal (or categorical)
attributes are “names of things.” Nominal attributes have a finite number of possible values, with
no ordering among the values (e.g., occupation, brand, color)
Quantitative attributes are numeric and have an implicit ordering among values (e.g., age, income,
price). Techniques for mining multidimensional association rules can be categorized into two
basic approaches regarding the treatment of quantitative attributes. In the first approach,
quantitative attributes are discretized using predefined concept hierarchies. This discretization
occurs before mining. For instance, a concept hierarchy for income may be used to replace the
original numeric values of this attribute by interval labels such as “0..20K,” “21K..30K,”
“31K..40K,” and so on.
Here, discretization is static and predetermined. Chapter 3 on data preprocessing gave several
techniques for discretizing numeric attributes. The discretized numeric attributes, with their
interval labels, can then be treated as nominal attributes (where each interval is considered
a category).
Mining Quantitative Association Rules
• Determine the number of partitions for each quantitative attribute
• Map values/ranges to consecutive integer values such that the order is preserved
• Find the support of each value of the attributes, and combine when support is less than MaxSup.
Find frequent itemsets, whose support is larger than MinSup
• Use frequent set to generate association rules
• Pruning out uninteresting rules
Partial Completeness
• R : rules obtained before partition
• R’: rules obtained after partition
• Partial Completeness measures the maximum distance between a rule in R and its closest
generalization in R’
• 𝑋̂ is a generalization of itemset X: if

70
SCSA3001 Data Mining And Data Warehousing

 x attributes (X) [ x < x, l, u X ^ x, l’, u’> l’ ≤ l ≤ u ≤ u’] 𝑋̂>


• The distance is defined by the ratio of support
K-Complete
• C: the set of frequent itemsets
• For any K ≥ 1, P is K-complete w.r.t C if:
1. P C
2. For any itemset X (or its subset) in C, there exists a generalization whose support is no more
than K times that of X (or its subset)
• The smaller K is, the less the information lost
CORRELATION ANALYSIS
• Interest (correlation, lift)
– taking both P(A) and P(B) in consideration
– P(A^B)=P(B)*P(A), if A and B are independent events
– A and B negatively correlated, if the value is less than 1; otherwise A and B positively
correlated
X2 Correlation
• X2 measures correlation between categorical attributes

Table 3.2 Correlation

71
SCSA3001 Data Mining And Data Warehousing

 Expected (i,j) = count(row i) * count(column j) / N


 X2 = (4000 - 4500)2 / 4500 - (3500 - 3000)2 / 3000 - (2000 - 1500)2 / 1500 - (500 -1000)2 /
1000 = 555.6
 X2 > 1 and observed value of (game, video) < expected value, there is a negative correlation
Numeric correlation
• Correlation concept in statistics
– Used to study the relationship existing between 2 or more numeric variables
– A correlation is a measure of the linear relationship between variables Ex: number of hours
spent studying in a class with grade received
– Outcomes:
• → positively related
• → Not related
• → negatively related
– Statistical relationships
• Covariance
• Correlation coefficient
CONSTRAINT-BASED ASSOCIATION MINING

• Interactive, exploratory mining giga-bytes of data?

– Could it be real? — Making good use of constraints!

• What kinds of constraints can be used in mining?

– Knowledge type constraint: classification, association, etc.

• Find product pairs sold together in Vancouver in Dec.’98.

– Dimension/level constraints:

• in relevance to region, price, brand, customer category.

– Rule constraints

• small sales (price < $10) triggers big sales (sum > $200).

– Interestingness constraints:

• Strong rules (min_support ≥ 3%, min_confidence ≥ 60%).

72
SCSA3001 Data Mining And Data Warehousing

Rule Constraints in Association Mining

• Two kind of rule constraints:

– Rule form constraints: meta-rule guided mining.

• P(x, y) ^ Q(x, w) ® takes(x, ―database systems‖).

– Rule (content) constraint: constraint-based query optimization (Ng, et al., SIGMOD’98).

• sum (LHS) < 100 ^ min(LHS) > 20 ^ count(LHS) > 3 ^ sum(RHS) > 1000

• 1-variable vs. 2-variable constraints (Lakshmanan, et al. SIGMOD’99):

– 1-var: A constraint confining only one side (L/R) of the rule, e.g., as shown above.

– 2-var: A constraint confining both sides (L and R).

• sum (LHS) < min(RHS) ^ max(RHS) < 5* sum(LHS)

Constrain-Based Association Query


• Database: (1) trans (TID, Itemset ), (2) itemInfo (Item, Type, Price)
• A constrained asso. Query (CAQ) is in the form of {(S1, S2) |C },
– Where C is a set of constraints on S1, S2 including frequency constraint
• A classification of (single-variable) constraints:
– Class constraint: : S  A. e.g. S  Item
Constrained Association Query Optimization Problem
• Given a CAQ = { (S1, S2) | C }, the algorithm should be :
– Sound: It only finds frequent sets that satisfy the given constraints C
– Complete: All frequent sets satisfy the given constraints C are found
• A naïve solution:
– Apply Apriori for finding all frequent sets, and then to test them for constraint satisfaction one
by one.
• Our approach:
– Comprehensive analysis of the properties of constraints and try to push them as deeply as
possible inside the frequent set computation.

73
SCSA3001 Data Mining And Data Warehousing

Categories of Constraints
1. Anti-monotone and Monotone Constraints
• Constraint Ca is anti-monotone iff. for any pattern S not satisfying Ca, none of the super patterns
of S can satisfy Ca
• A constraint Cm is monotone iff. for any pattern S satisfying Cm, every super-pattern of S also
satisfies it
2. Succinct Constraint
• A subset of item Is is a succinct set, if it can be expressed as p (I) for some selection predicate
p, where  is a selection operator
• SP2I is a succinct power set, if there is a fixed number of succinct set I1, …, Ik I,s.t. SP
can be expressed in terms of the strict power sets of I1, …, Ik using union and minus
• A constraint Cs is succinct provided SATCs (I) is a succinct power set
3. Convertible Constraint
• Suppose all items in patterns are listed in a total order R
• A constraint C is convertible anti-monotone iff a pattern S satisfying the constraint implies that
each suffix of S w.r.t. R also satisfies C
• A constraint C is convertible monotone iff a pattern S satisfying the constraint implies that each
pattern of which S is a suffix w.r.t. R also satisfies C
Property of Constraints: Anti-Monotone
• Anti-monotonicity: If a set S violates the constraint, any superset of S violates the constraint.
• Examples:
– sum(S.Price) ≤v is anti-monotone
– sum(S.Price) ≥v is not anti-monotone
– sum(S.Price) = v is partly anti-monotone
• Application:
– Push ―sum(S.price) ≤ 1000‖ deeply into iterative frequent set computation.
Property of Constraints: Succinctness
• Succinctness:
– For any set S1 and S2 satisfying C, S1 S2 satisfies C
– Given A1 is the sets of size 1 satisfying C, then any set S satisfying C are based on A1
, i.e., it contains a subset belongs to A1 ,

74
SCSA3001 Data Mining And Data Warehousing

• Example :
– sum(S.Price ) ≥v is not succinct
– min(S.Price ) ≤v is succinct
Optimization:
– If C is succinct, then C is pre-counting prunable. The satisfaction of the constraint alone is not
affected by the iterative support counting.
• ed based on the training set
• Unsupervised learning (clustering)
• The class labels of training data is unknown
• Given a set of measurements, observations, etc. with the aim of establishing the existence of
classes or clusters in the data.

PART-A

Q. No Questions Competence BT Level

1. Define association and correlations. Remember BTL-1

2. List the ways in which interesting patterns should be mined. Remember BTL-1
Are all patterns generated are interesting and useful? Give
3. Understand BTL-2
reasons to justify
Compare the advantages of FP growth algorithm over
4. Analyze BTL-4
Apriori algorithm

5. How will you apply FP growth algorithm in Data mining? Apply BTL-3

6. How will you Apply pattern mining in Multilevel space? Apply BTL-3

7. Analyze the constraint based frequent pattern mining. Analyze BTL-4

8. Evaluate the classification using Frequent patterns Evaluate BTL-5

9. Generalize on Mining Closed and Max Patterns. Create BTL-6

10. Define correlation and market basket analysis. Remember BTL-1

75
SCSA3001 Data Mining And Data Warehousing

PART-B

Q. No Questions Competence BT Level


Define Market Basket Analysis. Describe about Frequent
1. Remember BTL-1
Itemsets, Closed Itemset and Association Rules.
Discuss about constraint based association rule mining with
2. examples and state how association mining to correlation Apply BTL-3
analysis is dealt with.
Find all frequent item sets for the given training set using
Apriori and FP growth respectively. Compare the efficiency
of the two mining processes (13)
TID ITEMS BROUGHT
3. T100 {M , O, N , K , E , Y } Apply BTL-3
T200 {D , O, N, K , E, Y }
T300 {M , A K, E }
T400 {M ,U , C , K ,Y }
T500 {C , O , O ,K , I , E }
i) How would you summarize in detail about mining
methods? (6)
4. Understand BTL-2
ii) Summarize in detail about various kinds of association
rules. (7)
i) What is interestingness of a pattern? (5)
5. ii) Summarize the various classification methods using Create BTL-6
frequent patterns. (10)
Analyze the various Frequent Itemset mining method with
6. Analyze BTL-4
examples.
Generalize how pattern mining is done in multilevel and
7. Create BTL-6
multidimensional space with necessary examples.

76
SCSA3001 Data Mining And Data Warehousing

TEXT / REFERENCE BOOKS


1. Jiawei Han and Micheline Kamber, “Data Mining Concepts and Techniques”, 2nd Edition,
Elsevier, 2007
2. Alex Berson and Stephen J. Smith, “ Data Warehousing, Data Mining & OLAP”, Tata McGraw
Hill, 2007.
3. Pang-Ning Tan, Michael Steinbach and Vipin Kumar, “Introduction To Data Mining”, Person
Education, 2007.
4. K.P. Soman, Shyam Diwakar and V. Ajay, “Insight into Data mining Theory and Practice”,
Easter Economy Edition,
Prentice Hall of India, 2006.
5. G. K. Gupta, “Introduction to Data Mining with Case Studies”, Easter Economy Edition,
Prentice Hall of India, 2006.
6. Daniel T.Larose, “Data Mining Methods and Models”, Wile-Interscience, 2006

77

You might also like