Data Mining M2
Data Mining M2
Data Mining M2
60
SCSA3001 Data Mining And Data Warehousing
Examples
Rule form: ―Body ® Head [support, confidence]‖.
buys(x, ―diapers‖) ® buys(x, ―beers‖) [0.5%, 60%]
major(x, ―CS‖) ^ takes(x, ―DB‖) ® grade(x, ―A‖) [1%, 75%]
ASSOCIATIONS AND CORRELATIONS
Association Rule: Basic Concepts
Given: (1) database of transactions, (2) each transaction is a list of items (purchased by a customer
in a visit)
Find: all rules that correlate the presence of one set of items with that of another set of items
E.g., 98% of people who purchase tires and auto accessories also get automotive services done
Applications
* Maintenance Agreement (What the store should do to boost Maintenance
Agreement sales)
– Home Electronics * (What other products should the store stocks up?)
61
SCSA3001 Data Mining And Data Warehousing
Table 3.1
Association Rule Mining: A Road Map
• Boolean vs. quantitative associations (Based on the types of values handled)
– buys(x, ―SQLServer‖) ^ buys(x, ―DMBook‖) ® buys(x, ―DBMiner‖) [0.2%, 60%]
– age(x, ―30..39‖) ^ income(x, ―42..48K‖) ® buys(x, ―PC‖) [1%, 75%]
• Single dimension vs. multiple dimensional associations (see ex. above)
• Single level vs. multiple-level analysis
– What brands of beers are associated with what brands of diapers?
• Various extensions
– Correlation, causality analysis
• Association does not necessarily imply correlation or causality
– Maxpatterns and closed itemsets
– Constraints enforced
• E.g., small sales (sum < 100) trigger big buys (sum > 1,000)?
62
SCSA3001 Data Mining And Data Warehousing
63
SCSA3001 Data Mining And Data Warehousing
MINING METHODS
• Mining Frequent Pattern with candidate generation
• Mining Frequent Pattern without candidate generation
MINING FREQUENT PATTERNS WITH CANDIDATE GENERATION
The method that mines the complete set of frequent item sets with candidate generation.
Apriori property & The Apriori Algorithm. Apriori property
• All nonempty subsets of a frequent item set most also be frequent.
– An item set I does not satisfy the minimum support threshold, min-sup, then I is not frequent,
i.e., support (I) < min-sup
– If an item A is added to the item set I then the resulting item set (I U A) cannot occur more
frequently than I.
• Monotonic functions are functions that move in only one direction.
• This property is called anti-monotonic.
• If a set cannot pass a test, all its supersets will fail the same test as well.
• This property is monotonic in failing the test.
The Apriori Algorithm
• Join Step: Ck is generated by joining Lk-1with itself
• Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent kitemset
Method
1) L1 = find_frequent_1 itemsets(D);
2) for (k = 2; Lk-1; k++) {
3) Ck = apriori_gen (Lk-1, min_sup);
4) For each transaction t D { // scan D for counts
5) Ct = subset (Ck, t); // get the subsets of t that are candidates
6) for each candidate c Ct
7) c.count++;
8) }
9) Lk = {c Ck|c.count ≥ min_sup}
10) }
11) return L = UkLL;
64
SCSA3001 Data Mining And Data Warehousing
65
SCSA3001 Data Mining And Data Warehousing
66
SCSA3001 Data Mining And Data Warehousing
67
SCSA3001 Data Mining And Data Warehousing
68
SCSA3001 Data Mining And Data Warehousing
69
SCSA3001 Data Mining And Data Warehousing
which contain multiple occurrences of some predicates. These rules are called hybrid-dimensional
association rules.
An example of such a rule is the following, where the predicate buys is repeated:
age(X, “20 . . . 29”) ⇒buys(X, “laptop”) ⇒buys(X, “HP printer”).
Database attributes can be nominal or quantitative. The values of nominal (or categorical)
attributes are “names of things.” Nominal attributes have a finite number of possible values, with
no ordering among the values (e.g., occupation, brand, color)
Quantitative attributes are numeric and have an implicit ordering among values (e.g., age, income,
price). Techniques for mining multidimensional association rules can be categorized into two
basic approaches regarding the treatment of quantitative attributes. In the first approach,
quantitative attributes are discretized using predefined concept hierarchies. This discretization
occurs before mining. For instance, a concept hierarchy for income may be used to replace the
original numeric values of this attribute by interval labels such as “0..20K,” “21K..30K,”
“31K..40K,” and so on.
Here, discretization is static and predetermined. Chapter 3 on data preprocessing gave several
techniques for discretizing numeric attributes. The discretized numeric attributes, with their
interval labels, can then be treated as nominal attributes (where each interval is considered
a category).
Mining Quantitative Association Rules
• Determine the number of partitions for each quantitative attribute
• Map values/ranges to consecutive integer values such that the order is preserved
• Find the support of each value of the attributes, and combine when support is less than MaxSup.
Find frequent itemsets, whose support is larger than MinSup
• Use frequent set to generate association rules
• Pruning out uninteresting rules
Partial Completeness
• R : rules obtained before partition
• R’: rules obtained after partition
• Partial Completeness measures the maximum distance between a rule in R and its closest
generalization in R’
• 𝑋̂ is a generalization of itemset X: if
70
SCSA3001 Data Mining And Data Warehousing
71
SCSA3001 Data Mining And Data Warehousing
– Dimension/level constraints:
– Rule constraints
• small sales (price < $10) triggers big sales (sum > $200).
– Interestingness constraints:
72
SCSA3001 Data Mining And Data Warehousing
• sum (LHS) < 100 ^ min(LHS) > 20 ^ count(LHS) > 3 ^ sum(RHS) > 1000
– 1-var: A constraint confining only one side (L/R) of the rule, e.g., as shown above.
73
SCSA3001 Data Mining And Data Warehousing
Categories of Constraints
1. Anti-monotone and Monotone Constraints
• Constraint Ca is anti-monotone iff. for any pattern S not satisfying Ca, none of the super patterns
of S can satisfy Ca
• A constraint Cm is monotone iff. for any pattern S satisfying Cm, every super-pattern of S also
satisfies it
2. Succinct Constraint
• A subset of item Is is a succinct set, if it can be expressed as p (I) for some selection predicate
p, where is a selection operator
• SP2I is a succinct power set, if there is a fixed number of succinct set I1, …, Ik I,s.t. SP
can be expressed in terms of the strict power sets of I1, …, Ik using union and minus
• A constraint Cs is succinct provided SATCs (I) is a succinct power set
3. Convertible Constraint
• Suppose all items in patterns are listed in a total order R
• A constraint C is convertible anti-monotone iff a pattern S satisfying the constraint implies that
each suffix of S w.r.t. R also satisfies C
• A constraint C is convertible monotone iff a pattern S satisfying the constraint implies that each
pattern of which S is a suffix w.r.t. R also satisfies C
Property of Constraints: Anti-Monotone
• Anti-monotonicity: If a set S violates the constraint, any superset of S violates the constraint.
• Examples:
– sum(S.Price) ≤v is anti-monotone
– sum(S.Price) ≥v is not anti-monotone
– sum(S.Price) = v is partly anti-monotone
• Application:
– Push ―sum(S.price) ≤ 1000‖ deeply into iterative frequent set computation.
Property of Constraints: Succinctness
• Succinctness:
– For any set S1 and S2 satisfying C, S1 S2 satisfies C
– Given A1 is the sets of size 1 satisfying C, then any set S satisfying C are based on A1
, i.e., it contains a subset belongs to A1 ,
74
SCSA3001 Data Mining And Data Warehousing
• Example :
– sum(S.Price ) ≥v is not succinct
– min(S.Price ) ≤v is succinct
Optimization:
– If C is succinct, then C is pre-counting prunable. The satisfaction of the constraint alone is not
affected by the iterative support counting.
• ed based on the training set
• Unsupervised learning (clustering)
• The class labels of training data is unknown
• Given a set of measurements, observations, etc. with the aim of establishing the existence of
classes or clusters in the data.
PART-A
2. List the ways in which interesting patterns should be mined. Remember BTL-1
Are all patterns generated are interesting and useful? Give
3. Understand BTL-2
reasons to justify
Compare the advantages of FP growth algorithm over
4. Analyze BTL-4
Apriori algorithm
5. How will you apply FP growth algorithm in Data mining? Apply BTL-3
6. How will you Apply pattern mining in Multilevel space? Apply BTL-3
75
SCSA3001 Data Mining And Data Warehousing
PART-B
76
SCSA3001 Data Mining And Data Warehousing
77