Algorithms of association rules extraction: State of the Art
AMDOUNI Hamida
GAMMOUDI Mohamed Mohsen
PhD Student,
Member of RIADI Laboratory.
FST, Tunisia.
hamdouni.ecri@gmail.com
Associate Professor,
Member of RIADI Laboratory,
ESSAI of Tunis, Tunisia
mohamed.gammoudi@fst.rnu.tn
Abstract—More than a decade, the task of generating
associative rules has received considerable attention by
researchers because the great need of enterprise deciders to be
assisted by systems taking into account unknown knowledge
extracted from a huge volume of data. In this paper, we
present a survey of the most known algorithms used for
associative rules extraction. We give a comparative study
between them and we show that they could be classified into
some categories.
Keywords-Itemsets; Closed Itemsets;Frequent Itemsets;FCA;
Associatives rules;
I.
INTRODUCTION
The extraction of association rules is one of the
most important techniques of data mining [1]. It consists of
extracting unknown knowledge (pattern) from a large
volume of data in order to help decision makers to take
efficient and profitable decisions.
Early approaches for extracting association rules are based
on the generating the frequent Itemsets [2, 16, 8]. But,
following the considerable computing time of their
extraction, redundancy and irrelevance of rules generated, a
new approach is introduced [12]. It consists of extracting a
subset of generic non-redundant rules and without loss of
information [3] based on the mathematical foundations of
Formal Concept Analysis (FCA) [7].
This approach is based on extracting a subset of itemsets
called closed itemsets, minimal generators and the relation
between frequent itemsets. To our knowledge only
algorithm Prince [9] focuses on presenting the minimal
generators according to the partial order relation. Not taking
this relationship into account involves the generation of a
large number of rules as in the case: A-Close [12], Closet
[13] and Charm [17].
The main goal of this paper is to present a state of the art
of the most known algorithms in the literature that allow the
generation of association rules. Before presenting these
algorithms, we recall the basic concepts necessary for their
understanding.
II.
BASIC NOTIONS
A. Extraction context of an associative rule
Let k = (O, I, R) be a triplet with O and I are
respectively sets of objects (eg. transactions), sets of items
and R O x I is a binary relation between objects and
items.
B. Table of co-occurrences
It indicates for each pair of items the number of cooccurrences in the set of objects.
C. Itemset
It is a nonempty subset of items. An itemset consisting
of k elements is called k-itemset.
D. Support of an itemset
The frequency of simultaneous occurrence of an itemset
(I’) in the set of objects called Supp(I’).
E. Frequent itemset (FI)
FI is a set of items whose support ≥ a user-specified
threshold called minsup. All its subsets are frequent. The set
of all frequent itemsets called SFI.
F. Max Itemset
An itemset which all its subsets are nonfrequents.
G. Associative rule
Any association rule having the following form: A B,
where A and B are disjoint itemsets with A is its premise
(condition) and B is its conclusion.
H. Confidence
The confidence of an association rule A B measures
how often items in B appear in objects that contain A.
Confidence(R) = Supp(A,B)/Supp(A)
(1)
Supp(A,B) : the number of objects that the itemset A
and the itemset B share.
Supp(A) : le number of objects that contain A.
Based on the degree of confidence, association rules can
be classified as follows:
Exact rule: rule which confidence = 1
Approximative rule: rule which confidence < 1
Valid rule: rule which confidence ≥ a user-specified
threshold called minconf
After going through some basic notions, we present
different algorithm of associative rules extraction by a
chronological order while showing their advantages and
disadvantages.
III.
ALGORITHM OF ASSOCIATIVE RULES EXTRACTION
In the literature, we can find two categories of
algorithms, those which extract the associative rules from
frequent Itemsets and those which are based on the Formal
Concept Analysis (FCA) to generate a sub-set of frequent
Itemsets called frequent closed Itemsets.
A. Algorithm based on extraction of frequent itemsets
1) Apriori [2]
Apriori is one of the first algorithms of associative rules
extraction. In what follows, we are going to present it
principle, its advantages and disadvantages.
In fact, its principle consists of two steps, the first one is to
find the set of frequent itemsets (SIF) out of the initial
extraction context and the second one uses this set to
determine the valid associative rules.
a) Frequent itemsets search
To find the set of frequent itemsets (SIF), it proceeds in
repetitive way, as follows:
Find the set of IF size one called L1 and out of this
one generate another set called C2 which includes the
IF candidate size two. Any element of C2 having a
support ≥ minsup becomes part of L2.
In general, any element of Ck+1 is the union of two IF
Lk having k-1 as a commun element. A generated
IF size k+1 is deleted of Ck+1 if at least one of its subset of size k Lk.
Its last process is repeated until Lk is empty. The
result of this phase is the union of the different
determined Lk
b) Associative rules extraction
To extract the associative rules out of the SIF found, it
proceeds iteratively treating each Lk.
In fact, a IF Lk, the rules generated having the following
form: IF – C C (C: set of conclusive items). Any rule
having the confidence: Supp(IF)/Supp(C) minconf is
maintained.
As a summary, the result generated by Apriori is clear
and easy to interpret. But, we can find many problems. First
one is the theoretical complexity O(mn2m) [2] when n =
|O| and m = |I|. The second one is the high number of
database accesses in order to extract the SIF, to count the
different supports and to generate the rules which generally
are redundant and little useful. In fact, to evaluate these
rules, the user’s intervention is required nevertheless it is
expensive.
In order to decrease the treatment time in the process of
frequent itemsets extraction, [2] present a new version of
Apriori, called AprioriTid. It based on the count of the
candidates support indexing the different transactions with
its ID called TID.
We will present, in the following part, its general principle
as long as its advantages and its disadvantages.
2) AprioriTid [2]
It uses the same principle of Apriori to generate the
candidates but it make a different count of their supports.
Firstly, it generates a set of candidates called C1
representing the database. This set includes elements (TID,
{c1}) which {c1} is the list of itemsets of size one in the
transaction TID. When k>1, Ck is made using Ck-1. If one
element in Ck having an empty list of k-itemsets in the TID,
it is deleted. The support of the itemsets in ck is equal to the
number of occurrences of each one in the Ck.
In fact, in the first iterations, the set of candidates’
itemsets can be huge which causes storage problem. In
addition, when k increases, the number of element in Ck
become smaller than the transaction’s.
3) Partition [14]
The objective of this algorithm is to reduce the number
of accesses to the initial context in two ways. It divides in p
partition D1, …, Dp. Each IF partition will be determined
during the first database access. the SFI of the extraction
context is the union of the different IF partitions. In the
second one, it count the support of any element SFI.
As a summary, Partition does only two accesses to the
database, it does not process the candidates Itemsets and the
support is calculated using the TID intersection unlike the
Apriori method.
4) Dic [5]
Dic divides the database in M transactions masses. After
going through one mass to determine the k-Itemsets
candidates and counts their supports, it generate the (k+1)Itemsets out of the frequent k-Itemsets and its start to
determine their supports.
While studying Dic, we observe that the number of
databases accesses is less than Apriori as it processes the
candidates having different sizes simultaneously. But, the
storage of itemsets presents a problem and the cost of
supports count are more important than the necessary run
time for Apriori.
5) Eclat [16]
This algorithm uses the vertical format of the context
extraction. Indeed, each item is associated with a Tidset (set
of all transactions containing this item).
It searches the set of frequent itemsets (SFI) of size one
and two using the horizontal format of the database and then
uses the transversal depth which is based on the concept of
equivalence classes (two k-itemsets belong to the same class
if they have k-1 prefixes common. Each class will be treated
separately).
In summary, the vertical format adopted by Eclat
reduces the calculation f the support of an itemset since it’s
a complete intersection of Tidsets. This allows
automatically reduce the size of the database as transactions
involving only an itemset are used for the intersection.
In addition, this method can be parallelized, since each class
can be treated separately to determine the frequent itemsets.
The problem is that this method is effective for small
databases. However, the representation is not possible in the
case of large databases.
6) FP-Growth [8]
To avoid the repetitive context accesses, [8] suggest an
algorithm called FP-Growth (Frequent Pattern Growth)
allowing the extraction of the SFI without generating
candidates.
It consists of compressing the database in a compact
stucture called FP-tree based on the notion of Trie [10]
which is made of:
A tree with no root, intermediate nodes containing
three pieces of information: the corresponding item,
its frequency and a pointer to the next node in the
tree.
A list called index containing the list of frequent
items. Each item is associated with a pointer to the
first node of the tree containing this item.
The construction of this structure requires two accesses
to the extraction context:
The first to determine the frequent items and save
them in the index list sorted in descending order of
supports.
The second allows building the FP-tree knowing
that the items in a transaction are stored according to
their order in index
A root is created, from what we model branch for
various transactions.
A transaction will be presented with a list of nodes
containing the item, its frequency and a pointer to
the next node (transaction with the same prefix will
be presented by the same branch, same transaction
will be submitted one)
After packing the database, it will be divided into sub
projections called conditional basis. Each of these bases is
associated with a frequent item. The extraction of the
frequent itemsets is done on each of the projections.
Using FP-Growth, the frequent items are sorted in a
decreasing order, implying that the most frequent items are
near the root and are better shared by transaction.
FP-tree is thus a compact representation and interesting. In
addition, if the prefixes are shared by the transaction, the
suffixes are not, that is why the number of nodes in FP-tree
is reducing. But it should be noted that there is a storage
problem if the FP-tree size becomes important [6].
7) Conclusion
After presenting some algorithms for generating
association rules based on extraction of frequent itemsets,
we can conclude that using this method presents some
advantages such as easier understandings of the process of
calculation adopted, clarity and ease of application of result.
But it is also clear that it does not provide satisfactory
results in the case of large volumes of data. Moreover, it is
very difficult to determine the number of items in each rule.
B. Algorithm of associative rules extraction using FCA
approch
This strategy has two steps, the first one consists of
extracting the frequent closed itemsets based on the
mathematical foundations of the formal concept analysis
(FCA) [7] and the second step consists of developing a
generic base including the informative rules in order to
provide a useful and non redundant rules.
In this part, we define some of the basic notions and we
present several algorithms which FCA approach.
1) Basic notions
a) Galois connection: In A formal context K is a triplet
K = (O, I, R), For every set of objects A ⊆ O, the set f(A) of
attributes in relation R with the objects of A is as follow:
f(A) = { i ∈ I | oRi ∀ o ∈ A}
(2)
Dually, for every set of attributes B ⊆ I, the set g(B) of
objects in relation R with the attributes of B is as follow:
g(B) = { o ∈ O | oRi ∀ i ∈ B}
(3)
The two functions f and g defined between objects and
attributes form a Galois connection.
The operators f ° g(B) and g ° f(A) called φ are the closure
operators.
φ verifies the following properties X, Y I (resp. X1,
Y1 O):
Idempotent : φ2(X) = φ(X),
(4)
extensive : X ⊆ φ(X),
(5)
monotone : φ(X) φ(Y).
(6)
b) Frequent Closed Itemset (FCI): An Itemset I’ is
called closed if I’ = φ(I’). In other words, an itemset I’ is
closed if the intersect of the objects to which I’ belongs is
equal to I’ and it is frequent if its support ≥ minsup. SFCI is
the Set of Frequent Closed Itemset.
c) Minimal Generator: An Itemset c ⊆ I is a closed
Itemset generator I’ ⇔ φ(c) = I’. c is a minimal frequent
generator if its support is ≥ minsup.
The set of frequent minimal generators of I’ called GMF)’’
GMF)’ = {c ⊆ I | φ(c) = I’ ∄ c1 ⊂ c as φ(c) = I’}
(7)
d) Negative border (GBd-): the set on no-frequents
minimal generator.
e) Positive border (GBd+) : Let GMFk is the set of all
minimal frequent generators:
GBd+= {c|c ≥ minsup c GMFk c’ c, c’ GMFk} (8)
f) Equivalent classes: The closure operator φ divides
the set of frequents Itemsets into disjoint equivalent classes
including elements having the same support. The largest
element in a given class is an FCI called I’ and smaller ones
are the GMFI’ [12].
g) Comparable equivalent classes: The classes Ci et Cj
are only said comparable if FCI of Ci covers that of Cj.
The five following notions are defined in [7]:
h) Formal concept: a formal concept is a maximal
objects-attributes subset where objects and attributes are in
relation. More formally, it is a pair (A, B) with A O and B
I, which verifies f(A) = B and g(B) = A. A is the extent
of the concept and B is its intent.
i) Partial order relation between concepts≤: The
partial order relation called ≤ is defined as follow: for two
formal concepts (A1, B1) and (A2, B2): (A1, B1) ≤ (A2, B2)
A2 A1 and B1 B2.
j) Meet / Join : for each concepts (A1, B1) and (A2, B2),
it exist a greatest lower bound (resp. a least upper bound)
called Meet (resp. Join) denoted as ((A1, B1) (A2, B2)
(resp. (A1, B1) (A2, B2)) defined by:
(A1, B1) (A2, B2) = (g(B1 ∩ B2), (B1 ∩ B2))
(9)
(A1, B1) (A2, B2) = ((A1 ∩ A2), f(A1 ∩ A2))
(10)
k) Galois lattice: The Galois lattice associated to a
formal context K is a graph composed of a set of formal
concepts equipped with the partial order relation ≤. This
graph is a representation of all the possible maximal
correspondences between a subset of objects O and a subset
of attributes I.
l) Frequent minimal generators lattice: A partial
ordered structure of which each equivalent class includes
the appropriate frequent minimal generators [4].
m) Iceberg lattice: A partial ordered structure of
frequent closed Itemsets having only the join operator. It is
considered a superior semi-lattice [15].
n) Generic base of exact associative rules: It is a base
composed of non-redundant generic rules having a
confidence ratio equal to 1 and called GB [3]. Given a
context (O, I, R), the set of frequent closed itemsets (SFCI)
and the set of minimal generators GMFk :
GB = {R: g (c - g)|c (SFCI) g (GMFk), g c} (11)
o) Informative base of approximative associative rules:
it is called IB and defined as follows:
IB = {R: XY, Y SFCI, ƒ(X)≤Y, confidence(R) ≥
minconf, Supp(Y)≥ minsup}
(12)
2) Extracting frequent closed itemsets: the algorithms
After presenting some basic notions, we introduce a set
of algorthms designed to extract frequent closed itemsets.
a) Close [12]: this algorithm iterate the search space to
extract minimal generators subsequently used to extract
frequent closed itemsets associated.
Each iteration involves two steps. The first is to do a
self-join between the minimal generators found in the
previous iteration to form a noted MGCk (k-Minimal
Generators Candidates), each element consists of triplet
(minimal generator, closed itemsets candidate: CIC, support
of CIC). The second step is to prune the MGCk eliminating
the triplet whose support of closed itemset > minsup. The
valid rules are generated using the frequent itemsets derived
from the frequent closed itemsets selected.
It should be noted that this step is costly and can generate
many redundant and irrelevant rules.
b) A-Close[12]: to remedy the problem addressed in
the algorithm Close. [12] have proposed a new algorithm
called A-Close. It is to generate frequent closed itemsets
from the associated minimal generators.
c) Titanic [15]: the algorithm Titanic makes a breadth
at each level to extract the frequent minimal generators and
then deduct the frequent closed itemsets. To determine the
k-generators, a self-join of (k-1)-generators is carried out
and if one of subsets of a given generator is not frequent,
then it will be deleted.
In addition, to reduce the computing time of the support
of k-generators, estimated support is associated with each of
them and it is equal to the minimum value of the supports of
two (k-1)-generators that form. For the case of the empty
set, its estimated supporter is equal to the cardinality of
context extraction. Every k-generator with a lower support
to the minsup or equal to real support will be removed from
the set. It is noteworthy that in the worst case, Titanic
performs an average number of accesses to the extraction
context equal to the maximum size of lists of candidates’
generators.
d) Closet [13]: this algorithm has two stages, the first is
to use the FP-tree structure to present the search space as a
tree by eliminating non-frequent itemsets. The second step
is to use this structure to extract frequent closed itemsets by
dividing the tree into subtrees, called sub-conditional
contexts in order to make an in-depth exploitation of search
space. This implies that this algorithm requires two accesses
to the context extraction, the first is to extract the list of
frequent 1-itemsets and the second is to build the FP-tree.
Sub-contexts of frequent 1-itemsets will be processed
fist in order of increasing support. Each contains only items
that collocate with 1-itemset in questin, called I’, and have a
support above minsup. The frequent closed itemsets
corresponding concatenation of I’ with all the items having
the same support. The constructing process of the subcontexts continues recursively knowing that an item treaty
will be excluded since all the frequent closed itemset
containing it are already generated. In addition, to reduce
extraction, a sub-context of an itemset will be build only if it
is not covered by any frequent closed itemsets generated.
This algorithm has some disadvantages, the first one is
the processing cost of sorting the initial context with a
decreasing value of the support. The second is to store a
large number of sub-contexts in division recursive initial
context. In addition, the fact of whether a given itemset is
included or not in one part of the list of frequent closed
itemsets found, requires the maintenance of this list in
memory throughout the treatment.
It should be noted that this algorithm does not manage a
list of candidates, but in the case of low support and a
context scattered, the number of itemsets included frequent
closed itemsets generated is very small, which leads to both
under construction sub-contexts that itemset.
e) Charm [17]: in order to reduce the number of
accesses to the extraction context as well as candidates
generated for the extraction of frequent closed itemsets,
Charm using both the set of closed itemsets and identifiers
of transactions to which they belong (Tidset). It uses,
therefore, a structure called the IT-tree (Itemset-Tidset Tree)
where each node represents a pair of the form (frequent
closed itemset candidate, Tidset).
First, the 1-itemsets are added to the structure by
decreasing support. Subsequently, a course will be made
from the root from left to right and depth, to determine the
frequent closed itemsets.
For every two closed itemset candidates I1 and I2 have
the same parents, are four cases to check:
If Tidset(I1) = Tidset(I2) then φ(I1) = φ(I2) = φ(I1I2)
all occurrences of I1 replaced by (I1I2) and I2 will
be removed from the tree since its closure is the
same as (I1I2).
If Tidset(I1) Tidset(I2) then φ(I1) φ(I2) and φ(I1)
= φ(I1I2), all occurrences of I1 are replaced by
(I1I2) but not I2 will be removed from the tree
because it can be a generator of another frequent
closed itemsets.
If Tidset(I1) Tidset(I2) then φ(I1) φ(I2) and φ(I2)
= φ(I1I2), I2 iis removed from the tree and a node of
the form ((I1I2), (Tidset(I1)Tidset(I2))) is added to
the list of descendants (I1).
If Tidset(I1) ≠ Tidset(I2) then φ(I1) φ(I2) φ(I1I2),
the occurrences of I1 and I2 does not change and a
node of the form ((I1I2), (Tidset(I1)Tidset(I2))) is
added to the list of descendants (I1) if (I1I2) is
frequent, otherwise the tree remains the same.
It should be noted that Charm performs a single access to
the context extraction to determine the lists of transactions 1itemsets. In addition, to reduce memory usage, it performs
incremental storage of Tidsets using a data representation
called Diffset as a Tidset of the candidate is the difference
between his intention and that of his immediate parent. In
addition, this reduces the processing time, since the
determination of the intersection of two candidates I1 and I2
is nothing else than the result of the difference between
Tidset(I1) and that of its parent P, one hand, and that between
Tidset(I2) and that of its parent P, on the other. But, despite
this, Charm is considered among the algorithms that
consume lots of memory.
f) Prince [9]: this algorithme can extract the minimal
generators and build a structure partially ordered called
Frequent minimal generators lattice in order to perform a
vertical scan (bottom to top) to find the frequent closed
itemsets and then extract the informative association rules
are fewer non-redundant rules and without loss of
information.
First, Prince extracts minimal generators of the initial
context ( is the first generator to extract, its support is
equal to the cardinality of the initial context) . It determines,
all k-generator candidates and by doing this, at each level k,
a self-join of (k-1)-generators. Subsequently, it eliminates
any candidate of size k if at least one of its subsets is not a
minimal generator or if its support is equal to one of them.
GMFk is the union of all sets of frequent minimal generators
determined in each level, while no-frequents form the GBd-.
Second, GMFk and GBd- be used to form a minimal
generators lattice and this by comparing each minimal
generator g to the list L of immediate successors of its
subsets of size k-1. If L is empty then g is added to this list,
otherwise, four cases are possible for each g1 L knowing
that Cg and Cg1 are the equivalence classes of g and g1:
If (gg1) is a minimal generator then Cg and Cg1 are
no-comparables.
If Supp(g) = Supp(g1) = Supp(gg1) then g and g1
a same class.
If Supp(g) < Supp(g1) = Supp(gg1) then g become
the successor of g1.
If Supp(g) < Supp(g1) Supp(gg1) then Cg and Cg1
are no-comparables.
If (gg1) is not a minimal generator, the calculation of its
support is performed by applying this proposal [15]:
Let GMk = GMFk GBd- (set of all generators), an itemset I’
is no-generator if:
Supp(I’) = min{Supp(gi)| gi GMk gi I’}
(13)
The research process of Supp(gg1) stops whene one of its
subsets has a strictly lower support than that of g and g1
because this implies that Cg and Cg1 are no-comparables.
After constructing the minimal generators lattice, Prince
determines for each equivalence class, starting from C to
the top, the frequent closed itemset and built the Iceberg
lattice by applying this proposal:
Let I1 and I2 are two frequent closed itemsets such as I1
covers I2 by the partial order relation and GMFI1 is the set of
the frequent minimal generator of I1:
I1 = {g | g GMFI1} I2
(14)
The two lattices are used to extract exact and
approximative rules. Note that the rules with confidence = 1
are exact and an implications extracted from each node
(intra-node). Whereas, the approximative rules have
confidence ≥ minconf are implications involving two
comparable equivalence classes. These rules are implications
between nodes.
As proved in [9] all generated rules are no redundant and
guarantee that there is no loss of information. But it should
be noted that the complexity of the first step (the extraction
of frequent minimal generators) is exponential, which
implies an overall processing time high in the case of
scattered contexts.
g) GrGrowth [11]: this algorithm is developed in order
to mine frequent generators and possitive border.
It uses the compact data structure FP-tree and adopts
the pattern growth approach. It constructs a conditional
database for each frequent generator.
The algorithm uses the depth-first-search strategy to explore
the search space, which is much more efficient than the
breadth-first-search strategy adopted by most of the existing
generator mining algorithms.
GrGrowth prunes a positive border during the mining
process to save mining cost. In fact, generator based
representations rely on a negative border to make the
representation lossless. However, the number of itemsets on
a negative border sometimes exceeds the total number of
frequent itemsets and positive border is usually smaller than
it. In addition, a set of frequent generators plus its positive
border is always no larger than the corresponding complete
set of frequent itemsets.
3) Conclusion
This approach has two advantages. Firstly it reduces the
run time and storage space by contributing to the first
approach because the number of frequent closed itemsets
and the processing time is smaller than that of extraction of
all frequent itemsets. Secondly, the reduced number of
association rules extracted without loss of information.
IV.
V.
In this paper we presented a state of the art of the most
knowing algorithms of extraction association rules. We
observe that algorithms based on the set of Frequent
Itemsets generation could be used easily for huge database.
However, they present some limits such as: redundancy and
less useful associative rules. The second kind of algorithms
based on FCA approach could reduce the number of
associative rules with the advantage of their relevance. But,
the cost in memory need and the runtime increases
especially when the formal context is sparse.
REFERENCES
[1]
[2]
[3]
[4]
COMPARATIVE STUDY
In this section, we have identified four characteristics in
order to classify the algorithms already presented. The first
characteristic is the computational complexity. The second
is the kind of itemsets extraction (FI or FCI). Third is the
strategy of association rules extraction from initial context,
knowing that there are three types:
Test-and-build: is to browse the database level,
generating a set of candidates associated with each
level and apply some metrics to reduce the result set
called pruning.
Divide-and-Build: is to divide the database into
subsets and apply to each subset the process of
extracting itemsets in order to reduce the number of
candidates. As already mentioned in the first
category, a pruning step is performed.
Hybrid: is to deal in depth the database but without
division. In addition, to reduce the set of closed
itemsets result, a statistical metric and heuristics are
used.
The latter feature is called ―Data structures‖ which includes
two sub classes: the structure used to represent the initial
context, such as Tidsets, FP-tree and IT-tree, and the one
used for storing the itemsets results such as hash-table,
hash-tree, Trie and sparse matrix. (See Table I in the last
page).
CONCLUSION
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
R. Agrawal, T. Imielinski and A. N. Swami, Mining association
rules between sets of items in large databases , In Proceedings of
the International Conference on Management of Data, ACM
S)GMOD’9 , Washington, D.C., USA, page 0 -216, May 1993.
R. Agrawal and R.Srikant, Fast algorithms for mining association
rules , In J. B. Bocca, M. Jarke and C. Zaniolo, editors, Proceedings
of the 20th International Conference on Very Large Databases,
Santiago, Chile, p.p. 478-499, June 1994.
Y. Bastide, N. Pasquier, R. Taouil, L. Lakhal and G . Stumme,
Mining minimal non-redundant association rules using frequent
closed itemsets , Proceedings of the )ntl. Conference DOOD’ 000,
LNCS, Springer-verlag, July 2000, p. 972-986.
S. Ben yahia, C. Latiri, G.W. Mineau and A. Jaoua, Découverte des
règles associatives non redondantes – application aux corpus
textuels , In M.S. Hacid, Y. Kodrattof and D. Boulanger, editors
EGC, volume 17 of Revue des Sciences Technologies de
l’)nformation – série RIA ECA, pages 131-144. Hermes Sciences
Publications, 2003.
S. Brin, R. Motwani, J.D. Ullman and S. Tsur, Dynamic itemset
counting and implication rules for market basket data , In :
Proceedings ACM SIGMOD International Conference on
Management of Data, Tucson, Arizona, USA, éd. par Peckham
(Joan). pp. 255-264 - ACM Press, 1997.
W. Cheung, W. Heung and O. Zaiane, “ Incremental Mining of
Frequent Patterns Without Candidate Generation or Support
Constraint‖, Proceedings of the Seventh International Database
Engineering and Applications Symposium (IDEAS 2003), Hong
Kong, China, July 2003.
B. Ganter and R. Wille, Formal Concept Analysis , Mathematical
Foundations, Springer, 1999.
J. Han, J. Pei and Y. Yin, Mining frequent patterns without
candidate generation , CM-SIGMOD Int. Conf. on Management of
Data, pp. 1-12, Mai 2000.
T. Hamrouni, S. Ben Yahia and Y. Slimani, Prince: An algorithm
for generating rule bases without closure computations , In 7th
International Conference on Data Warehousing and Knowledge
discovery DaWaK’0 , pages
-355, Copenhagen, Denmark,
2005. Springer-Verlag, LNCS.
R. L. Kruse and A. J. Ryba, Data structures and program design in
c++ , Prentice Hall, 1999.
G. Liu, J. Li and L. Wong, A new concise representation of
frequent Itemsets using generators and a positive border ,
Knowledge and Information Systems, 17(1) : 35-56, 2008.
N. Pasquier, Y. Bastide, R. Taouil, L. Lakhal, Efficient Mining of
Association Rules Using Closed )temset Lattices , Information
Systems Journal, vol. 24, no 1, 1999, p. 25-46.
J. Pei, J. Han, R. Mao, S. Nishio, S. Tang and D. Yang, CLOSET : An
efficient algorithm for mining frequent closed Itemsets ,
Proceedings of the ACM S)GMOD DMKD’00, Dallas, TX, 00 , p.
21-30.
[14] A. Savasere, E. Omiecinsky et S. Navathe, An efficient algorithm
for mining association rules in large databases , 21st Int'l Conf.
on Very Large Databases (VLDB), Septembre 1995.
[15] G. Stumme, R. Taouil, Y. Basride, N. Pasquier and L. Lakhal,
Computing Iceberg Concept Lattices with TITANIC , J. on
Knowledge and Data Engineering (KDE), vol. 2, no 42, 2002, p.
189-222.
[16] M. Zaki, S. Parthasarathy, M. Ogihara and W. Li, New algorithms
for fast discovery of association rules , In : 3rd Intl. Conf. on
TABLE I.
Algorithm
Complexity
Apriori
O(mn2m)
Knowledge Discovery and Data Mining, éd. par Heckerman (D.),
Mannila (H.), Pregibon (D.), Uthurusamy (R.) et Park (M.). pp.
283-296. AAAI Press, 1997.
[17] M. Zaki and C. J. Hsiao, CHARM : An Efficient Algorithm for Closed
Itemset Mining , Proceedings of the 2nd SIAM International
Conference on Data Mining, Arlington, April 2002, p. 34-43.
COMPARATIVE STUDY OF EXTRACTION ASSOCIATIVE RULES ALGORITHM
Itemsets
Extracts
Strategy of association rules extraction
from initial context
Test-andGenerat
Divide-andGenerate
Data Structure
Hybrid
Database
structure
Structure of Itemsets
storage
FI
X
-
hash-tree
m
FI
X
Tidsets
hash-tree
Partition
O(2 P)*
FI
X
-
hash-tree
Dic
O(2mM)**
FI
X
-
Trie
FI
X
Tidsets
Sparse Matrix
X
AprioriTid
O(mn2 )
m
2
Elat
2
O(n m )
3
FP_growth
O(mn + m )
Close
O(2mn2m)
A-Close
FP-tree
hash table
FCI
X
-
Trie
m 2
FCI
X
-
Trie
m 2
X
-
Trie
FP-tree
Trie
IT-Tree
Trie
-
Trie
FP-tree
hash table
O(2 n m)
FI
Titanic
O(2 n m)
FCI
Closet
O(mn + m4)
FCI
m
2
Charm
O(2 +n )
FCI
Prince
m
O(2 )
FCI
Gr-Growth
O(mn+m4)
FCI
X
X
X
* P: number of partition. **M: number of transactions
X