Patricia Mine
Patricia Mine
Patricia Mine
Table of Contents
Algorithm Description 3
Datasets Description
Steps
Experimental Results
Conclusions
Table of Figures
Patricia Trie Representation Example
List of Tables
Datasets Description
Experimental Results
1. Algorithm Description
Patricia Mine is a data mining algorithm that finds all frequent itemsets. The algorithm
is faster than similar mining algorithms like FP Growth and OportuneProject, but the
compromise used in order to obtain such performance is the assumption that the database
representation fits entirely in the main memory.
The algorithm uses the datastructure PATRICIA (Practical Algorithm To Retrieve
Information Coded In Alphanumeric) Trie. Patricia trie is a tree representation of
alphanumeric data optimized for size. Any node that is an only child is merged with its
parent.
more compact and contiguous that the initial Patricia trie, which decreases the traversal and
construction costs.
2. Datasets Description
In order to test and analyse the algorithm on different datasets, we used the Frequent
Itemset Mining Dataset Repository (FIMDR), which holds datasets from different datasets
donated by university or academics over the world.
For testing the Patricia Mine algorithm, we used the following datasets:
Chess
Mushroom
Retail
In these datasets, the data of the transactions are numbers. The databases can be
downloaded in text format (dat) and the data is represented as a two dimensional vector.
The databases chess and mushroom are provided by Roberto Bayardo. The chess
database is compiled from game state information, while the mushroom database contains
hypothetical description of different types of mushrooms.
The retail dataset was donated by Tom Brijs and contains the retail market basket
data from an anonymous Belgian retail store.
In the table below, we present the structure of each dataset in terms of number of
transactions, median transaction width and the database type (sparse or dense).
Number of
transactions
Median transaction
width
Database Type
chess
3,196
37
dense
mushroom
8,124
23
dense
retail
213,972
31
sparse
In order to analyse the Patricia Mine algorithm, we created a script that runs the
program for all described databases using as minsup values 70, 80 and 90. For each run,
we computed the time, the CPU usage and the memory usage.
Dataset
Min_support (%)
Time (s)
chess
70
1024.58
99%
2580
chess
80
825.80
99%
2580
chess
90
667.01
99%
2580
mushroom
70
0.18
89%
2712
mushroom
80
0.14
99%
2708
mushroom
90
0.14
100%
2700
retail
10
0.76
99%
18636
retail
40
0.41
100%
16608
retail
70
0.34
99%
15064
retail
100
0.30
99%
13740
5. Conclusions
In conclusion, Patricia Mine is an algorithm for determine the frequent itemsets that
uses Patricia Trie to represent the database which makes this mining project efficient in
terms of memory and performance.
The experimental results have shown that the performance of the Patricia Mine
algorithm varies for different minimum support values. The time decreases as the minimum
support increases. The linearity of the time decrease is influenced by aspects like the
transactions width, type and size of the database and TLB misses which are caused by the
design of the algorithm. Also, the memory usage decreases with the increase of the
minimum support which is due to the fact that many nodes of the trie are not included in the
initial graph as they do not have a satisfying support.
6. References
Frequent Itemset Mining Dataset Repository, http://fimi.ua.ac.be/data/, Accessed on May
2016
Frequent Itemset Mining Implementations Repository, http://fimi.ua.ac.be/src/, Accessed on
May 2016
Pietracaprina, Andrea. "Mining frequent itemsets using patricia tries." (2003).