Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Patricia Mine

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 8

Patricia Mine

Mining Frequent Itemsets using


Patricia Tries
Florescu Andreea
SSA 2016

Table of Contents
Algorithm Description 3
Datasets Description
Steps

Experimental Results
Conclusions

Table of Figures
Patricia Trie Representation Example

Dataset Represented as a Patricia Trie

Chess Dataset Experimental Results

Mushroom Database Experimental Results


Retail Database Experimental Results

List of Tables
Datasets Description

Experimental Results

1. Algorithm Description
Patricia Mine is a data mining algorithm that finds all frequent itemsets. The algorithm
is faster than similar mining algorithms like FP Growth and OportuneProject, but the
compromise used in order to obtain such performance is the assumption that the database
representation fits entirely in the main memory.
The algorithm uses the datastructure PATRICIA (Practical Algorithm To Retrieve
Information Coded In Alphanumeric) Trie. Patricia trie is a tree representation of
alphanumeric data optimized for size. Any node that is an only child is merged with its
parent.

Fig. 1 - Patricia Trie Representation Example


In order to determine the frequent itemsets, the algorithm performs an iterative depthfirst tree traversal of the nodes in the Frequent Item Set Tree (FIST). It visits each node by
decreasing support order. That way, the algorithm starts with the most frequent items and
finishes with the least ones that which have a support greater than the threshold.

Fig. 2 - Dataset Represented as a Patricia Trie


Besides the performance, another benefit of the Patricia Mine algorithm is the low
memory consumption due to the compressed trie representation. The compressed trie also
allows the algorithm to be efficient for both sparse and dense databases.
This algorithm is commonly used in the market basket analysis, understanding game
data and fraud prevention and detection.
An improved version of this algorithm, called Patricia* offers even better performance
than the original one and also outperforms FP Growth* and dEclat. The trie representation is

more compact and contiguous that the initial Patricia trie, which decreases the traversal and
construction costs.

2. Datasets Description

In order to test and analyse the algorithm on different datasets, we used the Frequent
Itemset Mining Dataset Repository (FIMDR), which holds datasets from different datasets
donated by university or academics over the world.
For testing the Patricia Mine algorithm, we used the following datasets:
Chess
Mushroom
Retail
In these datasets, the data of the transactions are numbers. The databases can be
downloaded in text format (dat) and the data is represented as a two dimensional vector.
The databases chess and mushroom are provided by Roberto Bayardo. The chess
database is compiled from game state information, while the mushroom database contains
hypothetical description of different types of mushrooms.
The retail dataset was donated by Tom Brijs and contains the retail market basket
data from an anonymous Belgian retail store.
In the table below, we present the structure of each dataset in terms of number of
transactions, median transaction width and the database type (sparse or dense).
Number of
transactions

Median transaction
width

Database Type

chess

3,196

37

dense

mushroom

8,124

23

dense

retail

213,972

31

sparse

Table 1 - Datasets Description


3. Steps
The algorithm implementation was downloaded from the Frequent Itemset Mining
Implementations Repository and is based on the paper Mining Frequent Itemsets using
Patricia Tries. The project is compiled and installed through the Makefile utilitary for the Unix
based systems.
The program takes as command line arguments two required parameters (datafile
and minsup) and one optional parameter (output). The datafile is the path of the file of
transactions. This file contains one transaction per line and the only accepted data
representation is integer number. The minsup parameter is the minimum support in
percentage. Any number greater than one and less than 100 is accepted. If the optional
output file is provided, the frequent itemsets are printed, but the I/O operations increase the
average runtime more than 4 times.
As the datasets we downloaded from FIMDR are in the required format, no
preprocessing of input was needed.
4

In order to analyse the Patricia Mine algorithm, we created a script that runs the
program for all described databases using as minsup values 70, 80 and 90. For each run,
we computed the time, the CPU usage and the memory usage.

---------------------------------------------------------------------------------------------------------------------Project Analysis Script


---------------------------------------------------------------------------------------------------------------------#!/bin/bash
database=$1
thresholds=(70 80 90)
for threshold in "${thresholds[@]}"
do
echo $threshold, $database
/usr/bin/time -f "%e %P %M" ./fim_all $database $threshold
Done
----------------------------------------------------------------------------------------------------------------------4. Experimental results
The experiments in this section have been performed on 2.1GHz Intel i3 processor
with 4GB RAM and 320GB hard disk, under Ubuntu 14.04 operating system.
Comparing program runs on the same database, but with different minimum support,
we observe that the CPU usage and memory usage are with low variance the same when
the support does not vary with more than 10 units. When the difference between supports is
larger, we notice that the difference between memory usages also increases. When we
increase the minimum support, the memory usages decreases. This behavior is due to the
fact that the data structures used by the algorithm are allocated at the beginning and their
sizes are indirectly proportional with the number of frequent transactions of size one. The
best example for this one to one relationship between the minimum support and the memory
usages is the retail database for which the memory usage decreases from 16 MB for a 10
support to 13 MB for 100 support.
In Table 2 we can also observe that the time needed to compute the chess database
has the order of tens of minutes while for the other databases the time is below one second.
The explanation for this phenomenon also lies in the structure of the databases. Chess is a
large, dense database with many frequent items which implies that the patricia trie after the
first traversal still has millions of nodes on which the algorithm has to iterate. Also, the chess
database has the biggest transaction width (37) which means the maximum size of frequent
transactions also increases.

Dataset

Min_support (%)

Time (s)

CPU usage (%)

Memory usage (KB)

chess

70

1024.58

99%

2580

chess

80

825.80

99%

2580

chess

90

667.01

99%

2580

mushroom

70

0.18

89%

2712

mushroom

80

0.14

99%

2708

mushroom

90

0.14

100%

2700

retail

10

0.76

99%

18636

retail

40

0.41

100%

16608

retail

70

0.34

99%

15064

retail

100

0.30

99%

13740

Table 2 - Experimental Results


As detailed in Figure 3, for the chess database, the algorithm exhibits a linear
decrease in difficulty as the minimum support increases. The reason why the time for a
threshold of 90 is almost half of the time for a threshold of 70 lies in the implementation of
the algorithm and in the structure of the chess database. As we described in Section 1, the
algorithm traverses less nodes as we increase the minimum support because we have less
items that satisfy the condition. Another reason for the large time decrease between runs
with increasing support is the difference between the number of items with a frequency of
70% and 80% or 90%.

Fig. 3 - Chess Dataset Experimental Results


In Figure 4 we observe that in the case of the mushroom database, the minimum
support has a low influence in the variance of the CPU time. This is due to the equal number
of IL traversals that the algorithm performs for both the cases of 80% and 90% threshold. We
can also state that 14 milliseconds is the minimum time that can be obtain using Patricia
Mine, as the cost of the trie remains constant in both cases and the number of IL traversals
varies with an imperceptible percentage.

Fig. 4 - Mushroom Database Experimental Results


In the case of the retail database, we decided to use different thresholds as while
using 70, 80 and 90 minimum support, we got inconclusive results due to the small CPU
time difference between the three runs. As it can be seen on Figure 5, there is a large drop
in time when comparing the run with a 10 minimum support and the one with a 40 minimum
support, but afterwards the decrease is rather linear. We can conclude that even though
less nodes are accessed, the time difference is not as big because more TLB misses occur
as we shift to opposite memory locations in the trie.

Fig. 5 - Retail Database Experimental Results

5. Conclusions

In conclusion, Patricia Mine is an algorithm for determine the frequent itemsets that
uses Patricia Trie to represent the database which makes this mining project efficient in
terms of memory and performance.
The experimental results have shown that the performance of the Patricia Mine
algorithm varies for different minimum support values. The time decreases as the minimum
support increases. The linearity of the time decrease is influenced by aspects like the
transactions width, type and size of the database and TLB misses which are caused by the
design of the algorithm. Also, the memory usage decreases with the increase of the
minimum support which is due to the fact that many nodes of the trie are not included in the
initial graph as they do not have a satisfying support.
6. References
Frequent Itemset Mining Dataset Repository, http://fimi.ua.ac.be/data/, Accessed on May
2016
Frequent Itemset Mining Implementations Repository, http://fimi.ua.ac.be/src/, Accessed on
May 2016
Pietracaprina, Andrea. "Mining frequent itemsets using patricia tries." (2003).

You might also like