When the number of replicates for each matrix entry is limited, we assume a uniform value distribution by default, since it introduces a small inductive bias compared with a more sophisticated distribution. We denote our prefix-projection (respectively, Apriori)-based mining algorithm by “DFS” (respectively, “Apri”), for pattern frequentness, we denote expected support (respectively, p-frequentness) by “ES” (respectively, “PF”). Thus, we have four algorithm variants: DFS-ES, Apri-ES, DFS-PF, and Apri-PF. In addition, since the FFT algorithm for checking p-frequentness of a pattern is expensive, we further proposed two probability approximation algorithms using the Poisson and Gaussian distributions, respectively, to check the p-frequentness, which improves the cost from \(O(n\log ^2 n)\) to \(O(n)\) ; we denote the Gaussian-based approximation algorithms by DFS-PF-G and Apri-PF-G, and the Poisson-based approximation algorithms by DFS-PF-P and Apri-PF-P.
Whether the mining algorithm is “DFS” or “Apri” does not impact the output. Therefore, when we evaluate result quality, we use ES for both DPS-ES and Apri-ES; PF for both DPS-PF and Apri-PF; PF-P for both DFS-PF-P and Apri-PF-P; and PF-G for both DFS-PF-G and Apri-PF-G.
To indicate whether an algorithm assumes uniform value distributions or Gaussian value distributions, we append each algorithm name with “(U)” (respectively, “(G)”) to indicate that uniform (respectively, Gaussian) distribution is adopted. For example, for DFS-PF, we have DFS-PF(U) and DFS-PF(G). Similarly, we use “(E)” to indicate that exponential distribution is adopted.
7.1 Experimental Setup
Datasets. As Table
2 shows, five publicly available real datasets were used in our experiments, including three microarray gene expression datasets, a movie rating dataset, and an RFID user trace dataset.
Among the three biology datasets, GDS2712 and GDS2002
2 are microarray datasets of the baker’s yeast
Saccharomyces cerevisiae from the
Gene Expression Omnibus (GEO) database [
3], where each matrix entry has three replicates. We also tested some other datasets such as GDS2713, GDS2715, and GDS2003, and we found that the results are similar and hence omitted to avoid redundancy in presentation. For GDS2712, the 7,826 rows shown in Table
2 are obtained from preprocessing, where we removed those unidentified genes and control probes, since they cannot be used to calculate the biological significance
p-value against the database of ground-truth gene functional categories.
The other biology dataset, GAL,
3 is regarding yeast galactose utilization [
32], which was also used by References [
12,
33]. As Table
2 shows, GAL contains 205 gene probes (rows) and 20 experimental conditions (columns) with four replicates for each entry in the matrix.
For the gene expression microarray datasets, we construct a uniform value interval for every matrix entry as
\([min, max]\) , where
\(min\) (respectively,
\(max\) ) is the minimum (respectively, maximum) replicated value of the entry from repeated experiments. As for Gaussian value distribution, given a matrix entry with replicates, we compute their sample mean
\(\mu\) and sample variance
\(\sigma\) and assume that the matrix entry has a value following
\(\mathcal {N}(\mu , \sigma ^2)\) . We fit its PDF with a cubic spline as we described in Section
3.4, and we only consider the six intervals
\([\mu -3\sigma , \mu -2\sigma ]\) ,
\([\mu -2\sigma , \mu -\sigma ]\) ,
\([\mu -\sigma , 0]\) ,
\([0, \mu +\sigma ]\) ,
\([\mu +\sigma , \mu +2\sigma ]\) ,
\([\mu +2\sigma , \mu +3\sigma ]\) when computing
\(p_g(P)\) . We similarly fit the PDF of exponential value distributions following our discussion in Section
3.4.
Besides the gene expression datasets, we also used a movie rating dataset MovieLens
4 with 100,000 ratings from 943 users on 1,682 movies [
16] to evaluate the algorithms. The movie rating dataset is an incomplete matrix, and we used TensorFlow to conduct factorization-based matrix completion, which provides a complete 943
\(\times\) 1,682 user rating matrix
\(M\) , where
\(M_{ij}\) estimates User
\(i\) ’s rating towards Movie
\(j\) . The loss function of factorization is given by
where
\(M^0\) is the observed rating matrix with missing values,
\(M=UV\) is the estimated rating matrix,
\(U\in \mathbb {R}^{943\times 10}\) ,
\(V\in \mathbb {R}^{10\times 100}\) ,
\(|X|\) takes the element-wise absolute value, reduce_sum
\((X)\) sums all elements of
\(X\) , and
\(\mathcal {P}_{\Omega }(X)\) projects to a matrix where the
\((i, j)\) -th element equals
\(X_{ij}\) if matrix entry
\((i, j)\) in
\(M^0\) is observed, and 0 otherwise. Put simply, the goal is to minimize the difference between
\(M^0\) and
\(M\) for those elements that are observed in
\(M^0\) . Implementation in TensorFlow is simple, with operators like tf.abs for taking the absolute value, tf.subtract for matrix subtraction, tf.gather to collect elements at observed matrix locations, and tf.reduce_sum to compute the sum of elements. We learn
\(U\) and
\(V\) by initializing them with truncated normal sampling, running stochastic gradient descent with a learning rate of
\(10^{-3}\) and a staircase decay of rate 0.96, and running for 100,000 steps.
To consider only the popular movies, we select the top-100 movies that get the most user ratings, which generates a \(943\times 100\) submatrix \(M_s\) of \(M\) . Since each new rating \(r\) is now a low-rank approximation, we introduced uncertainty to each rating \(r\) . A rating in the original data takes its value from \(\lbrace 1,2,3,4,5\rbrace\) ; after matrix completion, a rating is a real value like \(r=3.4\) , in which case, we assign an interval \([3, 4]\) to the matrix entry, and for OPSMRM, we assign replicates \(\lbrace 3, 4\rbrace\) . Other reasonable methods to assign uncertainty to the values can also be used. Note that user ratings are intrinsically uncertain: (1) the discrete rating values \(\lbrace 1,2,3,4,5\rbrace\) are coarse-grained and preferences among the movies with the same rating are not captured; (2) different users have different bias on the rating scale, with some giving rating 5 only to movies they like while others giving rating 4 also to such movies. Such uncertainty should be captured by relaxing the strict values in \(\lbrace 1,2,3,4,5\rbrace\) .
Finally, we also prepared an RFID dataset,
5 which contains the traces of 12 voluntary users in a building with 195 RFID antennas installed. Only 80 antennas are active, i.e., they detect the trace of at least one user for at least one time. The trace of a user consists of a series of (period, antenna) pairs, which indicates the time period during which the user is continuously detected by the antenna. When a user is detected by the same antenna at different time periods, we split the raw trace into several new traces in which the user is detected by any antenna for at most one time. We then generate a matrix having 105 rows (i.e., user traces) and 80 columns (i.e., antennas), where the
\((i, j)\) -th element records the potential interval of time that User
\(i\) is near Antenna
\(j\) , and an OPSM thus records a group of users that visits a group of locations (captured by antennas) in the same time order. To run OPSMRM for comparison, the set of timestamps within each time period is enumerated to generate the corresponding set-valued matrix.
Evaluation Metrics. The following metrics are studied to evaluate result quality, time efficiency, and space efficiency, respectively:
(1) Result Quality: For the microarray gene expression datasets, we use the
biological significance of the mined OPSMs to demonstrate the result quality of our proposed method. We adopt a widely used metric,
p-value [
12,
21,
33], which measures the association between OPSMs mined and the known gene functional categories. Specifically, a smaller
p-value indicates a stronger association between an OPSM and gene categories, i.e., biologically more significant. Following Reference [
12], we also consider four exponential-scale
p-value ranges as
significance levels, such as
\([0,10^{{-}40})\) ,
\([10^{{-}40}, 10^{{-}30})\) ,
\([10^{{-}30}, 10^{{-}20})\) , and
\([10^{{-}20}, \infty)\) (the actual ranges depend on the concrete datasets). To compare the result quality of our algorithms with those of the existing algorithms, we use the number and proportion of OPSMs mined at each significance level.
For the movie rating dataset, we define a Kendall tau score (KTS) motivated by the concept of Kendall tau distance: For a mined OPSM with movie order \(t_{1}\lt t_{2}\lt \cdots \lt t_{k}\) , we compute a KTS for each user \(g_i\) in the OPSM, which equals the fraction of all possible \(C_k^2\) movie pairs \((t_i, t_j)\) where the rating order is consistent with that in our \(943\times 100\) submatrix \(M_s\) ; the KTS of the OPSM is then computed as the average KTS over all its users.
For the RFID dataset, each mined OPSM consists of a set of users sharing a common subroute passing a subset of antennas. We define a measure called the
trace matching score (TMS) as follows: For each minded OPSM with subroute
\(P=(t_{1}\lt t_{2}\lt \cdots \lt t_{k})\) , we compute a TMS for each user
\(g_i\) in the OPSM as follows:
where
\(\mathcal {T}_{g_i}\) denotes the manually annotated ground-truth trace of
\(g_i\) , which is also provided by the RFID dataset. The TMS of the OPSM is then computed as the average TMS over all its users.
(2) Time Efficiency: We evaluate the time efficiency of the algorithms by considering the following two metrics:
•
TR-Time: total running time for mining the OPSMs;
•
SP-Time: the average time for mining a single OPSM.
Here, SP-Time equals TR-Time divided by the number of patterns mined.
(3) Space Efficiency: We report the peak memory consumption of each program run in our experiments.
All experiments were repeated for 10 times and the reported metrics were averaged over the 10 runs (although the results observed from different runs are actually quite stable/similar).
For the purpose of succinct presentation, we abuse the term “more than” (similarly for “less than” and “fewer than”): When we say that A is \(k\) times more than B, we mean \(A=k\cdot B\) , not \(A=(1+k)\cdot B\) . The terms “more,” “less,” and “fewer” are meant to indicate the directions of comparison.
As an overview of our findings, our ES and PF algorithms are robust and output higher-quality OPSMs than OPSMRM and POPSM, and generally find more OPSMs. Our algorithms also consistently outperform OPSMRM and POPSM in terms of SP-Time in the context of uniform value distributions. ES consumes the least amount of memory, and PF consumes more memory than ES but generally less memory than OPSMRM and POPSM. PF performs the best w.r.t. result quality and quantity. Among other findings, using Gaussian value distributions as the uncertain model generally outperforms the uniform distribution model scheme, but the running time is two orders of magnitude longer; using exponential value distributions exhibits a similar performance. Finally, our parallel algorithm for prefix-projection-based mining scales well with the number of mining threads.
The rest of this section is organized as follows: Section
7.2 reports the results on GDS2712 and GDS2002 in terms of the result quality and quantity, running time, and peak memory cost. There, we use GDS2712 to demonstrate that our algorithms are efficient even when exponential value distribution is adopted. Due to space limitation, we report the results on GAL in Appendix
C. Section
7.3 then uses the GDS2712 dataset to study the effect of parameters
\(\tau _{cut}\) and
\(\tau _{prob}\) , and Section
7.4 compares row selection strategies to justify the use of
\(\tau _{cut}\) . Subsequently, Section
7.5 reports the vertical scalability experiments of our parallel prefix-projection-based mining algorithm, with additional scalability results in Appendix
D. Section
7.6 reports our results on MovieLens, and Section
7.7 replicates the MovieLens to generate datasets of different sizes to study the algorithm scalability. Finally, Section
7.8 reports the results on the RFID user trace dataset.
7.2 Performance Comparison on GDS2712 and GDS2002
Size Thresholds \(\tau _{row}\) and \(\tau _{col}\) for Testing. Both datasets are used in Reference [
33] for evaluation, and following Reference [
33], we fix
\(\tau _{cut} = 0.6\) for them. In fact, we also tested the other GDS datasets in Reference [
33] such as GDS2713, GDS2715, and GDS2003, and the results are very similar and thus omitted. Since the results are similar for various
\(\tau _{prob}\) values, we only show results when
\(\tau _{prob} = 0.5\) to save space (see Section
7.3 for more results on the effect of
\(\tau _{prob}\) ). While many
\((\tau _{row}, \tau _{col})\) pairs are possible, we would like to show a limited number of typical combinations so they can fit in one figure to be readable. For GDS2712, we set
\(\tau _{col} = 4\) and vary
\(\tau _{row}\) among
\(\lbrace 400, 500, 600, 700, 800\rbrace\) to show the effect
\(\tau _{row}\) . For GDS2002, we vary
\(\tau _{row}\) among
\(\lbrace 200, 300, 400\rbrace\) but also vary
\(\tau _{col}\) among
\(\lbrace 2, 3\rbrace\) to show the effect of
\(\tau _{col}\) .
Result Quality. Figure
8 presents the fraction of mined OPSMs that fall in each
significance level for GDS2712. We see that our algorithms find larger fractions of high-quality OPSMs than POPSM and OPSMRM, as they have taller white bars (representing the highest significance level with p-value
\(\in [0, 10^{-20})\) ). For example, when
\((\tau _{row}, \tau _{col}) = (800, 4)\) , our proposed
ES(U),
PF(U),
PF-G(U), and
PF-P(U) all have
\(90\%\) patterns falling into the highest
significance level, while
POPSM and
OPSMRM have only
\(50\%\) . Moreover, our proposed
ES(U),
PF(U),
PF-G(U), and
PF-P(U) discovered a significantly larger number of OPSM patterns compared with
POPSM and
OPSMRM as shown in Table
3. We can see that our algorithms using uniform value distributions generate consistently more OPSMs than
POPSM and
OPSMRM (and of higher quality as well). Furthermore, if we look at the numbers of OPSMs falling in the
highest significance level, then when
\((\tau _{row}, \tau _{col}) = (800, 4)\) , our proposed
ES(U),
PF(U),
PF-G(U), and
PF-P(U) have 60, 60, 55, and 60 patterns falling into the highest
significance level, respectively, while
POPSM and
OPSMRM have only 5 and 1, respectively.
As for our algorithms using Gaussian value distributions, while Figure
8 shows that they do not always have a taller white bar than the uniform distribution variants (e.g., much taller when
\((\tau _{row}, \tau _{col}) = (400, 4)\) but shorter when
\((\tau _{row}, \tau _{col}) = (800, 4)\) ), the absolute OPSM counts found in each significance level dominates those of the uniform distribution variants, as we can see from Table
3, showing that using the more expensive Gaussian distribution algorithm variants does pay off in result quality.
Figure
9 presents the fraction of mined OPSMs that fall in each
significance level for GDS2712, similar to Figure
8 except that we replace those algorithms adopting Gaussian value distributions with those adopting exponential value distributions. We can see that using exponential value distributions leads to much shorter white bars than using uniform distributions when
\((\tau _{row}, \tau _{col}) = (800, 4)\) , even though the absolute result counts are consistently higher, as shown in Table
3.
Comparing Figure
8 with Figure
9, we can see that using Gaussian value distributions is a better assumption than using exponential value distributions, since the former has a taller white bar. The same holds true when comparing absolute result counts as shown in Table
3, where using Gaussian value distributions consistently leads to more results for every
\((\tau _{row}, \tau _{col})\) setting.
For the future experiments on the other datasets, we only consider uniform and Gaussian value distributions, since using exponential value distributions leads to a lower result quality than using Gaussian value distributions, while both are much more expensive than using uniform value distributions due to the need of applying the expensive cubic spline approach.
As Table
3 shows, among our algorithms,
PF performs better than
ES, as well as
PF-G and
PF-P. This verifies that considering PMF gives better results than considering only expectation. Also note that
PF-G and
PF-P produce results not far behind
PF, and in fact,
PF-P(G) gives the same results as
PF(G), and
PF-P(E) gives the same results as
PF(E), showing that our approximations are accurate.
As for GDS2002, Figure
10 presents the fraction of mined OPSMs that fall in each
significance level, while Table
4 shows the absolute counts. We observe similar results as on GDS2712. Specifically, Figure
10 shows that our algorithms find larger fractions of high-quality OPSMs than POPSM and OPSMRM, as they have taller white bars. The absolute counts in all the significance levels are also much higher, as Table
4 shows, with the absolute counts found by the Gaussian distribution algorithm variants in each significance level dominating those of the uniform distribution variants. This shows that using the more expensive Gaussian distribution algorithm variants does pay off in result quality. The comparative performances of the various algorithms also exhibit the same observations as those on GDS2712.
Time Efficiency. We hereby report the experimental results on time efficiency for GDS2712. The time results on GDS2002 are very similar and thus omitted.
Figures
11 and
12 show the total running time
TR-Time and average single-pattern running time
SP-Time, respectively, of our proposed methods adopting uniform value distributions, as well as the existing algorithms
OPSMRM and
POPSM for different size thresholds
\((\tau _{row}, \tau _{col})\) . From Figure
11, we can see that
POPSM runs faster than our methods, and
OPSMRM has comparable total running time with our methods, but recall that both
POPSM and
OPSMRM mined significantly less OPSM patterns than our algorithms and their results are of poorer quality w.r.t. all different size thresholds
\((\tau _{row}, \tau _{col})\) . For example, as Table
3 shows, when
\((\tau _{row}, \tau _{col}) = (800, 4)\) , our
ES(U) discovered 33.5 times more OPSMs than
OPSMRM and 6.7 times more than
POPSM.
According to Figure
12,
OPSMRM has the worst
SP-Time, and our methods have comparable average
single-pattern running time (SP-time) with
POPSM. Among our methods,
PF-G and
PF-P algorithms are always faster than
PF, and in some cases the
DFS algorithms run faster than our
Apri algorithms: For example,
DFS-ES runs faster than
Apri-ES. This is because the GDS2712 dataset is small with merely 7 columns, so many of our pruning techniques are not effective. As we shall see later, on the GAL dataset with 20 columns and MovieLens dataset with 100 columns, Apriori-based algorithms are faster than DFS-based ones.
Figures
13 and
14 show the total running time
TR-Time and average single-pattern running time
SP-Time, respectively, of our proposed methods adopting Gaussian value distributions. We can see that the comparative performances of our algorithms are very similar to those in Figures
11 and
12 in the context of uniform value distributions. However, algorithms using Gaussian value distributions are two orders of magnitude more expensive.
Figures
15 and
16 show the total running time
TR-Time and average single-pattern running time
SP-Time, respectively, of our proposed methods adopting exponential value distributions. We can see that the performances of our algorithms are very similar to those in Figures
13 and
14 in the context of Gaussian value distributions, actually slightly more expensive but comparable. This shows that our cubic spline approach is efficient and can handle various value distributions.
Memory Efficiency. We hereby report the experimental results on memory efficiency for GDS2712. The memory cost results on GDS2002 are very similar and thus omitted.
Figure
17 shows the peak memory usage of various algorithms using uniform value distributions. We can see that our algorithms have a consistently lower peak memory consumption than
OPSMRM. As an illustration, when
\((\tau _{row}, \tau _{col})=(400, 4)\) , our
Apri-ES algorithm consumes 6.6 times less memory than
OPSMRM. Our less memory usage does not compromise the number of mined OPSMs, as Table
3 already indicates. our algorithms discovered many times more OPSMs than
OPSMRM and
POPSM. For example, when
\((\tau _{row}, \tau _{col})=(400, 4)\) , our
DFS/Apri-ES algorithms discovered 2.5 times more OPSMs than
OPSMRM and 3.7 times more OPSMs than
POPSM.
Our
DFS/Apri-PF algorithms consume more memory than
DFS/Apri-ES due to the need of processing PMF vectors (cf. Section
4.2), but they generate higher-quality OPSMs (in terms of
p-value) than
ES (cf. Figure
8). However, even
DFS/Apri-PF algorithms have a much lower memory consumption compared with
OPSMRM. While
POPSM tends to consume slightly less memory than our algorithms, it outputs much fewer OPSMs.
Figure
18 shows the peak memory usage of our algorithms using Gaussian value distributions, and compared with Figure
17, we can see that their memory cost is one order of magnitude higher. This is reasonable, because each matrix entry now introduces seven split points (cf. Figure
6) rather than two as in the uniform interval case, and the recursive integral computation as specified in Equation (
11) now requires the maintenance of a high-order polynomial for probability computation.
Figure
19 shows the peak memory usage of our algorithms using exponential value distributions, and compared with Figure
18, we can see that the memory cost is very similar.
7.3 Effect of Probability Threshold and Inclusion Threshold
We next report the experiments on the effect of \(\tau _{cut}\) and \(\tau _{prob}\) using the GDS2712 dataset. The results on the other datasets are similar and thus omitted.
Effect of Inclusion Threshold \(\tau _{cut}\) . We fix
\(\tau _{row} = 400\) and
\(\tau _{col}=3\) , since various methods generate comparable number of OPSMs with these parameters. We also fix
\(\tau _{prob} = 0.5\) for
DFS-PF and
Apri-PF, but vary
\(\tau _{cut}\) among
\(\lbrace 0.2,0.3,0.4,0.5,0.6\rbrace\) . Figure
20 shows the fraction of OPSMs in each significance level for every algorithm with different inclusion threshold
\(\tau _{cut}\) . Table
5 presents the number of OPSMs mined with different inclusion threshold
\(\tau _{cut}\) , where we can see that our algorithms consistently find more OPSMs than
POPSM and
OPSMRM at various values of
\(\tau _{cut}\) in all significance levels, and those with Gaussian value distributions are better than those with uniform value distributions. In terms of the distribution of OPSMs among various significance levels, we can see from Figure
20 our algorithms consistently have a tall white bar (with those using Gaussian distributions being taller), while OPSMRM is only competitive when
\(\tau _{cut}\) is very low (however, the absolution count is much lower, as shown in Table
5).
Effect of Probability Threshold \(\tau _{prob}\) . We study the effect of \(\tau _{prob}\) used in PF by fixing \((\tau _{row} = 600, \tau _{col}=4)\) and \(\tau _{cut}=0.2\) ; and varying \(\tau _{prob}\) from 0.2 to 0.95 with a step length of 0.05.
First consider
PF(U) algorithms. Table
6 shows the number of OPSMs mined at each value of
\(\tau _{prob}\) , and Figure
21 shows their distribution in different significance levels. We can see that our approach is very robust, not sensitive to parameter
\(\tau _{prob}\) , and has consistent performance in terms of result quality. Figure
22 shows the
TR-Time of
PF,
PF-G, and
PF-P algorithms w.r.t. diffierent
\(\tau _{prob}\) . We can see that our DFS-based algorithms are faster than Apriori-based ones, which is often the case when the number of results are small. However, we tested that when the number of results reach thousands, our Apriori-based algorithms are actually faster, since the additional pruning power of the Apriori-based algorithms outweighs the pruning cost. Recall, for example, Figures
36 and
37 for GAL. So, both algorithms have their merits.
We also tested
PF(G) algorithms, with Table
7 showing the number of OPSMs mined at each value of
\(\tau _{prob}\) , Figure
23 showing their distribution in different significance levels, and Figure
24 showing the
TR-Time of
PF,
PF-G, and
PF-P algorithms w.r.t. different
\(\tau _{prob}\) . We can observe similar results, except that the
TR-Time is two orders of magnitude slower (compare Figure
24 with Figure
22), and the result quality is higher, as can been seen from the taller white bars in Figure
23 than in Figure
21, and the larger number of results in Table
7 than in Table
6.
7.6 Experiments on the Movie Rating Dataset
Recall from Section
7.1 that we use a completed movie rating matrix to find OPSMs where users share the same preference order over a set of movies, and that a
Kendall tau score (KTS) is defined to judge how well an OPSM pattern is reflected in its users’ movie ratings. We now report the results on the movie rating dataset where we added uncertain intervals to the discrete scale ratings.
Time Efficiency. Table
8 shows the
TR-Time of various algorithms with different
\((\tau _{row}, \tau _{col})\) (without loss of generality, we fix
\(\tau _{cut} = 0.3\) and
\(\tau _{prob} = 0.6\) ), where
OOM means out of memory. Note that when
\(\tau _{row} = 200\) ,
POPSM always runs out of memory after 10 hours.
OPSMRM has the shortest
TR-Time but it failed to produce any OPSMs that satisfy the
\(\tau _{col}\) threshold. We see that our methods are tens of times faster than
POPSM, and ES methods are faster than
PF, as they do not process PMF vectors. Also, our Apriori-based algorithms are faster than their
DFS-based counterparts due to the effective pattern pruning. Finally, approximation algorithms
PF-P and
PF-G provide reasonable speedup to
PF.
Effectiveness. Without loss of generality, we consider \(\tau _{cut} = 0.3\) , \(\tau _{prob} = 0.6\) , \(\tau _{row} = 300\) , and vary \(\tau _{col}=5\) and visualize the top-10 OPSMs mined by DFS/Apri-ES, DFS/Apri-PF, and POPSM with the highest KTS. We remark that in this set of experiments, our DFS-ES consumes 2,582 \(\times\) less memory than POPSM and 3,294 \(\times\) less memory than OPSMRM.
Table
9(a) lists the top-10 OPSM patterns with the highest KTS produced by
ES, Table
9(b) by
PF, Table
9(c) by
PF-P, Table
9(d) by
PF-G, and Table
9(e) by
POPSM. To be space-efficient, we use abbreviations for movie names, and their meanings are listed in Table
10. We can see that the OPSMs mined by our methods generally have a higher KTS.
Memory Efficiency. Table
11 shows the peak memory usage of our algorithms,
POPSM and
OPSMRM w.r.t. size thresholds
\((\tau _{row}, \tau _{col})\) , where
OOM represents
Out of Memory. When
\((\tau _{row}, \tau _{col}) = (200, 4)\) or
\((200, 5)\) ,
POPSM ran out of memory after running for 10 hours (or more accurately, 603 minutes). It is clear that our
Expected Support (ES)-based mining algorithms, namely,
Apri-ES and
DFS-ES, have a consistently much lower peak memory consumption than all the other algorithms. As an illustration, when
\((\tau _{row}, \tau _{col})=(300, 5)\) ,
DFS-ES consumes 2,582 times less memory than
POPSM and 3,294 times less memory than
OPSMRM, and our
Apri-ES consumes 720 times less memory than
POPSM and 918 times less memory than
OPSMRM. Also, our PMF approximation-based PF algorithms also use memory comparable to (or more accurately, slightly larger than) the ES counterparts, and the memory usage is three orders of magnitude less than their exact PF counterparts. This shows that our PMF approximation allows PF-based mining algorithms to scale to much larger datasets given the same memory budget, even though the speedup ratio is not as significant.
7.7 Scalability Experiments
To examine how well our algorithms scale to the number of rows and columns of a data matrix \(D\) , we duplicate the rows and columns of our \(943\times 100\) movie rating submatrix \(M_s\) to generate larger datasets for running scalability experiments.
To test row scalability, we duplicated the rows of
\(M_s\) for 100, 200, 300, 400, and 500 times, and ran the various algorithms on them. Without loss of generality, we set
\(\tau _{row} = 0.6 * n\) (where
\(n\) is the row number) and fix
\(\tau _{col} = 5\) and
\(\tau _{prob} = 0.6\) . Unfortunately,
OPSMRM runs out of memory even on the smallest data with
\(M_s\) ’s rows duplicated for 100 times, so we cannot report its result. The results for the other algorithms are shown in Figure
29, where we observe that POPSM is much slower than our algorithms, and that
Apri-ES,
Apri-PF-P, and
Apri-PF-G are much faster than the other algorithms. Overall, the time of all algorithms scale linearly with row number
\(n\) , which matches our derived time complexity
\(O(Cn(m^2+\log ^2 n))\) in Section
5.2 (i.e., almost linear to
\(n\) and quadratic to
\(m\) ).
To test column scalability, we duplicated the columns of
\(M_s\) for 2, 4, 8, and 16 times and ran the various algorithms on them. Here, we set
\(\tau _{row} = 700, \tau _{col} = 5\) , and
\(\tau _{prob} = 0.6\) . The results are shown in Figure
30, where we observe that the running time of various algorithms increase quickly with column number
\(m\) , which aligns with our analysis. Although
OPSMRM has a similar running time to our algorithms, the memory consumption rises quickly with more columns. Also,
OPSMRM runs out of memory when the columns are duplicated for four times and thus, we only plot two points for it in Figure
30. Finally, POPSM is not only slower than all our algorithms, its curve slope is also steeper.
7.8 Experiments on RFID User Trace Dataset
Recall from Section
7.1 that we use an RFID user trace dataset to find OPSM patterns that reveal a group of users who travel to a sequence of locations in the same time order, and that a
trace matching score (TMS) is defined to judge how well an OPSM pattern is reflected in its users’ actual travel trajectories that are manually annotated. We now report the results on the RFID user trace dataset where uniform time intervals are imposed at each visit location (i.e., antenna). We also compare our methods with POPSM and OPSMRM. However, OPSMRM always runs out of memory in this set of experiments and is thus not reported.
In this set of experiments, we use
\(\tau _{prob}=0.8\) and
\(\tau _{cut}=0.6\) for PF algorithms. Figure
31 shows the fraction of OPSMs in each significance level for our algorithms and
POPSM with different
\((\tau _{row}, \tau _{col})\) , and Table
12 presents the number of OPSMs mined. We can see that
POPSM mines very few patterns compared with our algorithms, and all OPSMs are with TMS in the range of
\([0.8, 0.9)\) , while our algorithms are able to find OPSMs with TMS
\(\gt 0.9\) (i.e., black bars). Moreover,
PF-G still achieves performance close to that of
PF even when
PF-P fails to get close. This verifies that PF-G is more robust than PF-P, thanks to its use of both the expected support and the standard deviation (i.e.,
\(\mu\) and
\(\delta\) ) while FP-P only uses the expected support, as we have indicated at the end of Section
4.4.2.
Figures
32 and
33 show the total running time
TR-Time and average single-pattern running time
SP-Time, respectively, of our algorithms and
POPSM for different size thresholds
\((\tau _{row}, \tau _{col})\) . From Figure
32, we can see that
POPSM runs faster than our methods, but recall that
POPSM mined significantly less OPSM patterns than our algorithms and their results are of poorer quality. Also,
PF-P algorithms are also fast but recall that the number of OPSMs found is much smaller than our other algorithms. According to Figure
33,
POPSM has the worst
SP-Time, followed by
PF-P algorithms, due to their small number of OPSMs found.
Figure
34 shows the peak memory usage of our algorithms and
POPSM. We can see that the memory cost of
PF algorithms is much higher than the other algorithms. In this set of experiments
ES and
PF-G have similar result quantity and quality, showing that
ES is effective in certain datasets, and
PF-G is a good approximation method that reduces the cost significantly while retaining similar results as with
PF.