Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Mining Compressing Sequential Patterns

Published: 01 February 2014 Publication History

Abstract

Pattern mining based on data compression has been successfully applied in many data mining tasks. For itemset data, the Krimp algorithm based on the minimumdescription length MDL principle was shown to be very effective in solving the redundancy issue in descriptive pattern mining. However, for sequence data, the redundancy issue of the set of frequent sequential patterns is not fully addressed in the literature. In this article, we study MDL-based algorithms for mining non-redundant sets of sequential patterns from a sequence database. First, we propose an encoding scheme for compressing sequence data with sequential patterns. Second, we formulate the problem of mining the most compressing sequential patterns from a sequence database. We show that this problem is intractable and belongs to the class of inapproximable problems. Therefore, we propose two heuristic algorithms. The first of these uses a two-phase approach similar to Krimp for itemset data. To overcome performance issues in candidate generation, we also propose GoKrimp, an algorithm that directly mines compressing patterns by greedily extending a pattern until no additional compression benefit of adding the extension into the dictionary. Since checks for additional compression benefit of an extension are computationally expensive we propose a dependency test which only chooses related events for extending a given pattern. This technique improves the efficiency of the GoKrimp algorithm significantly while it still preserves the quality of the set of patterns. We conduct an empirical study on eight datasets to show the effectiveness of our approach in comparison to the state-of-the-art algorithms in terms of interpretability of the extracted patterns, run time, compression ratio, and classification accuracy using the discovered patterns as features for different classifiers. © 2013 Wiley Periodicals, Inc. Statistical Analysis and Data Mining, 2013

References

[1]
<label>1</label> F.Mörchen, Unsupervised pattern mining from symbolic temporal data, SIGKDD Explor Newsl Volume 9 1 2007, pp.41-55.
[2]
<label>2</label> J.Vreeken, M.van Leeuwen, and A.Siebes, A. Krimp: mining itemsets that compress, Data Mining Knowl Discov Volume 23 1 2011, pp.169-214.
[3]
<label>3</label> P.Grünwald, The Minimum Description Length Principle, Cambridge, Massachusetts, USA, The MIT Press, 2007.
[4]
<label>4</label> M.van Leeuwen, J.Vreeken, and A.Siebes, Identifying the components, Data Mining Knowl Discov Volume 19 2 2009, pp.176-193.
[5]
<label>5</label> M.van Leeuwen and A.Siebes, StreamKrimp: detecting change in data streams, ECML/PKDD 1 Part I 2008, pp.672-687.
[6]
<label>6</label> H. T.Lam, F.Moerchen, D.Fradkin, and T.Calders, Mining Compressing Sequential Patterns, SDM, SIAM, Philadelphia, PA, USA, 2012.
[7]
<label>7</label> I.Witten, A.Moffat, and T.Bell, Managing Gigabytes: Compressing and Indexing Documents and Images, Burlington, Massachusetts, Morgan Kaufmann, 1999.
[8]
<label>8</label> J.Vreeken and N.Tatti, The Long and the Short of It: Summarizing Event Sequences with Serial Episodes, SIGKDD, ACM, 2012, pp.462-470.
[9]
<label>9</label> A.Gionis, H.Mannila, T.Mielikäinen, and P.Tsaparas, Assessing data mining results via swap randomization, TKDD Volume 1 3 2007.
[10]
<label>10</label> A.Miettinen, T.Mielikainen, A.Gionis, G.Das, and H.Mannila, IEEE Transactions on The discrete basis problem knowledge and data engineering, 2008.
[11]
<label>11</label> S.Hanhijärvi, G. C.Garriga, and K.Puolamäki, Randomization Techniques for Graphs, SDM, 2009, pp.780-791.
[12]
<label>12</label> R.Milo, S.Shen-Orr, S.Itzkovitz, N.Kashtan, D.Chklovskii, and U.Alon, Network motifs: simple building blocks of complex networks, Science Volume 298 5594 2002, pp.824-827.
[13]
<label>13</label> N.Castro and P.Azevedo, Time Series Motifs Statistical Significance, SDM, 2011, pp.687-698
[14]
<label>14</label> K.Smets and J. V.Slim, Directly Mining Descriptive Patterns, SIAM SDM, 2012, pp.236-247.
[15]
<label>15</label> L.Holder, D.Cook, S.Djoko, Substructure discovery in the SUBDUE system, KDD Workshop, 1994, pp.169-180.
[16]
<label>16</label> D.Chakrabarti, S.Papadimitriou, D.Modha, and C.Faloutsos, Fully automatic cross-associations, KDD, 2004, pp.79-88.
[17]
<label>17</label> R.Cilibrasi and P.Vitányi, Clustering by compression, IEEE Trans Inf Theory Volume 51 2005, 4.
[18]
<label>18</label> E.Keogh, S.Lonardi, C. A.Ratanamahatana, L.Wei, S.-H.Lee, and J.Handley, Compression-based data mining of sequential data, Data Mining Knowl Disco Volume 14 1 2007.
[19]
<label>19</label> C.Faloutsos and V.Megalooikonomou, On data mining, compression, and Kolmogorov complexity, Data Mining Knowl Discov Volume 15 1 2007, pp.3-20.
[20]
<label>20</label> F.Geerts, B.Goethals, and T.Mielikainen, Tiling databases, Discov Sci 2004, pp.278-289.
[21]
<label>21</label> C.Ambuhl, M.Mastrolilli, and O.Svensson, Inapproximability results for maximum edge biclique, minimum linear arrangement, and sparsest cut, SIAM J Comput Volume 40 2 2011, pp.567-596.
[22]
<label>22</label> J.Pei, J.Han, Mortazavi-Asl, J. W.Pinto, Q.C.Dayal and M.-C.Hsu, Mining Sequential Patterns by Pattern-Growth: The PrefixSpan Approach, TKDE, 2004, pp.1424-1440.
[23]
<label>23</label> Jianyong and J.Han, BIDE: Efficient mining of frequent closed sequences, In Proceedings of the 20th International Conference on Data Engineering ICDE, Washington DC, USA, IEEE Press, 2004, pp.79-90.
[24]
<label>24</label> D.Fradkin and F.Moerchen, Margin-Closed Frequent Sequential Pattern Mining, Workshop on Mining Useful Patterns, KDD, 2010.
[25]
<label>25</label> W.Conover, Practical Nonparametric Statistics, 2nd ed., New York, Wiley, 1980.
[26]
<label>26</label> F.Moerchen and D.Fradkin, Robust mining of time intervals with semi-interval partial order patterns, In Proceedings of SIAM SDM, 2010, pp.315-326.
[27]
<label>27</label> J.Vreeken, Making pattern mining useful, ACM SIGKDD Explor Volume 12 1 2010, pp.75-76.
[28]
<label>28</label> N.Tatti and J.Vreeken, Finding good itemsets by packing data, ICDM 2008, pp.588-597.
[29]
<label>29</label> T.De Bie, Maximum entropy models and subjective interestingness: an application to tiles in binary databases. DMKD J Volume 23 3 2011, pp.407-446.
[30]
<label>30</label> T.De Bie, K.-N.Kontonasios, E.Spyropoulou, A framework for mining interesting pattern sets, SIGKDD Explor Volume 12 2 2010, pp.92-100.
[31]
<label>31</label> J.Han, Mining useful patterns: my evolutionary view. Keynote talk at the Mining Useful Patterns workshop KDD 2010.
[32]
<label>32</label> F.Moerchen, T.Michael, and U.Alfred, Efficient mining of all margin-closed itemsets with applications in temporal knowledge discovery and classification by compression, Knowl Inf Syst 291 2010, pp.55-80.
[33]
<label>33</label> D.Huffman, A method for the construction of minimum-redundancy codes, Proc IRE Volume 40 9 1952, pp.1098-1102.
[34]
<label>34</label> J.Storer, Data compression via textual substitution, J ACM Volume 29 4 1982, pp.928-951.
[35]
<label>35</label> M.Warmuth and D.Haussler, On the complexity of iterated shuffle, J Comput Syst Sci Volume 28 3 1984, pp.345-358.

Cited By

View all
  • (2024)TaSPM: Targeted Sequential Pattern MiningACM Transactions on Knowledge Discovery from Data10.1145/363982718:5(1-18)Online publication date: 28-Feb-2024
  • (2024)Breadth-First Search Approach for Mining Serial Episodes with Simultaneous EventsProceedings of the 7th Joint International Conference on Data Science & Management of Data (11th ACM IKDD CODS and 29th COMAD)10.1145/3632410.3632445(36-44)Online publication date: 4-Jan-2024
  • (2024)SWoTTeD: an extension of tensor decomposition to temporal phenotypingMachine Language10.1007/s10994-024-06545-8113:9(5939-5980)Online publication date: 1-Sep-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Statistical Analysis and Data Mining
Statistical Analysis and Data Mining  Volume 7, Issue 1
February 2014
92 pages
ISSN:1932-1864
EISSN:1932-1872
Issue’s Table of Contents

Publisher

John Wiley & Sons, Inc.

United States

Publication History

Published: 01 February 2014

Author Tags

  1. complexity
  2. compressing patterns mining
  3. compression-based pattern mining
  4. minimum description length
  5. sequence data

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 07 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)TaSPM: Targeted Sequential Pattern MiningACM Transactions on Knowledge Discovery from Data10.1145/363982718:5(1-18)Online publication date: 28-Feb-2024
  • (2024)Breadth-First Search Approach for Mining Serial Episodes with Simultaneous EventsProceedings of the 7th Joint International Conference on Data Science & Management of Data (11th ACM IKDD CODS and 29th COMAD)10.1145/3632410.3632445(36-44)Online publication date: 4-Jan-2024
  • (2024)SWoTTeD: an extension of tensor decomposition to temporal phenotypingMachine Language10.1007/s10994-024-06545-8113:9(5939-5980)Online publication date: 1-Sep-2024
  • (2023)Efficient Depth-First Search Approach for Mining Injective General EpisodesProceedings of the 6th Joint International Conference on Data Science & Management of Data (10th ACM IKDD CODS and 28th COMAD)10.1145/3570991.3571012(1-9)Online publication date: 4-Jan-2023
  • (2023)Methods for Analyzing Medical-Order Sequence Variants in Sequential Pattern Mining for Electronic Medical Record SystemsACM Transactions on Computing for Healthcare10.1145/35618254:1(1-28)Online publication date: 30-Mar-2023
  • (2023)Process Discovery on Deviant Traces and Other Stranger ThingsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.323220735:11(11784-11800)Online publication date: 1-Nov-2023
  • (2023)Clustering customer orders in a smart factory using sequential pattern miningThe Journal of Supercomputing10.1007/s11227-023-05351-879:16(18970-18992)Online publication date: 22-May-2023
  • (2022)Online summarizing alerts through semantic and behavior informationProceedings of the 44th International Conference on Software Engineering10.1145/3510003.3510055(1646-1657)Online publication date: 21-May-2022
  • (2022)OWSP-Miner: Self-adaptive One-off Weak-gap Strong Pattern MiningACM Transactions on Management Information Systems10.1145/347624713:3(1-23)Online publication date: 4-Feb-2022
  • (2022)On the Feasibility of Anomaly Detection with Fine-Grained Program Tracing EventsJournal of Network and Systems Management10.1007/s10922-021-09635-330:2Online publication date: 1-Apr-2022
  • Show More Cited By

View Options

View options

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media