Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1281192.1281240acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

Mining statistically important equivalence classes and delta-discriminative emerging patterns

Published: 12 August 2007 Publication History

Abstract

The support-confidence framework is the most common measure used in itemset mining algorithms, for its antimonotonicity that effectively simplifies the search lattice. This computational convenience brings both quality and statistical flaws to the results as observed by many previous studies. In this paper, we introduce a novel algorithm that produces itemsets with ranked statistical merits under sophisticated test statistics such as chi-square, risk ratio, odds ratio, etc. Our algorithm is based on the concept of equivalence classes. An equivalence class is a set of frequent itemsets that always occur together in the same set of transactions. Therefore, itemsets within an equivalence class all share the same level of statistical significance regardless of the variety of test statistics. As an equivalence class can be uniquely determined and concisely represented by a closed pattern and a set of generators, we just mine closed patterns and generators, taking a simultaneous depth-first search scheme. This parallel approach has not been exploited by any prior work. We evaluate our algorithm on two aspects. In general, we compare to LCM and FPclose which are the best algorithms tailored for mining only closed patterns. In particular, we compare to epMiner which is the most recent algorithm for mining a type of relative risk patterns, known as minimal emerging patterns. Experimental results show that our algorithm is faster than all of them, sometimes even multiple orders of magnitude faster. These statistically ranked patterns and the efficiency have a high potential for real-life applications, especially in biomedical and financial fields where classical test statistics are of dominant interest.

References

[1]
C. C. Aggarwal and P. S. Yu. A new framework for itemset generation. In PODS, pages 18--24, 1998.
[2]
R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In SIGMOD, pages 207--216, 1993.
[3]
E. L. Allwein, R. E. Schapire, and Y. Singer. Reducing multiclass to binary: A unifying approach for margin classifiers. Journal of Machine Learning Research, 1:113--141, 2000.
[4]
Y. Bastide, R. Taouil, N. Pasquier, G. Stumme, and L. Lakhal. Mining frequent patterns with counting inference. SIGKDD Explorations, 2(2):66--75, 2000.
[5]
S. D. Bay and M. J. Pazzani. Detecting group differences: Mining contrast sets. Data Mining and Knowledge Discovery, 5:213--246, 2001.
[6]
T. Brijs, K. Vanhoof, and G. W. G. Defining interestingness for association rules. International journal of information theories and applications, 10 (4):370--376, 2003.
[7]
T. Calders and B. Goethals. Depth-first non-derivable itemset mining. In SDM, 2005.
[8]
G. Cong, K.-L. Tan, A. K. H. Tung, and X. Xu. Mining top-k covering rule groups for gene expression data. In SIGMOD, pages 670--681, 2005.
[9]
G. Dong and J. Li. Efficient mining of emerging patterns: Discovering trends and differences. In KDD, pages 43--52, 1999.
[10]
W. DuMouchel and D. Pregibon. Empirical bayes screening for multi-item associations. In KDD 2001, pages 67--76, 2001.
[11]
H. Fan and K. Ramamohanarao. Fast discovery and the generalization of strong jumping emerging patterns for building compact and accurate classifiers. IEEE TKDE, 18(6):721--737, 2006.
[12]
G. Grahne and J. Zhu. Fast algorithms for frequent itemset mining using fp-trees. IEEE TKDE, 17(10):1347--1362, 2005.
[13]
J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidates generation. In SIGMOD, pages 1--12. May 2000.
[14]
T. Hastie and R. Tibshirani. Classification by pairwise coupling. The annals of statistics, 2:451--471, 1998.
[15]
H. Li, J. Li, L. Wong, M. Feng, and Y.-P. Tan. Relative risk and odds ratio: A data mining perspective. In PODS, pages 368--377, 2005.
[16]
J. Li, H. Li, L. Wong, J. Pei, and G. Dong. Minimum description length principle: Generators are preferable to closed patterns. In AAAI, pages 409--414, 2006.
[17]
J. Li, K. Ramamohanarao, and G. Dong. The space of jumping emerging patterns and its incremental maintenance algorithms. In ICML, pages 551--558, 2000.
[18]
B. Liu, W. Hsu, and Y. Ma. Integrating classification and association rule mining. In KDD, pages 80--86, 1998.
[19]
G. Liu, H. Lu, W. Lou, and J. X. Yu. On computing, storing and querying frequent patterns. In KDD, pages 607--612, 2003.
[20]
E. Loekito and J. Bailey. Fast mining of high dimensional expressive contrast patterns using zero-suppressed binary decision diagrams. In KDD, pages 307--316, 2006.
[21]
N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering frequent closed itemsets for association rules. In ICDT, pages 398--416, 1999.
[22]
R. M. Rifkin and A. Klautau. In defense of one-vs-all classification. Journal of Machine Learning Research, 5:101--141, 2004.
[23]
R. Rymon. Search through systematic set enumeration. In Proceedings of the Third International Conference on Principles of Knowledge Representation and Reasoning, pages 539--550, 1992.
[24]
A. Soulet, B. Cremilleux, and F. Rioult. Condensed representation of emerging patterns. In PAKDD, pages 127--132, 2004.
[25]
P.-N. Tan, V. Kumar, and J. Srivastava. Selecting the right interestingness measure for association patterns. In KDD, pages 32--41, 2002.
[26]
T. Uno, M. Kiyomi, and H. Arimura. Lcm ver. 3: Collaboration of array, bitmap and prefix tree for frequent itemset mining. In Proc. of the ACM SIGKDD Open Source Data Mining Workshop on Frequent Pattern Mining Implementations, 2005.
[27]
J. Wang, J. Han, and J. Pei. CLOSET+: Searching for the best strategies for mining frequent closed itemsets. In KDD, pages 236--245, 2003.
[28]
S. Wassertheil-Smoller. Biostatistics and Epidemiology. Springer Verlag, 2004.
[29]
G. I. Webb, S. Butler, and D. Newlands. On detecting differences between groups. In KDD, pages 256--265, 2003.
[30]
K. M. Weiss. Genetic Variation and Human Disease: Principles and Evolutionary Approaches. Cambridge University Press, 1993.
[31]
T.-F. Wu, C.-J. Lin, and R. C. Weng. Probability estimates for multi-class classification by pairwise coupling. Journal of Machine Learning Research, 5:975--1005, 2004.
[32]
M. J. Zaki and C.-J. Hsiao. CHARM: An efficient algorithm for closed itemset mining. In SDM, 2002.

Cited By

View all
  • (2024)Identifying Coordinated Activities on Online Social Networks Using Contrast Pattern Mining2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10651552(1-9)Online publication date: 30-Jun-2024
  • (2023)Fast Frequent Patterns Mining by Multiple Sampling With Tight Guarantee Under Bayesian StatisticsIEEE Transactions on Cybernetics10.1109/TCYB.2021.312519653:5(2993-3006)Online publication date: May-2023
  • (2023)Theory and rationale of interpretable all-in-one pattern discovery and disentanglement systemnpj Digital Medicine10.1038/s41746-023-00816-96:1Online publication date: 22-May-2023
  • Show More Cited By

Index Terms

  1. Mining statistically important equivalence classes and delta-discriminative emerging patterns

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
    August 2007
    1080 pages
    ISBN:9781595936097
    DOI:10.1145/1281192
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 12 August 2007

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. equivalence classes
    2. itemsets with ranked statistical merit

    Qualifiers

    • Article

    Conference

    KDD07

    Acceptance Rates

    KDD '07 Paper Acceptance Rate 111 of 573 submissions, 19%;
    Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

    Upcoming Conference

    KDD '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)13
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 28 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Identifying Coordinated Activities on Online Social Networks Using Contrast Pattern Mining2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10651552(1-9)Online publication date: 30-Jun-2024
    • (2023)Fast Frequent Patterns Mining by Multiple Sampling With Tight Guarantee Under Bayesian StatisticsIEEE Transactions on Cybernetics10.1109/TCYB.2021.312519653:5(2993-3006)Online publication date: May-2023
    • (2023)Theory and rationale of interpretable all-in-one pattern discovery and disentanglement systemnpj Digital Medicine10.1038/s41746-023-00816-96:1Online publication date: 22-May-2023
    • (2023)Mining Discriminative Itemsets Over Data Streams Using Efficient Sliding WindowSN Computer Science10.1007/s42979-023-01887-x4:5Online publication date: 27-Jun-2023
    • (2023)DAC: Discriminative Associative ClassificationSN Computer Science10.1007/s42979-023-01819-94:4Online publication date: 17-May-2023
    • (2023)Mining frequent generators and closures in data streams with FGC-StreamKnowledge and Information Systems10.1007/s10115-023-01852-365:8(3295-3335)Online publication date: 3-Apr-2023
    • (2022)RHPTree—Risk Hierarchical Pattern Tree for Scalable Long Pattern MiningACM Transactions on Knowledge Discovery from Data10.1145/348838016:4(1-33)Online publication date: 8-Jan-2022
    • (2022)Effective Mining of Contrast Hybrid Patterns from Nominal-numerical Mixed DataAdvanced Data Mining and Applications10.1007/978-3-031-22064-7_26(352-367)Online publication date: 24-Nov-2022
    • (2021) DISSparse: Efficient Mining of Discriminative Itemsets Journal of Information & Knowledge Management10.1142/S0219649222500095Online publication date: 3-Dec-2021
    • (2021)FGC-Stream: A novel joint miner for frequent generators and closed itemsets in data streams2021 IEEE International Conference on Data Mining (ICDM)10.1109/ICDM51629.2021.00053(419-428)Online publication date: Dec-2021
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media