Article

Mining statistically important equivalence classes and delta-discriminative emerging patterns

Authors:

Limsoon WongAuthors Info & Claims

KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 430 - 439

https://doi.org/10.1145/1281192.1281240

Published: 12 August 2007 Publication History

Abstract

The support-confidence framework is the most common measure used in itemset mining algorithms, for its antimonotonicity that effectively simplifies the search lattice. This computational convenience brings both quality and statistical flaws to the results as observed by many previous studies. In this paper, we introduce a novel algorithm that produces itemsets with ranked statistical merits under sophisticated test statistics such as chi-square, risk ratio, odds ratio, etc. Our algorithm is based on the concept of equivalence classes. An equivalence class is a set of frequent itemsets that always occur together in the same set of transactions. Therefore, itemsets within an equivalence class all share the same level of statistical significance regardless of the variety of test statistics. As an equivalence class can be uniquely determined and concisely represented by a closed pattern and a set of generators, we just mine closed patterns and generators, taking a simultaneous depth-first search scheme. This parallel approach has not been exploited by any prior work. We evaluate our algorithm on two aspects. In general, we compare to LCM and FPclose which are the best algorithms tailored for mining only closed patterns. In particular, we compare to epMiner which is the most recent algorithm for mining a type of relative risk patterns, known as minimal emerging patterns. Experimental results show that our algorithm is faster than all of them, sometimes even multiple orders of magnitude faster. These statistically ranked patterns and the efficiency have a high potential for real-life applications, especially in biomedical and financial fields where classical test statistics are of dominant interest.

References

[1]

C. C. Aggarwal and P. S. Yu. A new framework for itemset generation. In PODS, pages 18--24, 1998.

Digital Library

[2]

R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In SIGMOD, pages 207--216, 1993.

Digital Library

[3]

E. L. Allwein, R. E. Schapire, and Y. Singer. Reducing multiclass to binary: A unifying approach for margin classifiers. Journal of Machine Learning Research, 1:113--141, 2000.

Digital Library

[4]

Y. Bastide, R. Taouil, N. Pasquier, G. Stumme, and L. Lakhal. Mining frequent patterns with counting inference. SIGKDD Explorations, 2(2):66--75, 2000.

Digital Library

[5]

S. D. Bay and M. J. Pazzani. Detecting group differences: Mining contrast sets. Data Mining and Knowledge Discovery, 5:213--246, 2001.

Digital Library

[6]

T. Brijs, K. Vanhoof, and G. W. G. Defining interestingness for association rules. International journal of information theories and applications, 10 (4):370--376, 2003.

[7]

T. Calders and B. Goethals. Depth-first non-derivable itemset mining. In SDM, 2005.

[8]

G. Cong, K.-L. Tan, A. K. H. Tung, and X. Xu. Mining top-k covering rule groups for gene expression data. In SIGMOD, pages 670--681, 2005.

Digital Library

[9]

G. Dong and J. Li. Efficient mining of emerging patterns: Discovering trends and differences. In KDD, pages 43--52, 1999.

Digital Library

[10]

W. DuMouchel and D. Pregibon. Empirical bayes screening for multi-item associations. In KDD 2001, pages 67--76, 2001.

Digital Library

[11]

H. Fan and K. Ramamohanarao. Fast discovery and the generalization of strong jumping emerging patterns for building compact and accurate classifiers. IEEE TKDE, 18(6):721--737, 2006.

Digital Library

[12]

G. Grahne and J. Zhu. Fast algorithms for frequent itemset mining using fp-trees. IEEE TKDE, 17(10):1347--1362, 2005.

Digital Library

[13]

J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidates generation. In SIGMOD, pages 1--12. May 2000.

Digital Library

[14]

T. Hastie and R. Tibshirani. Classification by pairwise coupling. The annals of statistics, 2:451--471, 1998.

[15]

H. Li, J. Li, L. Wong, M. Feng, and Y.-P. Tan. Relative risk and odds ratio: A data mining perspective. In PODS, pages 368--377, 2005.

Digital Library

[16]

J. Li, H. Li, L. Wong, J. Pei, and G. Dong. Minimum description length principle: Generators are preferable to closed patterns. In AAAI, pages 409--414, 2006.

Digital Library

[17]

J. Li, K. Ramamohanarao, and G. Dong. The space of jumping emerging patterns and its incremental maintenance algorithms. In ICML, pages 551--558, 2000.

Digital Library

[18]

B. Liu, W. Hsu, and Y. Ma. Integrating classification and association rule mining. In KDD, pages 80--86, 1998.

Digital Library

[19]

G. Liu, H. Lu, W. Lou, and J. X. Yu. On computing, storing and querying frequent patterns. In KDD, pages 607--612, 2003.

Digital Library

[20]

E. Loekito and J. Bailey. Fast mining of high dimensional expressive contrast patterns using zero-suppressed binary decision diagrams. In KDD, pages 307--316, 2006.

Digital Library

[21]

N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering frequent closed itemsets for association rules. In ICDT, pages 398--416, 1999.

Digital Library

[22]

R. M. Rifkin and A. Klautau. In defense of one-vs-all classification. Journal of Machine Learning Research, 5:101--141, 2004.

Digital Library

[23]

R. Rymon. Search through systematic set enumeration. In Proceedings of the Third International Conference on Principles of Knowledge Representation and Reasoning, pages 539--550, 1992.

Digital Library

[24]

A. Soulet, B. Cremilleux, and F. Rioult. Condensed representation of emerging patterns. In PAKDD, pages 127--132, 2004.

[25]

P.-N. Tan, V. Kumar, and J. Srivastava. Selecting the right interestingness measure for association patterns. In KDD, pages 32--41, 2002.

Digital Library

[26]

T. Uno, M. Kiyomi, and H. Arimura. Lcm ver. 3: Collaboration of array, bitmap and prefix tree for frequent itemset mining. In Proc. of the ACM SIGKDD Open Source Data Mining Workshop on Frequent Pattern Mining Implementations, 2005.

Digital Library

[27]

J. Wang, J. Han, and J. Pei. CLOSET+: Searching for the best strategies for mining frequent closed itemsets. In KDD, pages 236--245, 2003.

Digital Library

[28]

S. Wassertheil-Smoller. Biostatistics and Epidemiology. Springer Verlag, 2004.

[29]

G. I. Webb, S. Butler, and D. Newlands. On detecting differences between groups. In KDD, pages 256--265, 2003.

Digital Library

[30]

K. M. Weiss. Genetic Variation and Human Disease: Principles and Evolutionary Approaches. Cambridge University Press, 1993.

[31]

T.-F. Wu, C.-J. Lin, and R. C. Weng. Probability estimates for multi-class classification by pairwise coupling. Journal of Machine Learning Research, 5:975--1005, 2004.

Digital Library

[32]

M. J. Zaki and C.-J. Hsiao. CHARM: An efficient algorithm for closed itemset mining. In SDM, 2002.

Cited By

Manchanayaka IZaidi ZKarunasekera SLeckie C(2024)Identifying Coordinated Activities on Online Social Networks Using Contrast Pattern Mining2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10651552(1-9)Online publication date: 30-Jun-2024
https://doi.org/10.1109/IJCNN60899.2024.10651552
Zhang ZHuang J(2023)Fast Frequent Patterns Mining by Multiple Sampling With Tight Guarantee Under Bayesian StatisticsIEEE Transactions on Cybernetics10.1109/TCYB.2021.312519653:5(2993-3006)Online publication date: May-2023
https://doi.org/10.1109/TCYB.2021.3125196
Wong AZhou PLee A(2023)Theory and rationale of interpretable all-in-one pattern discovery and disentanglement systemnpj Digital Medicine10.1038/s41746-023-00816-96:1Online publication date: 22-May-2023
https://doi.org/10.1038/s41746-023-00816-9
Show More Cited By

Index Terms

Mining statistically important equivalence classes and delta-discriminative emerging patterns
1. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Scalable Algorithms for Association Mining

Association rule discovery has emerged as an important problem in knowledge discovery and data mining. The association mining task consists of identifying the frequent itemsets and then, forming conditional implication rules among them. In this paper, ...
gRosSo: mining statistically robust patterns from a sequence of datasets
Abstract
Pattern mining is a fundamental data mining task with applications in several domains. In this work, we consider the scenario in which we have a sequence of datasets generated by potentially different underlying generative processes, and we study ...
Mining statistically sound co-location patterns at multiple distances
SSDBM '14: Proceedings of the 26th International Conference on Scientific and Statistical Database Management

Existing co-location mining algorithms require a user provided distance threshold at which prevalent patterns are searched. Since spatial interactions, in reality, may happen at different distances, finding the right distance threshold to mine all true ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining

August 2007

1080 pages

ISBN:9781595936097

DOI:10.1145/1281192

General Chair:
Pavel Berkhin
Yahoo!, USA
,
Program Chairs:
Rich Caruana
Cornell University, USA
,
Xindong Wu
University of Vermont, USA

Copyright © 2007 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 August 2007

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

KDD07

Sponsor:

KDD07: The 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

August 12 - 15, 2007

California, San Jose, USA

Acceptance Rates

KDD '07 Paper Acceptance Rate 111 of 573 submissions, 19%;

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Sponsor:
sigkdd
sigkdd

The 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 3 - 7, 2025

Toronto , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

84
Total Citations
View Citations
798
Total Downloads

Downloads (Last 12 months)13
Downloads (Last 6 weeks)1

Reflects downloads up to 28 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Manchanayaka IZaidi ZKarunasekera SLeckie C(2024)Identifying Coordinated Activities on Online Social Networks Using Contrast Pattern Mining2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10651552(1-9)Online publication date: 30-Jun-2024
https://doi.org/10.1109/IJCNN60899.2024.10651552
Zhang ZHuang J(2023)Fast Frequent Patterns Mining by Multiple Sampling With Tight Guarantee Under Bayesian StatisticsIEEE Transactions on Cybernetics10.1109/TCYB.2021.312519653:5(2993-3006)Online publication date: May-2023
https://doi.org/10.1109/TCYB.2021.3125196
Wong AZhou PLee A(2023)Theory and rationale of interpretable all-in-one pattern discovery and disentanglement systemnpj Digital Medicine10.1038/s41746-023-00816-96:1Online publication date: 22-May-2023
https://doi.org/10.1038/s41746-023-00816-9
Seyfi MNayak RXu Y(2023)Mining Discriminative Itemsets Over Data Streams Using Efficient Sliding WindowSN Computer Science10.1007/s42979-023-01887-x4:5Online publication date: 27-Jun-2023
https://doi.org/10.1007/s42979-023-01887-x
Seyfi MXu YNayak R(2023)DAC: Discriminative Associative ClassificationSN Computer Science10.1007/s42979-023-01819-94:4Online publication date: 17-May-2023
https://dl.acm.org/doi/10.1007/s42979-023-01819-9
Martin TValtchev PRoux L(2023)Mining frequent generators and closures in data streams with FGC-StreamKnowledge and Information Systems10.1007/s10115-023-01852-365:8(3295-3335)Online publication date: 3-Apr-2023
https://doi.org/10.1007/s10115-023-01852-3
Liu DLi YBaskett WLin DShyu C(2022)RHPTree—Risk Hierarchical Pattern Tree for Scalable Long Pattern MiningACM Transactions on Knowledge Discovery from Data10.1145/348838016:4(1-33)Online publication date: 8-Jan-2022
https://dl.acm.org/doi/10.1145/3488380
Fu MDuan LYu Z(2022)Effective Mining of Contrast Hybrid Patterns from Nominal-numerical Mixed DataAdvanced Data Mining and Applications10.1007/978-3-031-22064-7_26(352-367)Online publication date: 24-Nov-2022
https://doi.org/10.1007/978-3-031-22064-7_26
Seyfi MNayak RXu YGeva S(2021) DISSparse: Efficient Mining of Discriminative Itemsets Journal of Information & Knowledge Management10.1142/S0219649222500095Online publication date: 3-Dec-2021
https://doi.org/10.1142/S0219649222500095
Martin TValtchev PRoux L(2021)FGC-Stream: A novel joint miner for frequent generators and closed itemsets in data streams2021 IEEE International Conference on Data Mining (ICDM)10.1109/ICDM51629.2021.00053(419-428)Online publication date: Dec-2021
https://doi.org/10.1109/ICDM51629.2021.00053
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents