Article

Efficient pattern mining on shared memory systems: implications for chip multiprocessor architectures

Authors:

Gregory Buehrer,

Yen-Kuang Chen,

Srinivasan Parthasarathy,

Anthony Nguyen,

Daehyun KimAuthors Info & Claims

MSPC '06: Proceedings of the 2006 workshop on Memory system performance and correctness

Pages 31 - 40

https://doi.org/10.1145/1178597.1178603

Published: 22 October 2006 Publication History

Abstract

Frequent pattern mining is a fundamental data mining process which has practical applications ranging from market basket data analysis to web link analysis. In this work, we show that state-of-the-art frequent pattern mining applications are inefficient when executing on a shared memory multiprocessor system, due primarily to poor utilization of the memory hierarchy. To improve the efficiency of these applications, we explore memory performance improvements, task partitioning strategies, and task queuing models designed to maximize the scalability of pattern mining on SMP systems. Empirically, we show that the proposed strategies afford significantly improved performance. We also discuss implications of this work in light of recent trends in micro-architecture design, particularly chip multiprocessors (CMPs).

References

[1]

R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In Proceedings of the International Conference on Management of Data (SIGMOD), 1993.

Digital Library

[2]

R. Agrawal and J. Schafer. Parallel mining of association rules. IEEE Transactions on Knowledge and Data Engineering, 1996.

Digital Library

[3]

R. Agrawal and R. Srikant. Mining sequential patterns. In Proceedings of the International Conference on Data Engineering (ICDE), 1995.

Digital Library

[4]

E. D. Berger, K. S. McKinley, R. D. Blumofe, and P. R. Wilson. Hoard: A scalable memory allocator for multithreaded applications. In The Ninth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-IX), 2000.

Digital Library

[5]

C. Borgelt and M. R. Berthold. Mining molecular fragments: Finding relevant substructures in molecules. In Proceedings of the 2nd Internation Conference on Data Mining (ICDM), pages 51--58, 2002.

Digital Library

[6]

S. Brin, R. Motwani, and C. Silverstein. Beyond market basket: Generalizing association rules to correlations. In Proceedings of the International Conference on Management of Data (SIGMOD), 1997.

Digital Library

[7]

J. Chen, P. Juang, K. Ko, G. Contreras, D. Penry, R. Rangan, A. Stoler, L.-S. Peh, and M. Martonosi. Hardware-modulated parallelism in chip multiprocessors. SIGARCH Comput. Archit. News, 33(4):54--63, 2005.

Digital Library

[8]

Y. Chen, L. Yang, and Y. Wang. Incremental mining of frequent xml query patterns. In Proceedings of the International Conference on Data Mining (ICDM), 1999.

[9]

S. Cong, J. Han, J. Hoeflinger, and D. Padua. A sampling-based framework for parallel data mining. In PPoPP '05: Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming, pages 255--265. ACM Press, 2005.

Digital Library

[10]

D. J. Cook, L. B. Holder, G. Galal, and R. Maglothin. Approaches to parallel graph-based knowledge discovery. volume 61, pages 427--446, Orlando, FL, USA, 2001. Academic Press, Inc.

Digital Library

[11]

L. Dehaspe, H. Toivonen, and R. D. King. Finding frequent substructures in chemical compounds. In Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining, pages 30--36. AAAI Press, 1998.

[12]

A. Ghoting, G. Buehrer, S. Parthasarathy, D. Kim, A. Nguyen, Y. Chen, and P. Dubey. Cache-conscious frequent pattern mining on a modern processor. In Proceedings of the International Conference on Very Large Data Bases (VLDB), pages 577--588, 2005.

Digital Library

[13]

A. Ghoting, G. Buehrer, S. Parthasarathy, D. Kim, A. Nguyen, Y. Chen, and P. Dubey. Cache-conscious frequent pattern mining on modern and emerging architectures. In OSU Technical Report, volume OSU-CISRC-3/06-TR31, 2006.

[14]

B. Goethals and M. Zaki. Advances in frequent itemset mining implementations. In Proceedings of the ICDM workshop on frequent itemset mining implementations, 2003.

[15]

K. Gouda and M. Zaki. Efficiently mining maximal frequent itemsets. In Proceedings of the International Conference on Data Mining (ICDM), 2001.

Digital Library

[16]

V. Guralnik and G. Karypis. Dynamic load balancing algorithms for sequence mining. In University of Minnesota Technical Report TR 00-056, 2001.

[17]

E. Han, G. Karypis, and V. Kumar. Scalable parallel mining of association rules. IEEE Transactions on Knowledge and Data Engineering, 2000.

Digital Library

[18]

J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In Proceedings of the International Conference on Management of Data (SIGMOD), 2000.

Digital Library

[19]

M. Kuramochi and G. Karypis. Frequent subgraph discovery. In Proceedings of the Internation Conference on Data Mining (ICDM), pages 313--320, 2001.

Digital Library

[20]

E. Markatos and T. Leblanc. Using processor affinity in loop scheduling on shared-memory multiprocessors. In IEEE Transactions on Parallel and Distriuted Systems, 1993.

Digital Library

[21]

B. McKay. Practical graph isomorphism. In Congressus Numerantium, volume 30, pages 45--87, 1981.

[22]

T. Meinl, I. Fischer, and M. Philippsen. Parallel mining for frequent fragments on a shared-memory multiprocessor -results and java-obstacles-. In LWA 2005 - Beitrge zur GI-Workshopwoche Lernen, Wissensentdeckung, Adaptivitt, pages 196--201, Saarbrcken, Germany, 2005.

[23]

S. Nijssen and J. N. Kok. A quickstart in frequent structure mining can make a difference. In Proceedings of the 10th International Conference on Knowledge Discovery and Data mining (KDD), pages 647--652, New York, NY, USA, 2004. ACM Press.

Digital Library

[24]

J. Park, M. Chen, and P. Yu. An effective hash-based algorithm for mining association rules. In Proceedings of the International Conference on Management of Data (SIGMOD), 1995.

Digital Library

[25]

S. Parthasarathy. Active data mining in a distributed setting. In University of Rochester Ph.D. Thesis, 2000.

Digital Library

[26]

S. Parthasarathy and M. Coatney. Efficient discovery of common substructures in macromolecules. In Proceedings of the IEEE International Conference on Data Mining (ICDM), 2002.

Digital Library

[27]

S. Parthasarathy, M. Zaki, M. Ogihara, and W. Li. Parallel data mining for association rules on shared-memory systems. Knowledge and Information Systems Journal, 2001.

Digital Library

[28]

S. Parthasarathy, M. J. Zaki, M. Ogihara, and W. Li. Parallel data mining for association rules on shared-memory systems. In Knowledge and Information Systems, volume 3, pages 1--29, 2001.

Digital Library

[29]

J. Punin, M. Krishnamoorthy, and M. J. Zaki. Logml -- log markup language for web usage mining. In WEBKDD Workshop: Mining Log Data Across All Customer TouchPoints (with SIGKDD01), 2001.

Digital Library

[30]

C. Wang and S. Parthasarathy. Parallel algorithms for mining frequent structural motifs in scientific data. In Proceedings of the ACM International Conference on Supercomputing (ICS), 2004.

Digital Library

[31]

X. Yan and J. Han. gspan: Graph-based substructure pattern mining. In Proceedings of the 2002 International Conference on Data Mining (ICDM), 2002.

Digital Library

[32]

M. Zaki. Parallel sequence mining on shared-memory machines. In Journal of Parallel and Distributed Computing, volume 61, pages 401--426, 2001.

Digital Library

[33]

M. Zaki and C. Hsiao. CHARM: An efficient algorithm for closed itemset mining. In Proceedings of SIAM International Conference on Data Mining (SDM), 2002.

[34]

M. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. New algorithms for fast discovery of association rules. pages 283--296, 1997.

[35]

M. J. Zaki, V. Nadimpally, D. Bardhan, and C. Bystroff. Predicting protein folding pathways. Bioinformatics, 20(1):386--393, 2004.

Digital Library

Cited By

Ravi VAgrawal G(2009)Performance Issues in Parallelizing Data-Intensive Applications on a Multi-core ClusterProceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid10.1109/CCGRID.2009.83(308-315)Online publication date: 18-May-2009
https://dl.acm.org/doi/10.1109/CCGRID.2009.83

Index Terms

Efficient pattern mining on shared memory systems: implications for chip multiprocessor architectures
1. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

An efficient frequent pattern mining algorithm
FSKD'09: Proceedings of the 6th international conference on Fuzzy systems and knowledge discovery - Volume 2

Efficient algorithms for mining frequent itemsets are crucial for mining association rules and for other data mining tasks. FP-growth algorithm has been implemented using a prefix-tree structure, known as a FP-tree, for storing compressed frequency ...
An Efficient Close Frequent Pattern Mining Algorithm
ICICTA '09: Proceedings of the 2009 Second International Conference on Intelligent Computation Technology and Automation - Volume 01

Efficient algorithms for mining frequent itemsets are crucial for mining association rules and for other data mining tasks. FP-growth algorithm has been implemented using a prefix-tree structure, known as a FP-tree, for storing compressed frequency ...
An efficient pattern growth approach for mining fault tolerant frequent itemsets
Highlights
- Mining fault tolerant (FT) frequent itemsets are computationally expensive.
- ...
Abstract
Mining fault tolerant (FT) frequent itemsets from transactional databases are computationally more expensive than mining exact matching frequent itemsets. Previous algorithms mine FT frequent itemsets using Apriori heuristic. Apriori-...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MSPC '06: Proceedings of the 2006 workshop on Memory system performance and correctness

October 2006

114 pages

ISBN:1595935789

DOI:10.1145/1178597

General Chair:
Antony Hosking
Purdue U
,
Program Chair:
Ali-Reza Adl-Tabatabai
Intel

Copyright © 2006 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGPLAN: ACM Special Interest Group on Programming Languages

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 October 2006

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Article

Conference

MSPC '06

Sponsor:

SIGPLAN

MSPC '06: ACM SIGPLAN Workshop on Memory Systems Performance and Correctness 2006

October 22, 2006

California, San Jose

Acceptance Rates

Overall Acceptance Rate 6 of 20 submissions, 30%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
403
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Ravi VAgrawal G(2009)Performance Issues in Parallelizing Data-Intensive Applications on a Multi-core ClusterProceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid10.1109/CCGRID.2009.83(308-315)Online publication date: 18-May-2009
https://dl.acm.org/doi/10.1109/CCGRID.2009.83

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents