Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1178597.1178603acmconferencesArticle/Chapter ViewAbstractPublication PagesmspConference Proceedingsconference-collections
Article

Efficient pattern mining on shared memory systems: implications for chip multiprocessor architectures

Published: 22 October 2006 Publication History

Abstract

Frequent pattern mining is a fundamental data mining process which has practical applications ranging from market basket data analysis to web link analysis. In this work, we show that state-of-the-art frequent pattern mining applications are inefficient when executing on a shared memory multiprocessor system, due primarily to poor utilization of the memory hierarchy. To improve the efficiency of these applications, we explore memory performance improvements, task partitioning strategies, and task queuing models designed to maximize the scalability of pattern mining on SMP systems. Empirically, we show that the proposed strategies afford significantly improved performance. We also discuss implications of this work in light of recent trends in micro-architecture design, particularly chip multiprocessors (CMPs).

References

[1]
R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In Proceedings of the International Conference on Management of Data (SIGMOD), 1993.
[2]
R. Agrawal and J. Schafer. Parallel mining of association rules. IEEE Transactions on Knowledge and Data Engineering, 1996.
[3]
R. Agrawal and R. Srikant. Mining sequential patterns. In Proceedings of the International Conference on Data Engineering (ICDE), 1995.
[4]
E. D. Berger, K. S. McKinley, R. D. Blumofe, and P. R. Wilson. Hoard: A scalable memory allocator for multithreaded applications. In The Ninth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-IX), 2000.
[5]
C. Borgelt and M. R. Berthold. Mining molecular fragments: Finding relevant substructures in molecules. In Proceedings of the 2nd Internation Conference on Data Mining (ICDM), pages 51--58, 2002.
[6]
S. Brin, R. Motwani, and C. Silverstein. Beyond market basket: Generalizing association rules to correlations. In Proceedings of the International Conference on Management of Data (SIGMOD), 1997.
[7]
J. Chen, P. Juang, K. Ko, G. Contreras, D. Penry, R. Rangan, A. Stoler, L.-S. Peh, and M. Martonosi. Hardware-modulated parallelism in chip multiprocessors. SIGARCH Comput. Archit. News, 33(4):54--63, 2005.
[8]
Y. Chen, L. Yang, and Y. Wang. Incremental mining of frequent xml query patterns. In Proceedings of the International Conference on Data Mining (ICDM), 1999.
[9]
S. Cong, J. Han, J. Hoeflinger, and D. Padua. A sampling-based framework for parallel data mining. In PPoPP '05: Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming, pages 255--265. ACM Press, 2005.
[10]
D. J. Cook, L. B. Holder, G. Galal, and R. Maglothin. Approaches to parallel graph-based knowledge discovery. volume 61, pages 427--446, Orlando, FL, USA, 2001. Academic Press, Inc.
[11]
L. Dehaspe, H. Toivonen, and R. D. King. Finding frequent substructures in chemical compounds. In Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining, pages 30--36. AAAI Press, 1998.
[12]
A. Ghoting, G. Buehrer, S. Parthasarathy, D. Kim, A. Nguyen, Y. Chen, and P. Dubey. Cache-conscious frequent pattern mining on a modern processor. In Proceedings of the International Conference on Very Large Data Bases (VLDB), pages 577--588, 2005.
[13]
A. Ghoting, G. Buehrer, S. Parthasarathy, D. Kim, A. Nguyen, Y. Chen, and P. Dubey. Cache-conscious frequent pattern mining on modern and emerging architectures. In OSU Technical Report, volume OSU-CISRC-3/06-TR31, 2006.
[14]
B. Goethals and M. Zaki. Advances in frequent itemset mining implementations. In Proceedings of the ICDM workshop on frequent itemset mining implementations, 2003.
[15]
K. Gouda and M. Zaki. Efficiently mining maximal frequent itemsets. In Proceedings of the International Conference on Data Mining (ICDM), 2001.
[16]
V. Guralnik and G. Karypis. Dynamic load balancing algorithms for sequence mining. In University of Minnesota Technical Report TR 00-056, 2001.
[17]
E. Han, G. Karypis, and V. Kumar. Scalable parallel mining of association rules. IEEE Transactions on Knowledge and Data Engineering, 2000.
[18]
J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In Proceedings of the International Conference on Management of Data (SIGMOD), 2000.
[19]
M. Kuramochi and G. Karypis. Frequent subgraph discovery. In Proceedings of the Internation Conference on Data Mining (ICDM), pages 313--320, 2001.
[20]
E. Markatos and T. Leblanc. Using processor affinity in loop scheduling on shared-memory multiprocessors. In IEEE Transactions on Parallel and Distriuted Systems, 1993.
[21]
B. McKay. Practical graph isomorphism. In Congressus Numerantium, volume 30, pages 45--87, 1981.
[22]
T. Meinl, I. Fischer, and M. Philippsen. Parallel mining for frequent fragments on a shared-memory multiprocessor -results and java-obstacles-. In LWA 2005 - Beitrge zur GI-Workshopwoche Lernen, Wissensentdeckung, Adaptivitt, pages 196--201, Saarbrcken, Germany, 2005.
[23]
S. Nijssen and J. N. Kok. A quickstart in frequent structure mining can make a difference. In Proceedings of the 10th International Conference on Knowledge Discovery and Data mining (KDD), pages 647--652, New York, NY, USA, 2004. ACM Press.
[24]
J. Park, M. Chen, and P. Yu. An effective hash-based algorithm for mining association rules. In Proceedings of the International Conference on Management of Data (SIGMOD), 1995.
[25]
S. Parthasarathy. Active data mining in a distributed setting. In University of Rochester Ph.D. Thesis, 2000.
[26]
S. Parthasarathy and M. Coatney. Efficient discovery of common substructures in macromolecules. In Proceedings of the IEEE International Conference on Data Mining (ICDM), 2002.
[27]
S. Parthasarathy, M. Zaki, M. Ogihara, and W. Li. Parallel data mining for association rules on shared-memory systems. Knowledge and Information Systems Journal, 2001.
[28]
S. Parthasarathy, M. J. Zaki, M. Ogihara, and W. Li. Parallel data mining for association rules on shared-memory systems. In Knowledge and Information Systems, volume 3, pages 1--29, 2001.
[29]
J. Punin, M. Krishnamoorthy, and M. J. Zaki. Logml -- log markup language for web usage mining. In WEBKDD Workshop: Mining Log Data Across All Customer TouchPoints (with SIGKDD01), 2001.
[30]
C. Wang and S. Parthasarathy. Parallel algorithms for mining frequent structural motifs in scientific data. In Proceedings of the ACM International Conference on Supercomputing (ICS), 2004.
[31]
X. Yan and J. Han. gspan: Graph-based substructure pattern mining. In Proceedings of the 2002 International Conference on Data Mining (ICDM), 2002.
[32]
M. Zaki. Parallel sequence mining on shared-memory machines. In Journal of Parallel and Distributed Computing, volume 61, pages 401--426, 2001.
[33]
M. Zaki and C. Hsiao. CHARM: An efficient algorithm for closed itemset mining. In Proceedings of SIAM International Conference on Data Mining (SDM), 2002.
[34]
M. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. New algorithms for fast discovery of association rules. pages 283--296, 1997.
[35]
M. J. Zaki, V. Nadimpally, D. Bardhan, and C. Bystroff. Predicting protein folding pathways. Bioinformatics, 20(1):386--393, 2004.

Cited By

View all
  • (2009)Performance Issues in Parallelizing Data-Intensive Applications on a Multi-core ClusterProceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid10.1109/CCGRID.2009.83(308-315)Online publication date: 18-May-2009

Index Terms

  1. Efficient pattern mining on shared memory systems: implications for chip multiprocessor architectures

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MSPC '06: Proceedings of the 2006 workshop on Memory system performance and correctness
    October 2006
    114 pages
    ISBN:1595935789
    DOI:10.1145/1178597
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 22 October 2006

    Permissions

    Request permissions for this article.

    Check for updates

    Qualifiers

    • Article

    Conference

    MSPC '06
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 6 of 20 submissions, 30%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 30 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2009)Performance Issues in Parallelizing Data-Intensive Applications on a Multi-core ClusterProceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid10.1109/CCGRID.2009.83(308-315)Online publication date: 18-May-2009

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media