Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Mining tree-structured data on multicore systems

Published: 01 August 2009 Publication History
  • Get Citation Alerts
  • Abstract

    Mining frequent subtrees in a database of rooted and labeled trees is an important problem in many domains, ranging from phylogenetic analysis to biochemistry and from linguistic parsing to XML data analysis. In this work we revisit this problem and develop an architecture conscious solution targeting emerging multicore systems. Specifically we identify a sequence of memory related optimizations that significantly improve the spatial and temporal locality of a state-of-the-art sequential algorithm -- alleviating the effects of memory latency. Additionally, these optimizations are shown to reduce the pressure on the front-side bus, an important consideration in the context of large-scale multicore architectures. We then demonstrate that these optimizations while necessary are not sufficient for efficient parallelization on multicores, primarily due to parametric and data-driven factors which make load balancing a significant challenge. To address this challenge, we present a methodology that adaptively and automatically modulates the type and granularity of the work being shared among different cores. The resulting algorithm achieves near perfect parallel efficiency on up to 16 processors on challenging real world applications. The optimizations we present have general purpose utility and a key out-come is the development of a general purpose scheduling service for moldable task scheduling on emerging multicore systems.

    References

    [1]
    A. Aho, M. Ganapathi, and S. Tjiang. Code Generation Using Tree Matching and Dynamic Programming. ACM Transactions on Programming Languages and Systems, 11(4):491--516, 1989.
    [2]
    K. Aoki, A. Yamaguchi, Y. Okuno, T. Akutsu, N. Ueda, M. Kanehisa, and H. Mamitsuka. Efficient Tree-Matching Methods for Accurate Carbohydrate Database Queries. Genome Informatics Series, pages 134--143, 2003.
    [3]
    T. Asai, K. Abe, S. Kawasoe, H. Arimura, H. Satamoto, and S. Arikawa. Efficient Substructure Discovery from Large Semi-structured Data. In SDM, pages 158--174, 2002.
    [4]
    I. Baxter, A. Yahin, L. Moura, M. SantAnna, and L. Bier. Clone Detection Using Abstract Syntax Trees. In ICSM, pages 368--377, 1998.
    [5]
    D. Berndt and J. Clifford. Finding patterns in time series: a dynamic programming approach. Advances in knowledge discovery and data mining, pages 229--248, 1996.
    [6]
    G. Buehrer, S. Parthasarathy, and A. Ghoting. Adaptive parallel graph mining for CMP architectures. In ICDM, pages 97--106, 2006.
    [7]
    G. Buehrer, S. Parthasarathy, and A. Ghoting. Out-of-core frequent pattern mining on a commodity PC. In KDD, pages 86--95, 2006.
    [8]
    E. Charniak. Tree-bank grammars. In AAAI, pages 1031--1036, 1996.
    [9]
    Y. Chi, R. Muntz, S. Nijssen, and N. Kok. Frequent Subtree Mining-An Overview. Fundamenta Informaticae, 66(1):161--198, 2005.
    [10]
    Y. Chi, Y. Yang, Y. Xia, and R. Muntz. CMTreeMiner: Mining Both Closed and Maximal Frequent Subtrees. In PAKDD, pages 63--73, 2004.
    [11]
    H. H. Gan, S. Pasquali, T. Schlick. Exploring the repertoire of RNA secondary motifs using graph theory; implications for RNA design. In Nucleic acids research, 31(11):2926--2943, 2003.
    [12]
    J. Han, J. Pei, and Y. Yin. Mining Frequent Patterns without Candidate Generation. In SIGMOD, pages 1--12, 2000.
    [13]
    K. Hashimoto, I. Takigawa, M. Shiga, M. Kanehisa, and H. Mamitsuka. Mining significant tree patterns in carbohydrate sugar chains. In Bioinformatics, 24(16):i167--73, 2008.
    [14]
    R. Kumar, K. Farkas, N. Jouppi, P. Ranganathan, and D. Tullsen. Single-ISA heterogeneous multi-core architectures: the potential for processor power reduction. In IEEE Micro, pages 81--92, 2003.
    [15]
    S. Y. Le, and et al. RNA secondary structures: comparison and determination of frequently recurring substructures by consensus. In Bioinformatics, 5(3):205--210, 1989.
    [16]
    H. Leung and H. Ting. An optimal algorithm for global termination detection in shared-memory asynchronous multiprocessor systems. In TPDS, 8(5):538--543, 1997.
    [17]
    S. Needleman and C. Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. In Journal of Molecular Biology, 48(3):443--453, 1970.
    [18]
    S. Nijssen and J. Kok. Efficient Discovery of Frequent Unordered Trees. In MGTS, pages 55--64, 2003.
    [19]
    L. Qiao, V. Raman, F. Reiss, P. Haas, and G. Lohman. Main-memory scan sharing for multi-core cpus. In VLDB, pages 610--621, 2008.
    [20]
    C. F. Olson. Parallel algorithms for hierarchical clustering. In Parallel Computing, 21(8):1313--1325, 1995.
    [21]
    P. Rao and B. Moon. PRIX: indexing and querying XML using prufer sequences. In ICDE, pages 288--299, 2004.
    [22]
    U. Ruckert and S. Kramer. Frequent Free Tree Discovery in Graph Data. In ACMSAC, pages 564--570, 2004.
    [23]
    B. Saha and et al. Enabling scalability and performance in a large scale CMP environment. In EuroSys, pages 73--86, 2007.
    [24]
    B. A. Shapiro and K. Zhang. Comparing multiple RNA secondary structures using tree comparisons. In Bioinformatics, 6(4):309--318, 1990.
    [25]
    D. Shasha and J. Zhang. Unordered tree mining with applications to phylogeny. In ICDE, pages 708--719, 2004.
    [26]
    J. Srivastava, R. Cooley, M. Deshpande, and P. Tan. Web usage mining: discovery and applications of usage patterns from Web data. ACM SIGKDD Explorations Newsletter 1(2):12--23, 2000.
    [27]
    M. Steel and T. Warnow. Tree Theorems: Computing the Maximum Agreement Subtree. Information Processing Letters, 48:77--82, 1993.
    [28]
    H. Tan, T. Dillon, F. Hadzic, E. Chang, and L. Feng. IMB3-Miner: Mining Induced/Embedded Subtrees by Constraining the Level of Embedding. In PAKDD, pages 450--461, 2006.
    [29]
    S. Tatikonda and S. Parthasarathy. Mining Tree Structured Data on Multicores: An Adaptive Architecture Conscious Approach. Technical Report OSU-CISRC-TR18, The Ohio State University, 2007(updated October 2008).
    [30]
    S. Tatikonda, S. Parthasarathy, and M. Goyder. LCS-TRIM: Dynamic Programming meets XML Indexing and Querying. In VLDB, pages 63--74, 2007.
    [31]
    S. Tatikonda, S. Parthasarathy, and T. Kurc. TRIPS and TIDES: new algorithms for tree mining. In CIKM, pages 455--464, 2006.
    [32]
    A. Termier, M. Rousset, M. Sebag, K. Ohara, T. Washio, and H. Motoda. Efficient Mining of High Branching Factor Attribute Trees. In ICDM, pages 785--788, 2005.
    [33]
    A. Termier, M. C. Rousset, and M. Sebag. DRYADE: A New Approach for Discovering Closed Frequent Trees in Heterogeneous Tree Databases. In ICDM, pages 543--546, 2004.
    [34]
    R. Wagner and M. Fischer. The String-to-String Correction Problem. In JACM, 21(1):168--173, 1974.
    [35]
    C. Wang, M. Hong, J. Pei, H. Zhou, W. Wang, and B. Shi. Efficient Pattern-Growth Methods for Frequent Tree Pattern Mining. In PAKDD, pages 441--451, 2004.
    [36]
    J. Wang, H. Shan, D. Shasha, and W. Piel. TreeRank: A Similarity Measure for Nearest Neighbor Searching in Phylogenetic Databases. In SSDBM, pages 171--180, 2003.
    [37]
    S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, and J. Demmel. Optimization of sparse matrix-vector multiplication on emerging multicore platforms. In SC, pages 1--12, 2007.
    [38]
    M. E. Wolf and M. S. Lam. A data locality optimizing algorithm. In PLDI, pages 30--44, 1991.
    [39]
    Y. Xiao and J. Yao. Efficient data mining for maximal frequent subtrees. In ICDM, pages 379--386, 2003.
    [40]
    X. Yan and J. Han. gSpan: Graph-based substructure pattern mining. In ICDM, pages 721--724, 2002.
    [41]
    L. Yang, M. Lee, and W. Hsu. Finding hot query patterns over an XQuery stream. In VLDB, 13(4):318--332, 2004.
    [42]
    M. Zaki. Efficiently Mining Frequent Trees in a Forest: Algorithms and Applications. In TKDE, 17(8):1021--1035, 2005.
    [43]
    M. Zaki. Parallel and Distributed Association Mining: A Survey. In IEEE Concurrency, pages 14--25, 1999.
    [44]
    M. Zaki and C. Aggarwal. XRules: an effective structural classifier for XML data. In KDD, pages 316--325, 2003.
    [45]
    M. Zaki, C. Ho, and R. Agrawal. Parallel classification for data mining on shared-memory multiprocessors. In ICDE, pages 198--205, 1999.
    [46]
    K. Zhang. Computing similarity between RNA secondary structures. In IEEE Joint Symposia on Intelligence and Systems, pages 126--132, 1998.
    [47]
    P. Zezula, G. Amato, F. Debole, and F. Rabitti. Tree signatures for XML querying and navigation. In XSym, pages 149--163, 2003.

    Cited By

    View all
    • (2019)Efficient Subgraph MatchingProceedings of the 2019 International Conference on Management of Data10.1145/3299869.3319880(1429-1446)Online publication date: 25-Jun-2019
    • (2019)Short and Long-term Pattern Discovery Over Large-Scale Geo-Spatiotemporal DataProceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining10.1145/3292500.3330755(2905-2913)Online publication date: 25-Jul-2019
    • (2017)Frequent subtree mining on the automata processorProceedings of the International Conference on Supercomputing10.1145/3079079.3079084(1-11)Online publication date: 14-Jun-2017
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 2, Issue 1
    August 2009
    1293 pages

    Publisher

    VLDB Endowment

    Publication History

    Published: 01 August 2009
    Published in PVLDB Volume 2, Issue 1

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)6
    • Downloads (Last 6 weeks)0

    Other Metrics

    Citations

    Cited By

    View all
    • (2019)Efficient Subgraph MatchingProceedings of the 2019 International Conference on Management of Data10.1145/3299869.3319880(1429-1446)Online publication date: 25-Jun-2019
    • (2019)Short and Long-term Pattern Discovery Over Large-Scale Geo-Spatiotemporal DataProceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining10.1145/3292500.3330755(2905-2913)Online publication date: 25-Jul-2019
    • (2017)Frequent subtree mining on the automata processorProceedings of the International Conference on Supercomputing10.1145/3079079.3079084(1-11)Online publication date: 14-Jun-2017
    • (2016)Mining rooted ordered trees under subtree homeomorphismData Mining and Knowledge Discovery10.1007/s10618-015-0439-530:5(1249-1272)Online publication date: 1-Sep-2016
    • (2016)Transactional Tree MiningEuropean Conference on Machine Learning and Knowledge Discovery in Databases - Volume 985110.1007/978-3-319-46128-1_12(182-198)Online publication date: 19-Sep-2016
    • (2015)Density-based data partitioning strategy to approximate large-scale subgraph miningInformation Systems10.1016/j.is.2013.08.00548:C(213-223)Online publication date: 1-Mar-2015
    • (2011)Posting list intersection on multicore architecturesProceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval10.1145/2009916.2010045(963-972)Online publication date: 24-Jul-2011
    • (2011)Parallel skyline computation on multicore architecturesInformation Systems10.1016/j.is.2010.10.00536:4(808-823)Online publication date: 1-Jun-2011
    • (2010)Ten thousand SQLsProceedings of the VLDB Endowment10.14778/1920841.19208543:1-2(58-69)Online publication date: 1-Sep-2010
    • (2010)PGP-mcProceedings of the 15th international conference on Database Systems for Advanced Applications - Volume Part I10.1007/978-3-642-12026-8_8(78-84)Online publication date: 1-Apr-2010

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media