research-article

Mining tree-structured data on multicore systems

Authors:

Shirish Tatikonda and

Srinivasan ParthasarathyAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 2, Issue 1

Pages 694 - 705

https://doi.org/10.14778/1687627.1687706

Published: 01 August 2009 Publication History

Abstract

Mining frequent subtrees in a database of rooted and labeled trees is an important problem in many domains, ranging from phylogenetic analysis to biochemistry and from linguistic parsing to XML data analysis. In this work we revisit this problem and develop an architecture conscious solution targeting emerging multicore systems. Specifically we identify a sequence of memory related optimizations that significantly improve the spatial and temporal locality of a state-of-the-art sequential algorithm -- alleviating the effects of memory latency. Additionally, these optimizations are shown to reduce the pressure on the front-side bus, an important consideration in the context of large-scale multicore architectures. We then demonstrate that these optimizations while necessary are not sufficient for efficient parallelization on multicores, primarily due to parametric and data-driven factors which make load balancing a significant challenge. To address this challenge, we present a methodology that adaptively and automatically modulates the type and granularity of the work being shared among different cores. The resulting algorithm achieves near perfect parallel efficiency on up to 16 processors on challenging real world applications. The optimizations we present have general purpose utility and a key out-come is the development of a general purpose scheduling service for moldable task scheduling on emerging multicore systems.

References

[1]

A. Aho, M. Ganapathi, and S. Tjiang. Code Generation Using Tree Matching and Dynamic Programming. ACM Transactions on Programming Languages and Systems, 11(4):491--516, 1989.

Digital Library

[2]

K. Aoki, A. Yamaguchi, Y. Okuno, T. Akutsu, N. Ueda, M. Kanehisa, and H. Mamitsuka. Efficient Tree-Matching Methods for Accurate Carbohydrate Database Queries. Genome Informatics Series, pages 134--143, 2003.

[3]

T. Asai, K. Abe, S. Kawasoe, H. Arimura, H. Satamoto, and S. Arikawa. Efficient Substructure Discovery from Large Semi-structured Data. In SDM, pages 158--174, 2002.

[4]

I. Baxter, A. Yahin, L. Moura, M. SantAnna, and L. Bier. Clone Detection Using Abstract Syntax Trees. In ICSM, pages 368--377, 1998.

Digital Library

[5]

D. Berndt and J. Clifford. Finding patterns in time series: a dynamic programming approach. Advances in knowledge discovery and data mining, pages 229--248, 1996.

Digital Library

[6]

G. Buehrer, S. Parthasarathy, and A. Ghoting. Adaptive parallel graph mining for CMP architectures. In ICDM, pages 97--106, 2006.

Digital Library

[7]

G. Buehrer, S. Parthasarathy, and A. Ghoting. Out-of-core frequent pattern mining on a commodity PC. In KDD, pages 86--95, 2006.

Digital Library

[8]

E. Charniak. Tree-bank grammars. In AAAI, pages 1031--1036, 1996.

Digital Library

[9]

Y. Chi, R. Muntz, S. Nijssen, and N. Kok. Frequent Subtree Mining-An Overview. Fundamenta Informaticae, 66(1):161--198, 2005.

Digital Library

[10]

Y. Chi, Y. Yang, Y. Xia, and R. Muntz. CMTreeMiner: Mining Both Closed and Maximal Frequent Subtrees. In PAKDD, pages 63--73, 2004.

[11]

H. H. Gan, S. Pasquali, T. Schlick. Exploring the repertoire of RNA secondary motifs using graph theory; implications for RNA design. In Nucleic acids research, 31(11):2926--2943, 2003.

[12]

J. Han, J. Pei, and Y. Yin. Mining Frequent Patterns without Candidate Generation. In SIGMOD, pages 1--12, 2000.

Digital Library

[13]

K. Hashimoto, I. Takigawa, M. Shiga, M. Kanehisa, and H. Mamitsuka. Mining significant tree patterns in carbohydrate sugar chains. In Bioinformatics, 24(16):i167--73, 2008.

Digital Library

[14]

R. Kumar, K. Farkas, N. Jouppi, P. Ranganathan, and D. Tullsen. Single-ISA heterogeneous multi-core architectures: the potential for processor power reduction. In IEEE Micro, pages 81--92, 2003.

Digital Library

[15]

S. Y. Le, and et al. RNA secondary structures: comparison and determination of frequently recurring substructures by consensus. In Bioinformatics, 5(3):205--210, 1989.

[16]

H. Leung and H. Ting. An optimal algorithm for global termination detection in shared-memory asynchronous multiprocessor systems. In TPDS, 8(5):538--543, 1997.

Digital Library

[17]

S. Needleman and C. Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. In Journal of Molecular Biology, 48(3):443--453, 1970.

[18]

S. Nijssen and J. Kok. Efficient Discovery of Frequent Unordered Trees. In MGTS, pages 55--64, 2003.

[19]

L. Qiao, V. Raman, F. Reiss, P. Haas, and G. Lohman. Main-memory scan sharing for multi-core cpus. In VLDB, pages 610--621, 2008.

Digital Library

[20]

C. F. Olson. Parallel algorithms for hierarchical clustering. In Parallel Computing, 21(8):1313--1325, 1995.

Digital Library

[21]

P. Rao and B. Moon. PRIX: indexing and querying XML using prufer sequences. In ICDE, pages 288--299, 2004.

Digital Library

[22]

U. Ruckert and S. Kramer. Frequent Free Tree Discovery in Graph Data. In ACMSAC, pages 564--570, 2004.

Digital Library

[23]

B. Saha and et al. Enabling scalability and performance in a large scale CMP environment. In EuroSys, pages 73--86, 2007.

Digital Library

[24]

B. A. Shapiro and K. Zhang. Comparing multiple RNA secondary structures using tree comparisons. In Bioinformatics, 6(4):309--318, 1990.

[25]

D. Shasha and J. Zhang. Unordered tree mining with applications to phylogeny. In ICDE, pages 708--719, 2004.

Digital Library

[26]

J. Srivastava, R. Cooley, M. Deshpande, and P. Tan. Web usage mining: discovery and applications of usage patterns from Web data. ACM SIGKDD Explorations Newsletter 1(2):12--23, 2000.

Digital Library

[27]

M. Steel and T. Warnow. Tree Theorems: Computing the Maximum Agreement Subtree. Information Processing Letters, 48:77--82, 1993.

Digital Library

[28]

H. Tan, T. Dillon, F. Hadzic, E. Chang, and L. Feng. IMB3-Miner: Mining Induced/Embedded Subtrees by Constraining the Level of Embedding. In PAKDD, pages 450--461, 2006.

Digital Library

[29]

S. Tatikonda and S. Parthasarathy. Mining Tree Structured Data on Multicores: An Adaptive Architecture Conscious Approach. Technical Report OSU-CISRC-TR18, The Ohio State University, 2007(updated October 2008).

[30]

S. Tatikonda, S. Parthasarathy, and M. Goyder. LCS-TRIM: Dynamic Programming meets XML Indexing and Querying. In VLDB, pages 63--74, 2007.

Digital Library

[31]

S. Tatikonda, S. Parthasarathy, and T. Kurc. TRIPS and TIDES: new algorithms for tree mining. In CIKM, pages 455--464, 2006.

Digital Library

[32]

A. Termier, M. Rousset, M. Sebag, K. Ohara, T. Washio, and H. Motoda. Efficient Mining of High Branching Factor Attribute Trees. In ICDM, pages 785--788, 2005.

Digital Library

[33]

A. Termier, M. C. Rousset, and M. Sebag. DRYADE: A New Approach for Discovering Closed Frequent Trees in Heterogeneous Tree Databases. In ICDM, pages 543--546, 2004.

Digital Library

[34]

R. Wagner and M. Fischer. The String-to-String Correction Problem. In JACM, 21(1):168--173, 1974.

Digital Library

[35]

C. Wang, M. Hong, J. Pei, H. Zhou, W. Wang, and B. Shi. Efficient Pattern-Growth Methods for Frequent Tree Pattern Mining. In PAKDD, pages 441--451, 2004.

[36]

J. Wang, H. Shan, D. Shasha, and W. Piel. TreeRank: A Similarity Measure for Nearest Neighbor Searching in Phylogenetic Databases. In SSDBM, pages 171--180, 2003.

Digital Library

[37]

S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, and J. Demmel. Optimization of sparse matrix-vector multiplication on emerging multicore platforms. In SC, pages 1--12, 2007.

Digital Library

[38]

M. E. Wolf and M. S. Lam. A data locality optimizing algorithm. In PLDI, pages 30--44, 1991.

Digital Library

[39]

Y. Xiao and J. Yao. Efficient data mining for maximal frequent subtrees. In ICDM, pages 379--386, 2003.

Digital Library

[40]

X. Yan and J. Han. gSpan: Graph-based substructure pattern mining. In ICDM, pages 721--724, 2002.

Digital Library

[41]

L. Yang, M. Lee, and W. Hsu. Finding hot query patterns over an XQuery stream. In VLDB, 13(4):318--332, 2004.

Digital Library

[42]

M. Zaki. Efficiently Mining Frequent Trees in a Forest: Algorithms and Applications. In TKDE, 17(8):1021--1035, 2005.

Digital Library

[43]

M. Zaki. Parallel and Distributed Association Mining: A Survey. In IEEE Concurrency, pages 14--25, 1999.

Digital Library

[44]

M. Zaki and C. Aggarwal. XRules: an effective structural classifier for XML data. In KDD, pages 316--325, 2003.

Digital Library

[45]

M. Zaki, C. Ho, and R. Agrawal. Parallel classification for data mining on shared-memory multiprocessors. In ICDE, pages 198--205, 1999.

Digital Library

[46]

K. Zhang. Computing similarity between RNA secondary structures. In IEEE Joint Symposia on Intelligence and Systems, pages 126--132, 1998.

Digital Library

[47]

P. Zezula, G. Amato, F. Debole, and F. Rabitti. Tree signatures for XML querying and navigation. In XSym, pages 149--163, 2003.

Cited By

Han MKim HGu GPark KHan WBoncz PManegold SAilamaki ADeshpande AKraska T(2019)Efficient Subgraph MatchingProceedings of the 2019 International Conference on Management of Data10.1145/3299869.3319880(1429-1446)Online publication date: 25-Jun-2019
https://dl.acm.org/doi/10.1145/3299869.3319880
Moosavi SSamavatian MNandi AParthasarathy SRamnath RTeredesai AKumar VLi YRosales RTerzi EKarypis G(2019)Short and Long-term Pattern Discovery Over Large-Scale Geo-Spatiotemporal DataProceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining10.1145/3292500.3330755(2905-2913)Online publication date: 25-Jul-2019
https://dl.acm.org/doi/10.1145/3292500.3330755
Sadredini ERahimi RWang KSkadron KGropp WBeckman PLi ZCazorla F(2017)Frequent subtree mining on the automata processorProceedings of the International Conference on Supercomputing10.1145/3079079.3079084(1-11)Online publication date: 14-Jun-2017
https://dl.acm.org/doi/10.1145/3079079.3079084
Show More Cited By

Index Terms

Mining tree-structured data on multicore systems

Recommendations

Multicore Processors and Systems
Read More
Divide and Conquer on Hybrid GPU-Accelerated Multicore Systems

With the raw computing power of graphics processing units (GPUs) being more widely available in commodity multicore systems, there is an imminent need to harness their power for important numerical libraries such as LAPACK. In this paper, we consider ...
Read More
Designing fast architecture-sensitive tree search on modern multicore/many-core processors

In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous computing power by integrating multiple cores, each with wide vector units. There has been much work to exploit modern processor ...
Read More

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 2, Issue 1

August 2009

1293 pages

ISSN:2150-8097

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 August 2009

Published in PVLDB Volume 2, Issue 1

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

10
Total Citations
View Citations
366
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)0

Other Metrics

View Author Metrics

Citations

Cited By

Han MKim HGu GPark KHan WBoncz PManegold SAilamaki ADeshpande AKraska T(2019)Efficient Subgraph MatchingProceedings of the 2019 International Conference on Management of Data10.1145/3299869.3319880(1429-1446)Online publication date: 25-Jun-2019
https://dl.acm.org/doi/10.1145/3299869.3319880
Moosavi SSamavatian MNandi AParthasarathy SRamnath RTeredesai AKumar VLi YRosales RTerzi EKarypis G(2019)Short and Long-term Pattern Discovery Over Large-Scale Geo-Spatiotemporal DataProceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining10.1145/3292500.3330755(2905-2913)Online publication date: 25-Jul-2019
https://dl.acm.org/doi/10.1145/3292500.3330755
Sadredini ERahimi RWang KSkadron KGropp WBeckman PLi ZCazorla F(2017)Frequent subtree mining on the automata processorProceedings of the International Conference on Supercomputing10.1145/3079079.3079084(1-11)Online publication date: 14-Jun-2017
https://dl.acm.org/doi/10.1145/3079079.3079084
Haghir Chehreghani MBruynooghe M(2016)Mining rooted ordered trees under subtree homeomorphismData Mining and Knowledge Discovery10.1007/s10618-015-0439-530:5(1249-1272)Online publication date: 1-Sep-2016
https://dl.acm.org/doi/10.1007/s10618-015-0439-5
Haghir Chehreghani MHaghir Chehreghani M(2016)Transactional Tree MiningEuropean Conference on Machine Learning and Knowledge Discovery in Databases - Volume 985110.1007/978-3-319-46128-1_12(182-198)Online publication date: 19-Sep-2016
https://dl.acm.org/doi/10.1007/978-3-319-46128-1_12
Aridhi Sd'Orazio LMaddouri MMephu Nguifo E(2015)Density-based data partitioning strategy to approximate large-scale subgraph miningInformation Systems10.1016/j.is.2013.08.00548:C(213-223)Online publication date: 1-Mar-2015
https://dl.acm.org/doi/10.1016/j.is.2013.08.005
Tatikonda SCambazoglu BJunqueira FMa WNie JBaeza-Yates RChua TCroft W(2011)Posting list intersection on multicore architecturesProceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval10.1145/2009916.2010045(963-972)Online publication date: 24-Jul-2011
https://dl.acm.org/doi/10.1145/2009916.2010045
Im HPark JPark S(2011)Parallel skyline computation on multicore architecturesInformation Systems10.1016/j.is.2010.10.00536:4(808-823)Online publication date: 1-Jun-2011
https://dl.acm.org/doi/10.1016/j.is.2010.10.005
Qin LYu JChang L(2010)Ten thousand SQLsProceedings of the VLDB Endowment10.14778/1920841.19208543:1-2(58-69)Online publication date: 1-Sep-2010
https://dl.acm.org/doi/10.14778/1920841.1920854
Laurent ANegrevergne BSicard NTermier A(2010)PGP-mcProceedings of the 15th international conference on Database Systems for Advanced Applications - Volume Part I10.1007/978-3-642-12026-8_8(78-84)Online publication date: 1-Apr-2010
https://dl.acm.org/doi/10.1007/978-3-642-12026-8_8

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents