Optimized Substructure Discovery for Semi-structured Data

Abe, Kenji; Kawasoe, Shinji; Asai, Tatsuya; Arimura, Hiroki; Arikawa, Setsuo

doi:10.1007/3-540-45681-3_1

Kenji Abe⁴,
Shinji Kawasoe⁴,
Tatsuya Asai⁴,
Hiroki Arimura^4,5 &
…
Setsuo Arikawa⁴

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2431))

Included in the following conference series:

European Conference on Principles of Data Mining and Knowledge Discovery

2115 Accesses
22 Citations

Abstract

In this paper, we consider the problem of discovering interesting substructures from a large collection of semi-structured data in the framework of optimized pattern discovery. We model semi-structured data and patterns with labeled ordered trees, and present an efficient algorithm that discovers the best labeled ordered trees that optimize a given statistical measure, such as the information entropy and the classification accuracy, in a collection of semi-structured data. We give theoretical analyses of the computational complexity of the algorithm for patterns with bounded and unbounded size. Experiments show that the algorithm performs well and discovered interesting patterns on real datasets.

Download to read the full chapter text

Chapter PDF

A Relaxation-Based Approach for Mining Diverse Closed Patterns

Graph Clustering via Inexact Patterns

GraphMDL: Graph Pattern Selection Based on Minimum Description Length

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

K. Abe, S. Kawasoe, T. Asai, H. Arimura, S. Arikawa, Optimized substructure discovery for semi-structured data, DOI, Kyushu Univ., DOI-TR-206, Mar. 2002. ftp://ftp.i.kyushu-u.ac.jp/pub/tr/trcs206.ps.gz
S. Abiteboul, P. Buneman, and D. Suciu. Data on the Web. Morgan Kaufmann, 2000.
Google Scholar
R. Agrawal, R. Srikant, Fast algorithms for mining association rules, In Proc. VLDB’94/, 487–499, 1994.
Google Scholar
A. V. Aho, J. E. Hopcroft, and J. D. Ullman. Data Structures and Algorithms. Addison-Wesley, 1983.
Google Scholar
H. Arimura, A. Wataki, R. Fujino, S. Arikawa, An efficient algorithm for text data mining with optimal string patterns, In Proc. ALT’98, LNAI 1501, 247–261, 1998.
MathSciNet Google Scholar
H. Arimura, J. Abe, R. Fujino, H. Sakamoto, S. Shimozono, S. Arikawa, Text Data Mining: Discovery of Important Keywords in the Cyberspace, In Proc. IEEE Kyoto Int’l Conf. on Digital Libraries, 2000.
Google Scholar
T. Asai, K. Abe, S. Kawasoe, H. Arimura, H. Sakamoto, and S. Arikawa. Efficient substructure discovery from large semi-structured data. In Proc. the 2nd SIAM Int’l Conf. on Data Mining (SDM2002), 158–174, 2002.
Google Scholar
R. J. Bayardo Jr. Efficiently mining long patterns from databases. In Proc. SIGMOD98, 85–93, 1998.
Google Scholar
S. Ben-David, N. Eiron, and P. M. Long, On the difficulty of Approximately Maximizing Agreements, In Proc. COLT 2000, 266–274, 2000.
Google Scholar
L. Dehaspe, H. Toivonen, and R. D. King. Finding frequent substructures in chemical compounds. In Proc. KDD-98, 30–36, 1998.
Google Scholar
L. Devroye, L. Gyor., G. Lugosi, A Probablistic Theory of Pattern Recognition, Springer-Verlag, 1996.
Google Scholar
R. Fujino, H. Arimura, S. Arikawa, Discovering unordered and ordered phrase association patterns for text mining. In Proc. PAKDD2000, LNAI 1805, 2000.
Google Scholar
T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama. Data mining using two-dimensional optimized association rules. In Proc. SIGMOD’96, 13–23, 1996.
Google Scholar
R. C. Holte, Very simple classification rules perform well on most commonly used datasets, Machine Learning, 11, 63–91, 1993.
Article MATH Google Scholar
A. Inokuchi, T. Washio and H. Motoda An Apriori-Based Algorithm for Mining Frequent Substructures from Graph Data, In Proc. PKDD 2000, 13–23, 2000.
Google Scholar
M. J. Kearns, R. E. Shapire, L. M. Sellie, Toward efficient agnostic learning. Machine Learning, 17(2–3), 115–141, 1994.
MATH Google Scholar
W. Maass, Efficient agnostic PAC-learning with simple hypothesis, In Proc. COLT94, 67–75, 1994.
Google Scholar
T. Matsuda, T. Horiuchi, H. Motoda, T. Washio, et al., Graph-based induction for general graph structured data. In Proc. DS’99, 340–342, 1999.
Google Scholar
T. Miyahara, T. Shoudai, T. Uchida, K. Takahashi, and H. Ueda. Discovery of frequent tree structured patterns in semistructured web documents. In Proc. PAKDD-2001, 47–52, 2001.
Google Scholar
S. Morishita, On classification and regression, In Proc. Discovery Science’ 98, LNAI 1532, 49–59, 1998.
Google Scholar
S. Morishita and J. Sese, Traversing Itemset Lattices with Statistical Metric Pruning, In Proc. PODS’00, 226–236, 2000.
Google Scholar
J. R. Quinlan, C4.5: Program for Machine Learning, Morgan Kaufmann Publishers, San Mateo, CA, 1993.
Google Scholar
R. Rastogi, K. Shim, Mining Optimized Association Rules with Categorical and Numeric Attributes, In Proc. ICDE’98, 503–512, 1998.
Google Scholar
H. Arimura, S. Arikawa, S. Shimozono, Efficient discovery of optimal wordassociation patterns in large text databases New Gener. Comput., 18, 49–60, 2000.
Article Google Scholar
V. V. Vazirani, Approximaiton Algorithms, Springer, Berlin, 1998.
Google Scholar
W3C Recommendation. Extensibe Markup Language (XML) 1.0, second edition, 06 October 2000. http://www.w3.org/TR/REC-xml.
K. Wang and H. Q. Liu. Discovering structual association of semistructured data. IEEE Trans. Knowledge and Data Engineering (TKDE2000), 12(3):353–371, 2000.
Article Google Scholar
M. J. Zaki. Efficiently mining frequent trees in a forest. Computer Science Department, Rensselaer Ploytechnic Institute, PRI-TR01-7-2001, 2001. http://www.cs.rpi.edu/~zaki/PS/TR01-7.ps.gz

Download references

Author information

Authors and Affiliations

Department of Informatics, Kyushu University, 6-10-1 Hakozaki Higashi-ku, 812-8581, Fukuoka, Japan
Kenji Abe, Shinji Kawasoe, Tatsuya Asai, Hiroki Arimura & Setsuo Arikawa
PRESTO, JST, Japan
Hiroki Arimura

Authors

Kenji Abe
View author publications
You can also search for this author in PubMed Google Scholar
Shinji Kawasoe
View author publications
You can also search for this author in PubMed Google Scholar
Tatsuya Asai
View author publications
You can also search for this author in PubMed Google Scholar
Hiroki Arimura
View author publications
You can also search for this author in PubMed Google Scholar
Setsuo Arikawa
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, University of Helsinki, P.O. Box 26, 00014, Helsinki, Finland
Tapio Elomaa , Heikki Mannila & Hannu Toivonen , &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Abe, K., Kawasoe, S., Asai, T., Arimura, H., Arikawa, S. (2002). Optimized Substructure Discovery for Semi-structured Data. In: Elomaa, T., Mannila, H., Toivonen, H. (eds) Principles of Data Mining and Knowledge Discovery. PKDD 2002. Lecture Notes in Computer Science, vol 2431. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45681-3_1

Download citation

DOI: https://doi.org/10.1007/3-540-45681-3_1
Published: 18 September 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-44037-6
Online ISBN: 978-3-540-45681-0
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics

Optimized Substructure Discovery for Semi-structured Data

Abstract

Chapter PDF

Similar content being viewed by others

A Relaxation-Based Approach for Mining Diverse Closed Patterns

Graph Clustering via Inexact Patterns

GraphMDL: Graph Pattern Selection Based on Minimum Description Length

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Optimized Substructure Discovery for Semi-structured Data

Abstract

Chapter PDF

Similar content being viewed by others

A Relaxation-Based Approach for Mining Diverse Closed Patterns

Graph Clustering via Inexact Patterns

GraphMDL: Graph Pattern Selection Based on Minimum Description Length

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation