Abstract
In this paper, we consider the problem of discovering interesting substructures from a large collection of semi-structured data in the framework of optimized pattern discovery. We model semi-structured data and patterns with labeled ordered trees, and present an efficient algorithm that discovers the best labeled ordered trees that optimize a given statistical measure, such as the information entropy and the classification accuracy, in a collection of semi-structured data. We give theoretical analyses of the computational complexity of the algorithm for patterns with bounded and unbounded size. Experiments show that the algorithm performs well and discovered interesting patterns on real datasets.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
K. Abe, S. Kawasoe, T. Asai, H. Arimura, S. Arikawa, Optimized substructure discovery for semi-structured data, DOI, Kyushu Univ., DOI-TR-206, Mar. 2002. ftp://ftp.i.kyushu-u.ac.jp/pub/tr/trcs206.ps.gz
S. Abiteboul, P. Buneman, and D. Suciu. Data on the Web. Morgan Kaufmann, 2000.
R. Agrawal, R. Srikant, Fast algorithms for mining association rules, In Proc. VLDB’94/, 487–499, 1994.
A. V. Aho, J. E. Hopcroft, and J. D. Ullman. Data Structures and Algorithms. Addison-Wesley, 1983.
H. Arimura, A. Wataki, R. Fujino, S. Arikawa, An efficient algorithm for text data mining with optimal string patterns, In Proc. ALT’98, LNAI 1501, 247–261, 1998.
H. Arimura, J. Abe, R. Fujino, H. Sakamoto, S. Shimozono, S. Arikawa, Text Data Mining: Discovery of Important Keywords in the Cyberspace, In Proc. IEEE Kyoto Int’l Conf. on Digital Libraries, 2000.
T. Asai, K. Abe, S. Kawasoe, H. Arimura, H. Sakamoto, and S. Arikawa. Efficient substructure discovery from large semi-structured data. In Proc. the 2nd SIAM Int’l Conf. on Data Mining (SDM2002), 158–174, 2002.
R. J. Bayardo Jr. Efficiently mining long patterns from databases. In Proc. SIGMOD98, 85–93, 1998.
S. Ben-David, N. Eiron, and P. M. Long, On the difficulty of Approximately Maximizing Agreements, In Proc. COLT 2000, 266–274, 2000.
L. Dehaspe, H. Toivonen, and R. D. King. Finding frequent substructures in chemical compounds. In Proc. KDD-98, 30–36, 1998.
L. Devroye, L. Gyor., G. Lugosi, A Probablistic Theory of Pattern Recognition, Springer-Verlag, 1996.
R. Fujino, H. Arimura, S. Arikawa, Discovering unordered and ordered phrase association patterns for text mining. In Proc. PAKDD2000, LNAI 1805, 2000.
T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama. Data mining using two-dimensional optimized association rules. In Proc. SIGMOD’96, 13–23, 1996.
R. C. Holte, Very simple classification rules perform well on most commonly used datasets, Machine Learning, 11, 63–91, 1993.
A. Inokuchi, T. Washio and H. Motoda An Apriori-Based Algorithm for Mining Frequent Substructures from Graph Data, In Proc. PKDD 2000, 13–23, 2000.
M. J. Kearns, R. E. Shapire, L. M. Sellie, Toward efficient agnostic learning. Machine Learning, 17(2–3), 115–141, 1994.
W. Maass, Efficient agnostic PAC-learning with simple hypothesis, In Proc. COLT94, 67–75, 1994.
T. Matsuda, T. Horiuchi, H. Motoda, T. Washio, et al., Graph-based induction for general graph structured data. In Proc. DS’99, 340–342, 1999.
T. Miyahara, T. Shoudai, T. Uchida, K. Takahashi, and H. Ueda. Discovery of frequent tree structured patterns in semistructured web documents. In Proc. PAKDD-2001, 47–52, 2001.
S. Morishita, On classification and regression, In Proc. Discovery Science’ 98, LNAI 1532, 49–59, 1998.
S. Morishita and J. Sese, Traversing Itemset Lattices with Statistical Metric Pruning, In Proc. PODS’00, 226–236, 2000.
J. R. Quinlan, C4.5: Program for Machine Learning, Morgan Kaufmann Publishers, San Mateo, CA, 1993.
R. Rastogi, K. Shim, Mining Optimized Association Rules with Categorical and Numeric Attributes, In Proc. ICDE’98, 503–512, 1998.
H. Arimura, S. Arikawa, S. Shimozono, Efficient discovery of optimal wordassociation patterns in large text databases New Gener. Comput., 18, 49–60, 2000.
V. V. Vazirani, Approximaiton Algorithms, Springer, Berlin, 1998.
W3C Recommendation. Extensibe Markup Language (XML) 1.0, second edition, 06 October 2000. http://www.w3.org/TR/REC-xml.
K. Wang and H. Q. Liu. Discovering structual association of semistructured data. IEEE Trans. Knowledge and Data Engineering (TKDE2000), 12(3):353–371, 2000.
M. J. Zaki. Efficiently mining frequent trees in a forest. Computer Science Department, Rensselaer Ploytechnic Institute, PRI-TR01-7-2001, 2001. http://www.cs.rpi.edu/~zaki/PS/TR01-7.ps.gz
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Abe, K., Kawasoe, S., Asai, T., Arimura, H., Arikawa, S. (2002). Optimized Substructure Discovery for Semi-structured Data. In: Elomaa, T., Mannila, H., Toivonen, H. (eds) Principles of Data Mining and Knowledge Discovery. PKDD 2002. Lecture Notes in Computer Science, vol 2431. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45681-3_1
Download citation
DOI: https://doi.org/10.1007/3-540-45681-3_1
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-44037-6
Online ISBN: 978-3-540-45681-0
eBook Packages: Springer Book Archive