Abstract
XML schemas are useful in various applications. However, many XML documents in practice are not accompanied by a schema or by a valid schema. Therefore, it is essential to design efficient algorithms for schema learning. Each element in XML schema has its content model defined by a regular expression. Schema learning can be reduced to the inference of restricted regular expressions. In this paper, we focus on learning restricted regular expressions with interleaving from a set of XML documents. The new subclass is named as CHAin Regular Expression with Interleaving (ICHARE). Then based on single occurrence automaton (SOA) and maximum independent set (MIS), we introduce an inference algorithm GenICHARE. The algorithm is proved to infer a descriptive ICHARE from a set of given sample. At last, based on the data set crawled from the Web, we compare the coverage proportion of ICHARE compared with other existing subclasses. Besides, we analyze the conciseness of regular expressions inferred by GenICHARE based on DBLP. Experimental results show that ICHARE is more concise and useful in practice, and the inference algorithm is promising and effective.
H. Chen—Work supported by the National Natural Science Foundation of China under Grant Nos. 61472405.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Abiteboul, S., Buneman, P., Suciu, D.: Data on the Web: from Relations to Semistructured Data and XML. Morgan Kaufmann, San Francisco (2000)
Bex, G.J., Gelade, W., Neven, F., Vansummeren, S.: Learning deterministic regular expressions for the inference of schemas from XML data. ACM Trans. Web 4(4), 1–32 (2010)
Bex, G.J., Neven, F., Bussche, J.V.D.: DTDs versus XML schema: a practical study. In: International Workshop on the Web and Databases, pp. 79–84 (2004)
Bex, G.J., Neven, F., Schwentick, T., Tuyls, K.: Inference of concise DTDs from XML data. In: International Conference on Very Large Data Bases, Seoul, Korea, September, pp. 115–126 (2006)
Bex, G.J., Neven, F., Schwentick, T., Vansummeren, S.: Inference of concise regular expressions and DTDs. ACM Trans. Database Syst. 35(2), 1–47 (2010)
Bex, G.J., Neven, F., Vansummeren, S.: Inferring XML schema definitions from XML data. In: International Conference on Very Large Data Bases, University of Vienna, Austria, September, pp. 998–1009 (2007)
Boneva, I., Ciucanu, R., Staworko, S.: Simple schemas for unordered XML. In: International Workshop on the Web and Databases (2015)
Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms, 2nd edn., pp. 1297–1305 (2001)
Feng, X.Q., Zheng, L.X., Chen, H.M.: Inference Algorithm for a Restricted Class of Regular Expressions, vol. 41. Computer Science (2014)
Freydenberger, D.D., Kötzing, T.: Fast learning of restricted regular expressions and DTDs. Theory Comput. Syst. 57(4), 1114–1158 (2015)
Garcia, P., Vidal, E.: Inference of K-testable languages in the strict sense and application to syntactic pattern recognition. IEEE Trans. Pattern Anal. Mach. Intell. 12(9), 920–925 (2002)
Garofalakis, M., Gionis, A., Shim, K.: XTRACT: learning document type descriptors from XML document collections. Data Min. Knowl. Discov. 7(1), 23–56 (2003)
Ghelli, G., Colazzo, D., Sartiani, C.: Efficient inclusion for a class of XML types with interleaving and counting. Inf. Syst. 34(7), 643–656 (2009)
Gold, E.M.: Language Identification in the limit. Inf. Control 10(5), 447–474 (1967)
Grijzenhout, S., Marx, M.: The quality of the XML web. Web Semant. Sci. Serv. Agents World Wide Web 19, 59–68 (2013)
Koch, C., Scherzinger, S., Schweikardt, N., Stegmaier, B.: Schema-based scheduling of event processors and buffer minimization for queries on structured data streams. In: Proceedings of the Thirtieth International Conference on Very Large Data Bases, vol. 30, pp. 228–239. VLDB Endowment (2004)
Li, Y., Zhang, X., Peng, F., Chen, H.: Practical study of subclasses of regular expressions in DTD and XML schema. In: Li, F., Shim, K., Zheng, K., Liu, G. (eds.) APWeb 2016. LNCS, vol. 9932, pp. 368–382. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-45817-5_29
Manolescu, I., Florescu, D., Kossmann, D.: Answering XML queries on heterogeneous data sources. VLDB 1, 241–250 (2001)
Martens, W., Neven, F.: Typechecking top-down uniform unranked tree transducers. In: Calvanese, D., Lenzerini, M., Motwani, R. (eds.) ICDT 2003. LNCS, vol. 2572, pp. 64–78. Springer, Heidelberg (2003). https://doi.org/10.1007/3-540-36285-1_5
Martens, W., Neven, F., Schwentick, T.: Complexity of decision problems for XML schemas and chain regular expressions. Siam J. Comput. 39(4), 1486–1530 (2009)
Peng, F., Chen, H.: Discovering restricted regular expressions with interleaving. In: Cheng, R., Cui, B., Zhang, Z., Cai, R., Xu, J. (eds.) APWeb 2015. LNCS, vol. 9313, pp. 104–115. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-25255-1_9
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Zhang, X., Li, Y., Cui, F., Dong, C., Chen, H. (2018). Inference of a Concise Regular Expression Considering Interleaving from XML Documents. In: Phung, D., Tseng, V., Webb, G., Ho, B., Ganji, M., Rashidi, L. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2018. Lecture Notes in Computer Science(), vol 10938. Springer, Cham. https://doi.org/10.1007/978-3-319-93037-4_31
Download citation
DOI: https://doi.org/10.1007/978-3-319-93037-4_31
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-93036-7
Online ISBN: 978-3-319-93037-4
eBook Packages: Computer ScienceComputer Science (R0)