Abstract
Since lacking valid schemas is a critical problem for XML and present research on interleaving for XML is also quite insufficient, in this paper we focus on the inference of XML schemas with interleaving. Previous researches have shown that the essential task in schema learning is inferring regular expressions from a set of given samples. Presently, the most powerful model to learn XML schemas is the k-occurrence regular expressions (k-OREs for short). However, there have been no algorithms that can learn k-OREs with interleaving. Therefore, we propose an entire framework which can support both k-OREs and interleaving. To the best of our knowledge, our work is the first to address these two inference problems at the same time. We first defined a new subclass of regular expressions named k-OIREs, and developed an inference algorithm iKOIRE to learn k-OIRE based on genetic algorithm and maximum independent set (MIS). We further conducted a series of experiments on large-scale real datasets, and evaluated the effectiveness of our work compared with both ongoing learning algorithms in academia and industrial tools in real world. The results reveal the high practicability and outstanding performance of our work, and indicate its promising prospects in application.
Work supported by the National Natural Science Foundation of China under Grant Nos. 61872339 and 61472405.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Benedikt, M., Fan, W., Geerts, F.: XPath satisfiability in the presence of DTDs. J. ACM 55(2), 8:1–8:79 (2008)
Bex, G.J., Gelade, W., Neven, F., Vansummeren, S.: Learning deterministic regular expressions for the inference of schemas from XML data. TWEB 4(4), 14:1–14:32 (2010)
Bex, G.J., Neven, F., Schwentick, T., Tuyls, K.: Inference of concise DTDs from XML data. In: Proceedings of the 32nd VLDB, pp. 115–126 (2006)
Bex, G.J., Neven, F., Schwentick, T., Vansummeren, S.: Inference of concise regular expressions and DTDs. ACM Trans. Database Syst. 35(2), 11:1–11:47 (2010)
Bex, G.J., Neven, F., Vansummeren, S.: Inferring XML schema definitions from XML data. In: Proceedings of the 33rd VLDB, pp. 998–1009 (2007)
Boneva, I., Ciucanu, R., Staworko, S.: Simple schemas for unordered XML. In: Proceedings of the 16th WebDB, pp. 13–18 (2013)
Che, D., Aberer, K., Özsu, M.T.: Query optimization in XML structured-document databases. VLDB J. 15(3), 263–289 (2006)
Ciucanu, R., Staworko, S.: Learning schemas for unordered XML. In: Proceedings of the 14th DBPL (2013)
devutilsonline: Free XML to XSD Generator, March 2018. https://devutilsonline.com/xsd-xml/generate-xsd-from-xml
EditiX: Open Source XML Editor, March 2018. https://www.editix.com/
Feng, X.Q., Zheng, L.X., Chen, H.M.: Inference algorithm for a restricted class of regular expressions. Comput. Sci. 41, 178–183 (2014)
freeformatter: XML Schema Generator, March 2018. https://www.freeformatter.com/xsd-generator.html
Freydenberger, D.D., Kötzing, T.: Fast learning of restricted regular expressions and DTDs. Theor. Comput. Syst. 57(4), 1114–1158 (2015)
GarcÃa, P., Vidal, E.: Inference of k-testable languages in the strict sense and application to syntactic pattern recognition. IEEE Trans. Pattern Anal. Mach. Intell. 12(9), 920–925 (1990)
Garofalakis, M.N., Gionis, A., Rastogi, R., Seshadri, S., Shim, K.: XTRACT: learning document type descriptors from XML document collections. Data Min. Knowl. Discov. 7(1), 23–56 (2003)
Gold, E.M.: Language identification in the limit. Inf. Control 10(5), 447–474 (1967)
Grijzenhout, S., Marx, M.: The quality of the XML web. J. Web Semant. 19, 59–68 (2013)
Hopcroft, J.E., Motwani, R., Ullman, J.D.: Introduction to Automata Theory, Languages, and Computation. Addison-Wesley, Boston (2001)
InstanceToSchema: RELAX NG Schema Generator, October 2003. http://www.xmloperator.net/i2s/
JetBrains: Capable and Ergonomic IDE for JVM, March 2018. https://www.jetbrains.com/idea/
Koch, C., Scherzinger, S., Schweikardt, N., Stegmaier, B.: Schema-based scheduling of event processors and buffer minimization for queries on structured data streams. In: Proceedings of the 30th VLDB, pp. 228–239 (2004)
Li, Y., Chu, X., Mou, X., Dong, C., Chen, H.: Practical study of deterministic regular expressions from large-scale XML and schema data. In: Proceedings of the 22nd IDEAS, pp. 45–53 (2018)
Li, Y., Mou, X., Chen, H.: Learning concise relax NG schemas supporting interleaving from XML documents. In: Proceedings of the 14th ADMA, pp. 303–317 (2018)
Li, Y., Zhang, X., Peng, F., Chen, H.: Practical study of subclasses of regular expressions in DTD and XML schema. In: Proceedings of the 18th APWeb, pp. 368–382 (2016)
Li, Y., Zhang, X., Xu, H., Mou, X., Chen, H.: Learning restricted regular expressions with interleaving from XML data. In: Proceedings of the 37th ER, pp. 586–593 (2018)
Manolescu, I., Florescu, D., Kossmann, D.: Answering XML queries on heterogeneous data sources. In: Proceedings of the 27th VLDB, pp. 241–250 (2001)
Martens, W., Neven, F.: Typechecking top-down uniform unranked tree transducers. In: Proceedings of the 9th ICDT, pp. 64–78 (2003)
Martens, W., Neven, F.: Frontiers of tractability for typechecking simple XML transformations. J. Comput. Syst. Sci. 73(3), 362–390 (2007)
mherman: XML Schema Generator, March 2018. http://xml.mherman.org/
Microsoft: Xml Schema Inference - Developer Network, March 2018. https://msdn.microsoft.com/en-us/library/system.xml.schema.xmlschemainference.aspx
Oxygen: XML Editor, March 2018. https://www.oxygenxml.com/
Papakonstantinou, Y., Vianu, V.: DTD inference for views of XML data. In: Proceedings of the 19th PODS, pp. 35–46 (2000)
Peng, F., Chen, H.: Discovering restricted regular expressions with interleaving. In: Proceedings of the 17th APWeb, pp. 104–115 (2015)
Quinlan, J.R., Rivest, R.L.: Inferring decision trees using the minimum description length principle. Inf. Comput. 80(3), 227–248 (1989)
StylusStudio: XML Integrated Development Environment (XML IDE), March 2018. http://www.stylusstudio.com/
liquid technologies: Graphical XML Editor, March 2018. https://www.liquid-technologies.com/
Trang: Multi-Format Schema Converter Based on RELAX NG, October 2008. http://www.thaiopensource.com/relaxng/trang.html
XMLBlueprint: XML Editor, March 2018. https://www.xmlblueprint.com/
Zhang, X., Li, Y., Cui, F., Dong, C., Chen, H.: Inference of a concise regular expression considering interleaving from XML documents. In: Proceedings of the 22nd PAKDD, pp. 389–401 (2018)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Li, Y., Zhang, X., Cao, J., Chen, H., Gao, C. (2019). Learning k-Occurrence Regular Expressions with Interleaving. In: Li, G., Yang, J., Gama, J., Natwichai, J., Tong, Y. (eds) Database Systems for Advanced Applications. DASFAA 2019. Lecture Notes in Computer Science(), vol 11447. Springer, Cham. https://doi.org/10.1007/978-3-030-18579-4_5
Download citation
DOI: https://doi.org/10.1007/978-3-030-18579-4_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-18578-7
Online ISBN: 978-3-030-18579-4
eBook Packages: Computer ScienceComputer Science (R0)