Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

XML tree structure compression using RePair

Published: 01 November 2013 Publication History

Abstract

XML tree structures can conveniently be represented using ordered unranked trees. Due to the repetitiveness of XML markup these trees can be compressed effectively using dictionary-based methods, such as minimal directed acyclic graphs (DAGs) or straight-line context-free (SLCF) tree grammars. While minimal SLCF tree grammars are in general smaller than minimal DAGs, they cannot be computed in polynomial time unless P=NP. Here, we present a new linear time algorithm for computing small SLCF tree grammars, called TreeRePair, and show that it greatly outperforms the best known previous algorithm BPLEX. TreeRePair is a generalization to trees of Larsson and Moffat's RePair string compression algorithm. SLCF tree grammars can be used as efficient memory representations of trees. Using TreeRePair, we are able to produce the smallest queryable memory representation of ordered trees that we are aware of. Our investigations over a large corpus of commonly used XML documents show that tree traversals over TreeRePair grammars are 14 times slower than over pointer structures and 5 times slower than over succinct trees, while memory consumption is only 1/43 and 1/6, respectively. With respect to file compression we are able to show that a Huffman-based coding of TreeRePair grammars gives compression ratios comparable to the best known XML file compressors.

References

[1]
Adiego, J., Navarro, G. and de la Fuente, P., Using structural contexts to compress semistructured text collections. Information Processing & Management. v43 i3. 769-790.
[2]
Akutsu, T., A bisection algorithm for grammar-based compression of ordered trees. Information Processing Letters. v110 i18-19. 815-820.
[3]
D. Arroyuelo, R. Cánovas, G. Navarro, K. Sadakane, Succinct trees in practice, in: ALENEX, 2010, pp. 84-97.
[4]
P. Bille, I. Li Gørtz, G.M. Landau, O. Weimann, Tree compression with top trees, in: ICALP, in press.
[5]
Böttcher, S., Hartel, R. and Krislin, C., CluX - clustering XML sub-trees. International Conference on Enterprise Information Systems. v1. 142-150.
[6]
P. Buneman, M. Grohe, C. Koch, Path queries on compressed XML, in: Very Large Databases, 2003, pp. 141-152.
[7]
Busatto, G., Lohrey, M. and Maneth, S., Efficient memory representation of XML document trees. Information Systems. v33 i4-5. 456-474.
[8]
A. Gascón, C. Creus, G. Godoy, One-context unification with STG-compressed terms is in NP, in: RTA, 2012, pp. 149-164.
[9]
Charikar, M., Lehman, E., Liu, D., Panigrahy, R., Prabhakaran, M., Sahai, A. and Shelat, A., The smallest grammar problem. IEEE Transactions on Information Theory. v51 i7. 2554-2576.
[10]
J. Cheney, Compressing XML with multiplexed hierarchical PPM models, in: DCC, 2001, pp. 163-172.
[11]
H. Comon, M. Dauchet, R. Gilleron, F. Jacquemard, C. Löding, D. Lugiez, S. Tison, M. Tommasi, Tree automata techniques and applications, Available at: {http://www.grappa.univ-lille3.fr/tata}, 2007.
[12]
P. Deutsch, DEFLATE compressed data format specification version 1.3. 1996.
[13]
Downey, P.J., Sethi, R. and Tarjan, R.E., Variations on the common subexpression problem. Journal of the ACM. v27 i4. 758-771.
[14]
M. Frick, M. Grohe, C. Koch, Query evaluation on compressed trees (extended abstract), in: LICS, 2003, pp. 188-197.
[15]
Gascón, A., Godoy, G. and Schmidt-Schauí, M., Unification and matching on compressed terms. ACM Transactions on Computational Logic. v12 i4. 26
[16]
Knuth, D.E., . 1968. Volume I: The Art of Computer Programming, Fundamental Algorithms, 1968.Addison-Wesley.
[17]
N. Kobayashi, K. Matsuda, A. Shinohara, Functional programs as compressed data, in: PEPM 2012, pp. 121-130.
[18]
C. Krislin, Optimierung grammatik-basierter XML-Kompression, Diplomarbeit, Faculty for Electrical Engineering, Computer Science and Mathematics, University of Paderborn, Germany, 2008.
[19]
N.J. Larsson, A. Moffat, Offline dictionary-based compression, in: DCC, 1999, pp. 296-305.
[20]
Levy, J., Schmidt-Schauí, M. and Villaret, M., The complexity of monadic second-order unification. SIAM Journal on Computing. v38 i3. 1113-1140.
[21]
H. Liefke, D. Suciu, XMill: an efficient compressor for XML data, in: SIGMOD Conference, 2000, pp. 153-164.
[22]
Lohrey, M. and Maneth, S., The complexity of tree automata and XPath on grammar-compressed trees. Theoretical Computer Science. v363 i2. 196-210.
[23]
M. Lohrey, S. Maneth, R. Mennicke, Tree structure compression with RePair, CoRR, abs/1007.5406, 2010.
[24]
M. Lohrey, S. Maneth, R. Mennicke, Tree structure compression with RePair, in: DCC, 2011, pp. 353-362.
[25]
M. Lohrey, S. Maneth, E. Noeth, Xml compression via dags, in: ICDT, 2013, pp. 69-80.
[26]
Lohrey, M., Maneth, S. and Schmidt-Schauí, M., Parameter reduction and automata evaluation for grammar-compressed trees. Journal of Computer and System Sciences. v78 i5. 1651-1669.
[27]
S. Maneth, G. Busatto, Tree transducers and tree compressions, in: FoSSaCS, 2004, pp. 363-377.
[28]
S. Maneth, N. Mihaylov, S. Sakr, XML tree structure compression, in: DEXA Workshops, 2008, pp. 243-247.
[29]
S. Maneth, T. Sebastian, Fast and tiny structural self-indexes for XML, CoRR, abs/1012.5696, 2010.
[30]
Murata, M., Lee, D., Mani, M. and Kawaguchi, K., Taxonomy of XML schema languages using formal language theory. ACM Transactions on Internet Technology. v5 i4. 660-704.
[31]
Automata theory for XML researchers. SIGMOD Record. v31 i3. 39-46.
[32]
W. Plandowski, Testing equivalence of morphisms on context-free languages, in: ESA, 1994, pp. 460-470.
[33]
K. Sadakane, G. Navarro, Fully-functional succinct trees, in: SODA, 2010, pp. 134-149.
[34]
A. Schmidt, F. Waas, M.L. Kersten, M.J. Carey, I. Manolescu, R. Busse, XMark: a benchmark for XML data management, in: VLDB, 2002, pp. 974-985.
[35]
M. Schmidt-Schauí, Polynomial equality testing for terms with shared substructures, Frank report 21, Fachbereich Informatik und Mathematik. J.W. Goethe-Universität Frankfurt am Main, German, 2005.
[36]
M. Schmidt-Schauí, Matching of compressed patterns with character-variables, in: RTA, 2012, pp. 272-287.
[37]
M. Schmidt-Schauí, D. Sabel, A. Anis, Congruence closure of compressed terms in polynomial time, in: FroCos, 2011, pp. 227-242.
[38]
Schwentick, T., Automata for XML - a survey. Journal of Computer and System Sciences. v73 i3. 289-315.
[39]
K. Yamagata, T. Uchida, T. Shoudai, Y. Nakamura, An effective grammar-based compression algorithm for tree structured data, in: ILP, 2003, pp. 383-400.

Cited By

View all
  • (2024)Enumeration for MSO-Queries on Compressed TreesProceedings of the ACM on Management of Data10.1145/36511412:2(1-17)Online publication date: 14-May-2024
  • (2023)Homomorphic Compression: Making Text Processing on Compression UnlimitedProceedings of the ACM on Management of Data10.1145/36267651:4(1-28)Online publication date: 12-Dec-2023
  • (2023)CompressGraph: Efficient Parallel Graph Analytics with Rule-Based CompressionProceedings of the ACM on Management of Data10.1145/35886841:1(1-31)Online publication date: 30-May-2023
  • Show More Cited By

Index Terms

  1. XML tree structure compression using RePair
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Information Systems
    Information Systems  Volume 38, Issue 8
    November, 2013
    278 pages

    Publisher

    Elsevier Science Ltd.

    United Kingdom

    Publication History

    Published: 01 November 2013

    Author Tags

    1. Memory representation
    2. Tree structure compression
    3. XML

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 12 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Enumeration for MSO-Queries on Compressed TreesProceedings of the ACM on Management of Data10.1145/36511412:2(1-17)Online publication date: 14-May-2024
    • (2023)Homomorphic Compression: Making Text Processing on Compression UnlimitedProceedings of the ACM on Management of Data10.1145/36267651:4(1-28)Online publication date: 12-Dec-2023
    • (2023)CompressGraph: Efficient Parallel Graph Analytics with Rule-Based CompressionProceedings of the ACM on Management of Data10.1145/35886841:1(1-31)Online publication date: 30-May-2023
    • (2022)Properties of graphs specified by a regular languageActa Informatica10.1007/s00236-022-00427-z59:4(357-385)Online publication date: 1-Aug-2022
    • (2021)Balancing Straight-line ProgramsJournal of the ACM10.1145/345738968:4(1-40)Online publication date: 30-Jun-2021
    • (2021)Entropy Bounds for Grammar-Based Tree CompressorsIEEE Transactions on Information Theory10.1109/TIT.2021.311267667:11(7596-7615)Online publication date: 1-Nov-2021
    • (2021)The Smallest Grammar Problem RevisitedIEEE Transactions on Information Theory10.1109/TIT.2020.303814767:1(317-328)Online publication date: 1-Jan-2021
    • (2021)On the Complexity of the Smallest Grammar Problem over Fixed AlphabetsTheory of Computing Systems10.1007/s00224-020-10013-w65:2(344-409)Online publication date: 1-Feb-2021
    • (2021)Properties of Graphs Specified by a Regular LanguageDevelopments in Language Theory10.1007/978-3-030-81508-0_10(117-129)Online publication date: 16-Aug-2021
    • (2020)Grammar-Based Compression of Unranked TreesTheory of Computing Systems10.1007/s00224-019-09942-y64:1(141-176)Online publication date: 1-Jan-2020
    • Show More Cited By

    View Options

    View options

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media