Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

XML tree structure compression

2008, Database and Expert Systems …

To appear in XANTEC'08, DEXA Workshops, Proceedings, IEEE Computer Society, 2008. XML Tree Structure Compression Sebastian Maneth, Nikolay Mihaylov, Sherif Sakr NICTA and Univ. of New South Wales Sydney, Australia first.last@nicta.com.au Abstract In an XML document a considerable fraction consists of markup, that is, begin and end-element tags describing the document’s tree structure. XML compression tools such as XMill separate the tree structure from the data content and compress each separately. The main focus in these compression tools is how to group similar data content together prior to performing standard data compression such as gzip, bzip2, or ppm. In contrast, the focus of this paper is on compressing the tree structure part of an XML document. We use a known algorithm to derive a grammar representation of the tree structure which factors out the repetition of tree patterns. We then investigate several succinct binary encodings of these grammars. Our experiments show that we can be consistently smaller than the tree structure compression carried out by XMill, using the same backend compressors as XMill on our encodings. However, the most surprising result is that our own Huffman-like encoding of the grammars (without any backend compressor whatsoever) consistently outperforms XMill with gzip backend. This is of particular interest because our Huffmannlike encoding can be queried without prior decompression. To the best of our knowledge this offers the smallest queriable XML tree structure representation currently available. 1 Introduction XML is a W3C (World Wide Web Consortium) standard that has become the main data exchange format of the web. One of the critizisms of XML, is its verbosity: data becomes blurred in a thick soup of bulky markup information. Moreover, the markup tends to be highly repetitive resulting in large document files with great redundancy. To overcome these problems there are efforts to establish a standard binary format for XML; this has been the charter of the W3C working group “XML Binary Characterization”, now continued by the “W3C Efficient XML Interchange Working Group”. However, their work is still in progress and no standard has been agreed upon yet. An alternative to using a standard binary format is to compress XML documents using data compression tools. A considerable part of an XML document consists of its markup, mainly in form of nested element tags; these element tags induce the document’s tree structure. Most XML compression tools, such as XMill [13], separate the document’s tree structure from its data part, and compress them separately. In XMill, data can additionally be stored in separate containers, prior to running a backend compressor such as gzip, bzip2, or ppm, over all containers (tree structure and data). In this way, similar data items can be grouped in one container which will improve backend compression. While much work has dealt with better selection of containers and applying more sophisticated backend compressors, we are not aware of any work that has tried to improve over XMill’s tree structure compression. In this paper we want to deal with this issue: how can we effectively compress the tree structure of an XML document? Our starting point comes from pattern-based compression: the BPLEX algorithm [5] takes a tree as input and searches bottom-up for repeated patterns (that is, connected subgraphs) in the tree. It then represents repeated patterns only once, resulting in a compressed pointer-based representation called SLT grammar. Compared to DAGs, a wellknown data structure for efficient pointer-based tree representation, BPLEX’s SLT grammars require less than 50% of the pointers needed for the minimal unique DAG of a given tree. As far as we know, BPLEX generates the smallest pointer-based tree representations currently available. Our next idea was then: can we use these small pointer-based representations also for file compression? Our initial efforts in simply applying gzip or bzip2 to the textual SLT grammars where disillusioning: XMill’s tree structure compression which only consists of fixing a symbol table and then outputting one symbol per opening tag and a fixed symbol per closing tag (prior to backend compression) outperformed our zipped grammars by far. We then started to investigate different ways of coding depth-first order; the codeword for each node is prefixed by its parents codeword and two nodes in the XML document tree share the same codeword if they have the same path. Finally, there are Succinct DOM [7] and ISX [16] which use a balanced parentheses encoding scheme to store the XML tree structure. SLT grammars and DAGs into succinct bit-representations. We have run extensive experiments with a plethora of different codings and compression runs. Our experience can be summarized as follows: – For file compression, DAGs and SLT grammars offer only limited improvement over XMill: DAGs are on average 73% of the size of XMill (using gzip as backend compressor for both) and SLT grammars produced by BPLEX are on average 66% of the size of XMill. – As queriable format our succinct SLT grammars produce excellent results: we are able to produce a queriable representation of an XML tree structure that is on average 68% of the size of using XMill with gzip! This offers by far the most space efficient queriable format for XML tree structures. Queriable DAGs are on average 131% of the size of XMill with gzip. These number show that for file compression DAGs are useful because they improve over XMill and they are fairly cheap to obtain (by one run over the tree keeping a hash table of all subtrees). Running BPLEX is more expensive and therefore better suited for producing queriable in-memory representations. Related Work XML compressors can be classified into two main groups: 1) Non-Queriable (Archival) Compressors. This group of compressors focuses on achieving the highest compression ratio [6, 13, 11]. 2) Queriable XML Processors. They allow queries to be evaluated on their compressed formats [12, 14, 15]. Their compression ratio is usually worse than that of the archival XML compressor. Rightly so — their main objective is to avoid full document decompression during query execution. The primary innovation in the XML compression mechanisms is XMill’s idea of separating structure from data; most later XML compressors follow this idea. The structural encoding scheme of XMill [13] assigns each distinct element and attribute name an integer code, which serves as the key into the element and attribute name dictionaries before passing it to a back-end general text compression scheme. The XMLPPM compressor [6] encodes SAX events before passing it to PPM compressor. The AXECHOP compressor [11] uses a byte tokenization scheme that preserves the original structure of the document and then uses the MPM compression algorithm [9] to generate a context-free grammar which is then passed through an adaptive arithmetic coder. The first queriable compressor, XGrind [15] retains the original structure of the XML document and encodes element and attribute names using a dictionary; character data is compressed by semi-adaptive Huffman coding. XSeq [14] is a grammar-based queriable XML compressor; It separates structure and data as XMill, and then applies separately the famous grammar-based strings compression algorithm Sequitur. In TREECHOP [12] SAX-events are binary coded and written to the compression stream in 2 Binary Tree Encodings The tree structure induced by the nesting of element tags in an XML document naturally corresponds to an unranked (ordered) tree. In such a tree, each node can have an arbitrary number of children nodes. As an example, imagine a tree that consists of a root node labeled “book”, which has 3 children nodes labeled “chapter”, which in turn have 0, 3, and 2 children nodes labeled “section”, respectively. It is well-known that any unranked tree can be conveniently represented by a binary tree. One common such representation is the “first-child/next-sibling”-encoding: the first child of each node of the unranked tree becomes the left child of the corresponding node of the binary tree, and the next-sibling of each unranked node becomes the right child of the binary node. Any “missing” left or right children in the binary tree are filled with NIL-leaves (for which we will use the label “ ”). The encoding of the aforementioned unranked tree is the tree (written as expression): TEncode = book(chapter( ,chapter(section( ,section( ,section( , ))), chapter(section( ,section( , )), ))), ) Succinct Bit Representations We now want to store a binary tree such as the one shown before in a succinct bitwise representation. We will assume one a priori fixed leaf symbol, denoted , which represents the NIL-tree. Every other symbol is binary. Fixed-Length Encoding The most direct way of bitencoding the sequence of labels that determine our binary tree is to use a symbol table and then to assign to each symbol in the table a binary code of fixed length. The sequence of tree node labels is then encoded by simply concatenating the codes of the symbols. Our special symbol will be coded by the fixed-length expansion of the number zero, the first symbol in the symbol table by the 1 in fixed binary, and so on. The symbol table itself is simply a sequence of 0-byte terminated strings, followed by an additional 0-byte to indicate the end of the symbol table: B = book0chapter0section00 Thus, the final codewords of the symbols table will be assigned as follows: =00, book=01, chapter=10, section=11. If we apply this coding to the tree encoding TEncode , we obtain: 01100010110011001100001011001100000000. Avoiding NIL-Leaves It seems quite wasteful, at first sight, to use that many NIL-nodes in our binary tree; after all, the original (unranked) XML tree only has 9 nodes, while our binary tree for it has 19 nodes: 9 internal nodes 2 and 10 leaves labeled by . We can avoid the NIL-leaves by introducing 4 different versions of each symbol: (1) one for a binary node (let us denote it by 2-bit prefix “11”) (2) one for a (unary) node representing a binary node with left child NIL (denotes by prefix “01”) (3) one for a (unary) node representing a binary node with right child NIL (denoted by prefix “10”) and (4) one for a leaf node representing a binary node with left and right child NIL (denoted by prefix “00”). If we apply this to our example, then the bit-sequence after the symbol table starts with 10010111.. because the root node has right child NIL and thus we start with 10 followed by the code 01 for book, etc. The length of the final codeword in this example is almost the same as of the one before (using the NIL symbols) — 19 · 2 versus 9 · (2 + 2). However, assume we have one more symbol and node; then with NIL-leaves we need (after the symbol table) 21 · 3 bits while with the “no-NIL” variant we only need 10 · (3 + 2) bits for the node sequence. Variable-Length Codings We consider two variants of variable length codings: 1) var-coding, here we reserve the first bit of a byte to indicate whether the following byte belongs to the current symbol, and 2) Huffmancoding. While it is known that arithmetic coding can generate shorter codewords than Huffman, we do not wish to use it because its output cannot be used as a queriable format. Byte Alignment Using encodings that are not aligned at byte boundaries can circumvent byte oriented compressors. There might be identical subsequences which will be hidden by the encoding because it will make them start at different positions within a byte and give rise to different byte sequences. To improve the subsequent compression we can pad the bit representation of every symbol with zeros so that it occupies whole number of bytes. 3 (0 : , 1 : a[0, 0], 2 : a[1, 1]). This bottom-up graph-like notation means that there is a node “0” labeled by , and an a-node “1” which has as left and right child the node “0”, and an a-node “2” with children “1” and “1”. Succinct Bit Representations In the case of fixedlength coding we store additional to the symbol table the number of bits used per symbol. Coming back to our initial example, the µ-DAG of the binary tree (with NIL-leaves) is (0 : , 1 : section[0, 0], 2 : section[0, 1], 3 : book[chapter[0, chapter[section[0, 2], chapter[2, 0]]], 0]). Here we need (4 − 1) references (the root node is never referenced) and 3 symbols ( is now a reference) which means that we need 3 bits per symbol in the fixed-length coding. To store the number 3, we use our var-byte encoding (alternatively, a half-byte-var coding might be appropriate where the first of 4 bits is used to determine whether 4 more bits belong to the symbol, etc). Apart from the symbol table and the half-byte or byte storing the number of bits per symbol, we need 51 (17 · 3) bits to represent the DAG. Clearly, this uses more bits than the corresponding codeword needed for the binary tree (51 vs. 38 bits). But, for the other tree which has one more symbol and node, we obtain 51 + 6 which is smaller than the number of bits needed for the corresponding binary tree — 21 · 3. The Sharing Threshold Especially when applying backend compression, we found that µ-DAGs do not give rise to optimal numbers. The reason is that a subtree reference is too expensive if the tree is small (such as one node), and that compressors like gzip are more effective in storing short string repetitions. We therefore introduce the weight of a pattern as (frequency of the pattern) · (size of pattern), and introduce a new pattern only if its weight is larger than the threshold. The same threshold will be used for BPLEX outputs of the next section. In our experiments we found that even with no backend compression it is not optimal to share everything, but a threshold of 14 turned out to be optimal for our test set (using non-aligned Huffman, no-NIL). For gzip backend compression the optimal number was 1000 (with aligned Huffman, no-NIL) while for bzip2 and ppm it was at 3000 with var-coding (with NIL) and aligned Huffman (no-NIL), respectively. Binary DAGs A (top-down directed) tree can be conveniently represented by a directed acyclic graph (DAG). The difference of the DAG to the tree is that a node in the DAG may have multiple incoming edges. In the limit, the numbers of nodes and edges in a DAG can be exponentially smaller than in the tree it represents. It is folklore that the unique minimal DAG of a tree (called mu-DAG, or µ-DAG) can be computed by one run through the tree, keeping a hash table of subtrees/dags. For real XML data as it is in use, the number of pointers (edges) of the mu-DAG is on average approximately 10% of the number of edges of the original (unranked) XML tree structure [4]. As opposed to them, we work on DAGs of binary tree representation of the XML tree structures. We represent our DAGs using references. Intuitively, every repeated subtree gets a name (or “reference”) by which it is referred when it reoccurs. For instance, the µ-DAG of the binary tree a(a( , ), a( , )) is written as 4 Binary SLT Grammars A µ-DAG avoids to store repeated subtrees. Sometimes, however, internal parts of a tree, so called “tree patterns”, are repeated many times; this redundancy cannot be removed by a DAG. As an example, consider two large subtrees which are identical, except at one leaf position where they differ. In a DAG, these subtrees cannot be shared because they are different. Sharing-graphs [10] are a general3 Document 1998statistics.xml Catalog-01.xml Catalog-02.xml Dictionary-01.xml Dictionary-02.xml EnWikiNew.xml EnWikiQuote.xml EnWikiSource.xml EnWikiVersity.xml EnWikTionary.xml EXI-Array.xml EXI-Factbook.xml EXI-Invoice.xml EXI-Telecomp.xml EXI-Weblog.xml JST gene.xml JST snp.xml Lineitem.xml Medline.xml Mondial.xml Nasa.xml NCBI gene.xml NCBI snp.xml Sprot.xml Treebank.xml USHouse.xml Size (KB) 717 6,624 65,875 3,481 34,311 7,834 5,034 21,849 9,530 160,373 7,156 2,087 457 5,402 2,216 7,932 24,667 30,270 80,248 409 9,958 13,042 135,853 206,993 31,450 144 Tags 47 51 51 25 25 21 21 21 21 21 48 200 53 39 13 27 43 19 79 23 62 51 16 49 252 44 # Nodes 54,581 372,459 3,705,071 513,574 5,077,549 665,825 437,682 1,902,189 828,229 14,520,656 226,524 86,581 26,130 177,634 178,375 388,029 1,169,686 1,985,776 5,394,921 22,423 792,467 645,917 6,879,757 21,634,330 3,843,775 11,889 Depth 7 9 9 9 9 6 6 6 6 6 10 6 8 7 4 8 9 4 8 5 9 8 5 7 38 17 ber of parameters in its definition. Since we encode the grammar bottom up, starting with a pattern tree that includes no references, followed by one that may only reference the previous pattern, etc, we do not need to explicitly store the number of arguments of the references. In the small example of above, using fixed-length coding, we have 6 symbols, in binary 000–101, for the labels a through f , plus the symbol 110 denoting a parameter, plus the symbol 111 for the reference “1”. Thus, the SLT grammar (1 : a[b[y1 ], y2 ], 2 : 1[c, 1[d, 1[e, f ]]]) becomes this bitsequence 000001110110111010111011111100101. Compare these 33 required bits to the 50 bits required for a fixed-length encoding of the tree, using non-NIL coding. Note that [8] already introduced a simple bit coding of SLT grammars. In our experiments we found that a threshold of 14 was optimal on our data set, for the uncompressed and gzipped version of BPLEX output, both with non-aligned Huffman (no-NIL). For gzip and ppm the optimal thresholds were 10,000 and 30,000, respectively, using aligned Huffman with and without NIL, respectively. Table 1. Characteristics of XML data sets. ization of DAGs which allow to share arbitrary connected subgraphs of a tree (tree patterns). Such sharing graphs can be conveniently represented by “straight-line tree (SLT) grammars” [5]. Just as for DAGs, each shared component gets a reference. A tree pattern is denoted by its internal nodes, plus additional leaf nodes labeled y1 , y2 , . . . which are filled in from left to right in order to obtain a well-balanced tree. As an example, consider the tree a(b(c), a(b(d), a(b(e), f ))). In this tree, no subtree is repeated, i.e., it equals its µ-DAG. However, the tree pattern consisting of an a-node with left child b occurs three times. This pattern is denoted by the tree a(b(y1 ), y2 ) in our SLT grammar notation. The complete SLT grammar for the tree is (1 : a[b[y1 ], y2 ], 2 : 1[c, 1[d, 1[e, f ]]]). Notice that reference now appear at internal nodes, i.e., a pattern can be instantiated with many different values for its parameter placeholders y1 , y2 , . . . . For common XML documents, the BPLEX algorithm of [5] produces SLT grammars which require, when represented as pointer structures, approximately one half of the number of pointers of the µ-DAG. As our experiments will show, succinct storage of SLT grammars will further reduce the compression ratio wrt to DAGs. Succinct Bit Representations Additional to DAGs, the tree patterns of an SLT grammar contain parameter leaves labeled y1 , y2 , . . . . However, we know that in a pattern parameters appear from left to right, and each of them exactly once; thus, we only need one new special symbol for parameters. The only other difference to our DAGs of before is that pattern trees are not strictly binary trees anymore, because a reference may have an arbitrary (but fixed) number of children corresponding to the num- 5 Experiments We compared our results with the following two XML compressors: 1) TREECHOP [12] as a representative for a queriable XML compressor (in fact, the only queriable compressor that we could get to run). 2) XMill [13] as a representative of an archival compressor. As far as we know, all queriable formats have worse compression ratios than XMillGzip, and all non-queriable compressors are worse than XMill when compressing tree structures (of course, comparing XMLPPM with XMillPPM, for instance). XMill supports three alternative backend compressor: gzip, bzip2, or PPM. In our experiments we compared our compression ratios using these three backends independently. We refer to the corresponding versions of XMill with XMillGzip, XMillBzip2, XMillPPM, respectively, and similarly for BPLEX and DAG. In our experiments we used 26 XML files which represent real-life XML data sets that are commonly used in testing the efficiency of XML compression techniques [1, 2, 3] and cover a wide range of sizes and structures. Our versions of the documents are obtained by removing all data values, and only keeping the element nodes (additionally we replace each text node by a placeholder element node). Table 1 represents the details characteristics of the data sets. Additionally, we tested more than 60 different combinations of the options of our approach, but for the sake of fairness, we chose only one setting to use for the reported results. Figures 1 and 2 show the results of our experiments where the values of the compression ratios are normalized with respect to the compression ratio of XMillGzip. Note 4 1.6 factor is expected for ISX [16] as it uses the same idea of tree structure coding as Succinct DOM). To summarize our outcomes, DAGs are useful for file compression of XML tree structures, while BPLEX is useful for producing queriable tree structure representations. To the best of our knowledge our approach represents the smallest queriable XML tree structure representation currently available. Based on all results of the literature, our approach is the first queriable representation which has the compression ratio of XMill. We expect that the running time of tree traversal operations will be slower than that of codings such as Succinct DOM. We are currently implementing a DOM interface to our queriable BPLEX representation so that we can assess the running times of tree access operations. In parallel, we are working on an XPath evaluator which makes use of the repetition of tree patterns in BPLEX outputs and therefore is expected to run fast. 1.4 Compression Ratio 1.2 1.0 0.8 0.6 0.4 0.2 TREECHOP DAG XMillGzip BPLEX DAGGzip BPLEXGzip XMillBzip2 DAGBzip2 BPLEXBzip2 XMillPPM DAGPPM BPLEXPPM 0.0 Figure 1. Average compression ratios. 2.0 1.8 References 1.6 XMillGzip BPLEXgzip XMillBzip2 BPLEXBzip2 XMillPPM BPLEXPPM BPLEX TREECHOP Compression Ratio 1.4 1.2 1.0 0.8 [1] http://www.w3.org/XML/EXI/. [2] http://www.cs.washington.edu/research/xmldatasets/. [3] http://download.wikipedia.org/backup-index.html. [4] P. Buneman, M. Grohe, and C. Koch. Path queries on compressed XML. In VLDB, 2003. 0.6 [5] G. Busatto, M. Lohrey, and S. Maneth. Efficient memory representation of XML document trees. To appear in Information Systems. 0.4 0.2 0.0 1998statistics Catalog-01 Catalog-02 Dictionary-01 Dictionary-02 EnWikiNew EnWikiQuote EnWikiSource EnWikiVersity EnWikTionary EXI-Array EXI-factbook EXI-Invoice EXI-Telecomp EXI-weblog JST_gene JST_snp Lineitem Medline Mondial Nasa NCBI_gene NCBI_snp Sprot Treebank USHouse [6] J. Cheney. Compressing XML with Multiplexed Hierarchical PPM Models. In DCC, 2001. [7] O. Delpratt, R. Raman, and N. Rahman. Engineering succinct DOM. In EDBT, 2008. XML Data Sets [8] D. K. Fisher and S. Maneth. Structural selectivity estimation for XML documents. In ICDE, 2007. Figure 2. Detailed compression ratios. [9] Kieffer, Yang, Nelson, and Cosman. Universal Lossless Compression Via Multilevel Pattern Matching. IEEE Trans. Inform. Theory, 46, 2000. that in 7 occasions the TREECHOP ratio is out of the scale of the figure (indicated by an upgoing line). Our main two comments are: (1) The results of the experiments in Figure 1 show that the average compression ratio achieved by our approach is consistently better than the compression ratio of XMill (the best known compressor for the structure parts of XML documents) using any of the three alternative back-ends (gzip, bzip2, ppm). (2) Figures 2 and 1 show that our queriable (uncompressed) BPLEX representation is achieving a significant improvement on the compression ratio over the selected compared compressors. In particular the compression ratio of a queriable BPLEX representation is on average 68% of the size of XMillGzip. Moreover, it is (on average) around 3 times smaller than the compression ratio of the TREECHOP queriable compressor, and more than 200 times smaller than Succinct DOM [7] (a similar [10] J. Lamping. An algorithm for optimal lambda calculus reductions. In POPL, 1990. [11] G. Leighton, J. Diamond, and T. Muldner. AXECHOP: A Grammar-based Compressor for XML. In DCC, 2005. [12] G. Leighton, T. Muldner, and J. Diamond. TREECHOP: a tree-based query-able compressor for XML. Technical report, Acadia University, 2005. [13] H. Liefke and D. Suciu. XMill: An efficient compressor for XML data. In SIGMOD, 2000. [14] Y. Lin, Y. Zhang, Q. Li, and J. Yang. Supporting efficient query processing on compressed XML files. In SAC, 2005. [15] P. Tolani and J. Haritsa. XGRIND: A Query-Friendly XML Compressor. In ICDE, 2002. [16] R. K. Wong, F. Lam, and W. M. Shui. Querying and maintaining a compact XML storage. In WWW, 2007. 5