To appear in XANTEC'08, DEXA Workshops, Proceedings, IEEE Computer Society, 2008.
XML Tree Structure Compression
Sebastian Maneth, Nikolay Mihaylov, Sherif Sakr
NICTA and Univ. of New South Wales
Sydney, Australia
first.last@nicta.com.au
Abstract
In an XML document a considerable fraction consists
of markup, that is, begin and end-element tags describing
the document’s tree structure. XML compression tools such
as XMill separate the tree structure from the data content
and compress each separately. The main focus in these
compression tools is how to group similar data content together prior to performing standard data compression such
as gzip, bzip2, or ppm. In contrast, the focus of this paper is on compressing the tree structure part of an XML
document. We use a known algorithm to derive a grammar representation of the tree structure which factors out
the repetition of tree patterns. We then investigate several
succinct binary encodings of these grammars. Our experiments show that we can be consistently smaller than the tree
structure compression carried out by XMill, using the same
backend compressors as XMill on our encodings. However,
the most surprising result is that our own Huffman-like encoding of the grammars (without any backend compressor
whatsoever) consistently outperforms XMill with gzip backend. This is of particular interest because our Huffmannlike encoding can be queried without prior decompression.
To the best of our knowledge this offers the smallest queriable XML tree structure representation currently available.
1
Introduction
XML is a W3C (World Wide Web Consortium) standard
that has become the main data exchange format of the web.
One of the critizisms of XML, is its verbosity: data becomes
blurred in a thick soup of bulky markup information. Moreover, the markup tends to be highly repetitive resulting in
large document files with great redundancy. To overcome
these problems there are efforts to establish a standard binary format for XML; this has been the charter of the W3C
working group “XML Binary Characterization”, now continued by the “W3C Efficient XML Interchange Working
Group”. However, their work is still in progress and no
standard has been agreed upon yet. An alternative to using
a standard binary format is to compress XML documents
using data compression tools.
A considerable part of an XML document consists of its
markup, mainly in form of nested element tags; these element tags induce the document’s tree structure. Most XML
compression tools, such as XMill [13], separate the document’s tree structure from its data part, and compress them
separately. In XMill, data can additionally be stored in separate containers, prior to running a backend compressor such
as gzip, bzip2, or ppm, over all containers (tree structure
and data). In this way, similar data items can be grouped
in one container which will improve backend compression.
While much work has dealt with better selection of containers and applying more sophisticated backend compressors,
we are not aware of any work that has tried to improve over
XMill’s tree structure compression. In this paper we want
to deal with this issue: how can we effectively compress the
tree structure of an XML document?
Our starting point comes from pattern-based compression: the BPLEX algorithm [5] takes a tree as input and
searches bottom-up for repeated patterns (that is, connected
subgraphs) in the tree. It then represents repeated patterns
only once, resulting in a compressed pointer-based representation called SLT grammar. Compared to DAGs, a wellknown data structure for efficient pointer-based tree representation, BPLEX’s SLT grammars require less than 50% of
the pointers needed for the minimal unique DAG of a given
tree. As far as we know, BPLEX generates the smallest
pointer-based tree representations currently available. Our
next idea was then: can we use these small pointer-based
representations also for file compression? Our initial efforts
in simply applying gzip or bzip2 to the textual SLT grammars where disillusioning: XMill’s tree structure compression which only consists of fixing a symbol table and then
outputting one symbol per opening tag and a fixed symbol per closing tag (prior to backend compression) outperformed our zipped grammars by far.
We then started to investigate different ways of coding
depth-first order; the codeword for each node is prefixed by
its parents codeword and two nodes in the XML document
tree share the same codeword if they have the same path. Finally, there are Succinct DOM [7] and ISX [16] which use
a balanced parentheses encoding scheme to store the XML
tree structure.
SLT grammars and DAGs into succinct bit-representations.
We have run extensive experiments with a plethora of different codings and compression runs. Our experience can
be summarized as follows:
– For file compression, DAGs and SLT grammars offer
only limited improvement over XMill: DAGs are on average 73% of the size of XMill (using gzip as backend compressor for both) and SLT grammars produced by BPLEX
are on average 66% of the size of XMill.
– As queriable format our succinct SLT grammars produce excellent results: we are able to produce a queriable
representation of an XML tree structure that is on average
68% of the size of using XMill with gzip! This offers by
far the most space efficient queriable format for XML tree
structures. Queriable DAGs are on average 131% of the size
of XMill with gzip. These number show that for file compression DAGs are useful because they improve over XMill
and they are fairly cheap to obtain (by one run over the tree
keeping a hash table of all subtrees). Running BPLEX is
more expensive and therefore better suited for producing
queriable in-memory representations.
Related Work XML compressors can be classified
into two main groups: 1) Non-Queriable (Archival) Compressors. This group of compressors focuses on achieving
the highest compression ratio [6, 13, 11]. 2) Queriable XML
Processors. They allow queries to be evaluated on their
compressed formats [12, 14, 15]. Their compression ratio
is usually worse than that of the archival XML compressor.
Rightly so — their main objective is to avoid full document
decompression during query execution.
The primary innovation in the XML compression mechanisms is XMill’s idea of separating structure from data;
most later XML compressors follow this idea. The structural encoding scheme of XMill [13] assigns each distinct
element and attribute name an integer code, which serves
as the key into the element and attribute name dictionaries before passing it to a back-end general text compression scheme. The XMLPPM compressor [6] encodes SAX
events before passing it to PPM compressor. The AXECHOP compressor [11] uses a byte tokenization scheme
that preserves the original structure of the document and
then uses the MPM compression algorithm [9] to generate a context-free grammar which is then passed through
an adaptive arithmetic coder.
The first queriable compressor, XGrind [15] retains the
original structure of the XML document and encodes element and attribute names using a dictionary; character data is compressed by semi-adaptive Huffman coding.
XSeq [14] is a grammar-based queriable XML compressor;
It separates structure and data as XMill, and then applies
separately the famous grammar-based strings compression
algorithm Sequitur. In TREECHOP [12] SAX-events are
binary coded and written to the compression stream in
2
Binary Tree Encodings
The tree structure induced by the nesting of element tags
in an XML document naturally corresponds to an unranked
(ordered) tree. In such a tree, each node can have an arbitrary number of children nodes. As an example, imagine a
tree that consists of a root node labeled “book”, which has
3 children nodes labeled “chapter”, which in turn have 0, 3,
and 2 children nodes labeled “section”, respectively.
It is well-known that any unranked tree can be conveniently represented by a binary tree. One common such representation is the “first-child/next-sibling”-encoding: the
first child of each node of the unranked tree becomes the left
child of the corresponding node of the binary tree, and the
next-sibling of each unranked node becomes the right child
of the binary node. Any “missing” left or right children
in the binary tree are filled with NIL-leaves (for which we
will use the label “ ”). The encoding of the aforementioned
unranked tree is the tree (written as expression): TEncode
= book(chapter( ,chapter(section( ,section( ,section( , ))),
chapter(section( ,section( , )), ))), )
Succinct Bit Representations We now want to store a
binary tree such as the one shown before in a succinct bitwise representation. We will assume one a priori fixed leaf
symbol, denoted , which represents the NIL-tree. Every
other symbol is binary.
Fixed-Length Encoding The most direct way of bitencoding the sequence of labels that determine our binary
tree is to use a symbol table and then to assign to each
symbol in the table a binary code of fixed length. The
sequence of tree node labels is then encoded by simply
concatenating the codes of the symbols. Our special symbol will be coded by the fixed-length expansion of the
number zero, the first symbol in the symbol table by the
1 in fixed binary, and so on. The symbol table itself is
simply a sequence of 0-byte terminated strings, followed
by an additional 0-byte to indicate the end of the symbol table: B = book0chapter0section00 Thus, the final
codewords of the symbols table will be assigned as follows: =00, book=01, chapter=10, section=11. If we apply this coding to the tree encoding TEncode , we obtain:
01100010110011001100001011001100000000.
Avoiding NIL-Leaves It seems quite wasteful, at first
sight, to use that many NIL-nodes in our binary tree; after all, the original (unranked) XML tree only has 9 nodes,
while our binary tree for it has 19 nodes: 9 internal nodes
2
and 10 leaves labeled by . We can avoid the NIL-leaves by
introducing 4 different versions of each symbol: (1) one for
a binary node (let us denote it by 2-bit prefix “11”) (2) one
for a (unary) node representing a binary node with left child
NIL (denotes by prefix “01”) (3) one for a (unary) node
representing a binary node with right child NIL (denoted by
prefix “10”) and (4) one for a leaf node representing a binary
node with left and right child NIL (denoted by prefix “00”).
If we apply this to our example, then the bit-sequence after the symbol table starts with 10010111.. because the
root node has right child NIL and thus we start with 10 followed by the code 01 for book, etc. The length of the final
codeword in this example is almost the same as of the one
before (using the NIL symbols) — 19 · 2 versus 9 · (2 + 2).
However, assume we have one more symbol and node; then
with NIL-leaves we need (after the symbol table) 21 · 3 bits
while with the “no-NIL” variant we only need 10 · (3 + 2)
bits for the node sequence.
Variable-Length Codings We consider two variants
of variable length codings: 1) var-coding, here we reserve the first bit of a byte to indicate whether the following byte belongs to the current symbol, and 2) Huffmancoding. While it is known that arithmetic coding can generate shorter codewords than Huffman, we do not wish to use
it because its output cannot be used as a queriable format.
Byte Alignment Using encodings that are not aligned
at byte boundaries can circumvent byte oriented compressors. There might be identical subsequences which will be
hidden by the encoding because it will make them start at
different positions within a byte and give rise to different
byte sequences. To improve the subsequent compression
we can pad the bit representation of every symbol with zeros so that it occupies whole number of bytes.
3
(0 : , 1 : a[0, 0], 2 : a[1, 1]). This bottom-up graph-like
notation means that there is a node “0” labeled by , and an
a-node “1” which has as left and right child the node “0”,
and an a-node “2” with children “1” and “1”.
Succinct Bit Representations In the case of fixedlength coding we store additional to the symbol table
the number of bits used per symbol. Coming back to
our initial example, the µ-DAG of the binary tree (with
NIL-leaves) is (0 : , 1 : section[0, 0], 2 : section[0, 1], 3 :
book[chapter[0, chapter[section[0, 2], chapter[2, 0]]], 0]).
Here we need (4 − 1) references (the root node is never
referenced) and 3 symbols ( is now a reference) which
means that we need 3 bits per symbol in the fixed-length
coding. To store the number 3, we use our var-byte
encoding (alternatively, a half-byte-var coding might be
appropriate where the first of 4 bits is used to determine
whether 4 more bits belong to the symbol, etc). Apart
from the symbol table and the half-byte or byte storing
the number of bits per symbol, we need 51 (17 · 3) bits to
represent the DAG. Clearly, this uses more bits than the
corresponding codeword needed for the binary tree (51
vs. 38 bits). But, for the other tree which has one more
symbol and node, we obtain 51 + 6 which is smaller than
the number of bits needed for the corresponding binary tree
— 21 · 3.
The Sharing Threshold Especially when applying
backend compression, we found that µ-DAGs do not give
rise to optimal numbers. The reason is that a subtree reference is too expensive if the tree is small (such as one node),
and that compressors like gzip are more effective in storing
short string repetitions. We therefore introduce the weight
of a pattern as (frequency of the pattern) · (size of pattern),
and introduce a new pattern only if its weight is larger than
the threshold. The same threshold will be used for BPLEX
outputs of the next section.
In our experiments we found that even with no backend compression it is not optimal to share everything, but
a threshold of 14 turned out to be optimal for our test set
(using non-aligned Huffman, no-NIL). For gzip backend
compression the optimal number was 1000 (with aligned
Huffman, no-NIL) while for bzip2 and ppm it was at 3000
with var-coding (with NIL) and aligned Huffman (no-NIL),
respectively.
Binary DAGs
A (top-down directed) tree can be conveniently represented by a directed acyclic graph (DAG). The difference
of the DAG to the tree is that a node in the DAG may have
multiple incoming edges. In the limit, the numbers of nodes
and edges in a DAG can be exponentially smaller than in
the tree it represents. It is folklore that the unique minimal
DAG of a tree (called mu-DAG, or µ-DAG) can be computed by one run through the tree, keeping a hash table of
subtrees/dags. For real XML data as it is in use, the number
of pointers (edges) of the mu-DAG is on average approximately 10% of the number of edges of the original (unranked) XML tree structure [4]. As opposed to them, we
work on DAGs of binary tree representation of the XML
tree structures. We represent our DAGs using references.
Intuitively, every repeated subtree gets a name (or “reference”) by which it is referred when it reoccurs. For instance,
the µ-DAG of the binary tree a(a( , ), a( , )) is written as
4
Binary SLT Grammars
A µ-DAG avoids to store repeated subtrees. Sometimes,
however, internal parts of a tree, so called “tree patterns”,
are repeated many times; this redundancy cannot be removed by a DAG. As an example, consider two large subtrees which are identical, except at one leaf position where
they differ. In a DAG, these subtrees cannot be shared because they are different. Sharing-graphs [10] are a general3
Document
1998statistics.xml
Catalog-01.xml
Catalog-02.xml
Dictionary-01.xml
Dictionary-02.xml
EnWikiNew.xml
EnWikiQuote.xml
EnWikiSource.xml
EnWikiVersity.xml
EnWikTionary.xml
EXI-Array.xml
EXI-Factbook.xml
EXI-Invoice.xml
EXI-Telecomp.xml
EXI-Weblog.xml
JST gene.xml
JST snp.xml
Lineitem.xml
Medline.xml
Mondial.xml
Nasa.xml
NCBI gene.xml
NCBI snp.xml
Sprot.xml
Treebank.xml
USHouse.xml
Size (KB)
717
6,624
65,875
3,481
34,311
7,834
5,034
21,849
9,530
160,373
7,156
2,087
457
5,402
2,216
7,932
24,667
30,270
80,248
409
9,958
13,042
135,853
206,993
31,450
144
Tags
47
51
51
25
25
21
21
21
21
21
48
200
53
39
13
27
43
19
79
23
62
51
16
49
252
44
# Nodes
54,581
372,459
3,705,071
513,574
5,077,549
665,825
437,682
1,902,189
828,229
14,520,656
226,524
86,581
26,130
177,634
178,375
388,029
1,169,686
1,985,776
5,394,921
22,423
792,467
645,917
6,879,757
21,634,330
3,843,775
11,889
Depth
7
9
9
9
9
6
6
6
6
6
10
6
8
7
4
8
9
4
8
5
9
8
5
7
38
17
ber of parameters in its definition. Since we encode the
grammar bottom up, starting with a pattern tree that includes no references, followed by one that may only reference the previous pattern, etc, we do not need to explicitly store the number of arguments of the references. In the
small example of above, using fixed-length coding, we have
6 symbols, in binary 000–101, for the labels a through f ,
plus the symbol 110 denoting a parameter, plus the symbol 111 for the reference “1”. Thus, the SLT grammar
(1 : a[b[y1 ], y2 ], 2 : 1[c, 1[d, 1[e, f ]]]) becomes this bitsequence 000001110110111010111011111100101. Compare these 33 required bits to the 50 bits required for a
fixed-length encoding of the tree, using non-NIL coding.
Note that [8] already introduced a simple bit coding of SLT
grammars.
In our experiments we found that a threshold of 14 was
optimal on our data set, for the uncompressed and gzipped
version of BPLEX output, both with non-aligned Huffman
(no-NIL). For gzip and ppm the optimal thresholds were
10,000 and 30,000, respectively, using aligned Huffman
with and without NIL, respectively.
Table 1. Characteristics of XML data sets.
ization of DAGs which allow to share arbitrary connected
subgraphs of a tree (tree patterns). Such sharing graphs
can be conveniently represented by “straight-line tree (SLT)
grammars” [5]. Just as for DAGs, each shared component gets a reference. A tree pattern is denoted by its internal nodes, plus additional leaf nodes labeled y1 , y2 , . . .
which are filled in from left to right in order to obtain
a well-balanced tree. As an example, consider the tree
a(b(c), a(b(d), a(b(e), f ))). In this tree, no subtree is repeated, i.e., it equals its µ-DAG. However, the tree pattern consisting of an a-node with left child b occurs three
times. This pattern is denoted by the tree a(b(y1 ), y2 ) in our
SLT grammar notation. The complete SLT grammar for the
tree is (1 : a[b[y1 ], y2 ], 2 : 1[c, 1[d, 1[e, f ]]]). Notice that
reference now appear at internal nodes, i.e., a pattern can
be instantiated with many different values for its parameter
placeholders y1 , y2 , . . . .
For common XML documents, the BPLEX algorithm
of [5] produces SLT grammars which require, when represented as pointer structures, approximately one half of the
number of pointers of the µ-DAG. As our experiments will
show, succinct storage of SLT grammars will further reduce
the compression ratio wrt to DAGs.
Succinct Bit Representations Additional to DAGs,
the tree patterns of an SLT grammar contain parameter
leaves labeled y1 , y2 , . . . . However, we know that in a
pattern parameters appear from left to right, and each of
them exactly once; thus, we only need one new special
symbol for parameters. The only other difference to our
DAGs of before is that pattern trees are not strictly binary
trees anymore, because a reference may have an arbitrary
(but fixed) number of children corresponding to the num-
5
Experiments
We compared our results with the following two XML
compressors: 1) TREECHOP [12] as a representative for
a queriable XML compressor (in fact, the only queriable
compressor that we could get to run). 2) XMill [13] as a representative of an archival compressor. As far as we know,
all queriable formats have worse compression ratios than
XMillGzip, and all non-queriable compressors are worse
than XMill when compressing tree structures (of course,
comparing XMLPPM with XMillPPM, for instance). XMill
supports three alternative backend compressor: gzip, bzip2,
or PPM. In our experiments we compared our compression
ratios using these three backends independently. We refer to
the corresponding versions of XMill with XMillGzip, XMillBzip2, XMillPPM, respectively, and similarly for BPLEX
and DAG.
In our experiments we used 26 XML files which represent real-life XML data sets that are commonly used in testing the efficiency of XML compression techniques [1, 2, 3]
and cover a wide range of sizes and structures. Our versions of the documents are obtained by removing all data
values, and only keeping the element nodes (additionally
we replace each text node by a placeholder element node).
Table 1 represents the details characteristics of the data sets.
Additionally, we tested more than 60 different combinations of the options of our approach, but for the sake of
fairness, we chose only one setting to use for the reported
results. Figures 1 and 2 show the results of our experiments
where the values of the compression ratios are normalized
with respect to the compression ratio of XMillGzip. Note
4
1.6
factor is expected for ISX [16] as it uses the same idea of
tree structure coding as Succinct DOM).
To summarize our outcomes, DAGs are useful for file
compression of XML tree structures, while BPLEX is useful for producing queriable tree structure representations.
To the best of our knowledge our approach represents the
smallest queriable XML tree structure representation currently available. Based on all results of the literature, our
approach is the first queriable representation which has the
compression ratio of XMill. We expect that the running
time of tree traversal operations will be slower than that
of codings such as Succinct DOM. We are currently implementing a DOM interface to our queriable BPLEX representation so that we can assess the running times of tree
access operations. In parallel, we are working on an XPath
evaluator which makes use of the repetition of tree patterns
in BPLEX outputs and therefore is expected to run fast.
1.4
Compression Ratio
1.2
1.0
0.8
0.6
0.4
0.2
TREECHOP
DAG
XMillGzip
BPLEX
DAGGzip
BPLEXGzip
XMillBzip2
DAGBzip2
BPLEXBzip2
XMillPPM
DAGPPM
BPLEXPPM
0.0
Figure 1. Average compression ratios.
2.0
1.8
References
1.6
XMillGzip
BPLEXgzip
XMillBzip2
BPLEXBzip2
XMillPPM
BPLEXPPM
BPLEX
TREECHOP
Compression Ratio
1.4
1.2
1.0
0.8
[1] http://www.w3.org/XML/EXI/.
[2] http://www.cs.washington.edu/research/xmldatasets/.
[3] http://download.wikipedia.org/backup-index.html.
[4] P. Buneman, M. Grohe, and C. Koch. Path queries on compressed XML. In VLDB, 2003.
0.6
[5] G. Busatto, M. Lohrey, and S. Maneth. Efficient memory
representation of XML document trees. To appear in Information Systems.
0.4
0.2
0.0
1998statistics
Catalog-01
Catalog-02
Dictionary-01
Dictionary-02
EnWikiNew
EnWikiQuote
EnWikiSource
EnWikiVersity
EnWikTionary
EXI-Array
EXI-factbook
EXI-Invoice
EXI-Telecomp
EXI-weblog
JST_gene
JST_snp
Lineitem
Medline
Mondial
Nasa
NCBI_gene
NCBI_snp
Sprot
Treebank
USHouse
[6] J. Cheney. Compressing XML with Multiplexed Hierarchical
PPM Models. In DCC, 2001.
[7] O. Delpratt, R. Raman, and N. Rahman. Engineering succinct DOM. In EDBT, 2008.
XML Data Sets
[8] D. K. Fisher and S. Maneth. Structural selectivity estimation
for XML documents. In ICDE, 2007.
Figure 2. Detailed compression ratios.
[9] Kieffer, Yang, Nelson, and Cosman. Universal Lossless
Compression Via Multilevel Pattern Matching. IEEE Trans.
Inform. Theory, 46, 2000.
that in 7 occasions the TREECHOP ratio is out of the scale
of the figure (indicated by an upgoing line). Our main two
comments are: (1) The results of the experiments in Figure 1 show that the average compression ratio achieved by
our approach is consistently better than the compression ratio of XMill (the best known compressor for the structure
parts of XML documents) using any of the three alternative back-ends (gzip, bzip2, ppm). (2) Figures 2 and 1 show
that our queriable (uncompressed) BPLEX representation is
achieving a significant improvement on the compression ratio over the selected compared compressors. In particular
the compression ratio of a queriable BPLEX representation
is on average 68% of the size of XMillGzip. Moreover, it is
(on average) around 3 times smaller than the compression
ratio of the TREECHOP queriable compressor, and more
than 200 times smaller than Succinct DOM [7] (a similar
[10] J. Lamping. An algorithm for optimal lambda calculus reductions. In POPL, 1990.
[11] G. Leighton, J. Diamond, and T. Muldner. AXECHOP: A
Grammar-based Compressor for XML. In DCC, 2005.
[12] G. Leighton, T. Muldner, and J. Diamond. TREECHOP: a
tree-based query-able compressor for XML. Technical report, Acadia University, 2005.
[13] H. Liefke and D. Suciu. XMill: An efficient compressor for
XML data. In SIGMOD, 2000.
[14] Y. Lin, Y. Zhang, Q. Li, and J. Yang. Supporting efficient
query processing on compressed XML files. In SAC, 2005.
[15] P. Tolani and J. Haritsa. XGRIND: A Query-Friendly XML
Compressor. In ICDE, 2002.
[16] R. K. Wong, F. Lam, and W. M. Shui. Querying and maintaining a compact XML storage. In WWW, 2007.
5