Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1135777.1135891acmconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
Article

Compressing and searching XML data via two zips

Published: 23 May 2006 Publication History

Abstract

XML is fast becoming the standard format to store, exchange and publish over the web, and is getting embedded in applications. Two challenges in handling XML are its size (the XML representation of a document is significantly larger than its native state) and the complexity of its search (XML search involves path and content searches on labeled tree structures). We address the basic problems of compression, navigation and searching of XML documents. In particular, we adopt recently proposed theoretical algorithms [11] for succinct tree representations to design and implement a compressed index for XML, called XBZIPiNDEX, in which the XML document is maintained in a highly compressed format, and both navigation and searching can be done uncompressing only a tiny fraction of the data. This solution relies on compressing and indexing two arrays derived from the XML data. With detailed experiments we compare this with other compressed XML indexing and searching engines to show that XBZIPiNDEX has compression ratio up to 35% better than the ones achievable by those other tools, and its time performance on some path and content search operations is order of magnitudes faster: few milliseconds over hundreds of MBs of XML files versus tens of seconds, on standard XML data sources.

References

[1]
http://xml.coverpages.org/xml.html.]]
[2]
J. Adiego, P. de la Fuente, and G. Navarro. Lempel-Ziv compression of structured text. In IEEE Data Compression Conference, 2004.]]
[3]
J. Adiego, P. de la Fuente, and G. Navarro. Merging prediction by partial matching with structural contexts model. In IEEE Data Compression Conference, page 522, 2004.]]
[4]
A. Arion, A. Bonifati, G. Costa, S. D'Aguanno, I. Manolescu, and A. Pugliese. XQueC: pushing queries to compressed XML data. In VLDB, 2003.]]
[5]
D. Benoit, E. Demaine, I. Munro, R. Raman, V. Raman, and S. Rao. Representing trees of higher degree. Algorithmica, 2005.]]
[6]
B. Catania, A. Maddalena, and A. Vakali. XML document indexes: a classification. In IEEE Internet Computing, pages 64--71, September-October 2005.]]
[7]
T. Chen, J. Lu, and T. W. Lin. On boosting holism in XML twig pattern matching using structural indexing techniques. In ACM Sigmod, pages 455--466, 2005.]]
[8]
J. Cheney. Compressing XML with multiplexed hierarchical PPM models. In IEEE Data Compression Conference, pages 163--172, 2001.]]
[9]
J. Cheney. An empirical evaluation of simple DTD-conscious compression techniques. In WebDB, 2005.]]
[10]
J. Cheng and W. Ng. XQzip: Querying compressed XML using structural indexing. In International Conference on Extending Database Technology, pages 219--236, 2004.]]
[11]
P. Ferragina, F. Luccio, G. Manzini, and S. Muthukrishnan. Structuring labeled trees for optimal succinctness, and beyond. In IEEE Focs, pages 184--193, 2005.]]
[12]
P. Ferragina and G. Manzini. An experimental study of a compressed index. Information Sciences, 135:13--28, 2001.]]
[13]
P. Ferragina and G. Manzini. Indexing compressed text. Journal of the ACM, 52(4):552--581, 2005.]]
[14]
R. F. Geary, R. Raman, and V. Raman. Succinct ordinal trees with level-ancestor queries. In ACM-SIAM Soda, 2004.]]
[15]
D. Geer. Will binary XML speed network traffic? IEEE Computer, pages 16--18, April 2005.]]
[16]
R. Goldman and J. Widom. Dataguides: enabling query formulation and optimization in semistructured databases. In VLDB, pages 436--445, 1997.]]
[17]
A. Golinsky, I. Munro, and S. Rao. Rank/Select operations on large alphabets: a tool for text indexing. In ACM-SIAM SODA, 2006.]]
[18]
R. Kaushik, R. Krishnamurthy, J. Naughton, and R. Ramakrishnan. On the integration of structure indexes and inverted lists. In ACM Sigmod, pages 779--790, 2004.]]
[19]
R. Kaushik, R. Krishnamurthy, J. F. Naughton, and R. Ramakrishnan. On the integration of structure indexes and inverted lists. In ACM Sigmod, 2004.]]
[20]
W. Y. Lam, W. Ng, P. T. Wood, and M. Levene. XCQ: XML compression and querying system. In WWW, 2003.]]
[21]
H. Liefke and D. Suciu. XMILL: An efficient compressor for XML data. In ACM Sigmod, pages 153--164, 2000.]]
[22]
T. Milo and D. Suciu. Index structures for path expressions. In ICDT, pages 277--295, 1999.]]
[23]
Jun-Ki Min, Myung-Jae Park, and Chin-Wan Chung. Xpress: A queriable compression for XML data. In ACM Sigmod, pages 122--133, 2003.]]
[24]
E. Moura, G. Navarro, N. Ziviani, and R. Baeza-Yates. Fast and flexible word searching on compressed text. ACM Transactions on Information Systems, 18(2):113--139, 2000.]]
[25]
P. R. Raw and B. Moon. PRIX: Indexing and querying XML using Pr"ufer sequences. In ICDE, pages 288--300, 2004.]]
[26]
D. Shkarin. PPM: One step to practicality. In IEEE Data Compression Conference, pages 202--211, 2002.]]
[27]
P. M. Tolani and J. R. Haritsa. XGRIND: A query-friendly XML compressor. In ICDE, pages 225--234, 2002.]]
[28]
H. Wang, S. Park, W. Fan, and P. S. Yu. ViST: a dynamic index methd for querying XML data by tree structures. In ACM Sigmod, pages 110--121, 2003.]]
[29]
W. Wang, H. Wang, H. Lu, H. Jang, X. Lin, and J. Li. Efficient processing of XML path queries using the disk-based F&B index. In VLDB, pages 145--156, 2005.]]

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
WWW '06: Proceedings of the 15th international conference on World Wide Web
May 2006
1102 pages
ISBN:1595933239
DOI:10.1145/1135777
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 May 2006

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. XML compression and indexing
  2. labeled trees

Qualifiers

  • Article

Conference

WWW06
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)11
  • Downloads (Last 6 weeks)5
Reflects downloads up to 13 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media