Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2063576.2063790acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Semi-indexing semi-structured data in tiny space

Published: 24 October 2011 Publication History

Abstract

Semi-structured textual formats are gaining increasing popularity for the storage of document collections and rich logs. Their flexibility comes at the cost of having to load and parse a document entirely even if just a small part of it needs to be accessed. For instance, in data analytics massive collections are usually scanned sequentially, selecting a small number of attributes from each document. We propose a technique to attach to a raw, unparsed document (even in compressed form) a "semi-index": a succinct data structure that supports operations on the document tree at speed comparable with an in-memory deserialized object, thus bridging textual formats with binary formats. After describing the general technique, we focus on the JSON format: our experiments show that avoiding the full loading and parsing step can give speedups of up to 12 times for on-disk documents using a small space overhead.

References

[1]
Apache Xerces2 XML Parser. http://xerces.apache.org/xerces-j/.
[2]
D. Arroyuelo, R. Cánovas, G. Navarro, and K. Sadakane. Succinct trees in practice. In ALENEX, pages 84--97, 2010.
[3]
Binary XML. http://en.wikipedia.org/wiki/Binary_XML.
[4]
CouchDB. http://couchdb.apache.org/.
[5]
CouchDB in the wild. http://wiki.apache.org/couchdb/CouchDB_in_the_wild.
[6]
A. Couthures. JSON for XForms. In Proc. XMLPrague 2011, 2011.
[7]
J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, pages 137--150, 2004.
[8]
O. Delpratt, R. Raman, and N. Rahman. Engineering succinct DOM. In EDBT, pages 49--60, 2008.
[9]
P. Elias. Efficient storage and retrieval by content and address of static files. Journal of the ACM (JACM), 21(2):246--260, 1974.
[10]
R. Fano. On the number of bits required to implement an associative memory. Memorandum 61. Computer Structures Group, Project MAC, MIT, Cambridge, Mass., nd, 1971.
[11]
F. Farfán, V. Hristidis, and R. Rangaswami. 2LP: A double-lazy XML parser. Inf. Syst., 34(1):145--163, 2009.
[12]
P. Ferragina, F. Luccio, G. Manzini, and S. Muthukrishnan. Compressing and indexing labeled trees, with applications. J. ACM, 57(1), 2009.
[13]
G. Gou and R. Chirkova. Efficiently Querying Large XML Data Repositories: A Survey. IEEE Transactions on Knowledge and Data Engineering, 19(10):1381--1403, 2007.
[14]
R. Grossi and J. S. Vitter. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comput., 35(2):378--407, 2005.
[15]
Hive JSON SerDe. http://code.google.com/p/hive-json-serde/.
[16]
J. Hunter. A JSON Facade on MarkLogic Server. In Proc. XMLPrague 2011, 2011.
[17]
S. Idreos, I. Alagiannis, R. Johnson, and A. Ailamaki. Here are my Data Files. Here are my Queries. Where are my Results? In 5th International Conference on Innovative Data Systems Research (CIDR), 2011.
[18]
G. Jacobson. Space-efficient static trees and graphs. In FOCS, pages 549--554, 1989.
[19]
Jaql. http://code.google.com/p/jaql/.
[20]
JSON dump of Delicious bookmarks, September 2009. http://infochimps.com/datasets/delicious-bookmarks-september-2009.
[21]
JSON specification. http://json.org/.
[22]
JsonCpp. http://jsoncpp.sourceforge.net/.
[23]
M. Kay. Ten Reasons Why Saxon XQuery is Fast. IEEE Data Eng. Bull., 31(4):65--74, 2008.
[24]
MongoDB. http://www.mongodb.org/.
[25]
MongoDB Production Deployments. http://www.mongodb.org/display/DOCS/Production+Deployments.
[26]
J. I. Munro and V. Raman. Succinct representation of balanced parentheses, static trees and planar graphs. In FOCS, pages 118--126, 1997.
[27]
D. Okanohara and K. Sadakane. Practical entropy-compressed rank/select dictionary. In ALENEX, 2007.
[28]
R. Raman, V. Raman, and S. R. Satti. Succinct indexable dictionaries with applications to encoding n-ary trees, prefix sums and multisets. ACM Transactions on Algorithms, 3(4), 2007.
[29]
K. Sadakane and G. Navarro. Fully-functional succinct trees. In SODA, pages 134--149, 2010.
[30]
S. Sakr. XML compression techniques: A survey and comparison. J. Comput. Syst. Sci., 75(5):303--322, 2009.
[31]
A. Schmidt, F. Waas, M. Kersten, M. J. Carey, I. Manolescu, and R. Busse. Xmark: A benchmark for xml data management. In In VLDB, pages 974--985, 2002.
[32]
The Open Library, JSON dump of author records. http://infochimps.com/datasets/the-open-library.
[33]
S. Vigna. Broadword implementation of rank/select queries. In WEA, pages 154--168, 2008.
[34]
Wikipedia database dumps. http://download.wikimedia.org/.
[35]
R. K. Wong, F. Lam, and W. M. Shui. Querying and maintaining a compact XML storage. In WWW, pages 1073--1082, 2007.

Cited By

View all
  • (2019)Optimizing partitioning strategies for faster inverted index compressionFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-016-6252-513:2(343-356)Online publication date: 17-May-2019
  • (2016)Efficient Querying Distributed Big-XML Data using MapReduceInternational Journal of Grid and High Performance Computing10.4018/IJGHPC.20160701058:3(70-79)Online publication date: 1-Jul-2016
  • (2016)SJSON: A succinct representation for JavaScript object notation documents2016 Eleventh International Conference on Digital Information Management (ICDIM)10.1109/ICDIM.2016.7829787(173-178)Online publication date: Sep-2016
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge management
October 2011
2712 pages
ISBN:9781450307178
DOI:10.1145/2063576
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 October 2011

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. semi-index
  2. semi-structured data
  3. succinct data structures

Qualifiers

  • Research-article

Conference

CIKM '11
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)4
  • Downloads (Last 6 weeks)1
Reflects downloads up to 11 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2019)Optimizing partitioning strategies for faster inverted index compressionFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-016-6252-513:2(343-356)Online publication date: 17-May-2019
  • (2016)Efficient Querying Distributed Big-XML Data using MapReduceInternational Journal of Grid and High Performance Computing10.4018/IJGHPC.20160701058:3(70-79)Online publication date: 1-Jul-2016
  • (2016)SJSON: A succinct representation for JavaScript object notation documents2016 Eleventh International Conference on Digital Information Management (ICDIM)10.1109/ICDIM.2016.7829787(173-178)Online publication date: Sep-2016
  • (2016)High-performance XML modeling of parallel queries based on MapReduce frameworkCluster Computing10.1007/s10586-016-0628-z19:4(1975-1986)Online publication date: 1-Dec-2016
  • (2013)Performance evaluation of unstructured NoSQL data over distributed framework2013 International Conference on Advances in Computing, Communications and Informatics (ICACCI)10.1109/ICACCI.2013.6637424(1623-1627)Online publication date: Aug-2013
  • (2013)Design of Practical Succinct Data Structures for Large Data CollectionsExperimental Algorithms10.1007/978-3-642-38527-8_3(5-17)Online publication date: 2013

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media