article

Studying the XML Web: Gathering Statistics from an XML Sample

Authors:

Denilson Barbosa,

Laurent Mignet,

Pierangelo VeltriAuthors Info & Claims

World Wide Web, Volume 8, Issue 4

Pages 413 - 438

https://doi.org/10.1007/s11280-005-1544-y

Published: 01 December 2005 Publication History

Abstract

XML has emerged as the language for exchanging data on the web and has attracted considerable interest both in industry and in academia. Nevertheless, to date, little is known about the XML documents published on the web. This paper presents a comprehensive analysis of a sample of about 200,000 XML documents on the web, and is the first study of its kind. We study the distribution of XML documents across the web in several ways; moreover, we provided a detailed characterization of the structure of real XML documents. Our results provide valuable input to the design of algorithms, tools and systems that use XML in one form or another.

References

[1]

{1} S. Abiteboul, P. Buneman, and D. Suciu, Data on the Web. Morgan Kauffman, 1999.

Google Scholar

[2]

{2} S. Abiteboul, M. Preda, and G. Cobena, "Adaptive On-Line Page Importance Computation," in Proc. of the Int. WWW Conf., 2003.

Crossref

Google Scholar

[3]

{3} S. Abiteboul and V. Vianu, "Queries and Computation on the Web," in Proc. of the Int. Conf. on Data Transaction (ICDT), 1997.

Crossref

Google Scholar

[4]

{4} V. Aguiléra, S. Cluet, T. Milo, P. Veltri, and D. Vodislav, "Views in a large scale XML repository," VLDB Journal 11(3), November 2002.

Crossref

Google Scholar

[5]

{5} V. Apparao, S. Byrne, M. Champion, S. Isaacs, I. Jacobs, A. L. Hors, G. Nicol, J. Robie, R. Sutor, C. Wilson, and L. Wood. Document Object Model (DOM) Level 1 Specification. W3C Recommendation, http://www.w3.org/TR/1998/REC-DOM-Level-1-19981001, October 1 1998.

Google Scholar

[6]

{6} D. Barbosa, A. O. Mendelzon, L. Libkin, L. Mignet, and M. Arenas, "Effcient incremental validation of XML documents," in Proceedings of the 20th International Conference on Data Engineering, IEEE Computer Society, Boston, MA, USA, 2004, pp 671-682.

Crossref

Google Scholar

[7]

{7} L. Barbosa and J. Freire, "Siphoning hidden-web data through keyword-based interfaces," in Proceedings of the Brazilian Symposium on Databases.

Google Scholar

[8]

{8} G. J. Bex, F. Neven, and J. V. den Bussche, "DTDs versus XML Schema: A practical study," in Proceedings of the Seventh International Workshop on the Web and Databases, WebDB 2004, Maison de la Chimie, Paris, France, June 17-18, 2004, pp. 79-84.

Crossref

Google Scholar

[9]

{9} P. Bohannon, J. Freire, P. Roy, and J. Siméon, "From XML schema to relations: A cost-based approach to XML storage," in Proc. of the Int. Conf. on Data Engineering (ICDE), 2002.

Crossref

Google Scholar

[10]

{10} T. Bray, J. Paoli, C. M. Sperberg-McQueen, and E. Maler (Eds). Extensible Markup Language (XML) 1.0. World Wide Web Consortium, third edition, February 4 2004. http://www.w3.org/TR/2004/REC-xml- 20040204.

Google Scholar

[11]

{11} S. Brin and L. Page, "The anatomy of a large-scale hypertextual web search engine," in Proc. of the Int. WWW Conf., 1998.

Digital Library

Google Scholar

[12]

{12} P. Buneman, M. Grohe, and C. Koch, "Path queries on compressed XML," in Proceedings of 29th International Conference on Very Large Data Bases, Berlin, Germany, September 9-12, 2003, pp. 141-152.

Crossref

Google Scholar

[13]

{13} Cooperative Association for Internet Data Analysis. http://www.caida.org/.

Google Scholar

[14]

{14} J. Cho and H. Garcia-Molina, "Finding replicated web collections," in Proc. of the Int. Conf. on Management of Data (SIGMOD), 2000.

Crossref

Google Scholar

[15]

{15} B. Choi, "What are real DTDs like," in WebDB, 2002.

Google Scholar

[16]

{16} J. Clark and S. DeRose, XML Path Language (XPath)--Version 1.0. World Wide Web Consortium, November 16, 1999. http://www.w3.org/TR/1999/REC-xpath-19991116

Google Scholar

[17]

{17} S. Dill, R. Kumar, K. S. McCurley, S. Rajagopalan, D. Sivakumar, and A. Tomkins, "Self-similarity in the Web, " in Proc. Int. Conf. on Very Large Data Bases (VLDB), 2001.

Crossref

Google Scholar

[18]

{18} T. Fiebig, S. Helmer, C. Kanne, G. Moerkotte, J. Neumann, R. Schiele, and T. Westmann, "Anatomy of a native XML base management system," VLDB Journal, 11(4), 2002, 292-314.

Digital Library

Google Scholar

[19]

{19} R. T. Fielding, J. Gettys, J. C. Mogul, H. F. Nielsen, L. Masinter, P. Leach, and T. Berners-Lee, Hypertext Transfer Protocol-- HTTP/1.1. RFC 2616. HTTP Working Group, 1999. ftp://ftp.isi.edu/in-notes/rfc2616.txt.

Crossref

Google Scholar

[20]

{20} J. Freire, J. R. Haritsa, M. Ramanath, P. Roy, and J. Simon, "StatiX: Making XML count," in Proc. of the Int. Conf. on Management of Data (SIGMOD), 2002.

Crossref

Google Scholar

[21]

{21} R. Hull, M. Benedikt, V. Christophides, and J. Su, "Eservices: A look behind the curtain," in Proceedings of the Twenty-Second ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, San Diego, California, USA, June 09-11, 2003, pp. 1-14.

Crossref

Google Scholar

[22]

{22} IBM DB2 v8.1. http://www.ibm.com.

Google Scholar

[23]

{23} International Standards Organization. ISO 8879--Standard Generalized Markup Language (SGML), 1986.

Google Scholar

[24]

{24} Internet Domain Survey. http://www.isc.org/ds/

Google Scholar

[25]

{25} P. Iperiotis, L. Gravano, and M. Saham, "Probe, count, and classify: Categorizing hidden web databases," in Proc. of the Int. Conf. on Management of Data (SIGMOD), 2001.

Crossref

Google Scholar

[26]

{26} H. V. Jagadish, S. Al-Khalifa, A. Chapman, L. V. S. Lakshmanan, A. Nierman, S. Paparizos, J. M. Patel, D. Srivastava, N. Wiwatwattana, Y. Wu, and C. Yu, "TIMBER: A native XML database," VLDB Journal, 11(4), 2002, 274-291.

Digital Library

Google Scholar

[27]

{27} R. Kumar, P. Raghavan, S. Rajagopalan, D. Sivakumar, A. Tomkins, and E. Upfal, "The web as a graph," in Proc. of the Int. Conf. on Principle of Database Systems (PODS), 2000.

Crossref

Google Scholar

[28]

{28} Q. Li and B. Moon, "Indexing and querying XML data for regular path expressions," in Proc. Int. Conf. on Very Large Data Bases (VLDB), 2001.

Crossref

Google Scholar

[29]

{29} H. Liefke and D. Suciu, "XMILL: An efficient compressor for XML data," in Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, Dallas, Texas, USA, May 16-18, 2000, ACM, 2000, pp. 153-164.

Crossref

Google Scholar

[30]

{30} I. Manolescu, D. Florescu, and D. Kossmann, "Answering XML queries on heterogeneous data sources," in Proc. Int. Conf. on Very Large Data Bases (VLDB), 2001.

Crossref

Google Scholar

[31]

{31} Microsoft SQL Server 2000. http://www.microsoft.com/sql

Google Scholar

[32]

{32} L. Mignet, D. Barbosa, and P. Veltri, "The XML Web: A first study," in Proceedings of the 12th International Conference on World Wide Web, Budapest, Hungary, May 20-24, 2003.

Crossref

Google Scholar

[33]

{33} L. Mignet, M. Preda, S. Abiteboul, S. Ailleret, B. Amann, and A. Marian, "Acquiring XML pages for a webhouse," in Base de Donnes Avances, 2000.

Google Scholar

[34]

{34} RFC 1321--The MD5 Message-Digest Algorithm.

Google Scholar

[35]

{35} Oracle 9i. http://www.oracle.com

Google Scholar

[36]

{36} Y. Papakonstantinou and V. Vianu, "Incremental validation of XML documents," in Proceeedings of The 9th International Conference on Database Theory, Siena, Italy, January 8-10, 2003, pp. 47-63.

Crossref

Google Scholar

[37]

{37} D. Raggett, A. L. Hors, and I. Jacobs, HTML 4.01 Specification, World Wide Web Consortium, December 24 1999. http://www.w3.org/TR/1999/REC-html401-19991224.

Google Scholar

[38]

{38} S. Raghavan and H. Garcia-Molina, "Crawling the hidden web," in Proc. Int. Conf. on Very Large Data Bases (VLDB), 2001.

Crossref

Google Scholar

[39]

{39} L. Segoufin and V. Vianu, "Validating streaming XML documents," in Proceedings of the 21st ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, Madison, Wisconsin, June 3-5 2002, pp. 53-64.

Crossref

Google Scholar

[40]

{40} The Plays of Shakespeare in XML. http://metalab.unc.edu/bosak/xml/

Google Scholar

[41]

{41} J. Shanmugasundaram, K. Tufte, C. Zhang, G. He, D. J. DeWitt, and J. F. Naughton, "Relational databases for querying XML documents: Limitations and opportunities," in Proceedings of 25th International Conference on Very Large Data Bases, Edinburgh, Scotland, UK, September 7-10, 1999, pp. 302-314.

Crossref

Google Scholar

[42]

{42} Tamino XML Server. http://www.softwareag.com/tamino

Google Scholar

[43]

{43} I. Tatarinov, Z. Ives, A. Halevy, and D. Weld, "Updating XML," in Proc. of the Int. Conf. on Management of Data (SIGMOD), 2001.

Crossref

Google Scholar

[44]

{44} H. S. Thompson, D. Beech, M. Maloney, and N. Mendelsohn (Eds), XML Schema Part 1: Structures. World Wide Web Consortium, May 2 2001. http://www.w3.org/TR/2001/REC-xmlschema-1-20010502/

Google Scholar

[45]

{45} Semantic Web. http://www.w3.org/2001/sw

Google Scholar

[46]

{46} Wireless Application Protocol. http://www.wapforum.org/

Google Scholar

[47]

{47} The XOO7 Benchmark. http://www.comp.nus.edu.sg/~ebh/XOO7.html

Google Scholar

[48]

{48} The XML benchmark project. http://www.xml-benchmark.org/

Google Scholar

[49]

{49} DBLP XML. http://dblp.uni-trier.de/xml/

Google Scholar

[50]

{50} S.A. Xylemehttp://www.xyleme.com/

Google Scholar

[51]

{51} L. Xyleme, "A dynamic warehouse for XML data of the Web," IEEE--Data Engineering Bulletin, 24(2), 2001.

Crossref

Google Scholar

Cited By

View all

Pawlik MAugsten Nd'Aquin MDietze SHauff CCurry ECudre Mauroux P(2020)Minimal Edit-Based Diffs for Large TreesProceedings of the 29th ACM International Conference on Information & Knowledge Management10.1145/3340531.3412026(1225-1234)Online publication date: 19-Oct-2020
https://dl.acm.org/doi/10.1145/3340531.3412026
Li YChen HZhang XZhang LDesai BAnagnostopoulos DManolopoulos YNikolaidou M(2019)An effective algorithm for learning single occurrence regular expressions with interleavingProceedings of the 23rd International Database Applications & Engineering Symposium10.1145/3331076.3331100(1-10)Online publication date: 10-Jun-2019
https://dl.acm.org/doi/10.1145/3331076.3331100
Fraigniaud PKorman A(2016)An Optimal Ancestry Labeling Scheme with Applications to XML Trees and Universal PosetsJournal of the ACM10.1145/279407663:1(1-31)Online publication date: 12-Feb-2016
https://dl.acm.org/doi/10.1145/2794076
Show More Cited By

Recommendations

The XML web: a first study
WWW '03: Proceedings of the 12th international conference on World Wide Web

Although originally designed for large-scale electronic publishing, XML plays an increasingly important role in the exchange of data on the Web. In fact, it is expected that XML will become the lingua franca of the Web, eventually replacing HTML. Not ...
The quality of the XML web
CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge management

We collect evidence to answer the following question: Is the quality of the XML documents found on the web sufficient to apply XML technology like XQuery, XPath and XSLT? XML collections from the web have been previously studied statistically, but no ...
The quality of the XML Web

We collect evidence to answer the following question: Is the quality of the XML documents found on the Web sufficient to apply XML technology like XQuery, XPath and XSLT? XML collections from the Web have been previously studied statistically, but no ...

Reviews

Reviewer: George R. Mayforth

According to the authors, "little is known about the XML documents published on the Web." This provides the motivation for the analysis presented in this paper, which examines "the distribution of XML documents across the Web in several ways," and presents "a detailed characterization of the structure of real XML documents. [The authors'] results provide valuable input to the design of algorithms, tools and systems that use XML in one form or another." One cannot argue with this characterization of the paper, but, unlike Hypertext Markup Language (HTML), the use of which is well known and understood, Extensible Markup Language (XML) represents an infrastructure element of the Web, which can describe almost any type of data. This adds the need for context, leading one to conclude that an analysis of XML would be greatly enhanced by relating the use of the documents to their structure. The authors state that their analysis is the first of its kind, which leaves us reluctant to be critical of their goals for the work, but reading the paper raises other such questions. For example, consider the universe of documents the authors studied: they analyzed about 200,000 XML documents, from a total population that has no reliable size estimate, because obtaining such an estimate is "intrinsically difficult." What does this mean with regard to the generality of the analysis__?__ How representative are these documents of the whole universe of them__?__ The authors "hope that [their] results will provide valuable insight for guiding the development of algorithms, tools and systems that process XML in one form or another. In particular, [they] believe [their] results have direct application in the development of meaningful benchmarks for XML applications." I agree with this statement, and an earlier one (not quoted here) that this paper represents a starting point. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

World Wide Web Volume 8, Issue 4

December 2005

141 pages

ISSN:1386-145X

Issue’s Table of Contents

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 01 December 2005

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

17
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 23 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Pawlik MAugsten Nd'Aquin MDietze SHauff CCurry ECudre Mauroux P(2020)Minimal Edit-Based Diffs for Large TreesProceedings of the 29th ACM International Conference on Information & Knowledge Management10.1145/3340531.3412026(1225-1234)Online publication date: 19-Oct-2020
https://dl.acm.org/doi/10.1145/3340531.3412026
Li YChen HZhang XZhang LDesai BAnagnostopoulos DManolopoulos YNikolaidou M(2019)An effective algorithm for learning single occurrence regular expressions with interleavingProceedings of the 23rd International Database Applications & Engineering Symposium10.1145/3331076.3331100(1-10)Online publication date: 10-Jun-2019
https://dl.acm.org/doi/10.1145/3331076.3331100
Fraigniaud PKorman A(2016)An Optimal Ancestry Labeling Scheme with Applications to XML Trees and Universal PosetsJournal of the ACM10.1145/279407663:1(1-31)Online publication date: 12-Feb-2016
https://dl.acm.org/doi/10.1145/2794076
Helmer SAugsten NBöhlen M(2012)Measuring structural similarity of semistructured data based on information-theoretic approachesThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-012-0263-021:5(677-702)Online publication date: 1-Oct-2012
https://dl.acm.org/doi/10.1007/s00778-012-0263-0
Grijzenhout SMarx M(2011)The quality of the XML webProceedings of the 20th ACM international conference on Information and knowledge management10.1145/2063576.2063824(1719-1724)Online publication date: 24-Oct-2011
https://dl.acm.org/doi/10.1145/2063576.2063824
Tamayo AGranell CHuerta JLiao L(2011)Analysing complexity of XML schemas in geospatial web servicesProceedings of the 2nd International Conference on Computing for Geospatial Research & Applications10.1145/1999320.1999337(1-9)Online publication date: 23-May-2011
https://dl.acm.org/doi/10.1145/1999320.1999337
Picalausa FServais FZimányi EChu WWong WPalakal MHung C(2011)XEvolveProceedings of the 2011 ACM Symposium on Applied Computing10.1145/1982185.1982530(1645-1650)Online publication date: 21-Mar-2011
https://dl.acm.org/doi/10.1145/1982185.1982530
Fraigniaud PKorman ACharikar M(2010)Compact ancestry labeling schemes for XML treesProceedings of the twenty-first annual ACM-SIAM symposium on Discrete algorithms10.5555/1873601.1873639(458-466)Online publication date: 17-Jan-2010
https://dl.acm.org/doi/10.5555/1873601.1873639
Bex GGelade WNeven FVansummeren S(2010)Learning Deterministic Regular Expressions for the Inference of Schemas from XML DataACM Transactions on the Web10.1145/1841909.18419114:4(1-32)Online publication date: 1-Sep-2010
https://dl.acm.org/doi/10.1145/1841909.1841911
Gelade WIdziaszek TMartens WNeven FParedaens JVan Gucht D(2010)Simplifying XML schemaProceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems10.1145/1807085.1807118(251-260)Online publication date: 6-Jun-2010
https://dl.acm.org/doi/10.1145/1807085.1807118
Show More Cited By

Abstract

References

Cited By

Index Terms

Recommendations

The XML web: a first study

The quality of the XML web

The quality of the XML Web

Reviews

Access critical reviews of Computing literature here

Comments

Information

Published In

Publisher

Publication History

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations