Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2361354.2361374acmconferencesArticle/Chapter ViewAbstractPublication PagesdocengConference Proceedingsconference-collections
research-article

A first approach to the automatic recognition of structural patterns in XML documents

Published: 04 September 2012 Publication History

Abstract

XML is among the preferred formats for storing the structure of documents such as scientific articles, manuals, documentation, literary works, etc. Sometimes publishers adopt established and well-known vocabularies such as DocBook and TEI, other times they create partially or entirely new ones that better deal with the particular requirements of their documents. The (explicit and implicit) requirements of use in these vocabularies often follow well-established patterns, creating meta-structures (the block, the container, the inline element, etc.) that persist across vocabularies and authors and that describe a truer and more general conceptualization of the documents' building blocks. Addressing such meta-structures not only gives a better insight of what documents really are composed of, but provides abstract and more general mechanisms to work on documents regardless of the availability of specific schemas, tools and presentation stylesheets. In this paper we introduce a schemaindependent theory based on eleven structural patterns. We provide a definition of such patterns and how they synthesize characteristics emerging from real markup documents. Additionally, we propose an algorithm that allows us to identify the pattern of each element in a set of homogeneous markup documents.

References

[1]
Cardoso, E., Jabour, I., Laber, E., Rodrigues, R., Cardoso, P. (2011). An efficient language-independent method to extract content from news webpages. In Proceedings of the 2011 ACM symposium on Document engineering (DocEng11).
[2]
Colazzo, D., Sartiani, C., Albano, A., Manghi, P., Ghelli, G., Lini, L., Paoli, M. (2002). A typed text retrieval query language for XML documents. In Journal of the American Society for Information Science and Technology, 53 (6): 467--488.
[3]
Dattolo, A., Di Iorio, A., Duca, S., Feliziani, A.A., Vitali, F. (2007). Structural patterns for descriptive documents. In Baresi, L., Fraternali, P., Houben, G. (Eds.), Proceedings of the 7th International Conference on Web Engineering 2007 (ICWE 2007).
[4]
Di Iorio, A., Gubellini, D., Vitali, F. (2005). Design patterns for document substructures. In Proceedings of the Extreme Markup Languages 2005. Rockville, MD, USA: Mulberry Technologies, Inc. http://conferences.idealliance.org/extreme/html/2005/ Vitali01/EML2005Vitali01.html (last visited June 29, 2012).
[5]
Gamma, E., Helm, R., Johnson, R., Vlissides, J. (1994). Design Patterns: Elements of Reusable Object-Oriented Software. Boston, Massachusetts, USA: Addison-Wesley. ISBN: 0201633610.
[6]
Georg, G., Hernault, H., Cavazza, M., Prendinger, H., Ishizuka, M. (2009). From Rhetorical Structures to Document Structure: Shallow Pragmatic Analysis for Document Engineering. In Proceedings of the 2009 ACM symposium on Document engineering (DocEng09).
[7]
Georg, G., Jaulent, M. (2007). A Document Engineering Environment for Clinical Guidelines. In Proceeding of the 2007 ACM symposium on Document engineering (DocEng07).
[8]
Horrocks, I., Patel-Schneider, P. F., McGuinness, D. L., Welty, C. A. (2007). OWL: A Description Logic Based Ontology Language for the Semantic Web. In Baader, F., Calvanese, D., McGuinness, D. L., Nardi, D., Patel-Schneider, P. F. (Eds.), The Description Logic Handbook: Theory, Implementation and Applications (2nd edition): 458--486. Cambridge, UK: Cambridge University Press. ISBN: 9780521876254.
[9]
Koh, E., Caruso, D., Kerne, A., Gutierrez-Osuna, R. (2007). Elimination of junk document surrogate candidates through pattern recognition. In Proceedings of the 2007 ACM symposium on Document engineering (DocEng07).
[10]
Krotzsch, M., Simancik, F., Horrocks, I. (2011). A Description Logic Primer. Ithaca, New York, New York: Cornell University Library. http://arxiv.org/pdf/1201.4089v1 (last visited June 29, 2012).
[11]
Lini, L., Lombardini, D., Paoli, M., Colazzo, D., Sartiani, C. (2001). XTReSy: A Text Retrieval System for XML documents. In Augmenting Comprehension: Digital Tools for the History of Ideas.
[12]
Presutti, V., Gangemi, A. (2008). Content Ontology Design Patterns as practical building blocks for web ontologies. In Li, Q., Spaccapietra, S., Yu, E. S. K., Olivé, A. (Eds.), Proceedings of the 27th International Conference on Conceptual Modeling (ER 2008).
[13]
Tannier, X., Girardot, J.,Mathieu, M. (2005). Classifying XML tags through "reading contexts". In Proceedings of the 2005 ACM symposium on Document engineering (DocEng05).
[14]
Text Encoding Initiative Consortium (2005). TEI P5: Guidelines for Electronic Text Encoding and Interchange. Charlottesville, Virginia, USA: TEI Consortium. http://www.tei-c.org/Guidelines/P5 (last visited June 29, 2012).
[15]
Walsh, N. (2010). DocBook 5: The Definitive Guide. Sebastopol, CA, USA: O'Really Media. Version 1.0.3. ISBN: 0596805029.
[16]
Zou, J., Le, D., Thoma, G. R. (2007). Structure and Content Analysis for HTML Medical Articles: A Hidden Markov Model Approach. In Proceedings of the 2007 ACM symposium on Document engineering (DocEng07).

Cited By

View all
  • (2023)Combining offline and on-the-fly disambiguation to perform semantic-aware XML queryingComputer Science and Information Systems10.2298/CSIS220228063T20:1(423-457)Online publication date: 2023
  • (2021)Almost Linear Semantic XML Keyword SearchProceedings of the 13th International Conference on Management of Digital EcoSystems10.1145/3444757.3485079(129-138)Online publication date: 1-Nov-2021
  • (2021)Lexdatafication: Italian Legal Knowledge Modelling in Akoma NtosoAI Approaches to the Complexity of Legal Systems XI-XII10.1007/978-3-030-89811-3_3(31-47)Online publication date: 27-Nov-2021
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
DocEng '12: Proceedings of the 2012 ACM symposium on Document engineering
September 2012
256 pages
ISBN:9781450311168
DOI:10.1145/2361354
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 September 2012

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. XML
  2. descriptive markup
  3. document visualisation
  4. pattern recognition
  5. structural patterns

Qualifiers

  • Research-article

Conference

DocEng '12
Sponsor:
DocEng '12: ACM Symposium on Document Engineering
September 4 - 7, 2012
Paris, France

Acceptance Rates

Overall Acceptance Rate 194 of 564 submissions, 34%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)4
  • Downloads (Last 6 weeks)0
Reflects downloads up to 14 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Combining offline and on-the-fly disambiguation to perform semantic-aware XML queryingComputer Science and Information Systems10.2298/CSIS220228063T20:1(423-457)Online publication date: 2023
  • (2021)Almost Linear Semantic XML Keyword SearchProceedings of the 13th International Conference on Management of Digital EcoSystems10.1145/3444757.3485079(129-138)Online publication date: 1-Nov-2021
  • (2021)Lexdatafication: Italian Legal Knowledge Modelling in Akoma NtosoAI Approaches to the Complexity of Legal Systems XI-XII10.1007/978-3-030-89811-3_3(31-47)Online publication date: 27-Nov-2021
  • (2021)Semantic Text Segment Classification of Structured Technical ContentNatural Language Processing and Information Systems10.1007/978-3-030-80599-9_15(165-177)Online publication date: 20-Jun-2021
  • (2019)Akoma NtosoProceedings of the Symposium on Markup Vocabulary Customization10.4242/BalisageVol24.Palmirani01Online publication date: 2019
  • (2018)Automated classification of content components in technical communicationComputational Intelligence10.1111/coin.1215734:1(30-48)Online publication date: 5-Jan-2018
  • (2017)Research Articles in Simplified HTML: a Web-first format for HTML-based scholarly articlesPeerJ Computer Science10.7717/peerj-cs.1323(e132)Online publication date: 2-Oct-2017
  • (2016)The Document Components Ontology (DoCO)Semantic Web10.3233/SW-1501777:2(167-181)Online publication date: 12-Feb-2016
  • (2016)Automated Intrinsic Text Classification for Component Content Management Applications in Technical CommunicationProceedings of the 2016 ACM Symposium on Document Engineering10.1145/2960811.2967153(95-98)Online publication date: 13-Sep-2016
  • (2014)Semantic Lenses as Exploration Method for Scholarly ArticlesBridging Between Cultural Heritage Institutions10.1007/978-3-642-54347-0_13(118-129)Online publication date: 2014
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media