A reverse engineering approach for automatic annotation of Web pages

De Virgilio, Roberto; Frasincar, Flavius; Hop, Walter; Lachner, Stephan

doi:10.1007/s11042-011-0852-8

A reverse engineering approach for automatic annotation of Web pages

Published: 20 August 2011

Volume 64, pages 119–140, (2013)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Roberto De Virgilio¹,
Flavius Frasincar²,
Walter Hop² &
…
Stephan Lachner²

391 Accesses
2 Citations
Explore all metrics

Abstract

The Semantic Web is gaining increasing interest to fulfill the need of sharing, retrieving, and reusing information. Since Web pages are designed to be read by people, not machines, searching and reusing information on the Web is a difficult task without human participation. To this aim adding semantics (i.e meaning) to a Web page would help the machines to understand Web contents and better support the Web search process. One of the latest developments in this field is Google’s Rich Snippets, a service for Web site owners to add semantics to their Web pages. In this paper we provide a structured approach to automatically annotate a Web page with Rich Snippets RDFa tags. Exploiting a data reverse engineering method, combined with several heuristics, and a named entity recognition technique, our method is capable of recognizing and annotating a subset of Rich Snippets’ vocabulary, i.e., all the attributes of its Review concept, and the names of the Person and Organization concepts. We implemented tools and services and evaluated the accuracy of the approach on real E-commerce Web sites.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automatically Injecting Semantic Annotations into Online Articles

AnnoTag: Concise Content Annotation via LOD Tags derived from Entity-Level Analytics

Semantic Annotation of Text Using Open Semantic Resources

Notes

References

Adida B, Birbeck M (2008) RDFa primer: bridging the human and data webs. http://www.w3.org/TR/xhtml-rdfa-primer/
Allison L, Wallace CS, Yee CN (1990) When is a string like a string? In: AI & Maths
Berners-Lee T, Hendler J, Lassila O (2001) The Semantic Web. Sci Am 284:34–43
Article Google Scholar
Bizer C, Cyganiak R (2006) D2R server: publishing relational databases on the semantic web. In: Proc. of the 5th intl Semantic Web conf. (ISWC 2006)
Can L, Qian Z, Xiaofeng M, Wenyin L (2005) Postal address detection from Web documents. In: International workshop on challenges in Web information retrieval and integration. IEEE Computer Society, Piscataway, pp 40–45
Chapter Google Scholar
Electrum (2009) Valid HTML statistics. http://try.powermapper.com/demo/statsvalid.aspx
Goel K, Guha RV, Hansson O (2009) Introducing Rich Snippets. http://googlewebmastercentral.blogspot.com/2009/05/introducing-rich-snippets.html
Google (2009) Google Webmaster tools: about review data. http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=146645
Kennedy A, Inkpen D (2006) Sentiment classification of movie reviews using contextual valence shifters. Comput Intell 22(2):110–225
Article MathSciNet Google Scholar
Krupka GR, Hausman K (1998) IsoQuest, Inc: Description of the NetOwl(TM) extractor system as used for MUC-7. In: Seventh message understanding conference
Laender A, Ribeiro-Neto B, Silva AD, Teixeira JS (2002) A brief survey of web data extraction tools. ACM SIGMOD Rec 31:84–93
Article Google Scholar
Mikheev A, Moens M, Grover C (1999) Named entity recognition without gazetteers. In: Ninth conference on European chapter of the Association for Computational Linguistics. Association for Computational Linguistics, Menlo Park, pp 1–8
Chapter Google Scholar
Morgan R, Garigliano R, Callaghan P, Poria S, Smith M, Urbanowicz A, Collingham R, Costantino M, Cooper C, Group L (1995) University of Durham: description of the LOLITA system as used in MUC-6. In: Sixth message understanding conference. Morgan Kaufmann, San Francisco
Google Scholar
Pang B, Lee L, Vaithyanathan S (2002) Thumbs up? Sentiment classification using machine learning techniques. In: Conference on empirical methods in natural language processing. ACL, Menlo Park, pp 79–86
Chapter Google Scholar
Ratinov L, Roth D (2009) Design challenges and misconceptions in named entity recognition. In: Thirteenth conference on computational natural language learning. Association for Computational Linguistics, Menlo Park, pp 147–155
Chapter Google Scholar
Seomoz.org (2009) Search engine ranking factors 2009. http://www.seomoz.org/article/search-ranking-factors
Tomberg V, Laanpere M (2009) RDFa versus microformats: exploring the potential for semantic interoperability of mash-up personal learning environments. In: Second international workshop on mashup personal learning environments. M. Jeusfeld c/o Redaktion Sun SITE, Informatik V, RWTH Aachen, pp 102–109
Turney P (2002) Thumbs up or thumbs down? Semantic orientation applied to unsupervised classification of reviews. In: 40th annual meeting of the Association for Computational Linguistics. ACL, Menlo Park, pp 417–424
Google Scholar
Virgilio RD, Torlone R (2008) A meta-model approach to the management of hypertexts in Web information systems. In: ER workshops (WISM 2008)
Virgilio RD, Torlone R (2009) A structured approach to data reverse engineering of Web applications. In: 9th international conference on Web engineering. Springer, New York, pp 91–105
Google Scholar
Yahoo! (2009) SearchMonkey: site owner overview. http://developer.yahoo.com/searchmonkey/siteowner.html
Ye Q, Zhang Z, Law R (2009) Sentiment classification of online reviews to travel destinations by supervised machine learning approaches. Expert Syst Appl 36(3):6527–6535
Article Google Scholar

Download references

Author information

Authors and Affiliations

Dipartimento di Informatica e Automazione, Universitá Roma Tre, Rome, Italy
Roberto De Virgilio
Erasmus School of Economics, Erasmus University Rotterdam, PO Box 1738, 3000 DR, Rotterdam, The Netherlands
Flavius Frasincar, Walter Hop & Stephan Lachner

Authors

Roberto De Virgilio
View author publications
You can also search for this author in PubMed Google Scholar
Flavius Frasincar
View author publications
You can also search for this author in PubMed Google Scholar
Walter Hop
View author publications
You can also search for this author in PubMed Google Scholar
Stephan Lachner
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Roberto De Virgilio.

Rights and permissions

Reprints and permissions

About this article

Cite this article

De Virgilio, R., Frasincar, F., Hop, W. et al. A reverse engineering approach for automatic annotation of Web pages. Multimed Tools Appl 64, 119–140 (2013). https://doi.org/10.1007/s11042-011-0852-8

Download citation

Published: 20 August 2011
Issue Date: May 2013
DOI: https://doi.org/10.1007/s11042-011-0852-8

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A reverse engineering approach for automatic annotation of Web pages

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Automatically Injecting Semantic Annotations into Online Articles

AnnoTag: Concise Content Annotation via LOD Tags derived from Entity-Level Analytics

Semantic Annotation of Text Using Open Semantic Resources

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

A reverse engineering approach for automatic annotation of Web pages

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Automatically Injecting Semantic Annotations into Online Articles

AnnoTag: Concise Content Annotation via LOD Tags derived from Entity-Level Analytics

Semantic Annotation of Text Using Open Semantic Resources

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation