research-article

WISDOM: Web Intrapage Informative Structure Mining Based on Document Object Model

Authors:

Ming-Syan ChenAuthors Info & Claims

IEEE Transactions on Knowledge and Data Engineering, Volume 17, Issue 5

Pages 614 - 627

https://doi.org/10.1109/TKDE.2005.84

Published: 01 May 2005 Publication History

Abstract

To increase the commercial value and accessibility of pages, most content sites tend to publish their pages with intrasite redundant information, such as navigation panels, advertisements, and copyright announcements. Such redundant information increases the index size of general search engines and causes page topics to drift. In this paper, we study the problem of mining intrapage informative structure in news Web sites in order to find and eliminate redundant information. Note that intrapage informative structure is a subset of the original Web page and is composed of a set of fine-grained and informative blocks. The intrapage informative structures of pages in a news Web site contain only anchors linking to news pages or bodies of news articles. We propose an intrapage informative structure mining system called WISDOM (Web Intrapage Informative Structure Mining based on the Document Object Model) which applies Information Theory to DOM tree knowledge in order to build the structure. WISDOM splits a DOM tree into many small subtrees and applies a top-down informative block searching algorithm to select a set of candidate informative blocks. The structure is built by expanding the set using proposed merging methods. Experiments on several real news Web sites show high precision and recall rates which validates WISDOM's practical applicability.

References

[1]

B. Adelberg, “NoDoSE-A Tool for Semi-Automatically Extracting Structured and Semistructured Data from Text Documents,” Proc. 1998 ACM SIGMOD Int'l Conf. Management of Data (SIGMOD), 1998.

Digital Library

[2]

T. Asai K. Abe S. Kawasoe H. Arimura H. Sakamoto and S. Arikawa, “Efficient Substructure Discovery from Large Semi-structured Data,” Proc. SIAM Int'l Conf. Data Mining (SDM), 2002.

[3]

R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval. Addision Wesley, 1999.

Digital Library

[4]

Z. Bar-Yossef and S. Rajagopalan, “Template Detection via Data Mining and Its Applications,” Proc. 11th World Wide Web Conf. (WWW), 2002.

Digital Library

[5]

A. Broder S. Glassman M. Manasse and G. Zweig, “Syntactic Clustering of the Web,” Proc. Sixth World Wide Web Conf. (WWW), 1997.

Digital Library

[6]

M. Craven D. DiPasquo D. Freitag A. McCallum T. Mitchell K. Nigam and S. Slattery, “Learning to Construct Knowledge Bases from the World Wide Web,” Artificial Intelligence, vol. 118, nos. 1-2, pp. 69-113, 2000.

Digital Library

[7]

S. Chakrabarti, “Integrating the Document Object Model with Hyperlinks for Enhanced Topic Distillation and Information Extraction,” Proc. 10th World Wide Web Conf. (WWW), 2001.

Digital Library

[8]

Y. Chen W.-Y. Ma and H.-J. Zhang, “Detecting Web Page Structure for Adaptive Viewing on Small Form Factor Devices,” Proc. 12th World Wide Web Conf. (WWW), 2003.

Digital Library

[9]

W. Cohen, “Recognizing Structure in Web Pages Using Similarity Queries,” Proc. Nat'l Conf. Artificial Intelligence (AAAI), 1999.

Digital Library

[10]

G. Cong L. Yi B. Liu and K. Wang, “Discovering Frequent Substructures from Hierarchical Semi-Structured Data,” Proc. SIAM Int'l Conf. Data Mining (SIAM SDM), 2002.

[11]

R. Cooley and J. Srivastava, “Web Mining: Information and Pattern Discovery on the World Wide Web,” Proc. Ninth IEEE Int'l Conf. Tools with Artificial Intelligence (ICTAI), 1997.

Digital Library

[12]

D.W. Embley Y. Jiang and Y.K. Ng, “Record-Boundary Discovery in Web Documents,” Proc. 1999 ACM SIGMOD Int'l Conf. Management of Data (SIGMOD), 1999.

Digital Library

[13]

K. Furukawa T. Uchida K. Yamada T. Miyahara T. Shoudai and Y. Nakamura, “Extracting Characteristic Structures among Words in Semistructured Documents,” Proc. Sixth Pacific-Asia Conf. Knowledge Discovery and Data Mining (PAKDD), 2002.

Digital Library

[14]

H. Grundel T. Naphtali C. Wiech J.-M. Gluba M. Rohdenburg and T. Scheffer, “Clipping and Analyzing News Using Machine Learning Techniques,” Proc. Int'l Conf. Discovery Science, 2001.

Digital Library

[15]

C.N. Hsu and M.T. Dung, “Generating Finite-State Transducers for Semi-Structured Data Extraction from the Web,” Information Systems, vol. 23, no. 8, pp. 521-538, 1998.

Digital Library

[16]

H.-Y. Kao S.H. Lin J.M. Ho and M.-S. Chen, “Entropy-Based Link Analysis for Mining Web Informative Structures,” Proc. ACM 11th Int'l Conf. Information and Knowledge Management (CIKM), 2002.

Digital Library

[17]

H.-Y. Kao S.-H. Lin J.-M. Ho and M.-S. Chen, “Mining Web Information Structures and Contents Based on Entropy Analysis,” IEEE Trans. Knowledge and Data Eng., vol. 16, no. 1, Jan. 2004.

Digital Library

[18]

J.M. Kleinberg, “Authoritative Sources in a Hyperlinked Environment,” Proc. ACM-SIAM Symp. Discrete Algorithms (SODA), 1998.

Digital Library

[19]

N. Kushmerick D. Weld and R. Doorenbos, “Wrapper Induction for Information Extraction,” Proc. 15th Int'l Joint Conf. Artificial Intelligence (IJCAI), 1997.

[20]

A. Laender B. Ribeiro-Neto A. Silva and J. Teixeira, “A Brief Survey of Web Data Extraction Tools,” SIGMOD Record, vol. 31, no. 2, June 2002.

Digital Library

[21]

S.H. Lin and J.M. Ho, “Discovering Informative Content Blocks from Web Documents,” Proc. Eighth ACM Int'l Conf. Knowledge Discovery and Data Mining (SIGKDD), 2002.

Digital Library

[22]

W.Y. Lin and W. Lam, “Learning to Extract Hierarchical Information from Semi-Structured Documents,” Proc. ACM Ninth Int'l Conf. Information and Knowledge Management (CIKM), 2000.

Digital Library

[23]

X. Li B. Liu T.-H. Phang and M. Hu, “Using Micro Information Units for Internet Search,” Proc. ACM 11th Int'l Conf. Information and Knowledge Management (CIKM), 2002.

Digital Library

[24]

T. Miyahara Y. Suzuki T. Shoudai T. Uchida K. Takahashi and H. Ueda, “Discovery of Frequent Tag Tree Patterns in Semistructured Web Documents,” Proc. Sixth Pacific-Asia Conf. Knowledge Discovery and Data Mining (PAKDD), 2002.

Digital Library

[25]

C.E. Shannon, “A Mathematical Theory of Communication,” Bell System Technical J., vol. 27, pp. 398-403, 1948.

[26]

G. Salton, Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison Wesley, 1989.

Digital Library

[27]

W3C DOM, Document Object Model (DOM), http://www.w3.org/DOM/, 2005.

[28]

K. Wang and H. Liu, “Discovering Structural Association of Semistructured Data,” IEEE Trans. Knowledge and Eng., vol. 12,no. 3, May/June 2000.

Digital Library

[29]

C. Yip C. Gertz and N. Sundaresan, “Reverse Engineering for Web Data: From Visual to Semantic Structures,” Proc. 19th IEEE Int'l Conf. Data Eng. (ICDE), 2002.

Digital Library

Cited By

Oita MSenellart P(2015)FORESTProceedings of the 18th International Workshop on Web and Databases10.1145/2767109.2767112(55-61)Online publication date: 31-May-2015
https://dl.acm.org/doi/10.1145/2767109.2767112
Fauzi FBelkhatir M(2014)Image understanding and the webJournal of Intelligent Information Systems10.1007/s10844-014-0323-643:2(271-306)Online publication date: 1-Oct-2014
https://dl.acm.org/doi/10.1007/s10844-014-0323-6
Bu ZZhang CXia ZWang J(2014)An FAR-SW based approach for webpage information extractionInformation Systems Frontiers10.1007/s10796-013-9412-216:5(771-785)Online publication date: 1-Nov-2014
https://dl.acm.org/doi/10.1007/s10796-013-9412-2
Show More Cited By

Index Terms

WISDOM: Web Intrapage Informative Structure Mining Based on Document Object Model
1. Information systems

Recommendations

Mining Web Informative Structures and Contents Based on Entropy Analysis

Abstract--In this paper, we study the problem of mining the informative structure of a news Web site that consists of thousands of hyperlinked documents. We define the informative structure of a news Web site as a set of index pages (or referred to as ...
A Model-Based Approach for Crawling Rich Internet Applications

New Web technologies, like AJAX, result in more responsive and interactive Web applications, sometimes called Rich Internet Applications (RIAs). Crawling techniques developed for traditional Web applications are not sufficient for crawling RIAs. The ...
Enabling web browsers to augment web sites' filtering and sorting functionalities
UIST '06: Proceedings of the 19th annual ACM symposium on User interface software and technology

Existing augmentations of web pages are mostly small cosmetic changes (e.g., removing ads) and minor addition of third-party content (e.g., product prices from competing sites). None leverages the structured data presented in web pages. This paper ...

Comments

Information & Contributors

Information

Published In

cover image IEEE Transactions on Knowledge and Data Engineering

IEEE Transactions on Knowledge and Data Engineering Volume 17, Issue 5

May 2005

143 pages

ISSN:1041-4347

Issue’s Table of Contents

Copyright © 2005.

Publisher

IEEE Educational Activities Department

United States

Publication History

Published: 01 May 2005

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

12
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 10 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Oita MSenellart P(2015)FORESTProceedings of the 18th International Workshop on Web and Databases10.1145/2767109.2767112(55-61)Online publication date: 31-May-2015
https://dl.acm.org/doi/10.1145/2767109.2767112
Fauzi FBelkhatir M(2014)Image understanding and the webJournal of Intelligent Information Systems10.1007/s10844-014-0323-643:2(271-306)Online publication date: 1-Oct-2014
https://dl.acm.org/doi/10.1007/s10844-014-0323-6
Bu ZZhang CXia ZWang J(2014)An FAR-SW based approach for webpage information extractionInformation Systems Frontiers10.1007/s10796-013-9412-216:5(771-785)Online publication date: 1-Nov-2014
https://dl.acm.org/doi/10.1007/s10796-013-9412-2
Alcic SConrad SAkerkar R(2011)Page segmentation by web content clusteringProceedings of the International Conference on Web Intelligence, Mining and Semantics10.1145/1988688.1988717(1-9)Online publication date: 25-May-2011
https://dl.acm.org/doi/10.1145/1988688.1988717
Madaan AChu WBhalla S(2011)VisHueProceedings of the 7th international conference on Databases in Networked Information Systems10.1007/978-3-642-25731-5_9(89-108)Online publication date: 12-Dec-2011
https://dl.acm.org/doi/10.1007/978-3-642-25731-5_9
Kohlschütter CFankhauser PNejdl WDavison BSuel TCraswell NLiu B(2010)Boilerplate detection using shallow text featuresProceedings of the third ACM international conference on Web search and data mining10.1145/1718487.1718542(441-450)Online publication date: 4-Feb-2010
https://dl.acm.org/doi/10.1145/1718487.1718542
Vineel G(2009)Web page DOM node characterization and its application to page segmentationProceedings of the 3rd IEEE international conference on Internet multimedia services architecture and applications10.5555/1812598.1812659(325-330)Online publication date: 9-Dec-2009
https://dl.acm.org/doi/10.5555/1812598.1812659
Fauzi FHong JBelkhatir MGao WRui YHanjalic AXu CSteinbach EEl Saddik AZhou M(2009)Webpage segmentation for extracting images and their surrounding contextual informationProceedings of the 17th ACM international conference on Multimedia10.1145/1631272.1631379(649-652)Online publication date: 23-Oct-2009
https://dl.acm.org/doi/10.1145/1631272.1631379
Kohlschütter CNejdl WShanahan JAmer-Yahia SManolescu IZhang YEvans DKolcz AChoi KChowdury A(2008)A densitometric approach to web page segmentationProceedings of the 17th ACM conference on Information and knowledge management10.1145/1458082.1458237(1173-1182)Online publication date: 26-Oct-2008
https://dl.acm.org/doi/10.1145/1458082.1458237
Chakrabarti DKumar RPunera KHuai JChen RHon HLiu YMa WTomkins AZhang X(2008)A graph-theoretic approach to webpage segmentationProceedings of the 17th international conference on World Wide Web10.1145/1367497.1367549(377-386)Online publication date: 21-Apr-2008
https://dl.acm.org/doi/10.1145/1367497.1367549
Show More Cited By

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents