Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

WISDOM: Web Intrapage Informative Structure Mining Based on Document Object Model

Published: 01 May 2005 Publication History

Abstract

To increase the commercial value and accessibility of pages, most content sites tend to publish their pages with intrasite redundant information, such as navigation panels, advertisements, and copyright announcements. Such redundant information increases the index size of general search engines and causes page topics to drift. In this paper, we study the problem of mining intrapage informative structure in news Web sites in order to find and eliminate redundant information. Note that intrapage informative structure is a subset of the original Web page and is composed of a set of fine-grained and informative blocks. The intrapage informative structures of pages in a news Web site contain only anchors linking to news pages or bodies of news articles. We propose an intrapage informative structure mining system called WISDOM (Web Intrapage Informative Structure Mining based on the Document Object Model) which applies Information Theory to DOM tree knowledge in order to build the structure. WISDOM splits a DOM tree into many small subtrees and applies a top-down informative block searching algorithm to select a set of candidate informative blocks. The structure is built by expanding the set using proposed merging methods. Experiments on several real news Web sites show high precision and recall rates which validates WISDOM's practical applicability.

References

[1]
B. Adelberg, “NoDoSE-A Tool for Semi-Automatically Extracting Structured and Semistructured Data from Text Documents,” Proc. 1998 ACM SIGMOD Int'l Conf. Management of Data (SIGMOD), 1998.
[2]
T. Asai K. Abe S. Kawasoe H. Arimura H. Sakamoto and S. Arikawa, “Efficient Substructure Discovery from Large Semi-structured Data,” Proc. SIAM Int'l Conf. Data Mining (SDM), 2002.
[3]
R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval. Addision Wesley, 1999.
[4]
Z. Bar-Yossef and S. Rajagopalan, “Template Detection via Data Mining and Its Applications,” Proc. 11th World Wide Web Conf. (WWW), 2002.
[5]
A. Broder S. Glassman M. Manasse and G. Zweig, “Syntactic Clustering of the Web,” Proc. Sixth World Wide Web Conf. (WWW), 1997.
[6]
M. Craven D. DiPasquo D. Freitag A. McCallum T. Mitchell K. Nigam and S. Slattery, “Learning to Construct Knowledge Bases from the World Wide Web,” Artificial Intelligence, vol. 118, nos. 1-2, pp. 69-113, 2000.
[7]
S. Chakrabarti, “Integrating the Document Object Model with Hyperlinks for Enhanced Topic Distillation and Information Extraction,” Proc. 10th World Wide Web Conf. (WWW), 2001.
[8]
Y. Chen W.-Y. Ma and H.-J. Zhang, “Detecting Web Page Structure for Adaptive Viewing on Small Form Factor Devices,” Proc. 12th World Wide Web Conf. (WWW), 2003.
[9]
W. Cohen, “Recognizing Structure in Web Pages Using Similarity Queries,” Proc. Nat'l Conf. Artificial Intelligence (AAAI), 1999.
[10]
G. Cong L. Yi B. Liu and K. Wang, “Discovering Frequent Substructures from Hierarchical Semi-Structured Data,” Proc. SIAM Int'l Conf. Data Mining (SIAM SDM), 2002.
[11]
R. Cooley and J. Srivastava, “Web Mining: Information and Pattern Discovery on the World Wide Web,” Proc. Ninth IEEE Int'l Conf. Tools with Artificial Intelligence (ICTAI), 1997.
[12]
D.W. Embley Y. Jiang and Y.K. Ng, “Record-Boundary Discovery in Web Documents,” Proc. 1999 ACM SIGMOD Int'l Conf. Management of Data (SIGMOD), 1999.
[13]
K. Furukawa T. Uchida K. Yamada T. Miyahara T. Shoudai and Y. Nakamura, “Extracting Characteristic Structures among Words in Semistructured Documents,” Proc. Sixth Pacific-Asia Conf. Knowledge Discovery and Data Mining (PAKDD), 2002.
[14]
H. Grundel T. Naphtali C. Wiech J.-M. Gluba M. Rohdenburg and T. Scheffer, “Clipping and Analyzing News Using Machine Learning Techniques,” Proc. Int'l Conf. Discovery Science, 2001.
[15]
C.N. Hsu and M.T. Dung, “Generating Finite-State Transducers for Semi-Structured Data Extraction from the Web,” Information Systems, vol. 23, no. 8, pp. 521-538, 1998.
[16]
H.-Y. Kao S.H. Lin J.M. Ho and M.-S. Chen, “Entropy-Based Link Analysis for Mining Web Informative Structures,” Proc. ACM 11th Int'l Conf. Information and Knowledge Management (CIKM), 2002.
[17]
H.-Y. Kao S.-H. Lin J.-M. Ho and M.-S. Chen, “Mining Web Information Structures and Contents Based on Entropy Analysis,” IEEE Trans. Knowledge and Data Eng., vol. 16, no. 1, Jan. 2004.
[18]
J.M. Kleinberg, “Authoritative Sources in a Hyperlinked Environment,” Proc. ACM-SIAM Symp. Discrete Algorithms (SODA), 1998.
[19]
N. Kushmerick D. Weld and R. Doorenbos, “Wrapper Induction for Information Extraction,” Proc. 15th Int'l Joint Conf. Artificial Intelligence (IJCAI), 1997.
[20]
A. Laender B. Ribeiro-Neto A. Silva and J. Teixeira, “A Brief Survey of Web Data Extraction Tools,” SIGMOD Record, vol. 31, no. 2, June 2002.
[21]
S.H. Lin and J.M. Ho, “Discovering Informative Content Blocks from Web Documents,” Proc. Eighth ACM Int'l Conf. Knowledge Discovery and Data Mining (SIGKDD), 2002.
[22]
W.Y. Lin and W. Lam, “Learning to Extract Hierarchical Information from Semi-Structured Documents,” Proc. ACM Ninth Int'l Conf. Information and Knowledge Management (CIKM), 2000.
[23]
X. Li B. Liu T.-H. Phang and M. Hu, “Using Micro Information Units for Internet Search,” Proc. ACM 11th Int'l Conf. Information and Knowledge Management (CIKM), 2002.
[24]
T. Miyahara Y. Suzuki T. Shoudai T. Uchida K. Takahashi and H. Ueda, “Discovery of Frequent Tag Tree Patterns in Semistructured Web Documents,” Proc. Sixth Pacific-Asia Conf. Knowledge Discovery and Data Mining (PAKDD), 2002.
[25]
C.E. Shannon, “A Mathematical Theory of Communication,” Bell System Technical J., vol. 27, pp. 398-403, 1948.
[26]
G. Salton, Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison Wesley, 1989.
[27]
W3C DOM, Document Object Model (DOM), http://www.w3.org/DOM/, 2005.
[28]
K. Wang and H. Liu, “Discovering Structural Association of Semistructured Data,” IEEE Trans. Knowledge and Eng., vol. 12,no. 3, May/June 2000.
[29]
C. Yip C. Gertz and N. Sundaresan, “Reverse Engineering for Web Data: From Visual to Semantic Structures,” Proc. 19th IEEE Int'l Conf. Data Eng. (ICDE), 2002.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image IEEE Transactions on Knowledge and Data Engineering
IEEE Transactions on Knowledge and Data Engineering  Volume 17, Issue 5
May 2005
143 pages

Publisher

IEEE Educational Activities Department

United States

Publication History

Published: 01 May 2005

Author Tags

  1. DOM
  2. Index Terms- Intrapage informative structure
  3. entropy
  4. information extraction.

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 10 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2015)FORESTProceedings of the 18th International Workshop on Web and Databases10.1145/2767109.2767112(55-61)Online publication date: 31-May-2015
  • (2014)Image understanding and the webJournal of Intelligent Information Systems10.1007/s10844-014-0323-643:2(271-306)Online publication date: 1-Oct-2014
  • (2014)An FAR-SW based approach for webpage information extractionInformation Systems Frontiers10.1007/s10796-013-9412-216:5(771-785)Online publication date: 1-Nov-2014
  • (2011)Page segmentation by web content clusteringProceedings of the International Conference on Web Intelligence, Mining and Semantics10.1145/1988688.1988717(1-9)Online publication date: 25-May-2011
  • (2011)VisHueProceedings of the 7th international conference on Databases in Networked Information Systems10.1007/978-3-642-25731-5_9(89-108)Online publication date: 12-Dec-2011
  • (2010)Boilerplate detection using shallow text featuresProceedings of the third ACM international conference on Web search and data mining10.1145/1718487.1718542(441-450)Online publication date: 4-Feb-2010
  • (2009)Web page DOM node characterization and its application to page segmentationProceedings of the 3rd IEEE international conference on Internet multimedia services architecture and applications10.5555/1812598.1812659(325-330)Online publication date: 9-Dec-2009
  • (2009)Webpage segmentation for extracting images and their surrounding contextual informationProceedings of the 17th ACM international conference on Multimedia10.1145/1631272.1631379(649-652)Online publication date: 23-Oct-2009
  • (2008)A densitometric approach to web page segmentationProceedings of the 17th ACM conference on Information and knowledge management10.1145/1458082.1458237(1173-1182)Online publication date: 26-Oct-2008
  • (2008)A graph-theoretic approach to webpage segmentationProceedings of the 17th international conference on World Wide Web10.1145/1367497.1367549(377-386)Online publication date: 21-Apr-2008
  • Show More Cited By

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media