Abstract
Automatically identifying and extracting main text from a news page becomes a critical task in many web content analysis applications with the explosive growth of News information. However, body contents are usually covered by presentation elements, such as dynamic flashing logos, navigational menus and a multitude of ad blocks. In this paper, we have proposed a function word (FW) based approach which involves the concept of DOM tree structure similarity (DTSS). Function words are the word that have no real meaning but semantic or functional meaning. Experiment statistics show that function words emerge a lot in main text, while they don’t appear or appear just once or twice in presentation elements. Our approach involves three separate stages. Stage 1 is learning stages. In stage 2, the number of function words in each paragraph is counted and then the paragraph having the most function words is chosen to be the sample. In stage 3, all body paragraphs are extracted according to their similarity with the sample paragraph in DOM tree structure. Experiments results on real world data show that the FW-DTSS based approach is excellent in efficiency and accuracy, compared with that of statistics-based and Vision-based approaches.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Gibson, D., Punera, K., Tomkins, A.: The volume and evolution of web page templates. In: Special Interest Tracks and Posters of the 14th International Conference on World Wide Web, pp. 830–839. ACM (2005)
World Wide Web Consortium: Document Object Model (DOM) Level 2 Specification. W3C Recommendation (2000)
Chakrabarti, S.: Integrating the Document Object Model with hyperlinks for enhanced topic distillation and information extraction. In: International Conference on World Wide Web, WWW 2001, pp. 211–220 (2001)
Koch, P.P.: The document object model: an introduction. Digital Web Magazine (2001). http://www.digital-web.com/articles/the_document_object_model/
Li, D.: Visual communication and design performance research for webpage. Henan University (2009)
Deng, C., Yu, S., Wen, J.: VIPS: A Vision-based Page segmentation. Microsoft Technical Report, MSR-TR-203-79 (2003)
He, Z., Gu, J., Yang, J.: Information extraction of BBS posting based on vision feature. Comput. Appl. 29, 171–174 (2009)
Alexjc: The easy way to extract useful text from arbitrary HTML (2007). http://ai-depot.com/articles/the-easy-way-to-extractuseful-text-fromarbitrary-html/
Zhang, J., Ya, T.: A study of the identification of authorship for Chinese texts. In: IEEE International Conference on Intelligence and Security Informatics, pp. 263–264 (2008)
Ding, J.: Existential state and presentation of Chinese style. Rhetoric Learn. 3, 1–6 (2006)
Quan, S., Zhan, B., Zheng, Y: Authentication of online authorship or article based on hypothesis testing model. In: The 14th IEEE International Conference on Computational Science and Engineering, pp. 3–8. IEEE Computer Society (2011)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Ma, L., Xia, Z. (2016). An FW-DTSS Based Approach for News Page Information Extraction. In: Tan, Y., Shi, Y. (eds) Data Mining and Big Data. DMBD 2016. Lecture Notes in Computer Science(), vol 9714. Springer, Cham. https://doi.org/10.1007/978-3-319-40973-3_22
Download citation
DOI: https://doi.org/10.1007/978-3-319-40973-3_22
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-40972-6
Online ISBN: 978-3-319-40973-3
eBook Packages: Computer ScienceComputer Science (R0)