Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Web Page Segmentation: A DOM-Structural Cohesion Analysis Approach

  • Conference paper
  • First Online:
Web Information Systems Engineering – WISE 2023 (WISE 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14306))

Included in the following conference series:

  • 1370 Accesses

Abstract

Web page segmentation is a fundamental technique applied in information retrieval systems to enhance web crawling tasks and information extraction. Its objectives are to gain deep insights from crawling results and to extract the main content of a webpage by disregarding the irrelevant regions. Over time, several solutions have been proposed to address the segmentation problem using different approaches and learning strategies. Among these, the structural cue, which is a characteristic of the DOM tree, is widely utilized as a primary factor in segmentation models. In this paper, we propose a novel technique for web page segmentation using DOM-structural cohesion analysis. Our approach involves generating blocks that represent groups of DOM subtrees with similar tag structures. By analyzing the cohesion within each generated block and comparing detailed information such as types, attributes, and visual cues of web page elements, the approach can effectively maintain or reconstruct the segmentation layout. Additionally, we employ the Canny algorithm to optimize the segmentation result by reducing redundant spaces, resulting in a more accurate segmentation. We evaluate the effectiveness of our approach using a dataset of 1,969 web pages. The approach achieves 64% on the \(\displaystyle F_{B^{3}}\) score, surpassing existing state-of-the-art methods. The proposed DOM-structural cohesion analysis has the potential to improve web page segmentation and its various applications.

M.-H. Huynh and Q.-T. Le—Both authors contributed equally to this work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Alcic, S., Conrad, S.: Page segmentation by web content clustering. In: Proceedings of the WIMS, pp. 1–9 (2011)

    Google Scholar 

  2. Bar-Yossef, Z., Rajagopalan, S.: Template detection via data mining and its applications. In: Proceedings of the 11th WWW, pp. 580–591 (2002)

    Google Scholar 

  3. Cai, D., He, X., Li, Z., Ma, W.Y., Wen, J.R.: Hierarchical clustering of www image search results using visual (2004)

    Google Scholar 

  4. Cai, D., Yu, S., Wen, J.-R., Ma, W.-Y.: Extracting content structure for web pages based on visual representation. In: Zhou, X., Orlowska, M.E., Zhang, Y. (eds.) APWeb 2003. LNCS, vol. 2642, pp. 406–417. Springer, Heidelberg (2003). https://doi.org/10.1007/3-540-36901-5_42

    Chapter  Google Scholar 

  5. Cai, D., Yu, S., Wen, J.R., Ma, W.Y.: VIPS: a vision-based page segmentation algorithm (2003)

    Google Scholar 

  6. Canny, J.: A computational approach to edge detection. IEEE Trans. Pattern Anal. Mach. Intell. 6, 679–698 (1986)

    Article  Google Scholar 

  7. Chakrabarti, D., Kumar, R., Punera, K.: A graph-theoretic approach to webpage segmentation. In: Proceedings of the 17th WWW, pp. 377–386 (2008)

    Google Scholar 

  8. Chen, K., et al.: Hybrid task cascade for instance segmentation. In: CVPR, pp. 4974–4983 (2019)

    Google Scholar 

  9. Chen, K., et al.: MMDetection: Open MMLab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155 (2019)

  10. Cormer, M., Mann, R., Moffatt, K., Cohen, R.: Towards an improved vision-based web page segmentation algorithm. In: 2017 14th CRV, pp. 345–352. IEEE (2017)

    Google Scholar 

  11. Jayashree, S.R., Dias, G., Andrew, J.J., Saha, S., Maurel, F., Ferrari, S.: Multimodal web page segmentation using self-organized multi-objective clustering. ACM Trans. Inf. Syst. 40(3) (2022). https://doi.org/10.1145/3480966

  12. Jiang, Z., Yin, H., Wu, Y., Lyu, Y., Min, G., Zhang, X.: Constructing novel block layouts for webpage analysis. TOIT 19(3), 1–18 (2019)

    Article  Google Scholar 

  13. Kiesel, J., Kneist, F., Meyer, L., Komlossy, K., Stein, B., Potthast, M.: Web page segmentation revisited: evaluation framework and dataset. In: Proceedings of the 29th ACM CIKM, CIKM 2020, pp. 3047–3054. Association for Computing Machinery, New York (2020). https://doi.org/10.1145/3340531.3412782

  14. Kiesel, J., Meyer, L., Kneist, F., Stein, B., Potthast, M.: An empirical comparison of web page segmentation algorithms. In: Hiemstra, D., Moens, M.-F., Mothe, J., Perego, R., Potthast, M., Sebastiani, F. (eds.) ECIR 2021. LNCS, vol. 12657, pp. 62–74. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-72240-1_5

    Chapter  Google Scholar 

  15. Kohlschütter, C., Nejdl, W.: A densitometric approach to web page segmentation. In: Proceedings of the 17th ACM CIKM, pp. 1173–1182 (2008)

    Google Scholar 

  16. Lu, C., Bing, L., Lam, W.: Structured positional entity language model for enterprise entity retrieval. In: 22nd ACM CIKM, pp. 129–138 (2013)

    Google Scholar 

  17. Manabe, T., Tajima, K.: Extracting logical hierarchical structure of html documents based on headings. Proc. VLDB Endow. 8(12), 1606–1617 (2015). https://doi.org/10.14778/2824032.2824058

    Article  Google Scholar 

  18. Manku, G.S., Jain, A., Das Sarma, A.: Detecting near-duplicates for web crawling. In: Proceedings of the 16th WWW, pp. 141–150 (2007)

    Google Scholar 

  19. Meier, B., Stadelmann, T., Stampfli, J., Arnold, M., Cieliebak, M.: Fully convolutional neural networks for newspaper article segmentation. In: 2017 14th ICDAR, vol. 1, pp. 414–419. IEEE (2017)

    Google Scholar 

  20. Narayana, V., Premchand, P., Govardhan, A.: A novel and efficient approach for near duplicate page detection in web crawling. In: 2009 IACC, pp. 1492–1496. IEEE (2009)

    Google Scholar 

  21. Pawlik, M., Augsten, N.: Tree edit distance: Robust and memory-efficient. Inf. Syst. 56, 157–173 (2016). https://doi.org/10.1016/j.is.2015.08.004

    Article  MATH  Google Scholar 

  22. Sanoja, A., Gançarski, S.: Block-o-matic: a web page segmentation framework. In: 2014 ICMCS, pp. 595–600 (2014). https://doi.org/10.1109/ICMCS.2014.6911249

  23. Velloso, R.P., Dorneles, C.F.: Automatic web page segmentation and noise removal for structured extraction using tag path sequences. JIDM 4(3), 173 (2013)

    Google Scholar 

  24. Vieira, K., Da Silva, A.S., Pinto, N., De Moura, E.S., Cavalcanti, J.M., Freire, J.: A fast and robust method for web page template detection and removal. In: 15th ACM CIKM, pp. 258–267 (2006)

    Google Scholar 

  25. Xiang, P., Yang, X., Shi, Y.: Web page segmentation based on gestalt theory. In: 2007 IEEE ICME, pp. 2253–2256. IEEE (2007)

    Google Scholar 

  26. Xie, X., Miao, G., Song, R., Wen, J.R., Ma, W.Y.: Efficient browsing of web search results on mobile devices based on block importance model. In: 3rd IEEE PerCom, pp. 17–26. IEEE (2005)

    Google Scholar 

  27. Yandrapally, R.K., Mesbah, A.: Fragment-based test generation for web apps. IEEE Trans. Softw. Eng. 49(3), 1086–1101 (2023). https://doi.org/10.1109/TSE.2022.3171295

    Article  Google Scholar 

  28. Yi, L., Liu, B., Li, X.: Eliminating noisy information in web pages for data mining. In: 9th KDD, pp. 296–305 (2003)

    Google Scholar 

  29. Yin, X., Lee, W.S.: Understanding the function of web elements for mobile content delivery using random walk models. In: Special interest tracks and posters of the 14th WWW, pp. 1150–1151 (2005)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Vu Nguyen .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Huynh, MH., Le, QT., Nguyen, V., Nguyen, T. (2023). Web Page Segmentation: A DOM-Structural Cohesion Analysis Approach. In: Zhang, F., Wang, H., Barhamgi, M., Chen, L., Zhou, R. (eds) Web Information Systems Engineering – WISE 2023. WISE 2023. Lecture Notes in Computer Science, vol 14306. Springer, Singapore. https://doi.org/10.1007/978-981-99-7254-8_25

Download citation

  • DOI: https://doi.org/10.1007/978-981-99-7254-8_25

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-99-7253-1

  • Online ISBN: 978-981-99-7254-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics