Abstract
Web page segmentation is a fundamental technique applied in information retrieval systems to enhance web crawling tasks and information extraction. Its objectives are to gain deep insights from crawling results and to extract the main content of a webpage by disregarding the irrelevant regions. Over time, several solutions have been proposed to address the segmentation problem using different approaches and learning strategies. Among these, the structural cue, which is a characteristic of the DOM tree, is widely utilized as a primary factor in segmentation models. In this paper, we propose a novel technique for web page segmentation using DOM-structural cohesion analysis. Our approach involves generating blocks that represent groups of DOM subtrees with similar tag structures. By analyzing the cohesion within each generated block and comparing detailed information such as types, attributes, and visual cues of web page elements, the approach can effectively maintain or reconstruct the segmentation layout. Additionally, we employ the Canny algorithm to optimize the segmentation result by reducing redundant spaces, resulting in a more accurate segmentation. We evaluate the effectiveness of our approach using a dataset of 1,969 web pages. The approach achieves 64% on the \(\displaystyle F_{B^{3}}\) score, surpassing existing state-of-the-art methods. The proposed DOM-structural cohesion analysis has the potential to improve web page segmentation and its various applications.
M.-H. Huynh and Q.-T. Le—Both authors contributed equally to this work.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Alcic, S., Conrad, S.: Page segmentation by web content clustering. In: Proceedings of the WIMS, pp. 1–9 (2011)
Bar-Yossef, Z., Rajagopalan, S.: Template detection via data mining and its applications. In: Proceedings of the 11th WWW, pp. 580–591 (2002)
Cai, D., He, X., Li, Z., Ma, W.Y., Wen, J.R.: Hierarchical clustering of www image search results using visual (2004)
Cai, D., Yu, S., Wen, J.-R., Ma, W.-Y.: Extracting content structure for web pages based on visual representation. In: Zhou, X., Orlowska, M.E., Zhang, Y. (eds.) APWeb 2003. LNCS, vol. 2642, pp. 406–417. Springer, Heidelberg (2003). https://doi.org/10.1007/3-540-36901-5_42
Cai, D., Yu, S., Wen, J.R., Ma, W.Y.: VIPS: a vision-based page segmentation algorithm (2003)
Canny, J.: A computational approach to edge detection. IEEE Trans. Pattern Anal. Mach. Intell. 6, 679–698 (1986)
Chakrabarti, D., Kumar, R., Punera, K.: A graph-theoretic approach to webpage segmentation. In: Proceedings of the 17th WWW, pp. 377–386 (2008)
Chen, K., et al.: Hybrid task cascade for instance segmentation. In: CVPR, pp. 4974–4983 (2019)
Chen, K., et al.: MMDetection: Open MMLab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155 (2019)
Cormer, M., Mann, R., Moffatt, K., Cohen, R.: Towards an improved vision-based web page segmentation algorithm. In: 2017 14th CRV, pp. 345–352. IEEE (2017)
Jayashree, S.R., Dias, G., Andrew, J.J., Saha, S., Maurel, F., Ferrari, S.: Multimodal web page segmentation using self-organized multi-objective clustering. ACM Trans. Inf. Syst. 40(3) (2022). https://doi.org/10.1145/3480966
Jiang, Z., Yin, H., Wu, Y., Lyu, Y., Min, G., Zhang, X.: Constructing novel block layouts for webpage analysis. TOIT 19(3), 1–18 (2019)
Kiesel, J., Kneist, F., Meyer, L., Komlossy, K., Stein, B., Potthast, M.: Web page segmentation revisited: evaluation framework and dataset. In: Proceedings of the 29th ACM CIKM, CIKM 2020, pp. 3047–3054. Association for Computing Machinery, New York (2020). https://doi.org/10.1145/3340531.3412782
Kiesel, J., Meyer, L., Kneist, F., Stein, B., Potthast, M.: An empirical comparison of web page segmentation algorithms. In: Hiemstra, D., Moens, M.-F., Mothe, J., Perego, R., Potthast, M., Sebastiani, F. (eds.) ECIR 2021. LNCS, vol. 12657, pp. 62–74. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-72240-1_5
Kohlschütter, C., Nejdl, W.: A densitometric approach to web page segmentation. In: Proceedings of the 17th ACM CIKM, pp. 1173–1182 (2008)
Lu, C., Bing, L., Lam, W.: Structured positional entity language model for enterprise entity retrieval. In: 22nd ACM CIKM, pp. 129–138 (2013)
Manabe, T., Tajima, K.: Extracting logical hierarchical structure of html documents based on headings. Proc. VLDB Endow. 8(12), 1606–1617 (2015). https://doi.org/10.14778/2824032.2824058
Manku, G.S., Jain, A., Das Sarma, A.: Detecting near-duplicates for web crawling. In: Proceedings of the 16th WWW, pp. 141–150 (2007)
Meier, B., Stadelmann, T., Stampfli, J., Arnold, M., Cieliebak, M.: Fully convolutional neural networks for newspaper article segmentation. In: 2017 14th ICDAR, vol. 1, pp. 414–419. IEEE (2017)
Narayana, V., Premchand, P., Govardhan, A.: A novel and efficient approach for near duplicate page detection in web crawling. In: 2009 IACC, pp. 1492–1496. IEEE (2009)
Pawlik, M., Augsten, N.: Tree edit distance: Robust and memory-efficient. Inf. Syst. 56, 157–173 (2016). https://doi.org/10.1016/j.is.2015.08.004
Sanoja, A., Gançarski, S.: Block-o-matic: a web page segmentation framework. In: 2014 ICMCS, pp. 595–600 (2014). https://doi.org/10.1109/ICMCS.2014.6911249
Velloso, R.P., Dorneles, C.F.: Automatic web page segmentation and noise removal for structured extraction using tag path sequences. JIDM 4(3), 173 (2013)
Vieira, K., Da Silva, A.S., Pinto, N., De Moura, E.S., Cavalcanti, J.M., Freire, J.: A fast and robust method for web page template detection and removal. In: 15th ACM CIKM, pp. 258–267 (2006)
Xiang, P., Yang, X., Shi, Y.: Web page segmentation based on gestalt theory. In: 2007 IEEE ICME, pp. 2253–2256. IEEE (2007)
Xie, X., Miao, G., Song, R., Wen, J.R., Ma, W.Y.: Efficient browsing of web search results on mobile devices based on block importance model. In: 3rd IEEE PerCom, pp. 17–26. IEEE (2005)
Yandrapally, R.K., Mesbah, A.: Fragment-based test generation for web apps. IEEE Trans. Softw. Eng. 49(3), 1086–1101 (2023). https://doi.org/10.1109/TSE.2022.3171295
Yi, L., Liu, B., Li, X.: Eliminating noisy information in web pages for data mining. In: 9th KDD, pp. 296–305 (2003)
Yin, X., Lee, W.S.: Understanding the function of web elements for mobile content delivery using random walk models. In: Special interest tracks and posters of the 14th WWW, pp. 1150–1151 (2005)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Huynh, MH., Le, QT., Nguyen, V., Nguyen, T. (2023). Web Page Segmentation: A DOM-Structural Cohesion Analysis Approach. In: Zhang, F., Wang, H., Barhamgi, M., Chen, L., Zhou, R. (eds) Web Information Systems Engineering – WISE 2023. WISE 2023. Lecture Notes in Computer Science, vol 14306. Springer, Singapore. https://doi.org/10.1007/978-981-99-7254-8_25
Download citation
DOI: https://doi.org/10.1007/978-981-99-7254-8_25
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-7253-1
Online ISBN: 978-981-99-7254-8
eBook Packages: Computer ScienceComputer Science (R0)