Focused crawling strategies based on ontologies and simulated annealing methods for rainstorm disaster domain knowledge

Liu, Jingfa; Li, Fan; Ding, Ruoyao; Liu, Zi’ang

doi:10.1631/FITEE.2100360

Focused crawling strategies based on ontologies and simulated annealing methods for rainstorm disaster domain knowledge

基于本体和模拟退火算法的暴雨灾害主题爬虫策略

Research Article
Published: 24 August 2022

Volume 23, pages 1189–1204, (2022)
Cite this article

Frontiers of Information Technology & Electronic Engineering Aims and scope Submit manuscript

93 Accesses
Explore all metrics

Abstract

At present, focused crawler is a crucial method for obtaining effective domain knowledge from massive heterogeneous networks. For most current focused crawling technologies, there are some difficulties in obtaining high-quality crawling results. The main difficulties are the establishment of topic benchmark models, the assessment of topic relevance of hyperlinks, and the design of crawling strategies. In this paper, we use domain ontology to build a topic benchmark model for a specific topic, and propose a novel multiple-filtering strategy based on local ontology and global ontology (MFSLG). A comprehensive priority evaluation method (CPEM) based on the web text and link structure is introduced to improve the computation precision of topic relevance for unvisited hyperlinks, and a simulated annealing (SA) method is used to avoid the focused crawler falling into local optima of the search. By incorporating SA into the focused crawler with MFSLG and CPEM for the first time, two novel focused crawler strategies based on ontology and SA (FCOSA), including FCOSA with only global ontology (FCOSA_G) and FCOSA with both local ontology and global ontology (FCOSA_LG), are proposed to obtain topic-relevant webpages about rainstorm disasters from the network. Experimental results show that the proposed crawlers outperform the other focused crawling strategies on different performance metric indices.

摘要

目前, 主题爬虫是从海量异构网络中获取有效领域知识的重要方法. 目前大多数主题爬虫技术难以获得高质量爬行结果. 主要难点包括主题基准模型的建立、超链接主题相关性的评估和爬行策略的设计. 本文采用领域本体为特定主题构建主题基准模型, 并提出一种新的基于局部本体和全局本体的多重筛选策略 (MFSLG). 为提高待访问超链接主题相关性计算精度, 提出一种基于网页文本和链接结构的综合优先度评估方法 (CPEM), 同时, 采用模拟退火 (SA) 算法避免主题爬虫陷入局部最优搜索. 本文首次设计融合SA算法、 MFSLG策略和CPEM策略实现主题爬虫, 提出两种新的基于本体和SA主题爬虫策略 (FCOSA), 包括基于全局本体的FCOSA策略 (FCOSA_G) 和基于局部本体和全局本体的FCOSA策略 (FCOSA_LG), 以从网络中获取与暴雨灾害主题相关的网页. 实验结果表明, 针对不同性能指标, 所提爬虫策略优于其他主题爬虫策略.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Applying particle swarm optimization-based dynamic adaptive hyperlink evaluation to focused crawler for meteorological disasters

Article Open access 19 July 2023

Towards a Novel Strategic Scheme for Web Crawler Design Using Simulated Annealing and Semantic Techniques

OGIA: Ontology Integration and Generation Using Archaeology as a Domain

References

Bajpai N, Arora D, 2018. Domain-based search engine evaluation. In: Saeed K, Chaki N, Pati B, et al. (Eds.), Progress in Advanced Computing and Intelligent Engineering. Advances in Intelligent Systems and Computing, volume 564. Springer, Singapore, p.711–720. https://doi.org/10.1007/978-981-10-6875-1_69
Chapter Google Scholar
Boukadi K, Rekik M, Rekik M, et al., 2018. FC4CD: a new SOA-based focused crawler for cloud service discovery. Computing, 100(10):1081–1107. https://doi.org/10.1007/s00607-018-0600-2
Article Google Scholar
Capuano A, Rinaldi AM, Russo C, 2020. An ontology-driven multimedia focused crawler based on linked open data and deep learning techniques. Multim Tools Appl, 79(11): 7577–7598. https://doi.org/10.1007/s11042-019-08252-2
Article Google Scholar
Chen YB, Zhang Z, Zhang T, 2011. A searching strategy in topic crawler using ant colony algorithm. Microcomput Appl, 30(1):53–56 (in Chinese). https://doi.org/10.19358/j.issn.1674-7720.2011.01.018
Google Scholar
Cheng YK, Liao WJ, Cheng G, 2018. Strategy of focused crawler with word embedding clustering weighted in shark-search algorithm. Comput Dig Eng, 46(1): 144–148 (in Chinese). https://doi.org/10.3969/j.issn.1672-9722.2018.01.031
Google Scholar
Colazzo D, Ghelli G, Pardini L, et al., 2013. Almost-linear inclusion for XML regular expression types. ACM Trans Database Syst, 38(3):15. https://doi.org/10.1145/2508020.2508022
Article MathSciNet Google Scholar
Derrac J, García S, Molina D, et al., 2011. A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms. Swarm Evol Comput, 1(1):3–18. https://doi.org/10.1016/j.swevo.2011.02.002
Article Google Scholar
Dong Y, Liu JF, Liu WJ, 2020. Focused crawler strategy based on multi-objective ant colony algorithm. Comput Eng, 46(9):274–282 (in Chinese). https://doi.org/10.19678/j.issn.1000-3428.0055967
Google Scholar
Du YJ, Pen QQ, Gao ZQ, 2013. A topic-specific crawling strategy based on semantics similarity. Data Knowl Eng, 88:75–93. https://doi.org/10.1016/j.datak.2013.09.003
Article Google Scholar
Du YJ, Hai YF, Xie CZ, et al., 2014. An approach for selecting seed URLs of focused crawler based on user-interest ontology. Appl Soft Comput, 14:663–676. https://doi.org/10.1016/j.asoc.2013.09.007
Article Google Scholar
Du YJ, Li CX, Hu Q, et al., 2017. Ranking webpages using a path trust knowledge graph. Neurocomputing, 269:58–72. https://doi.org/10.1016/j.neucom.2016.08.142
Article Google Scholar
Farag MMG, Lee S, Fox EA, 2018. Focused crawler for events. Int J Dig Libr, 19(1):3–19. https://doi.org/10.1007/s00799-016-0207-1
Article Google Scholar
Gruber TR, 1995. Toward principles for the design of ontologies used for knowledge sharing? Int J Human-Comput Stud, 43(5–6):907–928. https://doi.org/10.1006/ijhc.1995.1081
Article Google Scholar
Guan WG, Luo YC, 2016. Design and implementation of focused crawler based on concept context graph. Comput Eng Des, 37(10):2679–2684 (in Chinese). https://doi.org/10.16208/j.issn1000-7024.2016.10.019
Google Scholar
He S, Cheng JX, Cai XB, 2009. Focused crawler based on simulated anneal algorithm. Comput Technol Dev, 19(12): 55–58, 62 (in Chinese). https://doi.org/10.3969/j.issn.1673-629X.2009.12.015
Google Scholar
Jia JF, Tumanian V, Li GQ, 2021. Discovering semantically related technical terms and web resources in Q&A discussions. Front Inform Technol Electron Eng, 22(7):969–985. https://doi.org/10.1631/FITEE.2000186
Article Google Scholar
Jing WP, Wang YJ, Dong WW, 2016. Research on adaptive genetic algorithm in application of focused crawler search strategy. Comput Sci, 43(8):254–257 (in Chinese). https://doi.org/10.11896/j.issn.1002-137X.2016.8.051
Google Scholar
Khadir AC, Aliane H, Guessoum A, 2021. Ontology learning: grand tour and challenges. Comput Sci Rev, 39:100339. https://doi.org/10.1016/j.cosrev.2020.100339
Article MathSciNet Google Scholar
Lakzaei B, Shamsfard M, 2021. Ontology learning from relational databases. Inform Sci, 577:280–297. https://doi.org/10.1016/j.ins.2021.06.074
Article MathSciNet Google Scholar
Liu B, Jiang SY, Zou Q, 2020. HITS-PR-HHblits: protein remote homology detection by combining PageRank and hyperlink-induced topic search. Brief Bioinform, 21(1): 298–308. https://doi.org/10.1093/bib/bby104
Google Scholar
Liu JF, Li G, Chen DB, et al, 2010. Two-dimensional equilibrium constraint layout using simulated annealing. Comput Ind Eng, 59(4):530–536. https://doi.org/10.1016/j.cie.2010.06.009
Article Google Scholar
Liu JF, Li F, Jiang SY, 2019a. Focused annealing crawler algorithm for rainstorm disasters based on comprehensive priority and host information. Comput Sci, 46(2):215–222 (in Chinese). https://doi.org/10.11896/j.issn.1002-137X.2019.02.033
Google Scholar
Liu JF, Li X, Jiang SY, 2019b. Focused crawler for rainstorm disaster strategy based on web space evolutionary algorithm. Comput Eng, 45(2):184–190 (in Chinese). https://doi.org/10.19678/j.issn.1000-3428.0052035
Google Scholar
Liu JF, Gu YP, Liu WJ, 2020. Focused crawler method combining ontology and improved Tabu search for meteorological disaster. J Comput Appl, 40(8):2255–2261 (in Chinese).
Google Scholar
Liu WJ, Du YJ, 2014. A novel focused crawler based on celllike membrane computing optimization algorithm. Neurocomputing, 123:266–280. https://doi.org/10.1016/j.neucom.2013.06.039
Article Google Scholar
Patel A, Schmidt N, 2011. Application of structured document parsing to focused web crawling. Comput Stand Inter, 33(3):325–331. https://doi.org/10.1016/j.csi.2010.08.002
Article Google Scholar
Prakash J, Kumar R, 2015. Web crawling through shark-search using PageRank. Proc Comput Sci, 48:210–216. https://doi.org/10.1016/j.procs.2015.04.172
Article Google Scholar
Rawat S, Patil DR, 2013. Efficient focused crawling based on best first search. Proc 3^rd IEEE Int Advance Computing Conf, p.908–911. https://doi.org/10.1109/IAdCC.2013.6514347
Rios-Alvarado AB, Lopez-Arevalo I, Sosa-Sosa VJ, 2013. Learning concept hierarchies from textual resources for ontologies construction. Expert Syst Appl, 40(15):5907–5915. https://doi.org/10.1016/j.eswa.2013.05.005
Article Google Scholar
Tong YL, 2008. Application of focused crawler using adaptive dynamical evolutional particle swarm optimization. Geom Inform Sci Wuhan Univ, 33(12):1296–1299 (in Chinese).
Google Scholar
Tsikrika T, Moumtzidou A, Vrochidis S, et al., 2016. Focussed crawling of environmental web resources based on the combination of multimedia evidence. Multim Tools Appl, 75(3):1563–1587. https://doi.org/10.1007/s11042-015-2624-3
Article Google Scholar
Vidal MLA, da Silva AS, de Moura ES, et al., 2006. Structure-driven crawler generation by example. Proc 29^th Annual Int ACM SIGIR Conf on Research and Development in Information Retrieval, p.292–299. https://doi.org/10.1145/1148170.1148223
Wang ZG, Meng BJ, 2014. A comparison of approaches to Chinese word segmentation in Hadoop. Proc IEEE Int Conf on Data Mining Workshop, p.844–850. https://doi.org/10.1109/ICDMW.2014.43
Yang YK, Du YJ, Sun JY, et al., 2008. A topic-specific web crawler with concept similarity context graph based on FCA. Proc 4^th Int Conf on Intelligent Computing, p.840–847. https://doi.org/10.1007/978-3-540-85984-0_101
Zhu G, Yang JY, Wu XH, et al., 2017. Research on construction of hierarchy relationship and ontology of meteorological disaster based on FCA. Mod Inform, 37(5):79–88 (in Chinese). https://doi.org/10.3969/j.issn.1008-0821.2017.05.014
Google Scholar

Download references

Author information

Authors and Affiliations

Guangzhou Key Laboratory of Multilingual Intelligent Processing, Guangdong University of Foreign Studies, Guangzhou, 510006, China
Jingfa Liu (刘景发) & Ruoyao Ding (丁若尧)
School of Information Science and Technology, Guangdong University of Foreign Studies, Guangzhou, 510006, China
Jingfa Liu (刘景发) & Ruoyao Ding (丁若尧)
School of Computer and Software, Nanjing University of Information Science & Technology, Nanjing, 210044, China
Fan Li (李帆)
Faculty of Science, University of Alberta, Edmonton, T6G2H6, Canada
Zi’ang Liu (刘子昂)

Authors

Jingfa Liu (刘景发)
View author publications
You can also search for this author in PubMed Google Scholar
Fan Li (李帆)
View author publications
You can also search for this author in PubMed Google Scholar
Ruoyao Ding (丁若尧)
View author publications
You can also search for this author in PubMed Google Scholar
Zi’ang Liu (刘子昂)
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Jingfa LIU designed the research. Fan LI drafted the paper, implemented the software, and performed the experiments. Ruoyao DING and Zi’ang LIU revised and finalized the paper.

Corresponding authors

Correspondence to Jingfa Liu (刘景发) or Fan Li (李帆).

Ethics declarations

Jingfa LIU, Fan LI, Ruoyao DING, and Zi’ang LIU declare that they have no conflict of interest.

Additional information

Project supported by the Special Foundation of Guangzhou Key Laboratory of Multilingual Intelligent Processing, China (No. 201905010008), the Program of Science and Technology of Guangzhou, China (No. 202002030238), and the Guangdong Basic and Applied Basic Research Foundation, China (No. 2021A1515011974)

List of supplementary materials

Fig. S1 A global ontology structure about the topic of rainstorm disaster

Table S1 Seed URLs

Supplementary materials for