Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3321408.3323081acmotherconferencesArticle/Chapter ViewAbstractPublication Pagesacm-turcConference Proceedingsconference-collections
research-article

VRPSOFC: a framework for focused crawler using mutation improving particle swarm optimization algorithm

Published: 17 May 2019 Publication History
  • Get Citation Alerts
  • Abstract

    The focused crawler is the key technology of the search engine. It filters webpages based on relevant algorithms until certain conditions are met. The current focused crawler is prone to topic-drift and low precision in the process of crawling the webpages. Therefore, this paper proposes a focused crawler framework (VRPSOFC) based on mutation improving particle swarm optimization. First of all, for each topic, VRPSOFC gets 3 different types of seed pages that are easy to generate large-scale web page aggregation based on the page click rate of Google search, which are official website, wikipedia, forum or video page. Then VRPSOFC uses the mutation improved particle swarm optimization algorithm proposed in this paper to crawl webpages, where each seed page will be used as the initial page. Finally, experiment in the real web environment and analyze the results. Compared with traditional VSM and other methods, VRPSOFC can obtain more accurate URL priority and crawl high quality web pages. Therefore, the topic crawler framework proposed in this paper is effective and important.

    References

    [1]
    Soumen Chakrabarti, Martin V. Berg, Byron Dom. 1999. Focused crawling: a new approach to topic-specific Web resource discovery. Computer Networks 31, 11--16(1999), 1623--1640.
    [2]
    Inwoo Ro, Joong Soo Han, Eul Gyu Im. 2016. Detection Method for Distributed Web-Crawlers: A Long-Tail Threshold Model. Security and Communication Networks 2018, (2018), 9065424.
    [3]
    De Bra, P.M.E, Joseph Post. 1994. Information retrieval in the World-Wide Web: making client-based searching feasible. Data & Knowledge Engineering 27, 2(1994), 183--192.
    [4]
    Lawrence Page, Sergey Brin. 1994. The PageRank citation ranking: Bringing order to the web. Stanford InfoLab (1999).
    [5]
    Filippo Menczer, Gautam Pant, Padmini Srinivasan. 2001. Evaluating topic-driven web crawlers. In International ACM SIGIR Conference on Research and Development in Information Retrieval. 241--249.
    [6]
    Punam Bedi, Anjali Thukral, Hema Banati. 2013. Focused crawling of tagged web resources using ontology. Computers & Electrical Engineering 39, 2(2013), 613--628.
    [7]
    Du Yajun, Li Chenxing, Hu Qiang, et al. 2017. Ranking webpages using a path trust knowledge graph. Neurocomputing 269, (2017), 58--72.
    [8]
    Boukadi Khouloud, Rekik Mouna, Rekik Molka, et al. 2018. FC4CD: a new SOA-based Focused Crawler for Cloud service Discovery. Computer & Modernization 100, 10(2018), 1081--1107.
    [9]
    Zhao Wei, Guan Ziyu, Cao Zhengwen, et al. 2016. Detection Method for Distributed Web-Crawlers: A Long-Tail Threshold Model. Security and Communication Networks 25, 1(2016), 48--57.
    [10]
    Du Yajun, Liu Wenjun, Lv Xianjing. 2015. An improved focused crawler based on Semantic Similarity Vector Space Model. Applied Soft Computing 36, (2015), 392--407.
    [11]
    Qiu Lei, Lou Yuansheng, Min Chang. 2016. Research on theme crawler based on Shark-Search and PageRank algorithm. In International Conference on Cloud Computing and Intelligence Systems. 268--271.
    [12]
    Tong Yifei, Zhong Meng, Li Jingwei, et al. 2018. Research on intelligent welding robot path optimization based on GA and PSO algorithms. IEEE Access 6, 1(2018), 65397--65404.
    [13]
    M. Sheikhalishahi, V. Ebrahimipour. 2013. A hybrid GA-PSO approach for reliability optimization in redundancy allocation problem. International Journal of Advanced Manufacturing Technology 68, (2013), 317--338.
    [14]
    Karane Vieira, Luciano Barbosa, Altigran S. Silva. 2016. Finding seeds to bootstrap focused crawlers. World Wide Web-internet & Web Information Systems 19, 3(2016), 449--474.
    [15]
    Liu Wenjun, Du Yajun, He Xiaofei. 2016. A novel focused crawler based on cell-like membrane computing optimization algorithm. Neurocomputing 123, (2014), 266--280.

    Cited By

    View all
    • (2020)A Similarity Calculation Model of Weak Link Web Pages Based on Keyword Location Influence2020 IEEE 3rd International Conference on Information Systems and Computer Aided Education (ICISCAE)10.1109/ICISCAE51034.2020.9236843(62-67)Online publication date: 27-Sep-2020

    Index Terms

    1. VRPSOFC: a framework for focused crawler using mutation improving particle swarm optimization algorithm

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Other conferences
      ACM TURC '19: Proceedings of the ACM Turing Celebration Conference - China
      May 2019
      963 pages
      ISBN:9781450371582
      DOI:10.1145/3321408
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 17 May 2019

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. focused crawler
      2. mutation
      3. particle swarm algorithm
      4. precision
      5. topic-drift

      Qualifiers

      • Research-article

      Conference

      ACM TURC 2019

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)0
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 09 Aug 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2020)A Similarity Calculation Model of Weak Link Web Pages Based on Keyword Location Influence2020 IEEE 3rd International Conference on Information Systems and Computer Aided Education (ICISCAE)10.1109/ICISCAE51034.2020.9236843(62-67)Online publication date: 27-Sep-2020

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media