Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Sieep-Coding/web-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

See it in action

Go Web Crawler

This is a concurrent web crawler implemented in Go. It allows you to crawl websites, extract links, and scrape specific data from the visited pages.

Features

  • Crawls web pages concurrently using goroutines
  • Extracts links from the visited pages
  • Scrapes data such as page title, meta description, meta keywords, headings, paragraphs, image URLs, external links, and table data from the visited pages
  • Supports configurable crawling depth
  • Handles relative and absolute URLs
  • Tracks visited URLs to avoid duplicate crawling
  • Provides timing information for the crawling process
  • Saves the extracted data in a well-formatted CSV file

Installation

  1. Make sure you have Go installed on your system. You can download and install Go from the official website: https://golang.org
  2. Clone this repository to your local machine:
    git clone https://github.com/sieep-coding/web-crawler.git
  3. Navigate to the project directory:
    cd web-crawler
  4. Install the required dependencies:
    go mod download

Usage

  1. Open a terminal and navigate to the project directory.
  2. Run the following command to start the web crawler:
    go run main.go <url>
    Replace <url> with the URL you want to crawl.
  3. Wait for the crawling process to complete. The crawler will display the progress and timing information in the terminal.
  4. Once the crawling is finished, the extracted data will be saved in a CSV file named crawl_results.csv in the project directory.

Customization

You can customize the web crawler according to your needs:

  • Modify the processPage function in crawler/page.go to extract additional data from the visited pages using the goquery package.
  • Extend the Crawler struct in crawler/crawler.go to include more fields for storing extracted data.
  • Customize the CSV file generation in main.go to match your desired format.
  • Implement rate limiting to avoid overloading the target website.
  • Add support for handling robots.txt and respecting crawling restrictions.
  • Integrate the crawler with a database or file storage to persist the extracted data.

License

This project is licensed under the UNLICENSE.

About

A simple web crawler implemented in Go.

Topics

Resources

License

Stars

Watchers

Forks

Languages