Go Web Crawler

See it in action

Go Web Crawler

This is a concurrent web crawler implemented in Go. It allows you to crawl websites, extract links, and scrape specific data from the visited pages.

Features

Crawls web pages concurrently using goroutines
Extracts links from the visited pages
Scrapes data such as page title, meta description, meta keywords, headings, paragraphs, image URLs, external links, and table data from the visited pages
Supports configurable crawling depth
Handles relative and absolute URLs
Tracks visited URLs to avoid duplicate crawling
Provides timing information for the crawling process
Saves the extracted data in a well-formatted CSV file

Installation

Make sure you have Go installed on your system. You can download and install Go from the official website: https://golang.org

Clone this repository to your local machine:

git clone https://github.com/sieep-coding/web-crawler.git

Navigate to the project directory:
```
cd web-crawler
```
Install the required dependencies:
```
go mod download
```

Usage

Open a terminal and navigate to the project directory.
Run the following command to start the web crawler:
```
go run main.go <url>
```
Replace <url> with the URL you want to crawl.
Wait for the crawling process to complete. The crawler will display the progress and timing information in the terminal.
Once the crawling is finished, the extracted data will be saved in a CSV file named crawl_results.csv in the project directory.

Customization

You can customize the web crawler according to your needs:

Modify the processPage function in crawler/page.go to extract additional data from the visited pages using the goquery package.
Extend the Crawler struct in crawler/crawler.go to include more fields for storing extracted data.
Customize the CSV file generation in main.go to match your desired format.
Implement rate limiting to avoid overloading the target website.
Add support for handling robots.txt and respecting crawling restrictions.
Integrate the crawler with a database or file storage to persist the extracted data.

License

This project is licensed under the UNLICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
crawler		crawler
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
gif.gif		gif.gif
go.mod		go.mod
go.sum		go.sum
main.go		main.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

See it in action

Go Web Crawler

Features

Installation

Usage

Customization

License

About

Languages

License

Sieep-Coding/web-crawler

Folders and files

Latest commit

History

Repository files navigation

See it in action

Go Web Crawler

Features

Installation

Usage

Customization

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages