Summary

Summary

Scraperr is a self-hosted web application that allows users to scrape data from web pages by specifying elements via XPath. Users can submit URLs and the corresponding elements to be scraped, and the results will be displayed in a table.

From the table, users can download an excel sheet of the job's results, along with an option to rerun the job.

Features

Submitting URLs for Scraping

Submit/Queue URLs for web scraping
Add and manage elements to scrape using XPath
Scrape all pages within same domain
Add custom json headers to send in requests to URLs
Display results of scraped data

Managing Previous Jobs

Download csv containing results
Rerun jobs
View status of queued jobs
Favorite and view favorited jobs

User Management

User login/signup to organize jobs

Log Viewing

View app logs inside of web ui

Statistics View

View a small statistics view of jobs ran

Installation

Clone the repository:

git clone https://github.com/jaypyles/scraperr.git

Set environmental variables and labels in docker-compose.yml.

scraperr:
    labels:
      - "traefik.enable=true"
      - "traefik.http.routers.scraperr.rule=Host(`localhost`)" # change this to your domain, if not running on localhost
      - "traefik.http.routers.scraperr.entrypoints=web" # websecure if using https
      - "traefik.http.services.scraperr.loadbalancer.server.port=3000"

scraperr_api:
 environment:
      - LOG_LEVEL=INFO
      - MONGODB_URI=mongodb://root:example@webscrape-mongo:27017 # used to access MongoDB
      - SECRET_KEY=your_secret_key # used to encode authentication tokens (can be a random string)
      - ALGORITHM=HS256 # authentication encoding algorithm
      - ACCESS_TOKEN_EXPIRE_MINUTES=600 # access token expire minutes
  labels:
        - "traefik.enable=true"
        - "traefik.http.routers.scraperr_api.rule=Host(`localhost`) && PathPrefix(`/api`)" # change this to your domain, if not running on localhost
        - "traefik.http.routers.scraperr_api.entrypoints=web" # websecure if using https
        - "traefik.http.middlewares.api-stripprefix.stripprefix.prefixes=/api"
        - "traefik.http.routers.scraperr_api.middlewares=api-stripprefix"
        - "traefik.http.services.scraperr_api.loadbalancer.server.port=8000"

mongo:
    environment:
      MONGO_INITDB_ROOT_USERNAME: root
      MONGO_INITDB_ROOT_PASSWORD: example

Don't want to use traefik? This configuration can be used in other reverse proxies, as long as the API is proxied to /api of the frontend container. This is currently not able to be used without a reverse proxy, due to limitations of runtime client-side environmental variables in next.js.

Deploy

make up

The app provides its own traefik configuration to use independently, but can easily be reverse-proxied by any other app, or your own reverse-proxy.

Usage

Open the application in your browser at http://localhost.
Enter the URL you want to scrape in the URL field.
Add elements to scrape by specifying a name and the corresponding XPath.
Click the "Submit" button to queue URL to be scraped.
View queue in the "Previous Jobs" section.

API Endpoints

Use this service as an API for your own projects. Due to this using FastAPI, a docs page is available at /docs for the API.

Troubleshooting

Q: When running Scraperr, I'm met with "404 Page not found".
A: This is probably an issue with MongoDB related to running Scraperr in a VM. You should see something liks this in make logs:

WARNING: MongoDB 5.0+ requires a CPU with AVX support, and your current system does not appear to have that!

To resolve this issue, simply set CPU host type to host. This can be done in Proxmox in the VM settings > Processor. Related issue.

Legal and Ethical Considerations

When using Scraperr, please ensure that you:

Check Robots.txt: Verify allowed pages by reviewing the robots.txt file of the target website.
Compliance: Always comply with the website's Terms of Service (ToS) regarding web scraping.

Disclaimer: This tool is intended for use only on websites that permit scraping. The author is not responsible for any misuse of this tool.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Contributions

Development made easy by developing from webapp template. View documentation for extra information.

Start development server:

make deps build up-dev

Name		Name	Last commit message	Last commit date
Latest commit History 103 Commits
.github/workflows		.github/workflows
ansible		ansible
api/backend		api/backend
docker		docker
docs		docs
public		public
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
docker-compose.dev.yml		docker-compose.dev.yml
docker-compose.yml		docker-compose.yml
dynamic_conf.yaml		dynamic_conf.yaml
ipython.py		ipython.py
next-env.d.ts		next-env.d.ts
next.config.mjs		next.config.mjs
package-lock.json		package-lock.json
package.json		package.json
pdm.lock		pdm.lock
postcss.config.js		postcss.config.js
pyproject.toml		pyproject.toml
supervisord.conf		supervisord.conf
tailwind.config.js		tailwind.config.js
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Summary

Features

Submitting URLs for Scraping

Managing Previous Jobs

User Management

Log Viewing

Statistics View

Installation

Usage

API Endpoints

Troubleshooting

Legal and Ethical Considerations

License

Contributions

About

Releases

Packages

Languages

License

jaypyles/Scraperr

Folders and files

Latest commit

History

Repository files navigation

Summary

Features

Submitting URLs for Scraping

Managing Previous Jobs

User Management

Log Viewing

Statistics View

Installation

Usage

API Endpoints

Troubleshooting

Legal and Ethical Considerations

License

Contributions

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages