Micro Web Crawler in PHP & Manticore
Yo! is the super thin client-server crawler based on Manticore full-text search.
Compatible with different networks, includes flexible settings, history snaps, CLI tools and UI for Gemini Protocol.
To use HTTP version, please checkout main branch!
- MIME-based crawler with flexible filter settings by regular expressions, selectors, external links etc
- Page snap history with local and remote mirrors support (including FTP protocol)
- CLI tools for index administration and crontab tasks
- Gemini Protocol UI (coming soon)
- Manticore Server
- PHP library for Manticore
- PHP library for Gemini Protocol
- PHP library for Network operations
- FTP client for snap mirrors
wget https://repo.manticoresearch.com/manticore-repo.noarch.debdpkg -i manticore-repo.noarch.debapt updateapt install git composer manticore manticore-extra memcached php-fpm php-mbstring php-memcached
Yo search engine uses Manticore as the primary database. If your server sensitive to power down,
change default binlog flush strategy to binlog_flush = 1
git clone https://github.com/YGGverse/Yo.gitcd Yogit checkout geminicomposer update
git clone https://github.com/YGGverse/Yo.gitcd Yogit checkout geminigit checkout -b pr-branchgit commit -m 'new fix'git push
cd Yogit pullcomposer update
cp example/config.json config.jsonphp src/cli/index/init.php
php src/cli/document/add.php URLphp src/cli/document/crawl.phpphp src/cli/document/search.php '*'
Coming soon..
Create initial index
php src/cli/index/init.php [reset]
reset- optional, reset existing index
Change existing index
php src/cli/index/alter.php {operation} {column} {type}
operation- operation name, supported values:add|dropcolumn- target column nametype- target column type, supported values:text|integer
php src/cli/document/add.php URL
URL- add new URL to the crawl queue
php src/cli/document/crawl.php
Make index optimization, apply new configuration rules
php src/cli/document/clean.php [limit]
limit- integer, documents quantity per queue
php src/cli/document/search.php '@title "*"' [limit]
query- requiredlimit- optional search results limit
SQL text dumps could be useful for public index distribution, but requires more computing resources.
Better for infrastructure administration and includes original data binaries.
Coming soon..