Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

Fix spider #294

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from
Closed

Fix spider #294

wants to merge 3 commits into from

Conversation

jayvdb
Copy link
Contributor

@jayvdb jayvdb commented Jul 28, 2016

No description provided.

jayvdb added 3 commits July 28, 2016 07:28
spider.py used both Python 2-only (md5) and Python 3-only (urllib) imports.
Also, it didnt use a namespace when searching for links to spider,
and did not read the robots.txt, preventing any spidering occurring.

Fix exception occuring when robots processing removed items from
list toVisit while iterating over the list.

Add more output on stderr, and a main() which spiders yahoo.com
@jayvdb
Copy link
Contributor Author

jayvdb commented Jul 28, 2016

We could probably replace most of this with an existing library.

@jayvdb jayvdb mentioned this pull request Jul 28, 2016
@Ms2ger
Copy link
Contributor

Ms2ger commented Aug 5, 2016

Can we use six for the urllib imports?

@jayvdb
Copy link
Contributor Author

jayvdb commented Aug 5, 2016

Using six here looks good; I checked and it provides robotparser, since 2013 (https://bitbucket.org/gutworth/six/pull-requests/5/create-sixmovesurllib/diff / https://bitbucket.org/gutworth/six/commits/1f2c7f5d14be9027d180aa00138a1d29c8f48a6f), released as six 1.4.0 .
Should I find a minimum version for six support? Currently it doesnt have one.

@gsnedders
Copy link
Member

As far as I'm aware, nobody's actually run it for years hence it being badly broken. We should probably just kill it at this point, as it likely has no real use.

@willkg willkg mentioned this pull request Oct 3, 2017
@willkg
Copy link
Contributor

willkg commented Oct 3, 2017

Given the previous comment, I'm going to close this out. I opened #349 to cover removing utils/spider.py.

@willkg willkg closed this Oct 3, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants