Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

WIP More general fix for #127 with addinfourl #136

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

SimonSapin
Copy link
Contributor

Do not merge yet, this is incomplete.

As discussed in #134 I changed this to avoid .read(0) entirely and pass a first chunk to HTMLUnicodeInputStream and HTMLBinaryInputStream, but then I get lost in their implementation and I don’t know what to do with that first chunk.

Surprisingly, most tests still passed because .seek(0) is used in some places. I added a failing test with a non-seekable input.

@hoppipolla-critic-bot
Copy link

Critic review: https://critic.hoppipolla.co.uk/r/498

This is an external review system which you may optionally use for the code review of your pull request.

In order to help critic track your changes, please do not make in-place history rewrites (e.g. via git rebase -i or git commit --amend) when updating this pull request.

@gsnedders
Copy link
Member

Urgh, looking at this again I must've been thinking of my somewhat-working-written-a-long-time-ago Python 3-only rewrite of the input stream. Dealing with firstChunk is going to be hideously complex.

Alternative idea for fixing this (comments welcome, don't take this as me saying what the right approach is — I don't know what the right approach is!): make BufferedStream generic and work for both unicode and bytes file-like objects (literally the only thing specific is the b"".join at the end!), make sure it hides bugs where read is broken (so it guarantees len(s.read(n)) <= n), and then wrap all non-seekable streams with it in HTMLInputStream, which then means read(0) is safe for all non-seekable streams.

@gsnedders
Copy link
Member

The other option is to try and finish my long-abandoned Python 3-only rewrite of inputstream, create something with the same API (but slower when changing character encoding) for Python 2. https://github.com/gsnedders/html5lib-python/tree/new-inputstream is what there is. Hopefully shows the vague idea.

@SimonSapin
Copy link
Contributor Author

At the moment BufferedStream is only used when the input stream is non-seekable, so it doesn’t cover all cases.

@gsnedders
Copy link
Member

Hmm, looking at addinfourl, I wonder if we should just detect the bug with if isinstance(source, http_client.HTTPResponse) or source.read == http_client.HTTPResponse. This would handle both any cases of people subclassing HTTPResponse as well as handling addinfourl. Do we actually want to work around more general cases of people breaking the file-like object API contract? I'd argue not… This is only really special in that it is stdlib code breaking it.

@SimonSapin
Copy link
Contributor Author

At one point I had if isinstance(six.get_method_self(source.read), http_client.HTTPResponse), but that broke on objects where source.read was not a method.

@gsnedders
Copy link
Member

Is there any reason not to just go for the equality check?

@SimonSapin
Copy link
Contributor Author

Do you mean if source.read == http_client.HTTPResponse.read? That doesn’t work, source.read is a bound method.

@gsnedders
Copy link
Member

So, what, get_method_self would raise AttributeError, no? Can we not just check for that?

@SimonSapin
Copy link
Contributor Author

We can, it’s just one more corner case piling up…

@SimonSapin
Copy link
Contributor Author

How about this?

# Work around Python bug #20007: read(0) closes the connection.
if (isinstance(source, http_client.HTTPResponse) or 
    # Also check for addinfourl wrapping HTTPResponse
    isinstance(getattr(source, 'fp', None), http_client.HTTPResponse)):

It should never raise, and cover every case where we’ve seen 20007 so far.

@gsnedders
Copy link
Member

I guess it is as good as anything. As ugly as it is.

@gsnedders
Copy link
Member

I'm so confused as to what's happened here. Why did this get dropped? Was I expecting you to update PR with that? I don't know. Looking over this again, get_method_self seems sanest, along with moving it within the following if. Closing this to replace it with #186.

@gsnedders gsnedders closed this Apr 26, 2015
@gsnedders
Copy link
Member

Oh, no, I'm being stupid. I'm thinking about JS semantics and not Python ones!

gsnedders added a commit to gsnedders/html5lib-python that referenced this pull request Apr 26, 2015
gsnedders added a commit to gsnedders/html5lib-python that referenced this pull request Dec 12, 2015
gsnedders added a commit to gsnedders/html5lib-python that referenced this pull request Dec 12, 2015
gsnedders added a commit to gsnedders/html5lib-python that referenced this pull request Dec 12, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants