This tracker is in read-only model To report bugs, please use the appropriate Github repository.

 

Issue103

Title repoze.bitblt removes doctype
Priority bug Status resolved
Superseder Nosy List dbaty, jinty
Assigned To dbaty Topics repoze.bitblt

Created on 2009-11-03.18:37:10 by dbaty, last changed 2010-01-28.05:19:10 by jinty.

Files
File name Uploaded Type Edit Remove
keep_html_and_xhtml_doctypes.patch jinty, 2009-11-13.05:13:36 application/octet-stream
tests.py.patch dbaty, 2009-11-03.18:37:45 text/x-patch
transform.py.patch dbaty, 2009-11-03.18:37:09 text/x-patch
Messages
msg344 (view) Author: jinty Date: 2010-01-28.05:19:09
Fixed in revision 8095 by using regexes instead of lxml to parse img tags.
msg343 (view) Author: jinty Date: 2010-01-27.11:32:34
I've just discovered yet another way in which lxml is mangling my HTML. I'm fed up with fixing 
around the edges.

So in the next week or so, I will try re-implement regular expressions to find and replace the 
<img> tags. Given that malthe seems to think it's a reasonable idea I'll do it inside repoze.bitblt 
on a branch first.

So, any ideas on how to robustly find <img> tags in HTML via regexes? Any example 
implementations? 

Also, what kind of backwards compatibility does repoze.bitblt require? Should I replace the 
existing rewrite_image_tags or implement a new one alongside and implement some kind of 
configurability into ImageTransformationMiddleware?
msg322 (view) Author: jinty Date: 2009-11-13.05:40:54
You're right on the slowness of beautifulsoup compared to lxml: 
http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/
msg321 (view) Author: malthe Date: 2009-11-13.05:29:38
Speed matters; ``lxml`` is very fast and does not incur any significant overhead. I feel that 
BeautifulSoup would (this might not be correct).

Regular expressions are very fast, and sometimes brittle. Everything's a compromise. 
However, image-tags are usually the most wellformed elements in an HTML page.

I think it'll work.
msg320 (view) Author: jinty Date: 2009-11-13.05:20:08
As an alternative to regexes, there's always BeautifulSoup http://www.crummy.com/software/BeautifulSoup/documentation.html
msg319 (view) Author: jinty Date: 2009-11-13.05:13:36
I also was bitten by this. Attached is the patch I am using, it includes and expands on the 
originally posted patches using dbaty's "more complex than it should be" method to keep the 
doctype out of html that didn't already have it.

Using regexes does start to seem like a nice idea after looking at all the gymnastics one has to 
go through with lxml.
msg301 (view) Author: malthe Date: 2009-11-11.05:48:25
Perhaps we can use ``lxml`` to extract the locations (string start-
and end- ranges) for the ``<img>`` tags and then simply use regex
matching on those.

This way, the original document isn't changed, but we don't have the
pitfalls of heuristic.
msg300 (view) Author: dbaty Date: 2009-11-11.05:39:44
Malthe, I think you have replied to the wrong ticket. The patch I described has
not been applied (and regular expressions, well, we can use them everywhere, of
course, but... ;) )

(Note: I plan to commit the patch next Friday when I have time.)
msg299 (view) Author: malthe Date: 2009-11-11.04:06:46
I see this patch has already been applied.

Perhaps we should consider using regular expressions to do this. Chances are
that it'll be a) faster, b) less intrusive.
msg291 (view) Author: dbaty Date: 2009-11-03.18:37:09
When rewriting image tags, repoze.bitblt removes the doctype of any (X)HTML 
content (cf. attached test). It should not.

I have found a fix for XHTML code (cf. attached patch) by changing how the 
content is parsed. However, the bug persists for HTML content (when 'try_html' 
is not enforced). I tried to use the same technique as for XHTML (using 
lxml.etree.parse() instead of lxml.html.document_fromstring()) but the transformed 
content then always includes a doctype. Perhaps we could then remove it when 
it was not present in the original content, but it starts to be a bit more 
complicated than it should... (I admit that I did not dig too much in lxml...)

In a nutshell, the attached patch will keep the doctype for XHTML content. For 
HTML content, the current (bogus) behaviour is kept (and the doctype is 
removed). Malthe (or anyone who uses this package), if you do not object, I'll 
commit the patch.
History
Date User Action Args
2010-01-28 05:19:10 jinty set status: chatting -> resolved
messages: + msg344
2010-01-27 11:32:35 jinty set messages: + msg343
2009-11-13 05:40:54 jinty set messages: + msg322
2009-11-13 05:29:38 malthe set messages: + msg321
2009-11-13 05:20:09 jinty set nosy: + jinty
messages: + msg320
2009-11-13 05:13:36 jinty set files: + keep_html_and_xhtml_doctypes.patch
messages: + msg319
2009-11-11 05:48:26 malthe set messages: + msg301
2009-11-11 05:39:45 dbaty set topic: + repoze.bitblt
messages: + msg300
2009-11-11 04:06:47 malthe set status: unread -> chatting
messages: + msg299
2009-11-03 18:37:45 dbaty set files: + tests.py.patch
2009-11-03 18:37:10 dbaty create