weixin_39918248
weixin_39918248
2020-12-02 13:26

bleach linkify unescapes escaped HTML sometimes

, thanks for your library, we use it for years :+1:

Recently we discovered quite a strange behaviour linkify exposes in one scenario. Please, take a look on my example below. As you can see, sometimes it unescapes a correctly escaped tag:

 python

In [1]: import bleach

In [2]: bleach.__version__
Out[2]: u'1.4'

In [3]: bleach.linkify('<br>')  # everything is okay
Out[3]: u'<br>'

In [4]: bleach.linkify('<br> http://example.com')  # why <br> is unescaped?
Out[4]: u'<br> <a href="http://example.com" rel="nofollow">http://example.com</a>'

In [5]: bleach.linkify('<br> <br> http://example.com')  # ... and here it's not again ;-)
Out[5]: u'<br> <br> <a href="http://example.com" rel="nofollow">http://example.com</a>'

Frankly, I have no idea how it should be fixed. Thanks for any input.

该提问来源于开源项目:mozilla/bleach

  • 点赞
  • 写回答
  • 关注问题
  • 收藏
  • 复制链接分享
  • 邀请回答

13条回答

  • weixin_39575775 weixin_39575775 5月前

    Uh, hmm. What version of html5lib are you using?

    点赞 评论 复制链接分享
  • weixin_39918248 weixin_39918248 5月前
     python
    In [6]: html5lib.__version__
    Out[6]: u'1.0b3'
    
    点赞 评论 复制链接分享
  • weixin_39777404 weixin_39777404 5月前

    I can confirm this on bleach=1.4 and html5lib=0.999

    点赞 评论 复制链接分享
  • weixin_39628380 weixin_39628380 5月前

    I can also confirm this.

    
    [marca-mac2 pypi]$ pip freeze | egrep 'bleach|html5lib'
    bleach==1.4.1
    html5lib==0.999
    
    
    In [1]: import bleach
    
    In [2]: bleach.__version__
    Out[2]: u'1.4.1'
    
    In [3]: bleach.linkify('<br>')  # everything is okay
    Out[3]: u'<br>'
    
    In [4]: bleach.linkify('<br> http://example.com')  # why <br> is unescaped?
    Out[4]: u'<br> <a href="http://example.com" rel="nofollow">http://example.com</a>'
    
    In [5]: bleach.linkify('<br> <br> http://example.com')  # ... and here it's not again ;-)
    Out[5]: u'<br> <br> <a href="http://example.com" rel="nofollow">http://example.com</a>'
    
    
    [marca-mac2 pypi]$ python -c "import bleach; print(bleach.linkify('<br> http://example.com'))"
    <br> <a href="http://example.com" rel="nofollow">http://example.com</a>
    
    点赞 评论 复制链接分享
  • weixin_39628380 weixin_39628380 5月前

    It seems to have something to with having the escaped stuff next to a URL that gets linkified. See:

    
    [marca-mac2 pypi]$ python -c "import bleach; print(bleach.linkify('<br>'))"
    <br>
    [marca-mac2 pypi]$ python -c "import bleach; print(bleach.linkify('<br> '))"
    <br>
    [marca-mac2 pypi]$ python -c "import bleach; print(bleach.linkify('<br> h'))"
    <br> h
    [marca-mac2 pypi]$ python -c "import bleach; print(bleach.linkify('<br> ht'))"
    <br> ht
    [marca-mac2 pypi]$ python -c "import bleach; print(bleach.linkify('<br> htt'))"
    <br> htt
    [marca-mac2 pypi]$ python -c "import bleach; print(bleach.linkify('<br> http'))"
    <br> http
    [marca-mac2 pypi]$ python -c "import bleach; print(bleach.linkify('<br> http:'))"
    <br> http:
    [marca-mac2 pypi]$ python -c "import bleach; print(bleach.linkify('<br> http:/'))"
    <br> http:/
    [marca-mac2 pypi]$ python -c "import bleach; print(bleach.linkify('<br> http://'))"
    <br> http://
    [marca-mac2 pypi]$ python -c "import bleach; print(bleach.linkify('<br> http://www'))"
    <br> http://www
    [marca-mac2 pypi]$ python -c "import bleach; print(bleach.linkify('<br> http://www.'))"
    <br> http://www.
    [marca-mac2 pypi]$ python -c "import bleach; print(bleach.linkify('<br> http://www.google'))"
    <br> http://www.google
    [marca-mac2 pypi]$ python -c "import bleach; print(bleach.linkify('<br> http://www.google.com'))"
    <br> <a href="http://www.google.com" rel="nofollow">http://www.google.com</a>
    
    点赞 评论 复制链接分享
  • weixin_39628380 weixin_39628380 5月前

    It also only seems to happen if the characters between &lt; and &gt; form one of a number of HTML tags (maybe the set of HTML tags considered "safe"):

    
    [marca-mac2 pypi]$ python -c "import bleach; print(bleach.linkify('<foo> http://www.google.com'))"
    <foo> <a href="http://www.google.com" rel="nofollow">http://www.google.com</a>
    [marca-mac2 pypi]$ python -c "import bleach; print(bleach.linkify('<html> http://www.google.com'))"
    <html> <a href="http://www.google.com" rel="nofollow">http://www.google.com</a>
    [marca-mac2 pypi]$ python -c "import bleach; print(bleach.linkify('<body> http://www.google.com'))"
    <body> <a href="http://www.google.com" rel="nofollow">http://www.google.com</a>
    [marca-mac2 pypi]$ python -c "import bleach; print(bleach.linkify('<div> http://www.google.com'))"
    <div> <a href="http://www.google.com" rel="nofollow">http://www.google.com</a></div>
    [marca-mac2 pypi]$ python -c "import bleach; print(bleach.linkify('<span> http://www.google.com'))"
    <span> <a href="http://www.google.com" rel="nofollow">http://www.google.com</a></span>
    [marca-mac2 pypi]$ python -c "import bleach; print(bleach.linkify('<b> http://www.google.com'))"
    <b> <a href="http://www.google.com" rel="nofollow">http://www.google.com</a></b>
    [marca-mac2 pypi]$ python -c "import bleach; print(bleach.linkify('<head> http://www.google.com'))"
    <head> <a href="http://www.google.com" rel="nofollow">http://www.google.com</a>
    

    And for some tags, even stranger stuff happens, where the stuff completely disappears:

    
    [marca-mac2 pypi]$ python -c "import bleach; print(bleach.linkify('<tr> http://www.google.com'))"
     <a href="http://www.google.com" rel="nofollow">http://www.google.com</a>
    [marca-mac2 pypi]$ python -c "import bleach; print(bleach.linkify('<td> http://www.google.com'))"
     <a href="http://www.google.com" rel="nofollow">http://www.google.com</a>
    
    点赞 评论 复制链接分享
  • weixin_39575775 weixin_39575775 5月前

    Thanks for all the research here, folks. I'm digging into this today and tomorrow. Follow #143 for more frequent updates.

    点赞 评论 复制链接分享
  • weixin_39575775 weixin_39575775 5月前

    I had a bit of an epiphany yesterday and I think this is related to how text nodes get split up and replaced when there's a link in them.

     python
    >>> bleach.linkify('<br><br> example.com')
    u'<br><br> <a href="http://example.com" rel="nofollow">example.com</a>'
    

    The actual <br> tag means that there are 3 nodes: text, br, text, and we don't mess with the first text node. But without it,

     python
    >>> bleach.linkify('<br> example.com')
    u'<br> <a href="http://example.com" rel="nofollow">example.com</a>'
    

    There's only one text node, which gets split up into 3 nodes, which should be text, a, text. But the two new text nodes are getting replaced with something other than text nodes.

    点赞 评论 复制链接分享
  • weixin_39575775 weixin_39575775 5月前

    replace_nodes is broken-as-designed right now, it's going to take a minor surgery to fix.

    点赞 评论 复制链接分享
  • weixin_39758956 weixin_39758956 5月前

    When HTMLTokenizer tokenize text given to it, it unescape all html entities and then construct nodes.

    https://github.com/html5lib/html5lib-python/blob/master/html5lib/tokenizer.py#L253

    The reason why this unescaped entities are rendered properly as escaped is becuase HTMLSerializer escapes them again when token["type"] is either "Characters" or "SpaceCharacters"

    https://github.com/html5lib/html5lib-python/blob/master/html5lib/serializer/htmlserializer.py#L223

    But in bleach code, replace_nodes reconstruct Element from its text.

    https://github.com/mozilla/bleach/blob/master/bleach/init.py#L149

    But right now we don't care whether type of a token containing
    is "Character" or not.

    
    original data
    <br> example.com
    
    after tokenizing
    | <br> example.com |
    | type="Characters"|
    
    after replace_nodes
    |      <br>     |<a href="http://example.com" rel="nofollow">|  example.com  |     </a>    |
    |type="EmptyTag"|               type="StartTag"              |type=Characters|type="EndTag"|
    
    

    So yeah, it seems like replace_nodes is broken-as-designed.

    We should not construct new Element from giving whole node.text into parseFragment but should deal with only the url part in the node.

    But I don't fully understand tree structure html5lib tree so I don't think I can write patch on that.

    We should make the tree look life after replace_node like below.

    
    |       <br>      |<a href="http://example.com" rel="nofollow">|  example.com  |     </a>    |
    |type="Characters"|               type="StartTag"              |type=Characters|type="EndTag"|
    
    点赞 评论 复制链接分享
  • weixin_39758956 weixin_39758956 5月前

    And also this is why below happens.

    
    bleach.linkify('<br><br> example.com')
    u'<br><br> <a href="http://example.com" rel="nofollow">example.com</a>'
    
    
    original data
    <br><br> example.com
    
    after tokenizing
    |<br>|   <br>  |example.com|
    |char|empty tag|    char   |
    
    after replace_nodes
    |<br>|   <br>  |<a href="http://example.com" rel="nofollow">|example.com|  </a> |
    |char|empty tag|                  start tag                 |    char   |end tag|
    

    The <br> tag in the middle of the original text separate them into divided token.

    点赞 评论 复制链接分享
  • weixin_39966922 weixin_39966922 5月前

    Has anyone thought about workarounds in the meantime? I want to use linkify and clean to get sanitized HTML and user-submitted links marked rel="nofollow", which we consider part of sanitization, to avoid spammers exploiting us.

    linkify will clean (by which I mean mark) both URLs and <a> tags so I assumed it was an extended optional part of bleaching:

    
    >>> bleach.linkify('<a href="http://thegoogle.com">____</a>') 
    u'<a href="http://thegoogle.com" rel="nofollow">____</a>'
    >>> bleach.linkify('http://thegoogle.com')                   
    u'<a href="http://thegoogle.com" rel="nofollow">http://thegoogle.com</a>'
    

    Is my assumption that I can use linkify to stop SEO spam a mistake?

    Like I said above, intuitively the two functions should commute, but they don't. bleach.linkify(bleach.clean(html)) produces the rel="nofollow"s, but hits this bug, which undoes the cleaning, meaning that our bleaching is unsafe. Conversely, bleach.clean(bleach.linkify(html)) is safely bleached, but it bleaches too much: rel gets stripped, making our links unsafe in a weaker way.

    I could work around this by saying

    
    >>> bleach.ALLOWED_ATTRIBUTES[u'a'].append(u'rel')
    

    Which seems to make the two commute:

    
    >>> s="<p><a href="//evil.com" rel="author">Cute Hopping Innocent Bunnies</a></p>"
    >>> bleach.clean(bleach.linkify(s))                               
    u'<p><a href="//evil.com" rel="author nofollow">Cute Hopping Innocent Bunnies</a></p>'
    >>> bleach.linkify(bleach.clean(s))
    u'<p><a href="//evil.com" rel="author nofollow">Cute Hopping Innocent Bunnies</a></p>'
    >>> bleach.linkify(bleach.clean(s)) == bleach.clean(bleach.linkify(s))
    True
    

    but notice how it lets a client-chosen rel value throgh. There's a bunch of values rel could take and while I can't think of an attack off the top of my head, that's a scary potential attack surface opening up.

    Actually I know at least one (minor) attack: if someone has a blog with comments active and they can post <a href="https://github.com/attacker" rel="me">AaronPK</a>, then suddenly that attacker account can convince anyone who uses https://indieauth.com/ that they are the blogger.

    Is the right order clean then link, or link then clean? I think it shouldn't matter, but of course the real world bites us with this nofollow thing.

    点赞 评论 复制链接分享
  • weixin_39971172 weixin_39971172 5月前

    I would clean, then linkify. clean will sanitize the HTML and then linkify can convert links into <a href...> things with rel="nofollow" and you can use the callback mechanism to drop links to places you're not excited about.

    I spent the afternoon rewriting linkify as an html5lib Filter. That takes a lot of the complexity out of what it was doing previously and fixes the unescaping issue as demonstrated by examples in this issue.

    The thing I haven't done, yet, is get things to work so that you can do clean and linkify in one pass. It's technically doable, but I'm trying to figure out the API. Getting there.

    点赞 评论 复制链接分享

相关推荐