dongyu9667 2014-04-03 14:30
浏览 55
已采纳

php - 检测字符串中的HTML并使用代码标记进行换行

I'm in a trouble with treating HTML in text content. I'm thinking about a method that detects those tags and wrap all consecutive one inside code tags.

Don't wrap me<p>Hello</p><div class="text">wrap me please!</div><span class="title">wrap me either!</span> Don't wrap me <h1>End</h1>.

//expected result

Don't wrap me<code><p>Hello</p><div class="text">wrap me please!</div><span class="title">wrap me either!</span></code>Don't wrap me <code><h1>End</h1></code>.

Is this possible?

  • 写回答

3条回答 默认 最新

  • dongzhouzhang8696 2014-04-03 15:42
    关注

    It is hard to use DOMDocument in this specific case, since it wraps automatically text nodes with <p> tags (and add doctype, head, html). A way is to construct a pattern as a lexer using the (?(DEFINE)...) feature and named subpatterns:

    $html = <<<EOD
    Don't wrap me<p>Hello</p><div class="text">wrap me please!</div><span class="title">wrap me either!</span> Don't wrap me <h1>End</h1>
    EOD;
    
    $pattern = <<<'EOD'
    ~
    (?(DEFINE)
        (?<self>    < [^\W_]++ [^>]* > )
        (?<comment> <!-- (?>[^-]++|-(?!->))* -->)
        (?<cdata>   \Q<![CDATA[\E (?>[^]]++|](?!]>))* ]]> )
        (?<text>    [^<]++ )
        (?<tag>
            < ([^\W_]++) [^>]* >
            (?> \g<text> | \g<tag> | \g<self> | \g<comment> | \g<cdata> )*
            </ \g{-1} >
        )
    )
    # main pattern
    (?: \g<tag> | \g<self> | \g<comment> | \g<cdata> )+
    ~x
    EOD;
    
    $html = preg_replace($pattern, '<code>$0</code>', $html);
    
    echo htmlspecialchars($html);
    

    The (?(DEFINE)..) feature allows to put a definition section inside a regex pattern. This definition section and the named subpatterns inside don't match nothing, they are here to be used later in the main pattern.

    (?<abcd> ...) defines a subpattern you can reuse later with \g<abcd>. In the above pattern, subpatterns defined in this way are:

    • self: that describes a self-closing tag
    • comment: for html comments
    • cdata: for cdata
    • text: for text (all that is not a tag, a comment, or cdata)
    • tag: for html tags that are not self-closed

    self:
    [^\W_] is a trick to obtain \w without the underscore. [^\W]++ represents the tag name and is used too in the tag subpattern.
    [^>]* means all that is not a > zero or more times.

    comment:
    (?>[^-]++|-(?!->))* describes all the possible content inside an html comment:

    (?>          # open an atomic group
        [^-]++   # all that is not a literal -, one or more times (possessive)
      |          # OR
        -        # a literal -
        (?!->)   # not followed by -> (negative lookahead)
    )*           # close and repeat the group zero or more times 
    

    cdata:
    All characters between \Q..\E are seen as literal characters, special characters like [ don't need to be escaped. (This only a trick to make the pattern more readable).
    The content allowed in CDATA is described in the same way than the content in html comments.

    text:
    [^<]++ all characters until an opening angle bracket or the end of the string.

    tag:
    This is the most insteresting subpattern. Lines 1 and 3 are the opening and the closing tag. Note that, in line 1, the tag name is captured with a capturing group. In line 3, \g{-1} refers to the content matched by the last defined capturing group ("-1" means "one on the left").
    The line 2 describes the possible content between an opening and a closing tag. You can see that this description use not only subpatterns defined before but the current subpattern itself to allow nested tags.

    Once all items have been set and the definition section closed, you can easily write the main pattern.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(2条)

报告相同问题?

悬赏问题

  • ¥15 装 pytorch 的时候出了好多问题,遇到这种情况怎么处理?
  • ¥20 IOS游览器某宝手机网页版自动立即购买JavaScript脚本
  • ¥15 手机接入宽带网线,如何释放宽带全部速度
  • ¥30 关于#r语言#的问题:如何对R语言中mfgarch包中构建的garch-midas模型进行样本内长期波动率预测和样本外长期波动率预测
  • ¥15 ETLCloud 处理json多层级问题
  • ¥15 matlab中使用gurobi时报错
  • ¥15 这个主板怎么能扩出一两个sata口
  • ¥15 不是,这到底错哪儿了😭
  • ¥15 2020长安杯与连接网探
  • ¥15 关于#matlab#的问题:在模糊控制器中选出线路信息,在simulink中根据线路信息生成速度时间目标曲线(初速度为20m/s,15秒后减为0的速度时间图像)我想问线路信息是什么