douju1365 2013-08-11 05:28
浏览 35

从文本末尾删除未关闭的html元素

I want to remove all elements which are not closed properly at the end of content e.g in below test

commodi consequatur? Quis autem vel eum iure reprehenderit qui in ea
voluptate velit esse quam nihil molestiae consequatur, 
vel illum qui dolorem eum fugiat quo voluptas nulla 
pariatur? <a rel="nofollow" class="underline"

I want to remove

<a rel="nofollow" class="underline"

or elements without closing tags

<h2>sample text

or any other html element which is not closed properly at the end.

  • 写回答

1条回答 默认 最新

  • duanpao9781 2013-08-11 07:42
    关注

    I have written a function that should do what you want. The idea is to first replace all valid tag-sequences with a #### pattern. Then a regular expression removes everything from the first < till the end of the string. After that, the valid tag-sequences are put back to the buffer (if that part has not been removed due to invalid tag before that part).

    Too bad, I can't add a codepad because recursive regular expressions seems not to be supported by the PHP version used by codepad. I've tested this with PHP 5.3.5.

    PHP

    function StripUnclosedTags($input) {
        // Close <br> tags
        $buffer = str_ireplace("<br>", "<br/>", $input);
        // Find all matching open/close HTML tags (using recursion)
        $pattern = "/<([\w]+)([^>]*?) (([\s]*\/>)| (>((([^<]*?|<\!\-\-.*?\-\->)| (?R))*)<\/\\1[\s]*>))/ixsm";
        preg_match_all($pattern, $buffer, $matches, PREG_OFFSET_CAPTURE);
        // Mask matching open/close tag sequences in the buffer
        foreach ($matches[0] as $match) {
            $ofs = $match[1];
            for ($i = 0; $i < strlen($match[0]); $i++, $ofs++)
                $buffer[$ofs] = "#";
        }
        // Remove unclosed tags
        $buffer = preg_replace("/<.*$/", "", $buffer);
        // Put back content of matching open/close tag sequences to the buffer
        foreach ($matches[0] as $match) {
            $ofs = $match[1];
            for ($i = 0; $i < strlen($match[0]) && $ofs < strlen($buffer); $i++, $ofs++)
                $buffer[$ofs] = $match[0][$i];
        }
        return $buffer;
    }
    
    $str = 'commodi consequatur? Quis autem vel eum iure reprehenderit qui in ea voluptate '
          .'velit esse<br> quam nihil molestiae consequatur,  vel illum qui dolorem eum '
          .'fugiat quo voluptas nulla  pariatur? '
          .'<a href="test">test<p></p></a><span>test<p></p>bla';
    
    var_dump(StripUnclosedTags($str));
    

    Output

    string 'commodi consequatur? Quis autem vel eum iure reprehenderit qui in ea
    voluptate velit esse<br/> quam nihil molestiae consequatur, 
    vel illum qui dolorem eum fugiat quo voluptas nulla 
    pariatur? <a href="test">test<p></p></a>' (length=226)
    
    评论

报告相同问题?

悬赏问题

  • ¥15 python按要求编写程序
  • ¥15 Python输入字符串转化为列表排序具体见图,严格按照输入
  • ¥20 XP系统在重新启动后进不去桌面,一直黑屏。
  • ¥15 opencv图像处理,需要四个处理结果图
  • ¥15 无线移动边缘计算系统中的系统模型
  • ¥15 深度学习中的画图问题
  • ¥15 java报错:使用mybatis plus查询一个只返回一条数据的sql,却报错返回了1000多条
  • ¥15 Python报错怎么解决
  • ¥15 simulink如何调用DLL文件
  • ¥15 关于用pyqt6的项目开发该怎么把前段后端和业务层分离