douzepao0281 2011-08-28 08:37
浏览 56
已采纳

PHP Dom Documents:获取textContent忽略脚本标记和注释

i uses dom doc to load html from database like this:

$doc = new DOMDocument();
@$doc->loadHTML($data);
$doc->encoding = 'utf-8';
$doc->saveHTML();

Then i get the body text by doing these:

$bodyNodes = $doc->getElementsByTagName("body");
$words = htmlspecialchars($bodyNodes->item(0)->textContent);

The words i've gotten included everything in the <body>. Things like <scripts> were also included. How do i removed them and keep only the real text content?

  • 写回答

2条回答 默认 最新

  • 普通网友 2011-08-28 09:04
    关注

    You have to visit all nodes and return their text. If some contain other node, visit them too.

    This can be done with this basic recursive algorithm:

    extractNode:
        if node is a text node or a cdata node, return its text
        if is an element node or a document node or a document fragment node:
            if it’s a script node, return an empty string
            return a concatenation of the result of calling extractNode on all the child nodes
        for everything else return nothing
    

    Implementation:

    function extractText($node) {    
        if (XML_TEXT_NODE === $node->nodeType || XML_CDATA_SECTION_NODE === $node->nodeType) {
            return $node->nodeValue;
        } else if (XML_ELEMENT_NODE === $node->nodeType || XML_DOCUMENT_NODE === $node->nodeType || XML_DOCUMENT_FRAG_NODE === $node->nodeType) {
            if ('script' === $node->nodeName) return '';
    
            $text = '';
            foreach($node->childNodes as $childNode) {
                $text .= extractText($childNode);
            }
            return $text;
        }
    }
    

    This will return the textContent of the given $node, ignoring script tags and comments.

    $words = htmlspecialchars(extractText($bodyNodes->item(0)));
    

    Try it here: http://codepad.org/CS3nMp7U

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥15 执行 virtuoso 命令后,界面没有,cadence 启动不起来
  • ¥50 comfyui下连接animatediff节点生成视频质量非常差的原因
  • ¥20 有关区间dp的问题求解
  • ¥15 多电路系统共用电源的串扰问题
  • ¥15 slam rangenet++配置
  • ¥15 有没有研究水声通信方面的帮我改俩matlab代码
  • ¥15 ubuntu子系统密码忘记
  • ¥15 保护模式-系统加载-段寄存器
  • ¥15 电脑桌面设定一个区域禁止鼠标操作
  • ¥15 求NPF226060磁芯的详细资料