douzepao0281 2011-08-28 08:37
浏览 56
已采纳

PHP Dom Documents:获取textContent忽略脚本标记和注释

i uses dom doc to load html from database like this:

$doc = new DOMDocument();
@$doc->loadHTML($data);
$doc->encoding = 'utf-8';
$doc->saveHTML();

Then i get the body text by doing these:

$bodyNodes = $doc->getElementsByTagName("body");
$words = htmlspecialchars($bodyNodes->item(0)->textContent);

The words i've gotten included everything in the <body>. Things like <scripts> were also included. How do i removed them and keep only the real text content?

  • 写回答

2条回答 默认 最新

  • 普通网友 2011-08-28 09:04
    关注

    You have to visit all nodes and return their text. If some contain other node, visit them too.

    This can be done with this basic recursive algorithm:

    extractNode:
        if node is a text node or a cdata node, return its text
        if is an element node or a document node or a document fragment node:
            if it’s a script node, return an empty string
            return a concatenation of the result of calling extractNode on all the child nodes
        for everything else return nothing
    

    Implementation:

    function extractText($node) {    
        if (XML_TEXT_NODE === $node->nodeType || XML_CDATA_SECTION_NODE === $node->nodeType) {
            return $node->nodeValue;
        } else if (XML_ELEMENT_NODE === $node->nodeType || XML_DOCUMENT_NODE === $node->nodeType || XML_DOCUMENT_FRAG_NODE === $node->nodeType) {
            if ('script' === $node->nodeName) return '';
    
            $text = '';
            foreach($node->childNodes as $childNode) {
                $text .= extractText($childNode);
            }
            return $text;
        }
    }
    

    This will return the textContent of the given $node, ignoring script tags and comments.

    $words = htmlspecialchars(extractText($bodyNodes->item(0)));
    

    Try it here: http://codepad.org/CS3nMp7U

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥20 测距传感器数据手册i2c
  • ¥15 RPA正常跑,cmd输入cookies跑不出来
  • ¥15 求帮我调试一下freefem代码
  • ¥15 matlab代码解决,怎么运行
  • ¥15 R语言Rstudio突然无法启动
  • ¥15 关于#matlab#的问题:提取2个图像的变量作为另外一个图像像元的移动量,计算新的位置创建新的图像并提取第二个图像的变量到新的图像
  • ¥15 改算法,照着压缩包里边,参考其他代码封装的格式 写到main函数里
  • ¥15 用windows做服务的同志有吗
  • ¥60 求一个简单的网页(标签-安全|关键词-上传)
  • ¥35 lstm时间序列共享单车预测,loss值优化,参数优化算法