douyan9417 2012-05-24 04:26
浏览 65
已采纳

PHP DOM - 删除所有元素除了......?

I am attempting to use PHP to edit the DOM document tree. However, I am stuck. After loading the HTML, I want to remove every element EXCEPT a select few that I specify. (<p> and <b>, for example) How can I do this? Is it even possible?

Below is my current code:

<?php
$url = 'http://en.wikipedia.org/w/index.php?title=Elephant&action=render';
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');

$html = '<html>' . curl_exec($curl) . '</html>';
echo $html;

$document = new DOMDocument;
$document->loadHTML($html);

$allowed_elements = array(
    'a',
    'b',
    'i',
    'p',
);

$parent = $document->getElementsByTagName('html')->item(0);

foreach ($parent->getElementsByTagName('*') as $element)
{
    $node = strtolower((string)$element->nodeName);
    if (!in_array($node, $allowed_elements))
    {
        $element->parentNode->removeChild($element);
    }
}

echo $document->saveHTML();

curl_close($curl);
?>

My tinkering has shown me that it is possible to loop through the DOM tree, so I assume I could just loop through it. However, my code still isn't working! I'm trying to get the plaintext Wikipedia article ultimately--if someone knows an alternate tool that I don't have to write myself, that'll be an acceptable answer.

Thanks!! :)

  • 写回答

1条回答 默认 最新

  • dorkahemp972157683 2012-05-24 08:10
    关注

    Try this:

    <?php
    $url = 'http://en.wikipedia.org/w/index.php?title=Elephant&action=render';
    $curl = curl_init();
    curl_setopt($curl, CURLOPT_URL, $url);
    curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
    
    $html = '<html>' . curl_exec($curl) . '</html>';
    curl_close($curl);
    
    $document = new DOMDocument('1.0');
    $document->loadHTML($html);
    
    $allowed_elements = array('a','b','i','p');
    $elems = array();
    
    $parent = $document->getElementsByTagName('html')->item(0);
    
    foreach ($parent->getElementsByTagName('*') as $element)
    {
        $node = (string)$element->nodeName;
        if(strtolower($node) == 'body'){
            continue;
        }
    
        $elems[] = $node;
    }
    
    $elems = array_values( array_unique( $elems ) );
    $elems = array_diff( $elems, $allowed_elements );
    $elems = array_values( array_unique( $elems ) );
    sort($elems);
    
    foreach( $elems as $elem ) {
        $parent1 = $parent->getElementsByTagName($elem);
        $length = $parent->getElementsByTagName($elem)->length;
    
        for($i=0;$i<$length;$i++) {
            $el = $parent1->item(0); // 0 is the index because after each `removeChild`, the next element shifts 1 position back.
            if( $el ) {
                $el->parentNode->removeChild($el);
            }
        }
    }
    
    echo $document->saveHTML();
    ?>
    

    $elementsToKeep - The array containg the list of items which are not to be deleted.

    Hope this helps.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 如何用stata画出文献中常见的安慰剂检验图
  • ¥15 c语言链表结构体数据插入
  • ¥40 使用MATLAB解答线性代数问题
  • ¥15 COCOS的问题COCOS的问题
  • ¥15 FPGA-SRIO初始化失败
  • ¥15 MapReduce实现倒排索引失败
  • ¥15 ZABBIX6.0L连接数据库报错,如何解决?(操作系统-centos)
  • ¥15 找一位技术过硬的游戏pj程序员
  • ¥15 matlab生成电测深三层曲线模型代码
  • ¥50 随机森林与房贷信用风险模型