dsc80135 2016-01-05 19:48
浏览 217
已采纳

使用PHP Simple HTML DOM Parser从html中提取dom元素

I'm trying to extract links to the articles including the text, from this site using PHP Simple HTML DOM PARSER.

enter image description here

I want to extract all h2 tags for articles in the main page and I'm trying to do it this way:

    $html = file_get_html('http://www.winbeta.org');
    $articles = $html->getElementsByTagName('article');
    $a = null;

    foreach ($articles->find('h2') as $header) {
                $a[] = $header;
    }

    print_r($a);

according to the manual it should first get all the content inside article tags then for each article extract the h2 and save in array. but instead it gives me :

enter image description here

EDIT enter image description here

  • 写回答

1条回答 默认 最新

  • douzhi1924 2016-01-05 20:32
    关注

    There are several problems:

    • getElementsByTagName apparently returns a single node, not an array, so it would not work if you have more than one article tag on the page. Instead use find which does return an array;
    • But once you make that switch, you cannot use find on a result of find, so you should do that on each individual matched article tag, or better use a combined selector as argument to find;
    • Main issue: You must retrieve the text content of the node explicitly with ->plaintext, otherwise you get the object representation of the node, with all its attributes and internals;
    • Some of the text contains HTML entities like ’. These can be decoded with html_entity_decode.

    So this code should work:

    $a = array();
    foreach ($html->find('article h2') as $h2) { // any h2 within article
        $a[] = html_entity_decode($h2->plaintext);
    }
    

    Using array_map, you could also do it like this:

    $a = array_map(function ($h2) { return html_entity_decode($h2->plaintext); }, 
                   $html->find('article h2'));
    

    If you need to retrieve other tags within articles as well, to store their texts in different arrays, then you could do as follows:

    $a = array();
    $b = array();
    foreach ($html->find('article') as $article) {
        foreach ($article->find('h2') as $h2) {
            $a[] = html_entity_decode($h2->plaintext);
        }
        foreach ($article->find('h3') as $h3) {
            $b[] = html_entity_decode($h3->plaintext);
        }
    }
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥88 找成都本地经验丰富懂小程序开发的技术大咖
  • ¥15 如何处理复杂数据表格的除法运算
  • ¥15 如何用stc8h1k08的片子做485数据透传的功能?(关键词-串口)
  • ¥15 有兄弟姐妹会用word插图功能制作类似citespace的图片吗?
  • ¥200 uniapp长期运行卡死问题解决
  • ¥15 请教:如何用postman调用本地虚拟机区块链接上的合约?
  • ¥15 为什么使用javacv转封装rtsp为rtmp时出现如下问题:[h264 @ 000000004faf7500]no frame?
  • ¥15 乘性高斯噪声在深度学习网络中的应用
  • ¥15 关于docker部署flink集成hadoop的yarn,请教个问题 flink启动yarn-session.sh连不上hadoop,这个整了好几天一直不行,求帮忙看一下怎么解决
  • ¥15 深度学习根据CNN网络模型,搭建BP模型并训练MNIST数据集