dsc80135 2016-01-05 19:48
浏览 216

使用PHP Simple HTML DOM Parser从html中提取dom元素

I'm trying to extract links to the articles including the text, from this site using PHP Simple HTML DOM PARSER.

enter image description here

I want to extract all h2 tags for articles in the main page and I'm trying to do it this way:

    $html = file_get_html('http://www.winbeta.org');
    $articles = $html->getElementsByTagName('article');
    $a = null;

    foreach ($articles->find('h2') as $header) {
                $a[] = $header;


according to the manual it should first get all the content inside article tags then for each article extract the h2 and save in array. but instead it gives me :

enter image description here

EDIT enter image description here

  • 写回答

1条回答 默认 最新

  • douzhi1924 2016-01-05 20:32

    There are several problems:

    • getElementsByTagName apparently returns a single node, not an array, so it would not work if you have more than one article tag on the page. Instead use find which does return an array;
    • But once you make that switch, you cannot use find on a result of find, so you should do that on each individual matched article tag, or better use a combined selector as argument to find;
    • Main issue: You must retrieve the text content of the node explicitly with ->plaintext, otherwise you get the object representation of the node, with all its attributes and internals;
    • Some of the text contains HTML entities like ’. These can be decoded with html_entity_decode.

    So this code should work:

    $a = array();
    foreach ($html->find('article h2') as $h2) { // any h2 within article
        $a[] = html_entity_decode($h2->plaintext);

    Using array_map, you could also do it like this:

    $a = array_map(function ($h2) { return html_entity_decode($h2->plaintext); }, 
                   $html->find('article h2'));

    If you need to retrieve other tags within articles as well, to store their texts in different arrays, then you could do as follows:

    $a = array();
    $b = array();
    foreach ($html->find('article') as $article) {
        foreach ($article->find('h2') as $h2) {
            $a[] = html_entity_decode($h2->plaintext);
        foreach ($article->find('h3') as $h3) {
            $b[] = html_entity_decode($h3->plaintext);
    本回答被题主选为最佳回答 , 对您是否有帮助呢?



  • ¥15 一个识别内容的自动化脚本程序
  • ¥15 anaconda虚拟python环境部署langchain-chatchat报错
  • ¥20 matlab有约束条件下的多元函数求最小值
  • ¥50 如何隐藏网页弹出框的url地址栏
  • ¥20 metropolis算法模拟二维ising模型来计算磁化强度,fortran
  • ¥15 uniapp-typescript-vue报错
  • ¥15 oracle强制关机以后报错01033
  • ¥15 给Chat with RTX添加语言模型时遇到问题
  • ¥15 oracle修复,怎么根据日志修复呀?
  • ¥15 使用Stable Diffusion时出现错误