dscojuxf69080 2017-04-09 11:39
浏览 44
已采纳

PHP dom解析代码echos每个语句三次而不是一次

I have this code to extract statements from a multiple pages of a forum site. it works great except it prints each statement thrice instead of once. I checked each and every line still i don't understand why.

My code goes as:

<?php
    set_time_limit(3600);
    $i = 0;

    while($i < 100)
    {
        $e = 839303 - $i;

        require_once('dom/simple_html_dom.php'); 
        $html =file_get_html('http://www.usmleforum.com/files/forum/2017/1/'.$e.'.php');

        foreach ($html->find("tr") as $row)
        {
            $element = $row->find('td.Text2',0);

            if ($element == null) { continue; }

            $textNode = array_filter($element->nodes, function ($n)
            {
                 return $n->nodetype == 3;        //Text node type, like in jQuery     
            });

            if (!empty($textNode))
            {
                $text = current($textNode);
                echo $text."<br>"; 
            }
        }

        $i++;
    }
?>

In other hand, if the site that we are extracting contains more than on statements of the hidden somewhere, can we only ask the parser to echo once?

Any help is appreciated..

Trying to parse user details...but not working,, kinda skipping..

//User
    $element = $html->find('td.FootNotes2',0);
    if ($element == null) { continue; }
    $textNode = array_filter($element->nodes, function ($n) {
    return $n->nodetype == 3;        
    });
    if (!empty($textNode)) {
    $text = current($textNode);
    echo $text."<br><hr><hr>"; 
    }
  • 写回答

1条回答 默认 最新

  • douyong1908 2017-04-09 13:04
    关注

    The reason you are getting 3 outputs of the same text is because of the selection being made and the HTML structure of the page. The entire site is made up of nested tables, and so what is happening is the used query is finding all <tr>...</tr> tags on the entire page. Then looping through every tr tag and looking for the first td.Text2 which also shows up in the HTML multiple times on the pages, example.

    This is going to be a tricky crawl given the structure of the HTML and you may be better off searching for only td.Text2 and grabbing the first one on the page instead. Below is an example of this solution.

    Something like seems to work, but the pages are not the same throughout the loop so the results are a little weird:

    <?php
    require_once('dom/simple_html_dom.php');
    set_time_limit(3600);
    $i = 0;
    
    while ($i < 100) {
        $e = 839303 - $i;
        $html = file_get_html('http://www.usmleforum.com/files/forum/2017/1/'.$e.'.php');
        $i++;
    
        $element = $html->find('td.Text2',0);
    
        if ($element == null ) { continue; }
    
        $textNode = array_filter($element->nodes, function ($n) {
             return $n->nodetype == 3;        //Text node type, like in jQuery
        });
    
        if (!empty($textNode)) {
            $text = current($textNode);
            echo $text."<br>";
        }
    
        // Getting User/Author
        $parent = $element->parent()->parent();
        $element = $parent->find('td.FootNotes2',1);
        if ($element == null) { continue; }
        $textNode = array_filter($element->nodes, function ($n) {
          return $n->nodetype == 3;
        });
        if (!empty($textNode)) {
          $text = current($textNode);
          echo $text."<br><hr><hr>";
        }
    }
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 R语言Rstudio突然无法启动
  • ¥15 关于#matlab#的问题:提取2个图像的变量作为另外一个图像像元的移动量,计算新的位置创建新的图像并提取第二个图像的变量到新的图像
  • ¥15 改算法,照着压缩包里边,参考其他代码封装的格式 写到main函数里
  • ¥15 用windows做服务的同志有吗
  • ¥60 求一个简单的网页(标签-安全|关键词-上传)
  • ¥35 lstm时间序列共享单车预测,loss值优化,参数优化算法
  • ¥15 Python中的request,如何使用ssr节点,通过代理requests网页。本人在泰国,需要用大陆ip才能玩网页游戏,合法合规。
  • ¥100 为什么这个恒流源电路不能恒流?
  • ¥15 有偿求跨组件数据流路径图
  • ¥15 写一个方法checkPerson,入参实体类Person,出参布尔值