dougaoxian8922 2012-08-21 16:16
浏览 64
已采纳

For循环只迭代一次(simplehtmldom)

I have a for loop that loops 3 times, and within the loop, a shell_exec() is done, calling a binary phantomjs and returning its output. This output is then passed into simplehtmldom's str_get_html()

Problem: When str_get_html($html) is involved in the for loop and $html consist of a webpage's HTML, only the first loop is executed, not the 2nd or 3rd. However, if I were to use some simple <a> tags for $html, the for loop iterates completely!

What is happening here, and how can I solve it?

Note the difference in the 2 functions below (the one that works and the one that loops only once) is how one of them have a line commented out, the other has another line commented out instead.

Parent function (The for loop here does not iterate completely)

public function action_asos() {


    // Site details
    $base_url = "http://www.mysite.com";

    // Category details
    $category_id = 7616;
    $per_page = 500;

    // Find number of pages in category
    $num_pages = 2;

    //THIS IS THE LOOP THAT CANNOT LOOP COMPLETELY!
    // Extract Product URLs from Category page
    for($i = 0; $i <= $num_pages; $i++) {
        echo "<h2>Page $i</h2>";
        $page = $i;
        $category_url = 'http://www.mysite.com/pgecategory.aspx?cid='.$category_id.'&parentID=-1&pge='.$page.'&pgeSize='.$per_page.'&sort=1';
        $this->extract_product_urls($category_url, $base_url);
    }
        echo "Yes.";
        flush();

}

PHP Code (causes loop in parent function to loop only once)

public function extract_product_urls($category_url, $base_url) {


    set_time_limit(300);
    include_once('/home/mysite/public_html/application/libraries/simple_html_dom.php');

    // Retrieve page HTML using PhantomJS
    $html = $this->get_html($category_url);

    // Extract links
    $html = str_get_html($html);
    //$html = str_get_html('<a class="productImageLink" href="asdasd"></a>');
    foreach($html->find('.productImageLink') as $match) {
        $product_url = $base_url . $match->href;
        $product_url = substr($product_url, 0, strpos($product_url, '&'));  // remove metadata in URL string
        $this->product_urls[] = $product_url;
    }

    echo "done.";
    flush();

}

Helper functions

/**
 * Gets the webpage's HTML (after AJAX contented has loaded, using PhantonJS)
 * @return [type] [description]
 */
public function get_html($url) {

    $url = escapeshellarg($url);    // prevent truncating after characters like `&`
    $script = path('base')."application/phantomjs/httpget.js";
    $output = shell_exec("phantomjs $script $url");

    return $output;

}
  • 写回答

1条回答 默认 最新

  • duanqinbi9029 2012-08-21 16:20
    关注

    Try this:

    while($match = $html->find('.productImageLink')) {
        if (!is_object($match)) {
            break;
        } 
        . 
        . 
        .  
    }
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 组策略中的计算机配置策略无法下发
  • ¥15 机器学习简单问题解决
  • ¥15 如何绘制动力学系统的相图
  • ¥15 对接wps接口实现获取元数据
  • ¥20 给自己本科IT专业毕业的妹m找个实习工作
  • ¥15 用友U8:向一个无法连接的网络尝试了一个套接字操作,如何解决?
  • ¥30 我的代码按理说完成了模型的搭建、训练、验证测试等工作(标签-网络|关键词-变化检测)
  • ¥50 mac mini外接显示器 画质字体模糊
  • ¥15 TLS1.2协议通信解密
  • ¥40 图书信息管理系统程序编写