doumie7914 2015-03-15 15:05
浏览 67

PHP cURL和XPath给出不一致的结果

trying to do a loop with a url parameter, into a function which does a curl, gets all html and runs xpath on it. But the results varies. Is there something special I need to consider using curl or xpath? Sometimes it collects an emtpy string. The code works, just this flaw that is really hard to debug.

Here is the code I use.

    private function getArticles($url){

    // Instantiate cURL to grab the HTML page.
    $c = curl_init($url);
    curl_setopt($c, CURLOPT_HEADER, false);
    curl_setopt($c, CURLOPT_USERAGENT, $this->getUserAgent());
    curl_setopt($c, CURLOPT_FAILONERROR, true);
    curl_setopt($c, CURLOPT_FOLLOWLOCATION, true);
    curl_setopt($c, CURLOPT_AUTOREFERER, true);
    curl_setopt($c, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($c, CURLOPT_TIMEOUT, 10);

    // Grab the data.
    $html = curl_exec($c);

    // Check if the HTML didn't load right, if it didn't - report an error
    if (!$html) {
        echo "<p>cURL error number: " .curl_errno($c) . " on URL: " . $url ."</p>" .
             "<p>cURL error: " . curl_error($c) . "</p>";
    }

    // Close connection.
    curl_close($c);

    // Parse the HTML information and return the results.
    $dom = new DOMDocument(); 
    @$dom->loadHtml($html);
    $xpath = new DOMXPath($dom);

    // Get a list of articles from the section page
    $cname = $xpath->query('//*[@id="item-details"]/div/div[1]/h1');        
    $link = $xpath->query('//*[@id="item-details"]/div/ul/li[1]/a/@href');
    $streetadress = $xpath->query('//*[@id="item-details"]/div[2]/div[3]/div[1]/text()[1]');
    $zip = $xpath->query('//*[@id="item-details"]/div[2]/div[3]/div[1]/text()[2]');
    $phone1 = $xpath->query('//*[@id="item-details"]/div/h2/span[2]');
    $phone2 = $xpath->query('//*[@id="item-details"]/div/h2[2]/span[2]');       
    $ceo = $xpath->query('//*[@id="company-financials"]/div/div[2]/span');      
    $orgnr = $xpath->query('//*[@id="company-financials"]/div/div[1]/span');        
    $turnover13 = $xpath->query('//*[@class="geb-turnover1"]');
    $turnover12 = $xpath->query('//*[@class="geb-turnover2"]');
    $turnover11 = $xpath->query('//*[@class="geb-turnover3"]');
    $logo = $xpath->query('//*[@id="item-info"]/p/img/@src');
    $desc = $xpath->query('//*[@id="item-info"]/div[1]/div');

    $capturelink = "";
//  $capturelink = $this->getWebCapture($link->item(0)->nodeValue);

    return array(
    'companyname' => $cname->item(0)->nodeValue, 
    'streetadress' => $streetadress->item(0)->nodeValue,
    'zip' => $zip->item(0)->nodeValue,
    'phone1' => $phone1->item(0)->nodeValue,
    'phone2' => $phone2->item(0)->nodeValue,
    'link' => $link->item(0)->nodeValue,
    'ceo' => $ceo->item(0)->nodeValue,
    'orgnr' => $orgnr->item(0)->nodeValue,
    'turnover2013' => $turnover13->item(0)->nodeValue,
    'turnover2012' => $turnover12->item(0)->nodeValue,
    'turnover2011' => $turnover11->item(0)->nodeValue,
    'description' => $desc->item(0)->nodeValue,
    'logo' => $logo->item(0)->nodeValue,
    'capturelink' => $capturelink);
}
// End Get Articles

Edit:

I really tried everything on this one. But ended up using phpQuery and now it works. I do think php dom and xpath combined is not always a good mix. At least for me in this case.

This how I use it instead of xpath:

    ....

    require('phpQuery.php');

    phpQuery::newDocumentHTML($html);

    $capture = "";
//  $capture = $this->getWebCapture(pq('.website')->attr('href'));

    return array(       
    'companyname' => pq('.header')->find('h1')->text(),
    'streetadress' => pq('.address-container:first-child')->text(),
    'zip' => pq('.address-container')->text(),
    'phone1' => pq('.phone-number')->text(),
    'phone2' => pq('.phone-number')->text(),
    'link' => pq('.website')->attr('href'),
    'ceo' => pq('.geb-ceo')->text(),
    'orgnr' => pq('.geb-org-number')->text(),
    'turnover2013' => pq('.geb-turnover1')->text(),
    'turnover2012' => pq('.geb-turnover2')->text(),
    'turnover2011' => pq('.geb-turnover3')->text(),
    'description' => pq('#item-info div div')->text(),
    'logo' => pq('#item-info logo img')->attr('src'),
    'capture' => $capture);     
  • 写回答

1条回答 默认 最新

  • duan19913 2015-03-16 01:09
    关注

    Is there something special I need to consider using curl or xpath?

    As you ask that actually, I think you could benefit from making yourself more comfortable what the curl thingy is about and what the xpath thingy is about and at which point both are related and where not.

    The code works, just this flaw that is really hard to debug.

    Well, the function you've got there is pretty long and does too many things at once. That is why it's hard to debug, too. Move code out of that function into subroutines you call from that function. That will also help you to structure the code more.

    Additionally you can keep records of the activity your program does. So you can in debugging for example take the exact same HTML of a past request (because you've stored it) and verify if your xpath queries are really fitting for the data.

    评论

报告相同问题?

悬赏问题

  • ¥15 如何用Labview在myRIO上做LCD显示?(语言-开发语言)
  • ¥15 Vue3地图和异步函数使用
  • ¥15 C++ yoloV5改写遇到的问题
  • ¥20 win11修改中文用户名路径
  • ¥15 win2012磁盘空间不足,c盘正常,d盘无法写入
  • ¥15 用土力学知识进行土坡稳定性分析与挡土墙设计
  • ¥70 PlayWright在Java上连接CDP关联本地Chrome启动失败,貌似是Windows端口转发问题
  • ¥15 帮我写一个c++工程
  • ¥30 Eclipse官网打不开,官网首页进不去,显示无法访问此页面,求解决方法
  • ¥15 关于smbclient 库的使用