PHP cURL和XPath给出不一致的结果

trying to do a loop with a url parameter, into a function which does a curl, gets all html and runs xpath on it. But the results varies. Is there something special I need to consider using curl or xpath? Sometimes it collects an emtpy string. The code works, just this flaw that is really hard to debug.

Here is the code I use.

    private function getArticles($url){

    // Instantiate cURL to grab the HTML page.
    $c = curl_init($url);
    curl_setopt($c, CURLOPT_HEADER, false);
    curl_setopt($c, CURLOPT_USERAGENT, $this->getUserAgent());
    curl_setopt($c, CURLOPT_FAILONERROR, true);
    curl_setopt($c, CURLOPT_FOLLOWLOCATION, true);
    curl_setopt($c, CURLOPT_AUTOREFERER, true);
    curl_setopt($c, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($c, CURLOPT_TIMEOUT, 10);

    // Grab the data.
    $html = curl_exec($c);

    // Check if the HTML didn't load right, if it didn't - report an error
    if (!$html) {
        echo "<p>cURL error number: " .curl_errno($c) . " on URL: " . $url ."</p>" .
             "<p>cURL error: " . curl_error($c) . "</p>";
    }

    // Close connection.
    curl_close($c);

    // Parse the HTML information and return the results.
    $dom = new DOMDocument(); 
    @$dom->loadHtml($html);
    $xpath = new DOMXPath($dom);

    // Get a list of articles from the section page
    $cname = $xpath->query('//*[@id="item-details"]/div/div[1]/h1');        
    $link = $xpath->query('//*[@id="item-details"]/div/ul/li[1]/a/@href');
    $streetadress = $xpath->query('//*[@id="item-details"]/div[2]/div[3]/div[1]/text()[1]');
    $zip = $xpath->query('//*[@id="item-details"]/div[2]/div[3]/div[1]/text()[2]');
    $phone1 = $xpath->query('//*[@id="item-details"]/div/h2/span[2]');
    $phone2 = $xpath->query('//*[@id="item-details"]/div/h2[2]/span[2]');       
    $ceo = $xpath->query('//*[@id="company-financials"]/div/div[2]/span');      
    $orgnr = $xpath->query('//*[@id="company-financials"]/div/div[1]/span');        
    $turnover13 = $xpath->query('//*[@class="geb-turnover1"]');
    $turnover12 = $xpath->query('//*[@class="geb-turnover2"]');
    $turnover11 = $xpath->query('//*[@class="geb-turnover3"]');
    $logo = $xpath->query('//*[@id="item-info"]/p/img/@src');
    $desc = $xpath->query('//*[@id="item-info"]/div[1]/div');

    $capturelink = "";
//  $capturelink = $this->getWebCapture($link->item(0)->nodeValue);

    return array(
    'companyname' => $cname->item(0)->nodeValue, 
    'streetadress' => $streetadress->item(0)->nodeValue,
    'zip' => $zip->item(0)->nodeValue,
    'phone1' => $phone1->item(0)->nodeValue,
    'phone2' => $phone2->item(0)->nodeValue,
    'link' => $link->item(0)->nodeValue,
    'ceo' => $ceo->item(0)->nodeValue,
    'orgnr' => $orgnr->item(0)->nodeValue,
    'turnover2013' => $turnover13->item(0)->nodeValue,
    'turnover2012' => $turnover12->item(0)->nodeValue,
    'turnover2011' => $turnover11->item(0)->nodeValue,
    'description' => $desc->item(0)->nodeValue,
    'logo' => $logo->item(0)->nodeValue,
    'capturelink' => $capturelink);
}
// End Get Articles

Edit:

I really tried everything on this one. But ended up using phpQuery and now it works. I do think php dom and xpath combined is not always a good mix. At least for me in this case.

This how I use it instead of xpath:

    ....

    require('phpQuery.php');

    phpQuery::newDocumentHTML($html);

    $capture = "";
//  $capture = $this->getWebCapture(pq('.website')->attr('href'));

    return array(       
    'companyname' => pq('.header')->find('h1')->text(),
    'streetadress' => pq('.address-container:first-child')->text(),
    'zip' => pq('.address-container')->text(),
    'phone1' => pq('.phone-number')->text(),
    'phone2' => pq('.phone-number')->text(),
    'link' => pq('.website')->attr('href'),
    'ceo' => pq('.geb-ceo')->text(),
    'orgnr' => pq('.geb-org-number')->text(),
    'turnover2013' => pq('.geb-turnover1')->text(),
    'turnover2012' => pq('.geb-turnover2')->text(),
    'turnover2011' => pq('.geb-turnover3')->text(),
    'description' => pq('#item-info div div')->text(),
    'logo' => pq('#item-info logo img')->attr('src'),
    'capture' => $capture);

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

1条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
duan19913 2015-03-16 01:09
关注
Is there something special I need to consider using curl or xpath?

As you ask that actually, I think you could benefit from making yourself more comfortable what the curl thingy is about and what the xpath thingy is about and at which point both are related and where not.

The code works, just this flaw that is really hard to debug.

Well, the function you've got there is pretty long and does too many things at once. That is why it's hard to debug, too. Move code out of that function into subroutines you call from that function. That will also help you to structure the code more.

Additionally you can keep records of the activity your program does. So you can in debugging for example take the exact same HTML of a past request (because you've stored it) and verify if your xpath queries are really fitting for the data.

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

报告相同问题？

关注问题

如何在php中使用curl xpath在网站上获取特定图片 php
2017-04-28 22:04

回答 1 已采纳 Assuming you want the image the appears next to the first headline, the XPath is: function news($
php Curl 405不允许 php
2017-02-22 15:37

回答 2 已采纳 Add the following to your code to help debug the issue: $info = curl_getinfo($ch); print_r( $info
PHP CURL脚本在第一次请求后获得502/503服务器错误 php
2019-01-15 04:03

回答 1 已采纳 There is a mess of cookies in your snippet. The callback function just appends cookies to the arra
php用xpath解析html的代码实例讲解
2021-01-20 08:17

实例1 $xml = simplexml_load_file(...; $names = $xml->xpath(html/body/p/p/form/p/p/p/p/p[*]/p/p/table//tr/td[@class='topicViews']); foreach($names as $name) ...curl_setopt($ch, CURLOPT_FILE, f
简单的xpath查询不起作用 html php
2014-12-09 22:11

回答 1 已采纳 There is nothing wrong with your xpath query as it is correct syntax and the node does exist. The
在PHP中使用XPath循环 php
2014-04-19 13:48

回答 1 已采纳 You can try the following approach. <?php $url = 'http://www.oxybet.ro/pariu/external/betfair-
DomXPath php省略html元素[重复] php
2013-07-11 08:41

回答 1 已采纳 $dom = new DOMDocument(); $dom->loadHTML('<html><div id="location"><label>&lt
Seer:一个基于PHP XPath的Web抓取框架
2021-05-20 20:39

SEER 一个基于PHP XPath的Web抓取框架。安装使用Git $ git clone https://github.com/Omarito2412/Seer.git 使用作曲家{ " require " :{ " seer/seer " : " dev-master " }} 或手动下载用法只需提供Seer.php并开始您...
PHP / Curl - 循环和POST缓冲区未清除 html php
2010-12-23 10:22

回答 1 已采纳 You seems forget to reset $fields_string, so try this ... curl_close($ch); unset($fields_string);
使用cURL和simpleXMLElement来提取数据。如何在XPATH之后获取XML元素的值？ html php
2011-12-11 18:58

回答 1 已采纳 Use iterate, and attributes foreach ( $xml->xpath( "//ul[@id='wxoptions']/li[3]/a" ) as $node)
PHP：如何检查节点是否存在以及是否使用xpath？ php xml
2012-03-29 03:19

回答 1 已采纳 I'm not sure of the exact syntax, but you might try counting the number of nodes in the target nod
php document采集,PHP数据采集之使用CURL、DOMDocument和DOMXPath
2021-04-22 11:45

哈里叔叔的博客这三个组件有各自独特的功能：CURL能够抓取下载HTML，能模拟登陆，伪装客户端等DOMDocument将下载的HTML加载成DOMDOMXPath使用XPath语法进行数据的定位和采集下面是一个具体的例子代码，抓取了本博客...
无法通过PHP解析页面中的链接（href） php
2017-08-23 10:20

回答 1 已采纳 SOLVED :) Well. If it's stupid but it works, then it aint stupid :D Just added the following cod
PHP的html实现xpath解析,php用xpath解析html的代码实例讲解
2021-03-24 08:18

刘二婷ttt的博客 php用xpath解析html的代码实例讲解实例1$xml = simplexml_load_file('https://forums.eveonline.com');$names = $xml->xpath("html/body/p/p/form/p/p/p/p/p[*]/p/p/table//tr/td[@class='topicViews']");foreach...
php xpath类库,PHP 怎么使用 XPath 来采集页面数据内容
2021-03-24 11:47

快乐小学僧的博客之前有说过使用 Python 使用 XPath 去采集页面数据内容，前段时间参与百度内测的...还是直接单文件吧想到了之前写 Python 爬虫时使用的 XPath，PHP 应该也是可以搞的吧动手就干，先找到对应的 XPath 规则，如下：//s...
php xpath 网页,网页爬虫-请问PHP怎么使用xpath解析html内容呢？
2021-05-08 07:24

有梦想就有明天的博客在网上查看了很多相关资料，但都是PHP用xpath解析xml的，请问PHP有没有相关的函数或是类库能解析html吗？谢谢回复内容：在网上查看了很多相关资料，但都是PHP用xpath解析xml的，请问PHP有没有相关的函数或是类库能...
PHP的html实现xpath解析,php用xpath解析html代码
2021-03-24 08:17

叫我叔叔就行的博客 XPath可用来在 XML 文档中对元素和属性进行遍历。XPath 使用路径表达式来选取 XML 文档中的节点或者节点集。这些路径表达式和我们在常规的电脑文件系统中看到的表达式非常相似。XPath 含有超过 100 个内建的函数。...
没有解决我的问题, 去提问

悬赏问题

¥15 如何用Labview在myRIO上做LCD显示？(语言-开发语言)
¥15 Vue3地图和异步函数使用
¥15 C++ yoloV5改写遇到的问题
¥20 win11修改中文用户名路径
¥15 win2012磁盘空间不足,c盘正常，d盘无法写入
¥15 用土力学知识进行土坡稳定性分析与挡土墙设计
¥70 PlayWright在Java上连接CDP关联本地Chrome启动失败,貌似是Windows端口转发问题
¥15 帮我写一个c++工程
¥30 Eclipse官网打不开，官网首页进不去，显示无法访问此页面，求解决方法
¥15 关于smbclient 库的使用

PHP cURL和XPath给出不一致的结果

1条回答 默认 最新

悬赏问题

1条回答默认最新