doukengsha9472 2014-01-05 11:48
浏览 24
已采纳

PHP getElementById行为与元素共享id

I'm using some simple php to scrape information from a website to allow reading it offline. The code seems to be working fine but I am worried about undefined behaviour. The site is a bit poorly coded and some of the elements I'm grabbing share the same id with another element. I'd imagine that getElementById traverses the DOM from top to bottom and the reason I'm not having an issue is because the element I need is the first instance with the id. Is there any way to ensure this behaviour? The element has no other real way of distinguishing it so selecting it by id seems to be the best option. I have included a stripped back example of the code I'm using below.

Thanks.

<?php

$curl_referer = "http://example.com/";
$curl_url = "http://example.com/content.php";

$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, 'Scraper/0.9');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, false);
curl_setopt($ch, CURLOPT_REFERER, "$curl_referer");
curl_setopt($ch, CURLOPT_URL, "$curl_url");
$output = curl_exec($ch);

$dom = new DOMDocument();
@$dom->loadHTML($output);

$content = $dom->getElementById('content');
echo $content->nodeValue;
?>
  • 写回答

1条回答 默认 最新

  • dream543211 2014-01-05 11:52
    关注

    Try using XPath expression to get the first containing id. Like that: //*[@id="content"][1]

    The PHP code will be like that:

    $xpath = new DOMXPath($dom);
    $xpath->query('//*[@id="content"][1]')->item(0)->nodeValue;
    

    And an tip: use libxml_use_internal_errors(true), you can catch they latter for logging or try tidying-up the document.

    Edit
    Hey, in your code you're setting the UA as "Scraper/0.9", most people that write a bad website doesn't look at that and doesn't do logging incoming requests in their pages, but, i don't recommend to put UA like that, just put an browser UA, like chrome's user-agent because if they're monitoring and see requests that contains this user-agent, they will be blacklist you (future).

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 装 pytorch 的时候出了好多问题,遇到这种情况怎么处理?
  • ¥20 IOS游览器某宝手机网页版自动立即购买JavaScript脚本
  • ¥15 手机接入宽带网线,如何释放宽带全部速度
  • ¥30 关于#r语言#的问题:如何对R语言中mfgarch包中构建的garch-midas模型进行样本内长期波动率预测和样本外长期波动率预测
  • ¥15 ETLCloud 处理json多层级问题
  • ¥15 matlab中使用gurobi时报错
  • ¥15 这个主板怎么能扩出一两个sata口
  • ¥15 不是,这到底错哪儿了😭
  • ¥15 2020长安杯与连接网探
  • ¥15 关于#matlab#的问题:在模糊控制器中选出线路信息,在simulink中根据线路信息生成速度时间目标曲线(初速度为20m/s,15秒后减为0的速度时间图像)我想问线路信息是什么