doujiang1939 2014-02-04 07:15
浏览 259

抓取文本时,xpath返回空节点列表

im building a small scraping tool that will scape the urls from a google page. im trying to get the value value from "cite" which holds the url as text. im loading the webpage using curl to the doms load html. when i do a print_r the the results are displayed. so there is no problem with curl

below is my code

    $dom = new DOMDocument();
    $dom->loadHTML($result);

    $xpath = new DOMXPath($dom);

            $elements = $xpath->query("//cite[@class='vurls']");

            print_r($elements);

    foreach ($elements as $entry)
    {
     print_r($entry);
             //show cite url
    }

when i use //cite[@class='vurls'] in the firefox xpath checker it evaluates and shows all the cite text. but in my code the $elements is always empty.

i also tried the full path inside my query

//div[@id='ires']/ol[@id='rso']//li/div/div/div/div/cite

but it still returns a empty value.

an example query is

http://www.google.co.uk/search?q=xpath

can someone please tell me what am i doing wrong?

  • 写回答

1条回答 默认 最新

  • doucepei5298 2014-02-04 08:22
    关注

    Google is serving different HTML depending on the browser used. Have a look at the HTML you receive in PHP, not in Firefox. There is no @class attribute in the <cite/> elements, you need to find another way to query them, eg.

    //div[@class='kv']/cite
    

    Anyway: Don't parse Google search results, they offer an API for doing that. Parsing websites is likely to break (because they will change over time, and they do often), APIs are stable.

    评论

报告相同问题?

悬赏问题

  • ¥35 平滑拟合曲线该如何生成
  • ¥100 c语言,请帮蒟蒻写一个题的范例作参考
  • ¥15 名为“Product”的列已属于此 DataTable
  • ¥15 安卓adb backup备份应用数据失败
  • ¥15 eclipse运行项目时遇到的问题
  • ¥15 关于#c##的问题:最近需要用CAT工具Trados进行一些开发
  • ¥15 南大pa1 小游戏没有界面,并且报了如下错误,尝试过换显卡驱动,但是好像不行
  • ¥15 自己瞎改改,结果现在又运行不了了
  • ¥15 链式存储应该如何解决
  • ¥15 没有证书,nginx怎么反向代理到只能接受https的公网网站