doujiang1939 2014-02-04 07:15
浏览 259

抓取文本时,xpath返回空节点列表

im building a small scraping tool that will scape the urls from a google page. im trying to get the value value from "cite" which holds the url as text. im loading the webpage using curl to the doms load html. when i do a print_r the the results are displayed. so there is no problem with curl

below is my code

    $dom = new DOMDocument();
    $dom->loadHTML($result);

    $xpath = new DOMXPath($dom);

            $elements = $xpath->query("//cite[@class='vurls']");

            print_r($elements);

    foreach ($elements as $entry)
    {
     print_r($entry);
             //show cite url
    }

when i use //cite[@class='vurls'] in the firefox xpath checker it evaluates and shows all the cite text. but in my code the $elements is always empty.

i also tried the full path inside my query

//div[@id='ires']/ol[@id='rso']//li/div/div/div/div/cite

but it still returns a empty value.

an example query is

http://www.google.co.uk/search?q=xpath

can someone please tell me what am i doing wrong?

  • 写回答

1条回答 默认 最新

  • doucepei5298 2014-02-04 08:22
    关注

    Google is serving different HTML depending on the browser used. Have a look at the HTML you receive in PHP, not in Firefox. There is no @class attribute in the <cite/> elements, you need to find another way to query them, eg.

    //div[@class='kv']/cite
    

    Anyway: Don't parse Google search results, they offer an API for doing that. Parsing websites is likely to break (because they will change over time, and they do often), APIs are stable.

    评论

报告相同问题?

悬赏问题

  • ¥15 matlab有关常微分方程的问题求解决
  • ¥15 perl MISA分析p3_in脚本出错
  • ¥15 k8s部署jupyterlab,jupyterlab保存不了文件
  • ¥15 ubuntu虚拟机打包apk错误
  • ¥199 rust编程架构设计的方案 有偿
  • ¥15 回答4f系统的像差计算
  • ¥15 java如何提取出pdf里的文字?
  • ¥100 求三轴之间相互配合画圆以及直线的算法
  • ¥100 c语言,请帮蒟蒻写一个题的范例作参考
  • ¥15 名为“Product”的列已属于此 DataTable