doujiang1939 2014-02-04 07:15
浏览 259

抓取文本时,xpath返回空节点列表

im building a small scraping tool that will scape the urls from a google page. im trying to get the value value from "cite" which holds the url as text. im loading the webpage using curl to the doms load html. when i do a print_r the the results are displayed. so there is no problem with curl

below is my code

    $dom = new DOMDocument();
    $dom->loadHTML($result);

    $xpath = new DOMXPath($dom);

            $elements = $xpath->query("//cite[@class='vurls']");

            print_r($elements);

    foreach ($elements as $entry)
    {
     print_r($entry);
             //show cite url
    }

when i use //cite[@class='vurls'] in the firefox xpath checker it evaluates and shows all the cite text. but in my code the $elements is always empty.

i also tried the full path inside my query

//div[@id='ires']/ol[@id='rso']//li/div/div/div/div/cite

but it still returns a empty value.

an example query is

http://www.google.co.uk/search?q=xpath

can someone please tell me what am i doing wrong?

  • 写回答

1条回答

  • doucepei5298 2014-02-04 08:22
    关注

    Google is serving different HTML depending on the browser used. Have a look at the HTML you receive in PHP, not in Firefox. There is no @class attribute in the <cite/> elements, you need to find another way to query them, eg.

    //div[@class='kv']/cite
    

    Anyway: Don't parse Google search results, they offer an API for doing that. Parsing websites is likely to break (because they will change over time, and they do often), APIs are stable.

    评论

报告相同问题?

悬赏问题

  • ¥15 基于卷积神经网络的声纹识别
  • ¥15 Python中的request,如何使用ssr节点,通过代理requests网页。本人在泰国,需要用大陆ip才能玩网页游戏,合法合规。
  • ¥100 为什么这个恒流源电路不能恒流?
  • ¥15 有偿求跨组件数据流路径图
  • ¥15 写一个方法checkPerson,入参实体类Person,出参布尔值
  • ¥15 我想咨询一下路面纹理三维点云数据处理的一些问题,上传的坐标文件里是怎么对无序点进行编号的,以及xy坐标在处理的时候是进行整体模型分片处理的吗
  • ¥15 CSAPPattacklab
  • ¥15 一直显示正在等待HID—ISP
  • ¥15 Python turtle 画图
  • ¥15 stm32开发clion时遇到的编译问题