dov6891 2013-06-17 04:59
浏览 28

尝试使用php和DOM抓取链接[重复]

This question already has an answer here:

If I have the following X(HTML) structure, how do you go about capturing that imgur link deep within the div tree?

I tried several different methods. What I really want is to make a node tree for the div containing "siteTable" because there are many div's within that div that contain more imgur links. If you haven't noticed, this is the html for reddit.

Thanks!

<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head>
<body class="listing-page hot-page">
    <div id="header" role="banner">
    <div class="side">
    <a name="content"></a>
    <div class="content" role="main">
    <div class="infobar welcome">
    <div id="siteTable" class="sitetable linklisting">
        <div class=" thing id-t3_1gh823 over18 odd link " data-downs="5" data-ups="90" data-fullname="t3_1gh823" onclick="click_thing(this)">
            <p class="parent"></p>
            <span class="rank" style="width:2.20ex;">1</span>
            <div class="midcol unvoted" style="width:5ex;">
            <a class="thumbnail " href="http://i.imgur.com/FZ1I9wi.jpg">

This is what I know needs to be done:

    $dom = new domDocument;


    @$dom->loadHTML(file_get_contents($link));


    $dom->preserveWhiteSpace = false;


    $xpath = new DOMXPath($dom);

    $href = $xpath->query('?????');

    print_r($tags);
</div>
  • 写回答

2条回答 默认 最新

  • dongshen4129 2013-06-17 05:12
    关注

    I always try to make my XPath's as basic, but specific as possible. This makes it easier to change and debug as the page changes. Its hard to say without looking at the whole page, or multiple reddit pages..but I am assuming that the class thumbnail is only used for the thumbnail link you want. In this case we can make a really simple (but specific) XPath query:

    $link_nodes = $xpath->query('//a[@class="thumbnail"]');
    if($link_nodes->length > 0) {
      // you can do a foreach loop here if there may be multiple links?
      $link_node = $link_nodes->item(0);
      $href = $link_node->attributes->getNamedItem('href')->value;
    }
    

    Also, you may want to make sure you are getting an imgur link by enhancing the XPath query:

    $link_nodes = $xpath->query('//a[@class="thumbnail"][contains(@href, "imgur.com")]');
    
    评论

报告相同问题?

悬赏问题

  • ¥15 ETLCloud 处理json多层级问题
  • ¥15 matlab中使用gurobi时报错
  • ¥15 这个主板怎么能扩出一两个sata口
  • ¥15 不是,这到底错哪儿了😭
  • ¥15 2020长安杯与连接网探
  • ¥15 关于#matlab#的问题:在模糊控制器中选出线路信息,在simulink中根据线路信息生成速度时间目标曲线(初速度为20m/s,15秒后减为0的速度时间图像)我想问线路信息是什么
  • ¥15 banner广告展示设置多少时间不怎么会消耗用户价值
  • ¥15 可见光定位matlab仿真
  • ¥15 arduino 四自由度机械臂
  • ¥15 wordpress 产品图片 GIF 没法显示