donglu7286 2016-02-02 07:51
浏览 57
已采纳

我不能只使用preg_match或preg_replace调用div中的链接

This is my code:

$curl = curl_init('http://www.houseoffraser.co.uk/');
$userAgent = "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/525.13 (KHTML, like Gecko) Chrome/0.A.B.C Safari/525.13";

curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($curl, CURLOPT_CONNECTTIMEOUT ,0);
curl_setopt($curl, CURLOPT_USERAGENT, $userAgent);
curl_setopt($curl, CURLOPT_TIMEOUT, 400);
ini_set('max_execution_time', 300);

$page = curl_exec($curl);

if(curl_errno($curl)) // check for execution errors
{
    echo 'Scraper error: ' . curl_error($curl);
    exit;
}

$html= curl_close($curl);

$dom = new DOMDocument();
@$dom->loadHTML($html);

$regex = '/<nav class="hof-buttons">(.*?)<\/nav>/s';

if (preg_match($regex, $page, $list)) {
    echo  preg_replace("/<\\/?a(\\s+.*?>|>)/", "", $list[0])."<br />";
} else {
    print "Not found";  
}

I tried to get only the url name from the div tag. But it only gives me error. I want something like this in the main:

<div class="a"><a href="abc.php">a linki</a></div> 

and in the codes it must be something like this:

if ( preg_match($regex, $page, $list) ){}; 

echo  <a href="$list[1]"> $list[0]</a>;

But when I use this, it gives me error or no array. I want to have a code like that but how can I add what I want into the preg_match or how can I call the links in the div?

  • 写回答

1条回答 默认 最新

  • dtntjwkl83750 2016-02-02 08:11
    关注

    Ok, here's the entire solution (if this is what you're looking for).
    And, btw., without curl, just file_get_contents() does it:

    I took over your 3-step approach:

    • Step 1: extract between <nav>…</nav>.
    • Step 2: extract between <div>…</div> all hrefs.
    • Step 3: gather text from different sources and clean it.

    Code

    <?php
    $page = file_get_contents('http://www.houseoffraser.co.uk/');
    
    if($page===false) // check for execution errors
    {
        echo 'Scraper error: ' . curl_error($curl);
        exit;
    }
    
    if ( preg_match_all('%<nav class=[\'"]{1,1}hof-buttons-set left[\'"]{1,1}>(.*?)</nav>%si', $page, $regs1, PREG_PATTERN_ORDER) ) {
        for ($x1 = 0; $x1 < count($regs1[0]); $x1++) {
            if ( preg_match_all('%<div.*?<a href=[\'"]{1,1}([^\'"]*?)[\'"]{1,1}>(.*?)</a>.*?</div>%sim', $regs1[1][$x1], $regs2, PREG_PATTERN_ORDER) ) {
                for ($x2 = 0; $x2 < count($regs2[0]); $x2++) {
                $link = $regs2[1][$x2];
                if (preg_match('/<img.*? title=[\'"]{1,1}(.*?)[\'"]{1,1}/sim', $regs2[2][$x2], $regs3)) {
                    // No text, but image with title
                    $text = $regs3[1];
                } elseif (preg_match('%<span.*?class=[\'"]{1,1}hof-label[\'"]{1,1}.*?>(.*?)</span>%sim', $regs2[2][$x2], $regs3)) {
                    // Text in <span class="hof-label">...</span>
                    $text = $regs3[1];
                } else {
                    // Plain text
                    $text = $regs2[2][$x2];
                }
                    echo '<a href="'.$link.'" title="'.$link.'" target="_blank">' . trim($text) . '</a><br />';
                }    
            } else {
                echo '<span style="color:red; font-weight:bold;">HREF not found<span><br />';
            }
        }
    } else {
        echo '<span style="color:red; font-weight:bold;">NAV not found<span><br />';
        exit;
    }
    ?>
    

    Result

    text: Women
    link:http://www.houseoffraser.co.uk/Women%27s+Designer+Clothing/03,default,sc.html

    text: Dresses
    link:http://www.houseoffraser.co.uk/women%27s+designer+dresses/301,default,sc.html

    [....]

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 2020长安杯与连接网探
  • ¥15 关于#matlab#的问题:在模糊控制器中选出线路信息,在simulink中根据线路信息生成速度时间目标曲线(初速度为20m/s,15秒后减为0的速度时间图像)我想问线路信息是什么
  • ¥15 banner广告展示设置多少时间不怎么会消耗用户价值
  • ¥16 mybatis的代理对象无法通过@Autowired装填
  • ¥15 可见光定位matlab仿真
  • ¥15 arduino 四自由度机械臂
  • ¥15 wordpress 产品图片 GIF 没法显示
  • ¥15 求三国群英传pl国战时间的修改方法
  • ¥15 matlab代码代写,需写出详细代码,代价私
  • ¥15 ROS系统搭建请教(跨境电商用途)