dongmeirang4679 2014-09-06 11:05
浏览 30

使用div的PHP web抓取

I have tried everything, I have read on the other questions however it doesn't work.

I want from this website:

http://www.interparcel.com/tracking.php?action=dotrack&trackno=RE367831140GR

To extract this:

Sorry, no consignment was found with these details.Error - No xml data received

I have also tried with the websites parcelforce.com and dhl.com: The same procedures, it results zero matches.

Things I have tried (among st many):

$curl = curl_init('http://www.interparcel.com/tracking.php?action=dotrack&trackno=$nummm');
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);

$page = curl_exec($curl);

if(curl_errno($curl)) // check for execution errors
{
    echo 'Scraper error: ' . curl_error($curl);
    exit;
}

curl_close($curl);

$regex = '/<div class="header-description">(.*?)</div>/s';
if ( preg_match($regex, $page, $list) )
    echo $list[0];
else 
    print "Not found"; 

<?php // File: MatchAllDivMain.php

// Read html file to be processed into $data variable
$data = file_get_contents('test.html');

// Commented regex to extract contents from <div class="main">contents</div>
//  where "contents" may contain nested <div>s.
//  Regex uses PCRE's recursive (?1) sub expression syntax to recurs group 1
$pattern_long = '{           # recursive regex to capture contents of "main" DIV
<div\s+class="main"\s*>              # match the "main" class DIV opening tag
  (                                   # capture "main" DIV contents into $1
    (?:                               # non-cap group for nesting * quantifier
      (?: (?!<div[^>]*>|</div>). )++  # possessively match all non-DIV tag chars
    |                                 # or 
      <div[^>]*>(?1)</div>            # recursively match nested <div>xyz</div>
    )*                                # loop however deep as necessary
  )                                   # end group 1 capture
</div>                               # match the "main" class DIV closing tag
}six';  // single-line (dot matches all), ignore case and free spacing modes ON

// short version of same regex
$pattern_short = '{<div\s+class="main"\s*>((?:(?:(?!<div[^>]*>|</div>).)++|<div[^>]*>(?1)</div>)*)</div>}si';

$matchcount = preg_match_all($pattern_long, $data, $matches);
// $matchcount = preg_match_all($pattern_short, $data, $matches);
echo("<pre>
");
if ($matchcount > 0) {
    echo("$matchcount matches found.
");
    //  print_r($matches);
    for($i = 0; $i < $matchcount; $i++) {
        echo("
Match #" . ($i + 1) . ":
");
        echo($matches[1][$i]); // print 1st capture group for match number i
    }
} else {
    echo('No matches');
}
echo("
</pre>");
?>

Methods described in:

all without success, any help on what I'm doing wrong?

  • 写回答

0条回答 默认 最新

    报告相同问题?

    悬赏问题

    • ¥15 微信会员卡等级和折扣规则
    • ¥15 微信公众平台自制会员卡可以通过收款码收款码收款进行自动积分吗
    • ¥15 随身WiFi网络灯亮但是没有网络,如何解决?
    • ¥15 gdf格式的脑电数据如何处理matlab
    • ¥20 重新写的代码替换了之后运行hbuliderx就这样了
    • ¥100 监控抖音用户作品更新可以微信公众号提醒
    • ¥15 UE5 如何可以不渲染HDRIBackdrop背景
    • ¥70 2048小游戏毕设项目
    • ¥20 mysql架构,按照姓名分表
    • ¥15 MATLAB实现区间[a,b]上的Gauss-Legendre积分