duanshan2988 2017-02-12 16:50
浏览 58
已采纳

Web抓取来自3gpp网站的html表的链接和日期

I'm trying to extract/scrap Zip Links and corresponding Date from the below Link's Release tab:

3GPP report Website.

I am able to extract Zip links using the below php code:

preg_match_all('/<ul class=\"rpRootGroup\">(.*?)<\/ul/s',$specpage,$zipul);
$specul = new domDocument;
@$specul->loadHTML($zipul[0][0]);
$specul->preserveWhiteSpace = true;
$xpathspecul = new DOMXPath($specul);
$rowsUL = $xpathspecul->query('//tr');
$resultul = array();
$zipf = array();
$zipuni = array();

foreach ($rowsUL as $rowul) {
    $colsul = $rowul->getElementsByTagName('td');
    foreach ($colsul as $colul) {

        if($xpathspecul->evaluate('count(.//a)', $colul) > 0) { // check if an anchor exists
            $slinkul = $xpathspecul->evaluate('string(.//a/@href)', $colul); // if there is, then echo the href value
        }
        if (isset($slinkul) && $slinkul!=null){
            $resultul[] = $slinkul;
        }
    }
}

foreach ($resultul as $ziplink){
    $chkzip = pathinfo($ziplink, PATHINFO_EXTENSION);
    if ($chkzip == 'zip' && $ziplink!==null){
        $zipf[] = trim($ziplink);
    }
}
$zipuni = array_values (array_unique($zipf));

$specpage contains the website loaded using curl

Sample image of aforementioned Zip link and Date

However, I am not able to extract Corresponding Dates.

Further, i am having problem with using 'array_unique' as there can be same Zip link but with different corresponding date. However, without 'array_unique' im getting a lot of multiple links.

Any help is appreciated.

  • 写回答

1条回答 默认 最新

  • dongqindu8110 2017-02-12 17:52
    关注

    If your literally just trying to grab the date(00-00-0000) and zip url from the page given, you could just use this below. You could easily put this into one Regex but it's clearer to see how it's working using two. As the Regex queries are so specific, I was getting precisely 21 matches per query, so it was just a matter of creating an additional array with keys so the data can be sorted with ease.

    $url = 'https://portal.3gpp.org/desktopmodules/Specifications/SpecificationDetails.aspx?specificationId=1387';
    $data = file_get_contents($url);
    preg_match_all('/http:\/\/.*\.zip/', $data, $links);
    preg_match_all('/<\/td><td>\s*(\d*-\d*-\d*)\s*<\/td><td>/', $data, $dates);
    $newArr = []; //Your new array with URL and Dates 
    
    foreach($dates[0] as $k=>$v) {
    
        $newArr[] = ['date' => $v, 'url' => $links[0][$k]];
        echo 'Date: ' . $newArr[$k]['date'] . '<br>URL: ' .  $newArr[$k]['url'] . '<br><br>';
        //echo is for testing purposes. 
    }
    

    Output:

    Date: 2015-12-18
    URL: http://www.3gpp.org/ftp/Specs/archive/26_series/26.073/26073-d00.zip
    
    Date: 2014-09-26
    URL: http://www.3gpp.org/ftp/Specs/archive/26_series/26.073/26073-c00.zip
    
    Date: 2012-09-21
    URL: http://www.3gpp.org/ftp/Specs/archive/26_series/26.073/26073-b00.zip
    
    Date: 2011-04-05
    URL: http://www.3gpp.org/ftp/Specs/archive/26_series/26.073/26073-a00.zip
    
    Date: 2009-12-18
    URL: http://www.3gpp.org/ftp/Specs/archive/26_series/26.073/26073-900.zip
    
    Date: 2008-12-18
    URL: http://www.3gpp.org/ftp/Specs/archive/26_series/26.073/26073-800.zip
    
    Date: 2007-06-21
    URL: http://www.3gpp.org/ftp/Specs/archive/26_series/26.073/26073-700.zip
    
    Date: 2005-01-06
    URL: http://www.3gpp.org/ftp/Specs/archive/26_series/26.073/26073-600.zip
    
    Date: 2004-04-01
    URL: http://www.3gpp.org/ftp/Specs/archive/26_series/26.073/26073-530.zip
    
    Date: 2003-10-02
    URL: http://www.3gpp.org/ftp/Specs/archive/26_series/26.073/26073-520.zip
    
    etc....
    

    I've spot checked the data and the dates match up perfectly with the links.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 求帮我调试一下freefem代码
  • ¥15 R语言Rstudio突然无法启动
  • ¥15 关于#matlab#的问题:提取2个图像的变量作为另外一个图像像元的移动量,计算新的位置创建新的图像并提取第二个图像的变量到新的图像
  • ¥15 改算法,照着压缩包里边,参考其他代码封装的格式 写到main函数里
  • ¥15 用windows做服务的同志有吗
  • ¥60 求一个简单的网页(标签-安全|关键词-上传)
  • ¥35 lstm时间序列共享单车预测,loss值优化,参数优化算法
  • ¥15 Python中的request,如何使用ssr节点,通过代理requests网页。本人在泰国,需要用大陆ip才能玩网页游戏,合法合规。
  • ¥100 为什么这个恒流源电路不能恒流?
  • ¥15 有偿求跨组件数据流路径图