doushan6161 2012-09-06 23:04
浏览 74
已采纳

PHP正则表达式 - 使用指定类从所有链接获取文本[重复]

Possible Duplicate:
How to parse and process HTML with PHP?

I'm trying to use PHP and regex to grab all the hyperlinks from an external page. The links I care about scraping are structured as follows:

<li class="magic"><a href="http://blah.com">TargetText1</a></li>
<li class="magic"><a href="http://blah.com">TargetText2</a></li>

Please bear in mind I'm trying to get the anchor text NOT the url. I've got the code below working however it simply scrapes all the links on the page. I'm trying to scrape links only wrapped with the li class listed above.

 $url = "http://www.example.com"; 
 $input = @file_get_contents($url) or die("Could not access file: $url"); 

 $regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>";

 if(preg_match_all("/$regexp/siU", $input, $matches)) { 
  print_r($matches);
 }
  • 写回答

1条回答 默认 最新

  • dongzaizai2015 2012-09-06 23:11
    关注
    <?php
    
        $dom = new domDocument;
        $dom->loadHTML($html);
        $dom->preserveWhiteSpace = false;
        $lis = $dom->getElementsByTagName('li');
        foreach($lis  as $li){
            if($li->getAttribute('class')=='magic'){
                $links = $li->getElementsByTagName('a');
                if($links->length){
                    echo $links->item(0)->nodeValue;
                }
            }
        }
    
    ?>
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥30 vmware exsi重置后登不上
  • ¥15 易盾点选的cb参数怎么解啊
  • ¥15 MATLAB运行显示错误,如何解决?
  • ¥15 c++头文件不能识别CDialog
  • ¥15 Excel发现不可读取的内容
  • ¥15 关于#stm32#的问题:CANOpen的PDO同步传输问题
  • ¥20 yolov5自定义Prune报错,如何解决?
  • ¥15 电磁场的matlab仿真
  • ¥15 mars2d在vue3中的引入问题
  • ¥50 h5唤醒支付宝并跳转至向小荷包转账界面