dongyao5186 2013-03-28 13:08
浏览 28
已采纳

PHP中的C#正则表达式?

I want my PHP program to extract all the URLs from a html file. When I was writing a C# program to extract all the URL in a html file, I used the following regular expression. Then add "http" part to the beginning to get a full URL list. Can you please tell me how can I use the regular expression that I used in the following code to work with PHP?

        List<string> links = new List<string>();
        Regex regEx;
        Match matches;

        regEx = new Regex("href=\"http\\s*(?:\"(?<1>[^\"]*)\"|(?<1>\\S+))\"", RegexOptions.IgnoreCase | RegexOptions.Compiled);
        for (matches = regEx.Match(downloadString); matches.Success; matches = matches.NextMatch())
        {
            links.Add("http" + matches.Groups[1].ToString());
        } //Add all the URLs to a list and return the list

        return links;

I would really appreciate it if you can show it to me with an example:


@julian Thank you so much for the detailed explanation. I modified the code a little and used it in the following way:

$html = file_get_contents('http://mysmallwebpage.com/');
$dom = new DOMDocument;
$dom->loadHTML($html);
$links = $dom->getElementsByTagName('a');

foreach ($links as $link)
{      
    $returnLink =  $link->getAttribute('href');
echo "<br />",$returnLink;
}

but the result doesn't show exact URL address. it output things like:

/nmsd-gallery/
/home/?currentPage=3
javascript:noop();

Can you please tell me if there is a way I can get just the URL address? such as: http://mysmallwebpage.com/

  • 写回答

2条回答 默认 最新

  • duanjie2940 2013-03-28 13:56
    关注

    mhm this are internal links of the page .. in this case you have to filter the js-links (or other unwanted files like images or so) and add the HTTP_REFERER as prefix

    ...

    foreach ($links as $link)
    {      
        $returnLink =  $link->getAttribute('href');
        if (stripos($returnLink,"javascript")!=false) // or other unwanted calls
        {
            if (stripos($returnLink,"http://") ==false)
            {
                $retunLink = $_SERVER['HTTP_REFERER'].$returnLink
            }
        } 
    echo "<br />++",$returnLink;
    }
    

    there are much more cases to check .. but i think this gives you an example ...

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥15 使用ESP8266连接阿里云出现问题
  • ¥15 BP神经网络控制倒立摆
  • ¥20 要这个数学建模编程的代码 并且能完整允许出来结果 完整的过程和数据的结果
  • ¥15 html5+css和javascript有人可以帮吗?图片要怎么插入代码里面啊
  • ¥30 Unity接入微信SDK 无法开启摄像头
  • ¥20 有偿 写代码 要用特定的软件anaconda 里的jvpyter 用python3写
  • ¥20 cad图纸,chx-3六轴码垛机器人
  • ¥15 移动摄像头专网需要解vlan
  • ¥20 access多表提取相同字段数据并合并
  • ¥20 基于MSP430f5529的MPU6050驱动,求出欧拉角