I want my PHP program to extract all the URLs from a html file. When I was writing a C# program to extract all the URL in a html file, I used the following regular expression. Then add "http" part to the beginning to get a full URL list. Can you please tell me how can I use the regular expression that I used in the following code to work with PHP?
List<string> links = new List<string>();
Regex regEx;
Match matches;
regEx = new Regex("href=\"http\\s*(?:\"(?<1>[^\"]*)\"|(?<1>\\S+))\"", RegexOptions.IgnoreCase | RegexOptions.Compiled);
for (matches = regEx.Match(downloadString); matches.Success; matches = matches.NextMatch())
{
links.Add("http" + matches.Groups[1].ToString());
} //Add all the URLs to a list and return the list
return links;
I would really appreciate it if you can show it to me with an example:
@julian Thank you so much for the detailed explanation. I modified the code a little and used it in the following way:
$html = file_get_contents('http://mysmallwebpage.com/');
$dom = new DOMDocument;
$dom->loadHTML($html);
$links = $dom->getElementsByTagName('a');
foreach ($links as $link)
{
$returnLink = $link->getAttribute('href');
echo "<br />",$returnLink;
}
but the result doesn't show exact URL address. it output things like:
/nmsd-gallery/
/home/?currentPage=3
javascript:noop();
Can you please tell me if there is a way I can get just the URL address? such as:
http://mysmallwebpage.com/