I need to extract Absolute URLs from source code. Now, here is the problem, i am extracting URLs for following:
>img tag SRC
>Script tag SRC (JS)
>CSS links
I'm using three different functions for each. The thing is that i sometimes get relative URLs, which are of no value since i have to further process them. Kindly review the following three functions and suggest improvements and corrections for how i can convert URLs to Absolute (after checking if they are not absolute already, of course).
thank you!
Function for extracting Image SRC.
function get_images(){
$images=array();
$regex='/[^(<!--)]<img [^>]*src=["|\']([^"|\']+(jpg|png|gif|jpeg))/i';
preg_match_all($regex, $this->source_code, $matches);
foreach ($matches[1] as $key=>$value) {
$images[$key]=$value;
}
return $images;
}
Function for extracting JS links
function get_scripts(){
$script_links=array();
$regex='/<script [^>]*src=["|\']([^"|\']+(\.js))/i';
preg_match_all($regex, $this->source_code, $matches);
foreach ($matches[1] as $key=>$value) {
$script_links[$key]=$value;
}
return $script_links;
}
Function for extracting CSS stylesheet links
function get_css(){
$css_links=array();
$regex='/<link [^>]*href=["|\']([^"|\']+(\.css))/i';
preg_match_all($regex, $this->source_code, $matches);
foreach ($matches[1] as $key=>$value) {
$css_links[$key]=$value;
}
return $css_links;
}
Output i get when i use it on Google.com's source:
Array ( [0] => /images/icons/product/chrome-48.png [1] => http://www.google.com/images/hpp/pyramids-35.png )
Now this first link starts with /images/.... and is not reusable. This is the problem i'm trying to fix for all 3 types of sources.