Wow, that is a big expression. I found several faults in it, and I shall hopefully explain them to you. Let's break it apart:
$pattern ="/
Here was your first mistake. As a forward slash is used in multiple sections of a url, you should use a different delimiter. I would suggest a tilde ~
, as this is not used in a url very often. This would mean you don't have to keep escaping the forward slash every where with \/
.
^(http|https|ftp)\:\/\/www\.([a-zA-Z0-9\.\-]+
This character class contains the next error. Within a character class, a dot just means a dot. There is no need to escape it. Furthermore, with placing the dash at the end, it also does not need escaping as it cannot possibly mean a range. The character class can be shortened to become [a-zA-Z0-9.-]+
.
(\:[a-zA-Z0-9\.&%\$\-]+
Here we have the next error, &
within the character class. This will match an & or an a or an m or a ;, not just an &. You don't need to convert it to the html code as doing so will mean to match any of the characters that the code contains. And using the previous knowledge, you don't need to escape the dot, or the dash if it is at the end. You also don't need to escape the dollar sign, as in a character class it just means a dollar. Remember, within a character class, all meta characters are just standard characters except the caret ^
, the backslash \
, the closing square bracket ]
, the dash -
(but this can be left if it's at the end), and whatever you choose as your delimiter, e.g. tilde ~
. This character class can then become, [a-zA-Z0-9.&%$-]+
.
)*@)*(\.){1}
Part of this might be an error, it might not be. Basically, is there any need to capture the dot here? If there is not a need to capture it, leave the brackets alone. However, there is a definite error in the repetition. {1}
is completely and utterly superfluous. Everything in there has to be repeated at least once. This is just making the code messy. The above can shortened into, )*@)*\.
.
((25[0-5]|2[0-4][0-9]|[0-1]{1}
Again, the {1}
is not needed. Remove it, ((25[0-5]|2[0-4][0-9]|[0-1]
.
[0-9]{2}|[1-9]{1}[0-9]{1}
And again twice, this becomes [0-9]{2}|[1-9][0-9]
.
You keep doing this, the next block of code you have can be shortened:
|[1-9])\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[0-9])
Into
|[1-9])\.(25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9]|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9]|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9]|[0-9])
It's not amazingly better, but every little helps. Next:
|([a-zA-Z0-9\-]+\.)*[a-zA-Z0-9\-]+
The two character classes can be optimized, |([a-zA-Z0-9-]+\.)*[a-zA-Z0-9-]+
.
\.(com|edu|gov|int|mil|net|org|biz|arpa|info|name|pro|aero|coop|museum|[a-zA-Z]{2})
This is very restrictive, but I assume you have it like this for a reason so I'll leave it.
)(\:[0-9]+)*(/
And here is the cause of your error. You did not escape the forward slash. However, I am going to leave it as using a different delimiter would avoid this and also tidy up your pattern.
($|[a-zA-Z0-9\.\,\?\'\\\+&%\$#\=~_\-]+))*$/";
That character class can be greatly shortened now knowing that we don't need to escape everything within them. It can become, ($|[a-zA-Z0-9.,?'\\+&%$#=~_-]+))*$/";
.
Using everything we now know your pattern can be made much prettier and easier to handle.
It can become instead:
$pattern = "~^(http|https|ftp)://www\.([a-zA-Z0-9.-]+(:[a-zA-Z0-9.&%$-]+)*@)*((25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9]|[1-9])\.(25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9]|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9]|[1-9]|0)\.(25[0-5]|2[0-4][0-9]|[0-1][0-9]{2}|[1-9][0-9]|[0-9])|([a-zA-Z0-9-]+\.)+(com|edu|gov|int|mil|net|org|biz|arpa|info|name|pro|aero|coop|museum|[a-zA-Z]{2}))(:[0-9]+)*(/($|[a-zA-Z0-9.,?'\\+&%$#=\~_-]+))*$~";
Now that you have a smaller expression, finding faults and more customization should be a little easier.
Just a quick note
I keep noticing that you have used the following syntax at the beginning of some groupings, (\:
. I have removed the backslash as it is not needed for a colon. However, were you trying to make it so the group was not captured? If so, the syntax for that is, (?:
.
Edit:: You can also optimize the pattern further by utilizing character classes
\d = [0-9]
\w = [a-zA-Z0-9_]
Adding i to the end of the last pattern delimiter turns case insensitivity on too. Which means, instead of writing [a-zA-Z]
you can just write [a-z]
instead.
Also, the http|https
can just become https?
So you pattern could be shortened further too:
$pattern = "~^(https?|ftp)://www\.([a-z\d.-]+(:[a-z\d.&%$-]+)*@)*((25[0-5]|2[0-4]\d|[0-1]\d{2}|[1-9]\d|[1-9])\.(25[0-5]|2[0-4]\d|[0-1]\d{2}|[1-9]\d|[1-9]|0)\.(25[0-5]|2[0-4]\d|[0-1]\d{2}|[1-9]\d|[1-9]|0)\.(25[0-5]|2[0-4]\d|[0-1]\d{2}|[1-9]\d|\d)|([a-z\d-]+\.)+(com|edu|gov|int|mil|net|org|biz|arpa|info|name|pro|aero|coop|museum|[a-z]{2}))(:\d+)*(/($|[\w.,?'\\+&%$#=\~-]+))*$~i";