I asked a question previously on here, but decided that the question ought to be broken down into multiple ones (it helped that I debugged further to figure out more exactly what I needed!)
Another user on here provided a pretty darn good regex key to detect and hyperlink a URL, which is broken down into the following parts below:
$rexProtocol = '(https?://)?';
$rexDomain = '((?:[-a-zA-Z0-9]{1,63}\.)+[-a-zA-Z0-9]{2,63}|(?:[0-9]{1,3}\.){3}[0-9]{1,3})';
$rexPort = '(:[0-9]{1,5})?';
$rexPath = '(/[!$-/0-9:;=@_\':;!a-zA-Z\x7f-\xff]*?)?';
$rexQuery = '(\?[!$-/0-9:;=@_\':;!a-zA-Z\x7f-\xff]+?)?';
$rexFragment = '(#[!$-/0-9:;=@_\':;!a-zA-Z\x7f-\xff]+?)?';
It's a great way to break a URL down to me, though this is of course coming from somebody that is working to get more familiar with the world of REGEX engines. Many good cases would be caught with this while conditional:
while (preg_match("{\\b$rexProtocol$rexDomain$rexPort$rexPath$rexQuery$rexFragment(?=[?.!,;:\"]?(\s|$))}", $text, &$match, PREG_OFFSET_CAPTURE, $position)) {...
One thing I found slightly frustrating with this, however, is that this doesn't quite capture a link while leaving out trailing punctuation marks and other characters (it only worked with ONE punctuation mark at the end of the link, etc.). Thus, I decided to mess around with the conditional and after some tweaking and research, found the following conditional to work much better- /s
is replaced with a .
instead:
while (preg_match("{\\b$rexProtocol$rexDomain$rexPort$rexPath$rexQuery$rexFragment(?=[?.!,;:\"\'-]?(.|$))}", $text, $match, PREG_OFFSET_CAPTURE, $position))
This effectively covers most non-alphanumeric characters trailing at the end of the URL in a sentence. You would THINK that this would cover hyphens, but for some reason, it does not- only eliminating ONE hyphen from the end of the URL and leaving the rest of them out, THUS preventing me from filtering a URL by a statement trailed by more than one hyphen. Any suggestions on maybe changing the REGEX key or something else in the code? Here's the rest of my modified code below:
function formatTextLinksVerbose($text) {
$rexProtocol = '(https?://)?';
$rexDomain = '((?:[-a-zA-Z0-9]{1,63}\.)+[-a-zA-Z0-9]{2,63}|(?:[0-9]{1,3}\.){3}[0-9]{1,3})';
$rexPort = '(:[0-9]{1,5})?';
$rexPath = '(/[!$-/0-9:;=@_\':;!a-zA-Z\x7f-\xff]*?)?';
$rexQuery = '(\?[!$-/0-9:;=@_\':;!a-zA-Z\x7f-\xff]+?)?';
$rexFragment = '(#[!$-/0-9:;=@_\':;!a-zA-Z\x7f-\xff]+?)?';
$validTlds = array_fill_keys(explode(" ", ".aero .asia .biz .cat .com .coop .edu .gov .info .int .jobs .mil .mobi .museum .name .net .org .pro .tel .travel .ac .ad .ae .af .ag .ai .al .am .an .ao .aq .ar .as .at .au .aw .ax .az .ba .bb .bd .be .bf .bg .bh .bi .bj .bm .bn .bo .br .bs .bt .bv .bw .by .bz .ca .cc .cd .cf .cg .ch .ci .ck .cl .cm .cn .co .cr .cu .cv .cx .cy .cz .de .dj .dk .dm .do .dz .ec .ee .eg .er .es .et .eu .fi .fj .fk .fm .fo .fr .ga .gb .gd .ge .gf .gg .gh .gi .gl .gm .gn .gp .gq .gr .gs .gt .gu .gw .gy .hk .hm .hn .hr .ht .hu .id .ie .il .im .in .io .iq .ir .is .it .je .jm .jo .jp .ke .kg .kh .ki .km .kn .kp .kr .kw .ky .kz .la .lb .lc .li .lk .lr .ls .lt .lu .lv .ly .ma .mc .md .me .mg .mh .mk .ml .mm .mn .mo .mp .mq .mr .ms .mt .mu .mv .mw .mx .my .mz .na .nc .ne .nf .ng .ni .nl .no .np .nr .nu .nz .om .pa .pe .pf .pg .ph .pk .pl .pm .pn .pr .ps .pt .pw .py .qa .re .ro .rs .ru .rw .sa .sb .sc .sd .se .sg .sh .si .sj .sk .sl .sm .sn .so .sr .st .su .sv .sy .sz .tc .td .tf .tg .th .tj .tk .tl .tm .tn .to .tp .tr .tt .tv .tw .tz .ua .ug .uk .us .uy .uz .va .vc .ve .vg .vi .vn .vu .wf .ws .ye .yt .yu .za .zm .zw .xn--0zwm56d .xn--11b5bs3a9aj6g .xn--80akhbyknj4f .xn--9t4b11yi5a .xn--deba0ad .xn--g6w251d .xn--hgbk6aj7f53bba .xn--hlcj6aya9esc7a .xn--jxalpdlp .xn--kgbechtv .xn--zckzah .arpa"), true);
$position = 0;
$returnText = "";
while (preg_match("{\\b$rexProtocol$rexDomain$rexPort$rexPath$rexQuery$rexFragment(?=[?.!,;:\"]?(.|$))}", $text, $match, PREG_OFFSET_CAPTURE, $position))
{
list($url, $urlPosition) = $match[0];
// Append the text leading up to the URL in return value.
$returnText .= htmlspecialchars(substr($text, $position, $urlPosition - $position));
$domain = $match[2][0];
$port = $match[3][0];
$path = $match[4][0];
// Check if the TLD is valid - or that $domain is an IP address.
$tld = strtolower(strrchr($domain, '.'));
if (preg_match('{\.[0-9]{1,3}}', $tld) || isset($validTlds[$tld]))
{
// Prepend http:// if no protocol specified
$completeUrl = $match[1][0] ? $url : "http://$url";
// Append the hyperlink.
$returnText .= '<a href="' . htmlspecialchars($completeUrl) . '">' . htmlspecialchars("$domain$port$path") . '</a>';
}
else
{
// Not a valid URL.
$returnText .= htmlspecialchars($url);
}
// Continue text parsing from after the URL.
$position = $urlPosition + strlen($url);
}
// Append and return the remainder of the text.
return($returnText . htmlspecialchars(substr($text, $position)));
}
(On a side note, I realize that htmlspecialchars is supposed to protect from user misbehavior with my form that submits to this page, but is there a place in the function where I can quit worrying about that? Should I decrypt back to the non-HTML character string OUTSIDE of the function? It's annoying to see the output include double quotes as the '&qout' character code)