删除多个尾随连字符PHP Regex（不包括“。”字符???）

I asked a question previously on here, but decided that the question ought to be broken down into multiple ones (it helped that I debugged further to figure out more exactly what I needed!)

Another user on here provided a pretty darn good regex key to detect and hyperlink a URL, which is broken down into the following parts below:

$rexProtocol = '(https?://)?';
$rexDomain   = '((?:[-a-zA-Z0-9]{1,63}\.)+[-a-zA-Z0-9]{2,63}|(?:[0-9]{1,3}\.){3}[0-9]{1,3})';
$rexPort     = '(:[0-9]{1,5})?';
$rexPath     = '(/[!$-/0-9:;=@_\':;!a-zA-Z\x7f-\xff]*?)?';
$rexQuery    = '(\?[!$-/0-9:;=@_\':;!a-zA-Z\x7f-\xff]+?)?';
$rexFragment = '(#[!$-/0-9:;=@_\':;!a-zA-Z\x7f-\xff]+?)?';

It's a great way to break a URL down to me, though this is of course coming from somebody that is working to get more familiar with the world of REGEX engines. Many good cases would be caught with this while conditional:

while (preg_match("{\\b$rexProtocol$rexDomain$rexPort$rexPath$rexQuery$rexFragment(?=[?.!,;:\"]?(\s|$))}", $text, &$match, PREG_OFFSET_CAPTURE, $position)) {...

One thing I found slightly frustrating with this, however, is that this doesn't quite capture a link while leaving out trailing punctuation marks and other characters (it only worked with ONE punctuation mark at the end of the link, etc.). Thus, I decided to mess around with the conditional and after some tweaking and research, found the following conditional to work much better- /s is replaced with a . instead:

    while (preg_match("{\\b$rexProtocol$rexDomain$rexPort$rexPath$rexQuery$rexFragment(?=[?.!,;:\"\'-]?(.|$))}", $text, $match, PREG_OFFSET_CAPTURE, $position))

This effectively covers most non-alphanumeric characters trailing at the end of the URL in a sentence. You would THINK that this would cover hyphens, but for some reason, it does not- only eliminating ONE hyphen from the end of the URL and leaving the rest of them out, THUS preventing me from filtering a URL by a statement trailed by more than one hyphen. Any suggestions on maybe changing the REGEX key or something else in the code? Here's the rest of my modified code below:

function formatTextLinksVerbose($text) {
    $rexProtocol = '(https?://)?';
    $rexDomain   = '((?:[-a-zA-Z0-9]{1,63}\.)+[-a-zA-Z0-9]{2,63}|(?:[0-9]{1,3}\.){3}[0-9]{1,3})';
    $rexPort     = '(:[0-9]{1,5})?';
    $rexPath     = '(/[!$-/0-9:;=@_\':;!a-zA-Z\x7f-\xff]*?)?';
    $rexQuery    = '(\?[!$-/0-9:;=@_\':;!a-zA-Z\x7f-\xff]+?)?';
    $rexFragment = '(#[!$-/0-9:;=@_\':;!a-zA-Z\x7f-\xff]+?)?';

    $validTlds = array_fill_keys(explode(" ", ".aero .asia .biz .cat .com .coop .edu .gov .info .int .jobs .mil .mobi .museum .name .net .org .pro .tel .travel .ac .ad .ae .af .ag .ai .al .am .an .ao .aq .ar .as .at .au .aw .ax .az .ba .bb .bd .be .bf .bg .bh .bi .bj .bm .bn .bo .br .bs .bt .bv .bw .by .bz .ca .cc .cd .cf .cg .ch .ci .ck .cl .cm .cn .co .cr .cu .cv .cx .cy .cz .de .dj .dk .dm .do .dz .ec .ee .eg .er .es .et .eu .fi .fj .fk .fm .fo .fr .ga .gb .gd .ge .gf .gg .gh .gi .gl .gm .gn .gp .gq .gr .gs .gt .gu .gw .gy .hk .hm .hn .hr .ht .hu .id .ie .il .im .in .io .iq .ir .is .it .je .jm .jo .jp .ke .kg .kh .ki .km .kn .kp .kr .kw .ky .kz .la .lb .lc .li .lk .lr .ls .lt .lu .lv .ly .ma .mc .md .me .mg .mh .mk .ml .mm .mn .mo .mp .mq .mr .ms .mt .mu .mv .mw .mx .my .mz .na .nc .ne .nf .ng .ni .nl .no .np .nr .nu .nz .om .pa .pe .pf .pg .ph .pk .pl .pm .pn .pr .ps .pt .pw .py .qa .re .ro .rs .ru .rw .sa .sb .sc .sd .se .sg .sh .si .sj .sk .sl .sm .sn .so .sr .st .su .sv .sy .sz .tc .td .tf .tg .th .tj .tk .tl .tm .tn .to .tp .tr .tt .tv .tw .tz .ua .ug .uk .us .uy .uz .va .vc .ve .vg .vi .vn .vu .wf .ws .ye .yt .yu .za .zm .zw .xn--0zwm56d .xn--11b5bs3a9aj6g .xn--80akhbyknj4f .xn--9t4b11yi5a .xn--deba0ad .xn--g6w251d .xn--hgbk6aj7f53bba .xn--hlcj6aya9esc7a .xn--jxalpdlp .xn--kgbechtv .xn--zckzah .arpa"), true);

    $position = 0;
    $returnText = "";
    while (preg_match("{\\b$rexProtocol$rexDomain$rexPort$rexPath$rexQuery$rexFragment(?=[?.!,;:\"]?(.|$))}", $text, $match, PREG_OFFSET_CAPTURE, $position))
    {
        list($url, $urlPosition) = $match[0];

        // Append the text leading up to the URL in return value.
        $returnText .= htmlspecialchars(substr($text, $position, $urlPosition - $position));

        $domain = $match[2][0];
        $port   = $match[3][0];
        $path   = $match[4][0];

        // Check if the TLD is valid - or that $domain is an IP address.
        $tld = strtolower(strrchr($domain, '.'));
        if (preg_match('{\.[0-9]{1,3}}', $tld) || isset($validTlds[$tld]))
        {
            // Prepend http:// if no protocol specified
            $completeUrl = $match[1][0] ? $url : "http://$url";

            // Append the hyperlink.
            $returnText .= '<a href="' . htmlspecialchars($completeUrl) . '">' . htmlspecialchars("$domain$port$path") . '</a>';
        }
        else
        {
            // Not a valid URL.
            $returnText .= htmlspecialchars($url);
        }

        // Continue text parsing from after the URL.
        $position = $urlPosition + strlen($url);
    }

    // Append and return the remainder of the text.
    return($returnText . htmlspecialchars(substr($text, $position)));
}

(On a side note, I realize that htmlspecialchars is supposed to protect from user misbehavior with my form that submits to this page, but is there a place in the function where I can quit worrying about that? Should I decrypt back to the non-HTML character string OUTSIDE of the function? It's annoying to see the output include double quotes as the '&qout' character code)

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

1条回答默认最新

dongyun3335 2015-11-05 22:34

关注

Not an answer to your question. Just a general observation.
You could factor out some of the regex parts, and use Named capture groups
This way you won't have to redo the code body when you change/modify the
regex.

$prot   = '(?<Protocol>https?://)?';
$domain = '(?<Domain>(?:(?&lt){1,63}\.)+(?&lt){2,63}|(?:[0-9]{1,3}\.){3}[0-9]{1,3})';
$port   = '(?<Port>:[0-9]{1,5})?';
$other  = '(?<Path>/(?&txt)*?)?(?<Query>\?(?&txt)+?)?(?<Fragment>\#(?&txt)+?)?';
$def    = '(?(DEFINE)(?<lt>[-a-zA-Z0-9])(?<txt>[!$-/0-9:;=@_\'a-zA-Z\x7f-\xff]))';

$regex = "$prot$domain$port$other$def"; 

while (preg_match("{\\b$regex(?=[?.!,;:\"]?(.|$))}", $text, $match, PREG_OFFSET_CAPTURE, $position))
{

}

Or, if you're so inclined, format the regex and use the ignore whitespace flag //x.

Doing it this way lets you see the variable name in the expression.
Can avoid confusion in the future. A good tool for doing this is here.

while (
   preg_match(
   '~
        (?<Protocol> https?:// )?     # (1)
        (?<Domain>                    # (2)
             (?:
                  (?&lt){1,63} \.
             )+
             (?&lt){2,63} 
          |  (?: [0-9]{1,3} \. ){3}
             [0-9]{1,3} 
        )
        (?<Port> : [0-9]{1,5} )?      # (3)
        (?<Path>                      # (4)
             / (?&txt)*? 
        )?
        (?<Query>                     # (5)
             \? (?&txt)+? 
        )?
        (?<Fragment>                  # (6)
             \# (?&txt)+? 
        )?
        (?(DEFINE)
             (?<lt> [-a-zA-Z0-9] )         # (7)
             (?<txt>                       # (8)
                  [!$-/0-9:;=@_\'a-zA-Z\x7f-\xff] 
             )
        )
        (?=[?.!,;:"]?(.|$))
   ~x'
   , $text, $match, PREG_OFFSET_CAPTURE, $position))
{

}

报告相同问题？

关注问题

如何获得与我的REGEX匹配的特定字符串？ php
2017-11-08 09:24

回答 1 已采纳 preg_replace solution: $s = 'I/O complained/O to/O Microsoft/ORGANIZATION about/O Bill/PERSON Gat
如何检查字符串是否包含一个单词并且不包含另一个单词？ php
2016-05-31 13:00

回答 6 已采纳 You can do it like this: ^ # anchor it to the beginning of the line (?:
如何用连字符替换点，空格和逗号，并使用PHP避免双连字符？ php
2018-09-27 07:55

回答 2 已采纳 How can I avoid that? I want it to be check-out-the-1-place - so that there only is one hyphen
java返回空格之前的字符,如何从Java中的字符串中删除前导和尾随空格？
2021-02-25 08:51

weixin_39795292的博客解决方案 s.trim() Without any internal method, use regex like s.replaceAll("^\\s+", "").replaceAll("\\s+$", "") or s.replaceAll("^\\s+|\\s+$", "") or just use pattern in pure form String s=" Hello ...
PHP正则表达式：查找字符串中的所有连续数字序列？ php
2018-10-05 13:13

回答 3 已采纳 With RegEx, you can use: (123(?:4(?:5(?:6(?:7(?:89?)?)?)?)?)?|234(?:5(?:6(?:7(?:89?)?)?)?)?|345(?
如何删除字符串中重复的字符序列？ php
2019-06-02 14:07

回答 3 已采纳 For details : https://algorithms.tutorialhorizon.com/remove-duplicates-from-the-string/ In differ
正则表达式获取仅包含模式列表中的单词的字符串？ php
2019-03-10 04:09

回答 3 已采纳 Something like this $names_list = ['benclinton','clintonharry','harryben','benwill','jasonsmith',
php ignore special characters,PHP忽略第5个字符？(PHP ignore 5th character?)
2021-05-08 04:36

weixin_39526415的博客 PHP忽略第5个字符？(PHP ignore 5th character?)我有一个简单的PHP问题。在我的PHP中，我有这个：$variable = 'howareyou';有可能以某种方式修改代码，因此它只计算变量的第6个字符？所以之后，当回声它会说怎么...
如何使用Regex匹配字符串中的PHP time（）或microtime（）？ php
2016-01-27 09:46

回答 2 已采纳 Add optional group using ?: preg_match( '/([a-z]+)_([0-9]{9,})(\.[0-9]{4,})?\.jpg/i', $aName, $lM
如何删除PHP中的前导和尾随非字母数字字符？ php
2014-02-19 04:47

回答 2 已采纳 Try using a pattern like this: $string = preg_replace('/^\W+|\W+$/', '', $string); This will r
如何从Euro（€）表达式中删除特定的前导和尾随字符？ php
2017-08-27 12:59

回答 2 已采纳 You don't need more than one function call for this. Match the € then zero or more non-digits, th
php 正则最后一个字符,关于正则表达式：在PHP中查找字符串中的最后一个字符...
2021-03-25 08:58

向天再借十厘米的博客我以为会有一个简单的PHP函数来查找最后一个字符串，但我找不到任何东西。第一直觉让我觉得我需要使用regex，但我不是100%。下面是一个例子：http://domainx.com/characters/我想找到一个尾随斜杠并将其转换为...
PHP：如何在in_array中使用正则表达式字符串数组？ php
2019-05-21 07:46

回答 2 已采纳 If you want to know, what patterns of an array match the string, how about using array_filter. $r
从String中移除空白字符的多种方式！？
2021-06-10 06:23

航迹者的博客字符串，是Java中最常用的一个数据类型了。我们在日常开发时候会经常... 其实，在Java中从字符串中删除空格有很多不同的方法，如trim，replaceAll等。但是，在Java 11添加了一些新的功能，如strip、stripLeading、s...
从String中移除空白字符的多种方式！？差别竟然这么大！
2021-01-07 08:00

sufu1065的博客字符串，是Java中最常用的一个数据类型了。我们在日常开发时候会经常使用字符串做很多的操作。比如字符串的拼接、截断、替换等。这一篇文章，我们介绍一个比较常见又容易被忽略的一个操作，那就是...
没有解决我的问题, 去提问

悬赏问题

¥15 三菱伺服电机按启动按钮有使能但不动作
¥20 为什么我写出来的绘图程序是这样的，有没有lao哥改一下
¥15 js，页面2返回页面1时定位进入的设备
¥200 关于#c++#的问题，请各位专家解答！网站的邀请码
¥50 导入文件到网吧的电脑并且在重启之后不会被恢复
¥15 （希望可以解决问题）ma和mb文件无法正常打开，打开后是空白，但是有正常内存占用，但可以在打开Maya应用程序后打开场景ma和mb格式。
¥20 ML307A在使用AT命令连接EMQX平台的MQTT时被拒绝
¥20 腾讯企业邮箱邮件可以恢复么
¥15 有人知道怎么将自己的迁移策略布到edgecloudsim上使用吗？
¥15 错误 LNK2001 无法解析的外部符号

码龄粉丝数原力等级 --

删除多个尾随连字符PHP Regex（不包括“。”字符???）

1条回答默认最新

码龄粉丝数原力等级 --

悬赏问题

删除多个尾随连字符PHP Regex（不包括“。”字符???）

1条回答 默认 最新

悬赏问题

1条回答默认最新