douzao1119 2015-11-05 18:18
浏览 54

删除多个尾随连字符PHP Regex(不包括“。”字符???)

I asked a question previously on here, but decided that the question ought to be broken down into multiple ones (it helped that I debugged further to figure out more exactly what I needed!)

Another user on here provided a pretty darn good regex key to detect and hyperlink a URL, which is broken down into the following parts below:

$rexProtocol = '(https?://)?';
$rexDomain   = '((?:[-a-zA-Z0-9]{1,63}\.)+[-a-zA-Z0-9]{2,63}|(?:[0-9]{1,3}\.){3}[0-9]{1,3})';
$rexPort     = '(:[0-9]{1,5})?';
$rexPath     = '(/[!$-/0-9:;=@_\':;!a-zA-Z\x7f-\xff]*?)?';
$rexQuery    = '(\?[!$-/0-9:;=@_\':;!a-zA-Z\x7f-\xff]+?)?';
$rexFragment = '(#[!$-/0-9:;=@_\':;!a-zA-Z\x7f-\xff]+?)?';

It's a great way to break a URL down to me, though this is of course coming from somebody that is working to get more familiar with the world of REGEX engines. Many good cases would be caught with this while conditional:

while (preg_match("{\\b$rexProtocol$rexDomain$rexPort$rexPath$rexQuery$rexFragment(?=[?.!,;:\"]?(\s|$))}", $text, &$match, PREG_OFFSET_CAPTURE, $position)) {...

One thing I found slightly frustrating with this, however, is that this doesn't quite capture a link while leaving out trailing punctuation marks and other characters (it only worked with ONE punctuation mark at the end of the link, etc.). Thus, I decided to mess around with the conditional and after some tweaking and research, found the following conditional to work much better- /s is replaced with a . instead:

    while (preg_match("{\\b$rexProtocol$rexDomain$rexPort$rexPath$rexQuery$rexFragment(?=[?.!,;:\"\'-]?(.|$))}", $text, $match, PREG_OFFSET_CAPTURE, $position))

This effectively covers most non-alphanumeric characters trailing at the end of the URL in a sentence. You would THINK that this would cover hyphens, but for some reason, it does not- only eliminating ONE hyphen from the end of the URL and leaving the rest of them out, THUS preventing me from filtering a URL by a statement trailed by more than one hyphen. Any suggestions on maybe changing the REGEX key or something else in the code? Here's the rest of my modified code below:

function formatTextLinksVerbose($text) {
    $rexProtocol = '(https?://)?';
    $rexDomain   = '((?:[-a-zA-Z0-9]{1,63}\.)+[-a-zA-Z0-9]{2,63}|(?:[0-9]{1,3}\.){3}[0-9]{1,3})';
    $rexPort     = '(:[0-9]{1,5})?';
    $rexPath     = '(/[!$-/0-9:;=@_\':;!a-zA-Z\x7f-\xff]*?)?';
    $rexQuery    = '(\?[!$-/0-9:;=@_\':;!a-zA-Z\x7f-\xff]+?)?';
    $rexFragment = '(#[!$-/0-9:;=@_\':;!a-zA-Z\x7f-\xff]+?)?';

    $validTlds = array_fill_keys(explode(" ", ".aero .asia .biz .cat .com .coop .edu .gov .info .int .jobs .mil .mobi .museum .name .net .org .pro .tel .travel .ac .ad .ae .af .ag .ai .al .am .an .ao .aq .ar .as .at .au .aw .ax .az .ba .bb .bd .be .bf .bg .bh .bi .bj .bm .bn .bo .br .bs .bt .bv .bw .by .bz .ca .cc .cd .cf .cg .ch .ci .ck .cl .cm .cn .co .cr .cu .cv .cx .cy .cz .de .dj .dk .dm .do .dz .ec .ee .eg .er .es .et .eu .fi .fj .fk .fm .fo .fr .ga .gb .gd .ge .gf .gg .gh .gi .gl .gm .gn .gp .gq .gr .gs .gt .gu .gw .gy .hk .hm .hn .hr .ht .hu .id .ie .il .im .in .io .iq .ir .is .it .je .jm .jo .jp .ke .kg .kh .ki .km .kn .kp .kr .kw .ky .kz .la .lb .lc .li .lk .lr .ls .lt .lu .lv .ly .ma .mc .md .me .mg .mh .mk .ml .mm .mn .mo .mp .mq .mr .ms .mt .mu .mv .mw .mx .my .mz .na .nc .ne .nf .ng .ni .nl .no .np .nr .nu .nz .om .pa .pe .pf .pg .ph .pk .pl .pm .pn .pr .ps .pt .pw .py .qa .re .ro .rs .ru .rw .sa .sb .sc .sd .se .sg .sh .si .sj .sk .sl .sm .sn .so .sr .st .su .sv .sy .sz .tc .td .tf .tg .th .tj .tk .tl .tm .tn .to .tp .tr .tt .tv .tw .tz .ua .ug .uk .us .uy .uz .va .vc .ve .vg .vi .vn .vu .wf .ws .ye .yt .yu .za .zm .zw .xn--0zwm56d .xn--11b5bs3a9aj6g .xn--80akhbyknj4f .xn--9t4b11yi5a .xn--deba0ad .xn--g6w251d .xn--hgbk6aj7f53bba .xn--hlcj6aya9esc7a .xn--jxalpdlp .xn--kgbechtv .xn--zckzah .arpa"), true);

    $position = 0;
    $returnText = "";
    while (preg_match("{\\b$rexProtocol$rexDomain$rexPort$rexPath$rexQuery$rexFragment(?=[?.!,;:\"]?(.|$))}", $text, $match, PREG_OFFSET_CAPTURE, $position))
    {
        list($url, $urlPosition) = $match[0];

        // Append the text leading up to the URL in return value.
        $returnText .= htmlspecialchars(substr($text, $position, $urlPosition - $position));

        $domain = $match[2][0];
        $port   = $match[3][0];
        $path   = $match[4][0];

        // Check if the TLD is valid - or that $domain is an IP address.
        $tld = strtolower(strrchr($domain, '.'));
        if (preg_match('{\.[0-9]{1,3}}', $tld) || isset($validTlds[$tld]))
        {
            // Prepend http:// if no protocol specified
            $completeUrl = $match[1][0] ? $url : "http://$url";

            // Append the hyperlink.
            $returnText .= '<a href="' . htmlspecialchars($completeUrl) . '">' . htmlspecialchars("$domain$port$path") . '</a>';
        }
        else
        {
            // Not a valid URL.
            $returnText .= htmlspecialchars($url);
        }

        // Continue text parsing from after the URL.
        $position = $urlPosition + strlen($url);
    }

    // Append and return the remainder of the text.
    return($returnText . htmlspecialchars(substr($text, $position)));
}

(On a side note, I realize that htmlspecialchars is supposed to protect from user misbehavior with my form that submits to this page, but is there a place in the function where I can quit worrying about that? Should I decrypt back to the non-HTML character string OUTSIDE of the function? It's annoying to see the output include double quotes as the '&qout' character code)

  • 写回答

1条回答 默认 最新

  • dongyun3335 2015-11-05 22:34
    关注

    Not an answer to your question. Just a general observation.
    You could factor out some of the regex parts, and use Named capture groups
    This way you won't have to redo the code body when you change/modify the
    regex.

    $prot   = '(?<Protocol>https?://)?';
    $domain = '(?<Domain>(?:(?&lt){1,63}\.)+(?&lt){2,63}|(?:[0-9]{1,3}\.){3}[0-9]{1,3})';
    $port   = '(?<Port>:[0-9]{1,5})?';
    $other  = '(?<Path>/(?&txt)*?)?(?<Query>\?(?&txt)+?)?(?<Fragment>\#(?&txt)+?)?';
    $def    = '(?(DEFINE)(?<lt>[-a-zA-Z0-9])(?<txt>[!$-/0-9:;=@_\'a-zA-Z\x7f-\xff]))';
    
    $regex = "$prot$domain$port$other$def"; 
    
    while (preg_match("{\\b$regex(?=[?.!,;:\"]?(.|$))}", $text, $match, PREG_OFFSET_CAPTURE, $position))
    {
    
    }
    

    Or, if you're so inclined, format the regex and use the ignore whitespace flag //x.

    Doing it this way lets you see the variable name in the expression.
    Can avoid confusion in the future. A good tool for doing this is here.

    while (
       preg_match(
       '~
            (?<Protocol> https?:// )?     # (1)
            (?<Domain>                    # (2)
                 (?:
                      (?&lt){1,63} \.
                 )+
                 (?&lt){2,63} 
              |  (?: [0-9]{1,3} \. ){3}
                 [0-9]{1,3} 
            )
            (?<Port> : [0-9]{1,5} )?      # (3)
            (?<Path>                      # (4)
                 / (?&txt)*? 
            )?
            (?<Query>                     # (5)
                 \? (?&txt)+? 
            )?
            (?<Fragment>                  # (6)
                 \# (?&txt)+? 
            )?
            (?(DEFINE)
                 (?<lt> [-a-zA-Z0-9] )         # (7)
                 (?<txt>                       # (8)
                      [!$-/0-9:;=@_\'a-zA-Z\x7f-\xff] 
                 )
            )
            (?=[?.!,;:"]?(.|$))
       ~x'
       , $text, $match, PREG_OFFSET_CAPTURE, $position))
    {
    
    }
    
    评论

报告相同问题?

悬赏问题

  • ¥15 三菱伺服电机按启动按钮有使能但不动作
  • ¥20 为什么我写出来的绘图程序是这样的,有没有lao哥改一下
  • ¥15 js,页面2返回页面1时定位进入的设备
  • ¥200 关于#c++#的问题,请各位专家解答!网站的邀请码
  • ¥50 导入文件到网吧的电脑并且在重启之后不会被恢复
  • ¥15 (希望可以解决问题)ma和mb文件无法正常打开,打开后是空白,但是有正常内存占用,但可以在打开Maya应用程序后打开场景ma和mb格式。
  • ¥20 ML307A在使用AT命令连接EMQX平台的MQTT时被拒绝
  • ¥20 腾讯企业邮箱邮件可以恢复么
  • ¥15 有人知道怎么将自己的迁移策略布到edgecloudsim上使用吗?
  • ¥15 错误 LNK2001 无法解析的外部符号