duandie0921 2016-08-22 09:53
浏览 225
已采纳

正则表达式 - URL中的希腊字符

I have a custom router that uses regex.

The problem is that I cannot parse Greek characters.


Here are some lines from index.php:

$router->get('/theatre/plays', 'TheatreController', 'showPlays');
$router->get('/theatre/interviews', 'TheatreController', 'showInterviews');
$router->get('/theatre/[-\w\d\!\.]+', 'TheatreController', 'single_post');

Here are some lines from Router.php:

$found = 0;
$path = parse_url($_SERVER['REQUEST_URI'], PHP_URL_PATH); //get the url

////// Bla Bla Bla /////////

if ( $found = preg_match("#^$value$#", $path) )
{
    //Do stuff
}

Now, when I try a url like http://kourtis.app/theatre/α (notice the last character is a Greek 'alpha') then it is somehow interpreted to http://kourtis.app/theatre/%CE%B1

I can see this when I var_dump($path) or when I copy-paste the url.


I guess it has something to do with encoding but everything (I can think of) is in utf-8 format.

Any ideas?

--------------------------------

UPDATE: After the suggestions in the comments, the following works for only with some Greek characters: /theatre/[α-ωΑ-Ω-\w\d\!\.]+ and use urldecode to decode the percent-encoding of the $path variable.

Some characters that produce an error are: κ π ρ χ.

The question now is ... why?? (BTW, this works for many chars /theatre/.+)

  • 写回答

1条回答 默认 最新

  • dongyang1518 2016-08-23 12:00
    关注

    You can use

    $router->get('/theatre/[^/]+', 'TheatreController', 'single_post');
    

    as [^/]+ will match one or more characters other than / since [^...] is a negated character class that matches any char but the one(s) defined in the class.

    Note you do not have to use \d if you used \w (\w already matches digits).

    Also, you did not match diacritics with your regex. If you need to match diacritics, add \p{M} to the regex: '/theatre/[-\w\p{M}!.]+'.

    Note that to allow \w to match Unicode letters/digits, you need to pass /u modifier to the regex: $found = preg_match("#^$value$#u", $path). This will both treat input strings as Unicode strings, and make shorthand patterns like \w Unicode aware.

    Another thing: you need not escape . inside a character class.

    Pattern details:

    • #...# - regex delimiters
    • ^ - start of string
    • $value - the $value variable contents (since double quoted strings in PHP allow interpolation)
    • $ - end of string
    • #u - the modifier enabling PCRE_UTF and PCRE_UCP options. See more info about them here
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 求差集那个函数有问题,有无佬可以解决
  • ¥15 【提问】基于Invest的水源涵养
  • ¥20 微信网友居然可以通过vx号找到我绑的手机号
  • ¥15 寻一个支付宝扫码远程授权登录的软件助手app
  • ¥15 解riccati方程组
  • ¥15 display:none;样式在嵌套结构中的已设置了display样式的元素上不起作用?
  • ¥15 使用rabbitMQ 消息队列作为url源进行多线程爬取时,总有几个url没有处理的问题。
  • ¥15 Ubuntu在安装序列比对软件STAR时出现报错如何解决
  • ¥50 树莓派安卓APK系统签名
  • ¥65 汇编语言除法溢出问题