My goal is to protect my web site from attacks by creating a strict whitelist of allowed characters for any and all POST data recieved from the client side.

This is a piece of cake when staying within ASCII characters. Something like:

if(preg_match('/[^aA-zZ0-9]/', $stringToTest))
   // Battle stations!!

However, I need to be able to allow any and all utf-8 characters, especially asian character sets like Japanese, Chinese, and Korean. But I don't want to exclude anybody with wacky characters, like Arabic or Russian, or whatever. One world, one love! ;)

How can I allow people to input the characters of their native language while excluding the nasties used in evil scripts, like *, ?, angle brackets, and so on?

我的目标是通过为任何和所有POST数据创建允许字符的严格白名单来保护我的网站免受攻击 从客户端收到。

保留ASCII字符时这是一块蛋糕。 类似于:

  if(preg_match('/ [^ aA-zZ0-9] /',$ stringToTest))
 //战斗站!! \  n} 

但是,我需要能够允许任何和所有utf-8字符,尤其是日语,中文和韩语等亚洲字符集。 但我不想排除任何有古怪字符的人,比如阿拉伯语或俄语,或其他什么。 一个世界,一个爱! ;)


4条回答 默认 最新

  • dongye1934 2011-02-22 05:01

    \w will give you word characters (letters, digits, and underscores), which is probably what you're after \s for whitespace.


    if(preg_match('/[\w\s]/', $stringToTest))
       // Battle stations!!
 is an excellent reference for this stuff - here and here are a couple of relevant pages :)

    edit: some more clarification needed, sorry!

    here's what I usually use for CJK:

    function get_CJK_ranges() {
        return array(
                    "[\x{2E80}-\x{2EFF}]",      # CJK Radicals Supplement
                    "[\x{2F00}-\x{2FDF}]",      # Kangxi Radicals
                    "[\x{2FF0}-\x{2FFF}]",      # Ideographic Description Characters
                    "[\x{3000}-\x{303F}]",      # CJK Symbols and Punctuation
                    "[\x{3040}-\x{309F}]",      # Hiragana
                    "[\x{30A0}-\x{30FF}]",      # Katakana
                    "[\x{3100}-\x{312F}]",      # Bopomofo
                    "[\x{3130}-\x{318F}]",      # Hangul Compatibility Jamo
                    "[\x{3190}-\x{319F}]",      # Kanbun
                    "[\x{31A0}-\x{31BF}]",      # Bopomofo Extended
                    "[\x{31F0}-\x{31FF}]",      # Katakana Phonetic Extensions
                    "[\x{3200}-\x{32FF}]",      # Enclosed CJK Letters and Months
                    "[\x{3300}-\x{33FF}]",      # CJK Compatibility
                    "[\x{3400}-\x{4DBF}]",      # CJK Unified Ideographs Extension A
                    "[\x{4DC0}-\x{4DFF}]",      # Yijing Hexagram Symbols
                    "[\x{4E00}-\x{9FFF}]",      # CJK Unified Ideographs
                    "[\x{A000}-\x{A48F}]",      # Yi Syllables
                    "[\x{A490}-\x{A4CF}]",      # Yi Radicals
                    "[\x{AC00}-\x{D7AF}]",      # Hangul Syllables
                    "[\x{F900}-\x{FAFF}]",      # CJK Compatibility Ideographs
                    "[\x{FE30}-\x{FE4F}]",      # CJK Compatibility Forms
                    "[\x{1D300}-\x{1D35F}]",    # Tai Xuan Jing Symbols
                    "[\x{20000}-\x{2A6DF}]",    # CJK Unified Ideographs Extension B
                    "[\x{2F800}-\x{2FA1F}]"     # CJK Compatibility Ideographs Supplement
    function contains_CJK($string) {
        $regex = '/'.implode('|',get_CJK_ranges()).'/u';
        return preg_match($regex,$string);

    To get everything that's could be a problem for escaping and other black-hat stuff, use:

    /[^\p{Punctuation}]/ ( == /[^\p{P}]/ )


    /[^\32-\151]/ ( == /[^!-~]/ )

    another good link

  • doufeixi6014 2011-02-22 04:59

    Try inverting the test - use a blacklist instead of a whitelist. e.g.

    if(preg_match('/[\*\?<>]/', $stringToTest))
        // Battle stations!!

    Regex might not be quite right, but you get the idea.

  • dpbsy60000 2011-02-22 07:43

    I doubt you can protect anything this way.
    You will just complicate matters for the fair users, but don't stop malicious one.

    I would just quit a site that won't allow me to enter a question mark or a quote, or e-mail.
    Simple dot is among "nasties used in evil scripts" for sure. But any message without it would look ugly.

    While SQL injection can be done using alphabet characters only.

    I see no sense in such a "protection".

  • dstm2014 2011-04-02 00:08

    For some things you can base64 encode, but I've had to remove a tiny bit of functionality where that's not doable as keeping all characters seems more important and it's certainly not worth any more time right now.


    After saying that I came across this but it seems the issue then becomes efficiency due to so many characters if you want a generic function but that isn't a huge issue (Chinese, Russian and Greek may have separate webpages etc.).

