dongren2128 2009-10-30 02:13
浏览 37
已采纳

PHP输入过滤 - 检查ascii与检查utf8

I need to insure that all my strings are utf8. Would it be better to check that input coming from a user is ascii-like or that it is utf8-like?

//KohanaPHP
function is_ascii($str) {
    return ! preg_match('/[^\x00-\x7F]/S', $str);
}

//Wordpress
function seems_utf8($Str) {
    for ($i=0; $i<strlen($Str); $i++) {
        if (ord($Str[$i]) < 0x80) continue; # 0bbbbbbb
        elseif ((ord($Str[$i]) & 0xE0) == 0xC0) $n=1; # 110bbbbb
        elseif ((ord($Str[$i]) & 0xF0) == 0xE0) $n=2; # 1110bbbb
        elseif ((ord($Str[$i]) & 0xF8) == 0xF0) $n=3; # 11110bbb
        elseif ((ord($Str[$i]) & 0xFC) == 0xF8) $n=4; # 111110bb
        elseif ((ord($Str[$i]) & 0xFE) == 0xFC) $n=5; # 1111110b
        else return false; # Does not match any model
        for ($j=0; $j<$n; $j++) { # n bytes matching 10bbbbbb follow ?
            if ((++$i == strlen($Str)) || ((ord($Str[$i]) & 0xC0) != 0x80))
            return false;
        }
    }
    return true;
}

I did some benchmarking on 100 strings (half valid utf8/ascii and half not) and found that seems_utf8() tasks 0.011 while is_ascii only takes 0.001. But my gut is telling me that you get what you pay for and the utf8 checking would be a better choice.

I'm planning on then doing something like this convert.

<?php

/* Example data */
$string[] = 'hello';
$string[] = 'asdfghjkl;qwertyuiop[]\zxcvbnm,./]12345657890-=+_)(*&^%$#@!';
$string[] = '';
$string[] = 'accentué';
$string[] = '»á½µÎ½Ï‰Î½ Ï„á½° ';
$string[] = '???R??=8 ????? ++++¦??? ???2??????';
$string[] = 'hello¦ùó 5/5¡45-52ZÜ¿»'. "0x93". octdec('77'). decbin(26). "F???pp?? ??? ". '»á½µÎ½Ï‰Î½ Ï„á½° ';


$time = microtime(true);

//Count the successes
$true = array(1 => 0, 0 => 0);

foreach($string as $s) {
    $r = seems_utf8($s);    //0.011

    print_pre(mb_substr($s, 0, 30). ' is '. ($r ? 'UTF-8' : 'non-UTF-8'));


    if( ! $r ) {

        $e = mb_detect_encoding($s, "auto");

        print_pre('Encoding: '. $e);

        //Convert
        $s = iconv($e, 'UTF-8//TRANSLIT', $s);

        print_pre(mb_substr($s, 0, 30). ' is now '. (seems_utf8($s) ? 'valid' : 'not'). ' UTF-8');
    }

}

print_pre($true);
print_pre((microtime(TRUE) - $time). ' seconds');

function print_pre() { print '<pre>'; print_r(func_get_args()); print '</pre>'; }
  • 写回答

4条回答 默认 最新

  • douwu8524 2009-12-06 00:30
    关注

    I'm not sure how necessary parts of this approach are. If you ask the user for UTF-8 input, and they give you "something else" throw it away and ask again.

    The various character set detecting functions out there are universally (and tragically, necessarily) imperfect. The ones in the MB library as well as the ones in iconv aren't even that advanced compared to some of the stuff that's out there. The mb_detect_encoding basically iterates through a list of character sets and returns the first one that makes the string it has in hand look valid. In this day and age it's probably that several would return true (which is why the ordering is exposed through mb_detect_order()).

    Ensure your pages are provided with the correct HTTP & HTML character set declarations, and browsers should return data in the same. To be extra specific include the accept-charset declaration in your form tag. I've yet to discover a case where this was ignored that didn't represent an attack.

    To check the encoding of a byte stream, you can simply use mb_check_encoding().

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(3条)

报告相同问题?

悬赏问题

  • ¥15 表达式必须是可修改的左值
  • ¥15 如何绘制动力学系统的相图
  • ¥15 对接wps接口实现获取元数据
  • ¥20 给自己本科IT专业毕业的妹m找个实习工作
  • ¥15 用友U8:向一个无法连接的网络尝试了一个套接字操作,如何解决?
  • ¥30 我的代码按理说完成了模型的搭建、训练、验证测试等工作(标签-网络|关键词-变化检测)
  • ¥50 mac mini外接显示器 画质字体模糊
  • ¥15 TLS1.2协议通信解密
  • ¥40 图书信息管理系统程序编写
  • ¥20 Qcustomplot缩小曲线形状问题