PHP输入过滤 - 检查ascii与检查utf8

I need to insure that all my strings are utf8. Would it be better to check that input coming from a user is ascii-like or that it is utf8-like?

//KohanaPHP
function is_ascii($str) {
    return ! preg_match('/[^\x00-\x7F]/S', $str);
}

//Wordpress
function seems_utf8($Str) {
    for ($i=0; $i<strlen($Str); $i++) {
        if (ord($Str[$i]) < 0x80) continue; # 0bbbbbbb
        elseif ((ord($Str[$i]) & 0xE0) == 0xC0) $n=1; # 110bbbbb
        elseif ((ord($Str[$i]) & 0xF0) == 0xE0) $n=2; # 1110bbbb
        elseif ((ord($Str[$i]) & 0xF8) == 0xF0) $n=3; # 11110bbb
        elseif ((ord($Str[$i]) & 0xFC) == 0xF8) $n=4; # 111110bb
        elseif ((ord($Str[$i]) & 0xFE) == 0xFC) $n=5; # 1111110b
        else return false; # Does not match any model
        for ($j=0; $j<$n; $j++) { # n bytes matching 10bbbbbb follow ?
            if ((++$i == strlen($Str)) || ((ord($Str[$i]) & 0xC0) != 0x80))
            return false;
        }
    }
    return true;
}

I did some benchmarking on 100 strings (half valid utf8/ascii and half not) and found that seems_utf8() tasks 0.011 while is_ascii only takes 0.001. But my gut is telling me that you get what you pay for and the utf8 checking would be a better choice.

I'm planning on then doing something like this convert.

<?php

/* Example data */
$string[] = 'hello';
$string[] = 'asdfghjkl;qwertyuiop[]\zxcvbnm,./]12345657890-=+_)(*&^%$#@!';
$string[] = '';
$string[] = 'accentué';
$string[] = '»á½µÎ½Ï‰Î½ Ï„á½° ';
$string[] = '???R??=8 ????? ++++¦??? ???2??????';
$string[] = 'hello¦ùó 5/5¡45-52ZÜ¿»'. "0x93". octdec('77'). decbin(26). "F???pp?? ??? ". '»á½µÎ½Ï‰Î½ Ï„á½° ';


$time = microtime(true);

//Count the successes
$true = array(1 => 0, 0 => 0);

foreach($string as $s) {
    $r = seems_utf8($s);    //0.011

    print_pre(mb_substr($s, 0, 30). ' is '. ($r ? 'UTF-8' : 'non-UTF-8'));


    if( ! $r ) {

        $e = mb_detect_encoding($s, "auto");

        print_pre('Encoding: '. $e);

        //Convert
        $s = iconv($e, 'UTF-8//TRANSLIT', $s);

        print_pre(mb_substr($s, 0, 30). ' is now '. (seems_utf8($s) ? 'valid' : 'not'). ' UTF-8');
    }

}

print_pre($true);
print_pre((microtime(TRUE) - $time). ' seconds');

function print_pre() { print '<pre>'; print_r(func_get_args()); print '</pre>'; }

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

4条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
douwu8524 2009-12-06 00:30
关注
I'm not sure how necessary parts of this approach are. If you ask the user for UTF-8 input, and they give you "something else" throw it away and ask again.

The various character set detecting functions out there are universally (and tragically, necessarily) imperfect. The ones in the MB library as well as the ones in iconv aren't even that advanced compared to some of the stuff that's out there. The mb_detect_encoding basically iterates through a list of character sets and returns the first one that makes the string it has in hand look valid. In this day and age it's probably that several would return true (which is why the ordering is exposed through mb_detect_order()).

Ensure your pages are provided with the correct HTTP & HTML character set declarations, and browsers should return data in the same. To be extra specific include the accept-charset declaration in your form tag. I've yet to discover a case where this was ignored that didn't represent an attack.

To check the encoding of a byte stream, you can simply use mb_check_encoding().

本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

查看更多回答(3条)

报告相同问题？

关注问题

PHP输入过滤 - 检查ascii与检查utf8 php
2009-10-30 02:13

回答 4 已采纳 I'm not sure how necessary parts of this approach are. If you ask the user for UTF-8 input, and th
MySQL数据库迁移PHP的UTF-8问题 mysql php
2018-06-12 14:23

回答 3 已采纳 To check for double encoding, use SELECT HEX(col)... é should come back C3A9 (proper utf8), but i
PHP，将UTF-8转换为ASCII 8位 php
2012-08-07 09:54

回答 1 已采纳 try this for the software #2 iconv("UTF-8", "CP437", $this->_output); Extended ASCII is not
php过滤非utf-8字符串,关于utf 8：如何使用PHP截断字符串中的非ascii字符
2021-04-21 07:37

辣目洋子的博客，并尝试使用以下代码echo preg_replace('/[^a-z0-9|^.]/i', '_', iconv("UTF-8","ISO-8859-1//TRANSLIT",$string));因为文件名中有一个特殊的(非ASCII)字符，因此在使用PHP上传文件时会创建垃圾字符。我想要...
php utf-8不起作用 php
2013-10-30 12:18

回答 3 已采纳 There are several mistakes. Don't use [] or strlen() on a multibyte string. Use mb_substr() and mb
当我需要UTF-8时PHP以ASCII输出XML [重复] php xml
2014-10-09 20:21

回答 3 已采纳 It seems that your MySQL database does not contain UTF-8 encoded data. As a result it's output is
如果我在PHP中将UTF-8编码的字符串与ASCII字符串连接，那么结果字符串的编码是什么？ php
2019-01-29 17:25

回答 2 已采纳 It would depend firstly on whether you mean strict ASCII, which only includes 128 characters. Ever
detox-php:驯服有问题的文件名。更难
2021-03-10 13:28

在正常情况下，没有充分的理由将UTF-8字符音译为ASCII。但是，迫切需要替换典型Linux命令行上具有特殊含义的字符。诸如$ ， : ， ( ) ， [ ， ]类的字符或多或少都是有问题的。要求作曲家吉特 php（7.2.5或更...
PHP json_decode的UTF-8问题 json perl php
2014-03-19 17:27

回答 2 已采纳 This is entirely based on a misunderstanding. json_encode encodes non-ASCII characters as Unicode
PHP JSON_encode（）收到“格式错误的UTF-8字符，可能编码错误”（错误） php
2018-05-30 18:07

回答 2 已采纳 SOLVED! The issue was in the function mb_detect_order(), this function just don't work as I was e
PHP UTF-8 mb_convert_encode和Internet-Explorer php
2015-07-15 12:56

回答 2 已采纳 Although I prefer using urlencoded strings in address bar but for your case you can try to encode
UTF-8贯穿始终
2019-12-11 16:24

p15097962069的博客我正在设置一个新服务器，并希望在我的Web应用程序中完全支持UTF-8。我过去曾在现有服务器上尝试过此操作，但最终似乎总是不得不退回到ISO-8859-1。我到底需要在哪里设置编码/字
如何从PHP中的UTF8字符“删除变音符号”？ mysql php
2018-02-02 18:21

回答 2 已采纳 intl's Transliterator will let you define far more in-depth transliteration rules. The full docume
php 过滤用户url输入,PHP 用户提交数据之过滤、验证和转义
2021-04-12 15:23

弄哭你的博客 1、过滤输入过滤输入是指转义或删除不安全的字符。在数据到达应用的存储层之前，一定要过滤输入数据，这是第一道防线。假如网站的评论框接受HTML，用户可以随意在评论中加入恶意的标签，如下所示：这篇文章很有用！...
php 宽字节注入转成utf8,Hr-Papers|宽字节注入深度讲解
2021-04-25 02:24

瓦罗兰十字军的博客 /*Team:红日安全团队团队成员：CPRTitle：宽字节注入*/原理介绍基本概概念宽字节是相对于ascII这样单字节而言的；...一个utf-8编码的汉字，占用3个字节转义函数：为了过滤用户输入的一些数据，对特殊的字符加上反...
没有解决我的问题, 去提问

悬赏问题

¥15 表达式必须是可修改的左值
¥15 如何绘制动力学系统的相图
¥15 对接wps接口实现获取元数据
¥20 给自己本科IT专业毕业的妹m找个实习工作
¥15 用友U8：向一个无法连接的网络尝试了一个套接字操作，如何解决？
¥30 我的代码按理说完成了模型的搭建、训练、验证测试等工作(标签-网络|关键词-变化检测)
¥50 mac mini外接显示器画质字体模糊
¥15 TLS1.2协议通信解密
¥40 图书信息管理系统程序编写
¥20 Qcustomplot缩小曲线形状问题

PHP输入过滤 - 检查ascii与检查utf8

4条回答 默认 最新

悬赏问题

4条回答默认最新