douyanzhou1450 2016-09-08 01:16
浏览 96

PHP如何确定浏览器POST请求数据中的字符编码?

When the browser sends data in the body of a POST request (i.e. the name=value pairs from form elements), how does PHP determine the character encoding so it can properly decode the bit stream into characters for its own internal usage?
I can understand for some tasks where PHP won't need to decode, e.g. for SQL INSERT queries, it may simply pass the data/string along to the DBMS with no additional processing.
But for text processing/regex operations, I imagine PHP will need to decode the bit stream into characters, before it can perform test, pattern matches etc on them.
Also, it seems that because the encoding is determined by the browser, PHP will need guidance from the browser on what charset it used to encode the POST data.
Expecting this guidance would be in the request headers, I set up a text form with

<meta charset="utf-8">

in the head of the webpage containing the form, then after entering some values and submitting the form, the request headers contains no obvious information about how it encoded the POST data

POST /experiments/foo.php HTTP/1.1
Host: localhost
Connection: keep-alive
Content-Length: 57
Pragma: no-cache
Cache-Control: no-cache
Origin: http://localhost
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36
Content-Type: application/x-www-form-urlencoded
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Referer: http://localhost/experiments/how_does_php_encode_data_it_receives_from_browser.php
Accept-Encoding: gzip, deflate
Accept-Language: en-GB,en-US;q=0.8,en;q=0.6

Or is there something else going on? e.g. is the browser expected to encode characters to some pre-determined standard?
How does PHP know how to decode data it receives from the browser POST requests?

  • 写回答

1条回答 默认 最新

  • dongqing904999 2016-09-08 02:15
    关注

    In regard to GET data, the W3C standard states

    Note. The "get" method restricts form data set values to ASCII characters.
    Only the "post" method (with enctype="multipart/form-data") is specified to cover the entire [ISO10646] character set.

    So with GET the browser seems to be locked into ASCII, if the form element has the attribute enctype="multipart/form-data" it seems the standard supports the larger charset [ISO10646].
    And I guess because it is closer to a pure bitstream, the default Content-type of application/x-www-form-url-encoded supports all character encodings. in particular this article states:
    http://www.herongyang.com/PHP/Non-ASCII-Form-Basic-Rules.html

    URL encoding converts all non ASCII bytes in the form of "%xx", "xx" is the HEX value of the byte.

    So this seems to explain what charsets the browser can possibly send, but not how it instructs PHP as to what actual charset it sent. (with the exception of GET, which PHP will know can only be ASCII). O Other wise from what I can understand there is essentially no direct guidance from the browser as to the character encoding of the form data it's sending.
    I could be wrong though and would be interested in any feedback/alternatives to this theory.
    Otherwise, from what I can tell the integrity of the scheme essentially relies on the server simply "remembering" what

    <meta charset="utf-8">
    

    or

    <form ... accept-charset="utf-8">
    

    values it was sending to users (and hoping users didn't change the character encoding via browser "settings") and expecting that the browser will faithfully send subsequent requests in that charset.
    So in other words, if you had a web designer on your team responsible for HTML and they set the HTML meta tag <meta charset="utf-8"> they would need to inform the database admin, hey, you need to set up your database schema, tables etc to expect UTF-8 encoding.
    This is because the server side devs/DBA's won't be able to dynamically check for the encoding (e.g. if a form submission came from a user in a different country, whose browser may be set to some different charset).
    and potentially reject or log a warning etc...
    Basically it seems the devs need to explicitly set charset for every HTML page containing forms, e.g. with <meta charset="utf-8"> and then just trust that the browser will send the POST data in the same charset that the HTML containing the form was encoded in.

    Further reading

    评论

报告相同问题?

悬赏问题

  • ¥15 MATLAB怎么通过柱坐标变换画开口是圆形的旋转抛物面?
  • ¥15 寻一个支付宝扫码远程授权登录的软件助手app
  • ¥15 解riccati方程组
  • ¥15 display:none;样式在嵌套结构中的已设置了display样式的元素上不起作用?
  • ¥15 使用rabbitMQ 消息队列作为url源进行多线程爬取时,总有几个url没有处理的问题。
  • ¥15 Ubuntu在安装序列比对软件STAR时出现报错如何解决
  • ¥50 树莓派安卓APK系统签名
  • ¥65 汇编语言除法溢出问题
  • ¥15 Visual Studio问题
  • ¥20 求一个html代码,有偿