2014-06-17 00:11
浏览 39


I'm working on an iOS app with a PHP+MySQL backend. The app has a chat section, which needs to support emoji. My tables are utf8_unicode_ci. If I don't call 'set names utf8' in my scripts, emoji it actually works - whatever is entered in the database, is returned to the clients as it should.

The problem is that this (if I understand it correctly) stores special characters incorrectly in the database, and this breaks string comparing (ie ï is no longer the same as i when comparing strings).

However, if I do call set names utf8, suddenly the emoji characters are inserted as a bunch of questionmarks.

Any suggestions on the proper way of handling this? Thanks!

图片转代码服务由CSDN问答提供 功能建议

我正在开发一个带有PHP + MySQL后端的iOS应用程序。 该应用程序有一个聊天部分,需要支持表情符号。 我的表是utf8_unicode_ci。 如果我不在我的脚本中调用'set names utf8',表情符号它实际上是有效的 - 无论在数据库中输入什么,都会按原样返回给客户端。

问题是 这(如果我理解正确的话)在数据库中不正确地存储特殊字符,这会破坏字符串比较(即,在比较字符串时,ï不再与i相同)。

但是, 如果我确实调用集合名称utf8,突然将表情符号字符作为一堆问号插入。

有关处理此问题的正确方法的任何建议吗? 谢谢!

  • 点赞
  • 写回答
  • 关注问题
  • 收藏
  • 邀请回答

1条回答 默认 最新

  • dongye1143
    dongye1143 2014-06-17 00:48

    The issue is wether the db has a diacritical insensitive compare. The other issue is composed characters, ï can be expressed as either one unicode character or two forming a surrogate pair. There are methods to convert a string to a pre-composed or decomposed form: precomposedStringWith* and decomposedStringWith*.

    It seems that MySQL supports two forms of unicode ucs2 (that is an older form that was supersede by utf16) which is 16-bits per character and utf8 up to 3 bytes per character. The bad news is that neither form is going to support plane 1 characters which require at 17 bits. (mainly emoji). It looks like MySQL 5.5.3 and up also support utf8mb4, utf16, and utf32 support BMP and supplementary characters (read emoji). See MySQL Unicode Character Sets.

    Here is some code and results to demonstrate the different unicode byte representations.
    Unicode is a 21 bit encoding system.
    UTF32 directly represents the code points and clearly demonstrates decomposed surrogate pairs.
    UTF8 and UTF16 require one or more bytes to represent a unicode character.

    NSLog(@"character: %@", @"Å");
    NSLog(@"decomposedStringWithCanonicalMapping UTF8: %@", [[@"Å" decomposedStringWithCanonicalMapping] dataUsingEncoding:NSUTF8StringEncoding]);
    NSLog(@"decomposedStringWithCanonicalMapping UTF16: %@", [[@"Å" decomposedStringWithCanonicalMapping] dataUsingEncoding:NSUTF16BigEndianStringEncoding]);
    NSLog(@"decomposedStringWithCanonicalMapping UTF32: %@", [[@"Å" decomposedStringWithCanonicalMapping] dataUsingEncoding:NSUTF32BigEndianStringEncoding]);

    NSLog(@"precomposedStringWithCanonicalMapping UTF8: %@", [[@"Å" precomposedStringWithCanonicalMapping] dataUsingEncoding:NSUTF8StringEncoding]);
    NSLog(@"precomposedStringWithCanonicalMapping UTF16: %@", [[@"Å" precomposedStringWithCanonicalMapping] dataUsingEncoding:NSUTF16BigEndianStringEncoding]);
    NSLog(@"precomposedStringWithCanonicalMapping UTF32: %@", [[@"Å" precomposedStringWithCanonicalMapping] dataUsingEncoding:NSUTF32BigEndianStringEncoding]);

    NSLog(@"character: %@", @"

    点赞 评论