utf8与unicode转码问题

1.如将带中文的字符串转成unicode的格式，然后如何再转回来，

注：字符串中既有英文又有中文
代码由C语言实现。

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除
收藏举报

3条回答默认最新

窝米逗佛~ 2020-01-21 11:01

关注

https://www.cnblogs.com/cfas/p/7931787.html
这个试试

/*****************************************************************************
* 将一个字符的UTF8编码转换成Unicode(UCS-2和UCS-4)编码.
*
* 参数:
*    pInput      指向输入缓冲区, 以UTF-8编码
*    Unic        指向输出缓冲区, 其保存的数据即是Unicode编码值,
*                类型为unsigned long .
*
* 返回值:
*    成功则返回该字符的UTF8编码所占用的字节数; 失败则返回0.
*
* 注意:
*     1. UTF8没有字节序问题, 但是Unicode有字节序要求;
*        字节序分为大端(Big Endian)和小端(Little Endian)两种;
*        在Intel处理器中采用小端法表示, 在此采用小端法表示. (低地址存低位)
****************************************************************************/
int enc_utf8_to_unicode_one( const /*unsigned*/ char* pInput, unsigned int length,/*unsigned*/ char *Unic )
{
    assert( pInput != NULL && Unic != NULL );

    // b1 表示UTF-8编码的pInput中的高字节, b2 表示次高字节, ...
    char b1, b2, b3, b4, b5, b6;

    int utfbytes = 0;
    unsigned char *pOutput = ( unsigned char * )Unic;
    int n = 0;
    while ( true )
    {

        //*Unic = 0x0; // 把 *Unic 初始化为全零
        utfbytes = enc_get_utf8_size( *pInput );

        switch ( utfbytes )
        {
        case 0:
            *pOutput = *pInput;
            utfbytes += 1;
            break;
        case 2:
            b1 = *pInput;
            b2 = *( pInput + 1 );
            if ( ( b2 & 0xE0 ) != 0x80 )
                return 0;
            *pOutput = ( b1 << 6 ) + ( b2 & 0x3F );
            *( pOutput + 1 ) = ( b1 >> 2 ) & 0x07;
            break;
        case 3:
            b1 = *pInput;
            b2 = *( ++pInput );
            b3 = *( ++pInput );
            if ( ( ( b2 & 0xC0 ) != 0x80 ) || ( ( b3 & 0xC0 ) != 0x80 ) )
                return 0;
            *(pOutput+n) = ( b2 << 6 ) + ( b3 & 0x3F );
            *( pOutput+n+1  ) = ( b1 << 4 ) + ( ( b2 >> 2 ) & 0x0F );
            n += 2;
            break;
        case 4:
            b1 = *pInput;
            b2 = *( pInput + 1 );
            b3 = *( pInput + 2 );
            b4 = *( pInput + 3 );
            if ( ( ( b2 & 0xC0 ) != 0x80 ) || ( ( b3 & 0xC0 ) != 0x80 )
                || ( ( b4 & 0xC0 ) != 0x80 ) )
                return 0;
            *pOutput = ( b3 << 6 ) + ( b4 & 0x3F );
            *( pOutput + 1 ) = ( b2 << 4 ) + ( ( b3 >> 2 ) & 0x0F );
            *( pOutput + 2 ) = ( ( b1 << 2 ) & 0x1C ) + ( ( b2 >> 4 ) & 0x03 );
            break;
        case 5:
            b1 = *pInput;
            b2 = *( pInput + 1 );
            b3 = *( pInput + 2 );
            b4 = *( pInput + 3 );
            b5 = *( pInput + 4 );
            if ( ( ( b2 & 0xC0 ) != 0x80 ) || ( ( b3 & 0xC0 ) != 0x80 )
                || ( ( b4 & 0xC0 ) != 0x80 ) || ( ( b5 & 0xC0 ) != 0x80 ) )
                return 0;
            *pOutput = ( b4 << 6 ) + ( b5 & 0x3F );
            *( pOutput + 1 ) = ( b3 << 4 ) + ( ( b4 >> 2 ) & 0x0F );
            *( pOutput + 2 ) = ( b2 << 2 ) + ( ( b3 >> 4 ) & 0x03 );
            *( pOutput + 3 ) = ( b1 << 6 );
            break;
        case 6:
            b1 = *pInput;
            b2 = *( pInput + 1 );
            b3 = *( pInput + 2 );
            b4 = *( pInput + 3 );
            b5 = *( pInput + 4 );
            b6 = *( pInput + 5 );
            if ( ( ( b2 & 0xC0 ) != 0x80 ) || ( ( b3 & 0xC0 ) != 0x80 )
                || ( ( b4 & 0xC0 ) != 0x80 ) || ( ( b5 & 0xC0 ) != 0x80 )
                || ( ( b6 & 0xC0 ) != 0x80 ) )
                return 0;
            *pOutput = ( b5 << 6 ) + ( b6 & 0x3F );
            *( pOutput + 1 ) = ( b5 << 4 ) + ( ( b6 >> 2 ) & 0x0F );
            *( pOutput + 2 ) = ( b3 << 2 ) + ( ( b4 >> 4 ) & 0x03 );
            *( pOutput + 3 ) = ( ( b1 << 6 ) & 0x40 ) + ( b2 & 0x3F );
            break;
        default:
            return 0;
            break;
        }
        length -= utfbytes;
        if ( length <= 0 )
            break;
        else
            ++pInput;
    }
    return utfbytes;
}
int main( int argc, char** argv )
{   
        string utf8 = CodeConverter::UnicodeToUtf8( L"成都" );
        wstring unicode = CodeConverter::Utf8ToUnicode( wsdd );
        char Unic[ 512 ] = { 0 };
        enc_utf8_to_unicode_one( wsdd.c_str(), wsdd.size(), Unic );
    }

图片说明

报告相同问题？

关注问题

linux下utf8编码转Unicode编码 linux
2018-01-25 13:27

回答 3 已采纳大概看了下这个代码，思路没有错。如果代码来源可靠，那么很可能是你调用的问题，比如你用的printf strcpy之类的函数，截断了unicode string的\\0，其实转换本身是成功的。
utf8转unicode的问题
2013-08-13 13:23

回答 1 已采纳 iphttpfile->ReadString(strLine),读到内容应该在不论ansi 还是 unicode 编码环境下都是一样的，因为都是服务器返回的。只是在ansi 下，字母显示正常，汉
在Go中将带有UTF-8字节字符串的命令行输出转换为Unicode代码点
2019-04-10 18:21

回答 1 已采纳 You can use the strconv package to parse the string literal containing the escape sequences. The
C++(19):字符转码UTF8/Unicode/Ascii
2019-09-12 09:16

Just_like_fire的博客字符转码一直是C++编程中的老大难问题，由于不同编码的规则不同，造成中文字符经常出现乱码，这里记录几个常见的字符编码之间的转换代码（C++） UTF-8转Unicode std::wstring Utf82Unicode(const std::string&...
beta版使用Unicode UTF-8提供全球语言支持 c++
2022-10-04 21:19

回答 1 已采纳不要勾选这个，勾了后很多程序会乱码。
native2ascii 转换UTF-8到Unicode出现问题 java
2022-04-18 16:17

回答 1 已采纳找到了，编写的文件r_temp. properties格式应该为UTF-8，之前的没有设置，文件格式不是UTF-8，所以出现上述问题。settings-file encoding中，transpare
VBscript实现UTF8中文转换为Unicode Hex编码 .net asp.net
2023-02-02 15:52

回答 1 已采纳错误1是因为，在 UTF-8 编码下，每个中文字符实际上由 3 个字节组成，而非 1 个字节。因此，函数 Len(input) 返回的结果是正确的。错误2是因为，您的代码使用了 AscW 函数来获取
Unicode 与 UTF-8 编码的转换
2022-05-24 23:21

跟着飞哥学编程的博客 Unicode 与 UTF-8 转换的方式
为什么utf 8.Valid String函数无法检测到无效的unicode字符？
2016-04-05 12:25

回答 2 已采纳 Your problem happens in Sprintf. Since you give it an invalid character Sprintf replaces with with
PHP输出utf-8字符的问题 apache php
2017-01-27 10:35

回答 2 已采纳 You should check the character encoding of the xlsx file. If the file was created on windows then
UTF8 mb_strpos问题 php
2015-05-05 08:42

回答 1 已采纳 Well, yes. The error is correct. You've just swapped the $haystack and $needle when calling mb_str
MSVC C++ UTF-8编程
2022-12-18 17:45

KyleWlk的博客 Window上MVC使用UTF-8编码
将Unicode UTF8添加到电子邮件中 php
2013-11-15 08:55

回答 1 已采纳 You've got two ambiguous parts of your script where character-sets aren't being considered. It's p
编程实现UTF-8到GBK转码
2018-03-10 10:52

alpbrook的博客 UTF-8的编码规则如下：U+ 0000 ~ U+ 007F: 0XXXXXXXU+ 0080 ~ U+ 07FF: 110XXXXX 10XXXXXXU+ 0800 ~ U+ FFFF: 1110XXXX 10XXXXXX 10XXXXXXU+10000 ~ U+1FFFF: 11110XXX 10XXXXXX 10XXXXXX 10XXXXXXGBK字符的UTF-8...
Unicode与UTF-8转换
2017-02-15 16:32

嘻哥的博客 Unicode是一个字符集，而UTF-8是Unicode的其中一种，Unicode是定长的都为双字节，而UTF-8是可变的，对于汉字来说Unicode占有的字节比UTF-8占用的字节少1个字节。Unicode为双字节，而UTF-8中汉字占三个字节。 ...
没有解决我的问题, 去提问

悬赏问题

¥15 2020长安杯与连接网探
¥15 关于#matlab#的问题：在模糊控制器中选出线路信息，在simulink中根据线路信息生成速度时间目标曲线（初速度为20m/s，15秒后减为0的速度时间图像）我想问线路信息是什么
¥15 banner广告展示设置多少时间不怎么会消耗用户价值
¥16 mybatis的代理对象无法通过@Autowired装填
¥15 可见光定位matlab仿真
¥15 arduino 四自由度机械臂
¥15 wordpress 产品图片 GIF 没法显示
¥15 求三国群英传pl国战时间的修改方法
¥15 matlab代码代写，需写出详细代码，代价私
¥15 ROS系统搭建请教（跨境电商用途）

码龄粉丝数原力等级 --

utf8与unicode转码问题

3条回答默认最新

码龄粉丝数原力等级 --

悬赏问题

utf8与unicode转码问题

3条回答 默认 最新

悬赏问题

3条回答默认最新