du548397507 2018-03-29 05:55
浏览 192
已采纳

不支持的字符的Groovy字符编码结果不匹配

This is so far my analysis and obstacle, suppose for the character below, which is basically supported by “UTF-8” character set and has not supported to “EUC-JP”. “―” For php, there is a method “var_dump(input_string)” to convert any string into byte array with encoding “EUC-JP”, in this case, it returns,

[161, 189, 10] //Note: [3]=>int(10) for Line Feed.

similarly, when I produce the byte array with encoding “UTF-8”, in this case it returns,

[226, 128, 141,10] //Note: [4]=>int(10) for Line Feed.

But, when I tried the same thing in Groovy, It behaves totally differently, For EUC-JP the byte arrangement as follows,

[-95, -67, 10] //Note: [3]=>int(10) for Line Feed.

For UTF-8,

[-30, -128, -107, 10] //Note: [4]=>int(10) for Line Feed.

N.B., I fetched the data directly from 2 different text file encoded with EUC-JP and UTF-8 respectively. The last byte of all the above arrays are for LF (Line Feed). As the byte arrangement is different for same character encoding for those two language, it’s not possible to match produced hash in between.

Here is code sample so far, Start with php,

<?php
$myfile = fopen("euc_jp.txt", "r") or die("Unable to open file!");
$str1 = fread($myfile,filesize("euc_jp.txt"));

echo "Read From File EUC-JP:<br/>";
echo $str1;
$byte_array1 = unpack('C*', $str1);
echo "<br/>Byte Dump of EUC-JP File Content:<br/>";
var_dump($byte_array1);

echo "<br/><br/>";
$myfile2 = fopen("utf_8.txt", "r") or die("Unable to open file!");
$str2 = fread($myfile2,filesize("utf_8.txt"));

echo "<br/><br/>";
echo "Read From File UTF-8:<br/>";
echo $str2;
$byte_array2 = unpack('C*', $str2);
echo "<br/>Byte Dump of EUC-JP File Content:<br/>";
var_dump($byte_array2);


$encodedToEucJp = mb_convert_encoding($str2, "EUC_JP");
echo "<br/><br/>After conversion (UTF-8) to (EUC-JP): <br/>";
echo $encodedToEucJp;

echo "<br/><br/>";

echo "Hash Generation Directly From EUC-JP:<br/>";
print_r(md5($str1));

echo "<br/><br/>";
echo "Hash Generation From UTF-8 File Content After Encoded to EUC-JP:<br/>";
print_r(md5($encodedToEucJp));

fclose($myfile);
fclose($myfile2);
?>

For Groovy,

println(new File('/var/www/html/euc_jp.txt').getText('EUC-JP').getBytes("EUC-JP"))
println(new File('/var/www/html/utf_8.txt').getText('UTF-8').getBytes("UTF-8"))

This is so far my obstacle, first of all the byte representation for those two language is different, if it’s not a limitation of Groovy as well as Java8, how do I produce the same byte arrangement that produced by php, secondly, what is the equivalent code for the native php function, b_convert_encoding(). So, that I able to convert any string encoding, where there might have some character that doesn’t support by both the encoding mechanism.

  • 写回答

1条回答 默认 最新

  • douzhangkui2467 2018-03-29 06:05
    关注

    Java and Groovy bytes are two's complement signed values, i.e. the value of one byte is between -128 and +127.

    To calculate the corresponding unsigned value for the same one byte, add 256 to a negative value, e.g. -95 + 256 = 161

    So, what you see are the same bytes. It's just that PHP prints the values as unsigned, and Groovy prints the values as signed. They are still the same 8 bits in a byte.

    Unsigned             Signed                 Hex
    161, 189, 10      == -95, -67, 10        == A1, BD, 0A
    226, 128, 141, 10 == -30, -128, -115, 10 == E2, 80, 8D, 0A
    226, 128, 149, 10 == -30, -128, -107, 10 == E2, 80, 95, 0A
    

    E2 80 8D is UTF-8 for Unicode Character 'ZERO WIDTH JOINER' (U+200D).

    E2 80 95 is UTF-8 for Unicode Character 'HORIZONTAL BAR' (U+2015).

    If you want to print the values of a byte array as readable text, I suggest you print the bytes using 2-digit HEX. That way every byte is consistently 2 hex-digits long.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥20 sub地址DHCP问题
  • ¥15 delta降尺度计算的一些细节,有偿
  • ¥15 Arduino红外遥控代码有问题
  • ¥15 数值计算离散正交多项式
  • ¥30 数值计算均差系数编程
  • ¥15 redis-full-check比较 两个集群的数据出错
  • ¥15 Matlab编程问题
  • ¥15 训练的多模态特征融合模型准确度很低怎么办
  • ¥15 kylin启动报错log4j类冲突
  • ¥15 超声波模块测距控制点灯,灯的闪烁很不稳定,经过调试发现测的距离偏大