du548397507
2018-03-29 05:55
浏览 187
已采纳

不支持的字符的Groovy字符编码结果不匹配

This is so far my analysis and obstacle, suppose for the character below, which is basically supported by “UTF-8” character set and has not supported to “EUC-JP”. “―” For php, there is a method “var_dump(input_string)” to convert any string into byte array with encoding “EUC-JP”, in this case, it returns,

[161, 189, 10] //Note: [3]=>int(10) for Line Feed.

similarly, when I produce the byte array with encoding “UTF-8”, in this case it returns,

[226, 128, 141,10] //Note: [4]=>int(10) for Line Feed.

But, when I tried the same thing in Groovy, It behaves totally differently, For EUC-JP the byte arrangement as follows,

[-95, -67, 10] //Note: [3]=>int(10) for Line Feed.

For UTF-8,

[-30, -128, -107, 10] //Note: [4]=>int(10) for Line Feed.

N.B., I fetched the data directly from 2 different text file encoded with EUC-JP and UTF-8 respectively. The last byte of all the above arrays are for LF (Line Feed). As the byte arrangement is different for same character encoding for those two language, it’s not possible to match produced hash in between.

Here is code sample so far, Start with php,

<?php
$myfile = fopen("euc_jp.txt", "r") or die("Unable to open file!");
$str1 = fread($myfile,filesize("euc_jp.txt"));

echo "Read From File EUC-JP:<br/>";
echo $str1;
$byte_array1 = unpack('C*', $str1);
echo "<br/>Byte Dump of EUC-JP File Content:<br/>";
var_dump($byte_array1);

echo "<br/><br/>";
$myfile2 = fopen("utf_8.txt", "r") or die("Unable to open file!");
$str2 = fread($myfile2,filesize("utf_8.txt"));

echo "<br/><br/>";
echo "Read From File UTF-8:<br/>";
echo $str2;
$byte_array2 = unpack('C*', $str2);
echo "<br/>Byte Dump of EUC-JP File Content:<br/>";
var_dump($byte_array2);


$encodedToEucJp = mb_convert_encoding($str2, "EUC_JP");
echo "<br/><br/>After conversion (UTF-8) to (EUC-JP): <br/>";
echo $encodedToEucJp;

echo "<br/><br/>";

echo "Hash Generation Directly From EUC-JP:<br/>";
print_r(md5($str1));

echo "<br/><br/>";
echo "Hash Generation From UTF-8 File Content After Encoded to EUC-JP:<br/>";
print_r(md5($encodedToEucJp));

fclose($myfile);
fclose($myfile2);
?>

For Groovy,

println(new File('/var/www/html/euc_jp.txt').getText('EUC-JP').getBytes("EUC-JP"))
println(new File('/var/www/html/utf_8.txt').getText('UTF-8').getBytes("UTF-8"))

This is so far my obstacle, first of all the byte representation for those two language is different, if it’s not a limitation of Groovy as well as Java8, how do I produce the same byte arrangement that produced by php, secondly, what is the equivalent code for the native php function, b_convert_encoding(). So, that I able to convert any string encoding, where there might have some character that doesn’t support by both the encoding mechanism.

  • 写回答
  • 好问题 提建议
  • 关注问题
  • 收藏
  • 邀请回答

1条回答 默认 最新

  • douzhangkui2467 2018-03-29 06:05
    已采纳

    Java and Groovy bytes are two's complement signed values, i.e. the value of one byte is between -128 and +127.

    To calculate the corresponding unsigned value for the same one byte, add 256 to a negative value, e.g. -95 + 256 = 161

    So, what you see are the same bytes. It's just that PHP prints the values as unsigned, and Groovy prints the values as signed. They are still the same 8 bits in a byte.

    Unsigned             Signed                 Hex
    161, 189, 10      == -95, -67, 10        == A1, BD, 0A
    226, 128, 141, 10 == -30, -128, -115, 10 == E2, 80, 8D, 0A
    226, 128, 149, 10 == -30, -128, -107, 10 == E2, 80, 95, 0A
    

    E2 80 8D is UTF-8 for Unicode Character 'ZERO WIDTH JOINER' (U+200D).

    E2 80 95 is UTF-8 for Unicode Character 'HORIZONTAL BAR' (U+2015).

    If you want to print the values of a byte array as readable text, I suggest you print the bytes using 2-digit HEX. That way every byte is consistently 2 hex-digits long.

    已采纳该答案
    评论
    解决 无用
    打赏 举报

相关推荐 更多相似问题