du548397507 2018-03-29 05:55
浏览 192
已采纳

不支持的字符的Groovy字符编码结果不匹配

This is so far my analysis and obstacle, suppose for the character below, which is basically supported by “UTF-8” character set and has not supported to “EUC-JP”. “―” For php, there is a method “var_dump(input_string)” to convert any string into byte array with encoding “EUC-JP”, in this case, it returns,

[161, 189, 10] //Note: [3]=>int(10) for Line Feed.

similarly, when I produce the byte array with encoding “UTF-8”, in this case it returns,

[226, 128, 141,10] //Note: [4]=>int(10) for Line Feed.

But, when I tried the same thing in Groovy, It behaves totally differently, For EUC-JP the byte arrangement as follows,

[-95, -67, 10] //Note: [3]=>int(10) for Line Feed.

For UTF-8,

[-30, -128, -107, 10] //Note: [4]=>int(10) for Line Feed.

N.B., I fetched the data directly from 2 different text file encoded with EUC-JP and UTF-8 respectively. The last byte of all the above arrays are for LF (Line Feed). As the byte arrangement is different for same character encoding for those two language, it’s not possible to match produced hash in between.

Here is code sample so far, Start with php,

<?php
$myfile = fopen("euc_jp.txt", "r") or die("Unable to open file!");
$str1 = fread($myfile,filesize("euc_jp.txt"));

echo "Read From File EUC-JP:<br/>";
echo $str1;
$byte_array1 = unpack('C*', $str1);
echo "<br/>Byte Dump of EUC-JP File Content:<br/>";
var_dump($byte_array1);

echo "<br/><br/>";
$myfile2 = fopen("utf_8.txt", "r") or die("Unable to open file!");
$str2 = fread($myfile2,filesize("utf_8.txt"));

echo "<br/><br/>";
echo "Read From File UTF-8:<br/>";
echo $str2;
$byte_array2 = unpack('C*', $str2);
echo "<br/>Byte Dump of EUC-JP File Content:<br/>";
var_dump($byte_array2);


$encodedToEucJp = mb_convert_encoding($str2, "EUC_JP");
echo "<br/><br/>After conversion (UTF-8) to (EUC-JP): <br/>";
echo $encodedToEucJp;

echo "<br/><br/>";

echo "Hash Generation Directly From EUC-JP:<br/>";
print_r(md5($str1));

echo "<br/><br/>";
echo "Hash Generation From UTF-8 File Content After Encoded to EUC-JP:<br/>";
print_r(md5($encodedToEucJp));

fclose($myfile);
fclose($myfile2);
?>

For Groovy,

println(new File('/var/www/html/euc_jp.txt').getText('EUC-JP').getBytes("EUC-JP"))
println(new File('/var/www/html/utf_8.txt').getText('UTF-8').getBytes("UTF-8"))

This is so far my obstacle, first of all the byte representation for those two language is different, if it’s not a limitation of Groovy as well as Java8, how do I produce the same byte arrangement that produced by php, secondly, what is the equivalent code for the native php function, b_convert_encoding(). So, that I able to convert any string encoding, where there might have some character that doesn’t support by both the encoding mechanism.

  • 写回答

1条回答 默认 最新

  • douzhangkui2467 2018-03-29 06:05
    关注

    Java and Groovy bytes are two's complement signed values, i.e. the value of one byte is between -128 and +127.

    To calculate the corresponding unsigned value for the same one byte, add 256 to a negative value, e.g. -95 + 256 = 161

    So, what you see are the same bytes. It's just that PHP prints the values as unsigned, and Groovy prints the values as signed. They are still the same 8 bits in a byte.

    Unsigned             Signed                 Hex
    161, 189, 10      == -95, -67, 10        == A1, BD, 0A
    226, 128, 141, 10 == -30, -128, -115, 10 == E2, 80, 8D, 0A
    226, 128, 149, 10 == -30, -128, -107, 10 == E2, 80, 95, 0A
    

    E2 80 8D is UTF-8 for Unicode Character 'ZERO WIDTH JOINER' (U+200D).

    E2 80 95 is UTF-8 for Unicode Character 'HORIZONTAL BAR' (U+2015).

    If you want to print the values of a byte array as readable text, I suggest you print the bytes using 2-digit HEX. That way every byte is consistently 2 hex-digits long.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 做个有关计算的小程序
  • ¥15 MPI读取tif文件无法正常给各进程分配路径
  • ¥15 如何用MATLAB实现以下三个公式(有相互嵌套)
  • ¥30 关于#算法#的问题:运用EViews第九版本进行一系列计量经济学的时间数列数据回归分析预测问题 求各位帮我解答一下
  • ¥15 setInterval 页面闪烁,怎么解决
  • ¥15 如何让企业微信机器人实现消息汇总整合
  • ¥50 关于#ui#的问题:做yolov8的ui界面出现的问题
  • ¥15 如何用Python爬取各高校教师公开的教育和工作经历
  • ¥15 TLE9879QXA40 电机驱动
  • ¥20 对于工程问题的非线性数学模型进行线性化