This is so far my analysis and obstacle, suppose for the character below, which is basically supported by “UTF-8” character set and has not supported to “EUC-JP”. “―” For php, there is a method “var_dump(input_string)” to convert any string into byte array with encoding “EUC-JP”, in this case, it returns,
[161, 189, 10] //Note: [3]=>int(10) for Line Feed.
similarly, when I produce the byte array with encoding “UTF-8”, in this case it returns,
[226, 128, 141,10] //Note: [4]=>int(10) for Line Feed.
But, when I tried the same thing in Groovy, It behaves totally differently, For EUC-JP the byte arrangement as follows,
[-95, -67, 10] //Note: [3]=>int(10) for Line Feed.
For UTF-8,
[-30, -128, -107, 10] //Note: [4]=>int(10) for Line Feed.
N.B., I fetched the data directly from 2 different text file encoded with EUC-JP and UTF-8 respectively. The last byte of all the above arrays are for LF (Line Feed). As the byte arrangement is different for same character encoding for those two language, it’s not possible to match produced hash in between.
Here is code sample so far, Start with php,
<?php
$myfile = fopen("euc_jp.txt", "r") or die("Unable to open file!");
$str1 = fread($myfile,filesize("euc_jp.txt"));
echo "Read From File EUC-JP:<br/>";
echo $str1;
$byte_array1 = unpack('C*', $str1);
echo "<br/>Byte Dump of EUC-JP File Content:<br/>";
var_dump($byte_array1);
echo "<br/><br/>";
$myfile2 = fopen("utf_8.txt", "r") or die("Unable to open file!");
$str2 = fread($myfile2,filesize("utf_8.txt"));
echo "<br/><br/>";
echo "Read From File UTF-8:<br/>";
echo $str2;
$byte_array2 = unpack('C*', $str2);
echo "<br/>Byte Dump of EUC-JP File Content:<br/>";
var_dump($byte_array2);
$encodedToEucJp = mb_convert_encoding($str2, "EUC_JP");
echo "<br/><br/>After conversion (UTF-8) to (EUC-JP): <br/>";
echo $encodedToEucJp;
echo "<br/><br/>";
echo "Hash Generation Directly From EUC-JP:<br/>";
print_r(md5($str1));
echo "<br/><br/>";
echo "Hash Generation From UTF-8 File Content After Encoded to EUC-JP:<br/>";
print_r(md5($encodedToEucJp));
fclose($myfile);
fclose($myfile2);
?>
For Groovy,
println(new File('/var/www/html/euc_jp.txt').getText('EUC-JP').getBytes("EUC-JP"))
println(new File('/var/www/html/utf_8.txt').getText('UTF-8').getBytes("UTF-8"))
This is so far my obstacle, first of all the byte representation for those two language is different, if it’s not a limitation of Groovy as well as Java8, how do I produce the same byte arrangement that produced by php, secondly, what is the equivalent code for the native php function, b_convert_encoding(). So, that I able to convert any string encoding, where there might have some character that doesn’t support by both the encoding mechanism.