To qualify my answer (to the downvoter):
Q: I have heard that UTF-8 does not support some Japanese characters. Is this correct?
A: There is a lot of misinformation floating around about the support
of Chinese, Japanese and Korean (CJK) characters. The Unicode Standard
supports all of the CJK characters from JIS X 0208, JIS X 0212, JIS X
0221, or JIS X 0213, for example, and many more. This is true no
matter which encoding form of Unicode is used: UTF-8, UTF-16, or
UTF-32.
Unicode supports over 80,000 CJK characters right now, and work is
underway to encode further additions. The International Standard
ISO/IEC 10646 and the Unicode Standard are completely synchronized in
repertoire and content. And that means that Unicode has the same
repertoire as GB 18030, since that also is synchronized with ISO 10646
— although with a different ordering and byte format.
From: The Unicode Consortium.
My Answer:
Rather than strpos
use mb_stripos
, from the PHP Multibyte string functions to find and replace characters. This should help your script detect and translate the non-latin characters.
If the uploaded file name ($_FILES['var']['name']
) is already incorrect in the PHP script (from output such as print_r($_FILES)
) then you need to ensure you are correctly encoding the HTML form with accept-charset='UTF-8'
(or SJIS, etc.). I would hope you're already well ahead of me on this.
Also it may be advisable to add a few preconditionals at the top of your code, again using the PHP mb_
functions add at the top of your PHP page:
mb_internal_encoding('UTF-8'); //or whatever character set works for you
mb_http_output('SJIS');
mb_http_input('UTF-8');
mb_regex_encoding('UTF-8');
Out of interest:
http://www.unicode.org/reports/tr37/
and
http://david.latapie.name/blog/shift-jis-utf-8/