donv29560 2016-01-21 21:25
浏览 25
已采纳

ISO 8859 1八进制恢复正常字符

I'm currently converting our old project database into a new format/new database. There are some old data, which were probably escaped by a smartphone app. Now the entry looks like this:

Tak hur\341 v posteli po pr\341ci a jde se sp\355nkat

now the real entry should look like this:

Tak hurá v posteli po práci a jde se spinkat

There are also entries like

Som nen\\355 ja len chodiaca kapuc\\341 pra\\u0161iva ignorujuca

which don't seem like ISO 8859 1, especially the \\u0161 part.

Any thoughts on any PHP function I may use to convert this back to readable version? Thanks!

  • 写回答

1条回答 默认 最新

  • duanjun7801 2016-01-21 23:37
    关注

    Simple workaround:

    The first string is only octal iso-8859-1, while the second one is double slashed iso-8859-1 with mixed utf-16 characters (why? now that is the question). The code below takes octal codes, converts to hex, packs them to binary and encodes them into utf-8. The utf-16 codes are already in hex, so they are only packed and encoded into utf-8.

    For future info reference on charsets: http://www.fileformat.info/info/charset/index.htm

    <?php
            $string = "Tak hur\341 v posteli po pr\341ci a jde se sp\355nkat";
            $string2 = "Som nen\\355 ja len chodiaca kapuc\\341 pra\\u0161iva ignorujuca";
    
            print decode_str($string2)."<br>";
            print decode_str($string);
    
    
            function decode_str($string){
                return utf16_to_utf8(iso_to_utf8($string));
            }
    
            function iso_to_utf8($string){
                preg_match_all('#\\\\[0-9]{3}#',$string,$matches);
    
                foreach($matches[0] as $match){
                    $char = preg_replace("#(\\\)#","",$match);
                    $a = pack("H*" , base_convert($char,8,16));
                    $string = preg_replace('#(\\\\)'.$char.'#',$a,$string);
                }
                return mb_convert_encoding($string,"UTF-8","ISO-8859-1");   
            }
    
            function utf16_to_utf8($string){
                preg_match_all('#\\\u[a-z0-9]{4}#',$string,$matches);
    
                foreach($matches[0] as $match){
                    $char = preg_replace("#\\\\u#","",$match);
                    $a = pack("H*" , $char);
                    $a = mb_convert_encoding($a,"UTF-8","UTF-16");
                    $string = preg_replace('#'.preg_quote($match).'#',$a,$string);
                }
    
                return $string;
            }
    
        ?>
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 扩散模型sd.webui使用时报错“Nonetype”
  • ¥15 stm32流水灯+呼吸灯+外部中断按键
  • ¥15 将二维数组,按照假设的规定,如0/1/0 == "4",把对应列位置写成一个字符并打印输出该字符
  • ¥15 NX MCD仿真与博途通讯不了啥情况
  • ¥15 win11家庭中文版安装docker遇到Hyper-V启用失败解决办法整理
  • ¥15 gradio的web端页面格式不对的问题
  • ¥15 求大家看看Nonce如何配置
  • ¥15 Matlab怎么求解含参的二重积分?
  • ¥15 苹果手机突然连不上wifi了?
  • ¥15 cgictest.cgi文件无法访问