douhao3562 2011-03-21 11:44
浏览 67
已采纳

如何在PHP中处理xls文件的不同编码?

I'm developing a php script involving parsing data from xls files. I'm using library phpexcelreader. All mostly works, but I stumbled upon a strange problem. Some files are parsed incorrecty. Looks like xls files may use different character encodings internally. At least, then I pipe output from my script through iconv -f cp1251 -t utf8, strings get corrected.

Phpexcelreader has an option for specifing output encoding, but looks like it lacks an ability detect input encoding. Any ideas?

  • 写回答

3条回答 默认 最新

  • douwo1862 2011-03-21 12:16
    关注

    The _defaultEncoding property of the workbook object can be set to contain the charset used by the Excel file, and this is then used to handle conversion to UTF-16LE by the reader, but it makes no effort to identify the internal charset itself.

    If you define

    define('SPREADSHEET_EXCEL_READER_TYPE_CODEPAGE',  0x0042);
    

    among the other SPREADSHEET_EXCEL_READER_TYPE definitions, and then modify the switch statement starting at line 464 to include a case for SPREADSHEET_EXCEL_READER_TYPE_CODEPAGE. The logic for this case needs to be something like:

    $length = $this->_GetInt2d($this->_data, $pos + 2);
    $recordData = substr($this->_data, $pos + 4, $length);
    
    // move stream pointer to next record
    $pos += 4 + $length;
    
    // offset: 0; size: 2; code page identifier
    $codepage = $this->_GetInt2d($recordData, 0);
    $codepage = $this->_CodePageNumberToName($codepage)
    

    Recreate the _GetInt2d method (that seems to have been stripped from the code at some point) as

    function _GetInt2d($data, $pos)
    {
        return ord($data[$pos]) | (ord($data[$pos + 1]) << 8);
    }
    

    and create a _CodePageNumberToName method to return the codepage name from its numeric value:

    function _CodePageNumberToName($codePage = '1252')
    {
        switch ($codePage) {
            case 367:   return 'ASCII';     break;  //  ASCII
            case 437:   return 'CP437';     break;  //  OEM US
            case 720:   throw new Exception('Code page 720 not supported.');
                                            break;  //  OEM Arabic
            case 737:   return 'CP737';     break;  //  OEM Greek
            case 775:   return 'CP775';     break;  //  OEM Baltic
            case 850:   return 'CP850';     break;  //  OEM Latin I
            case 852:   return 'CP852';     break;  //  OEM Latin II (Central European)
            case 855:   return 'CP855';     break;  //  OEM Cyrillic
            case 857:   return 'CP857';     break;  //  OEM Turkish
            case 858:   return 'CP858';     break;  //  OEM Multilingual Latin I with Euro
            case 860:   return 'CP860';     break;  //  OEM Portugese
            case 861:   return 'CP861';     break;  //  OEM Icelandic
            case 862:   return 'CP862';     break;  //  OEM Hebrew
            case 863:   return 'CP863';     break;  //  OEM Canadian (French)
            case 864:   return 'CP864';     break;  //  OEM Arabic
            case 865:   return 'CP865';     break;  //  OEM Nordic
            case 866:   return 'CP866';     break;  //  OEM Cyrillic (Russian)
            case 869:   return 'CP869';     break;  //  OEM Greek (Modern)
            case 874:   return 'CP874';     break;  //  ANSI Thai
            case 932:   return 'CP932';     break;  //  ANSI Japanese Shift-JIS
            case 936:   return 'CP936';     break;  //  ANSI Chinese Simplified GBK
            case 949:   return 'CP949';     break;  //  ANSI Korean (Wansung)
            case 950:   return 'CP950';     break;  //  ANSI Chinese Traditional BIG5
            case 1200:  return 'UTF-16LE';  break;  //  UTF-16 (BIFF8)
            case 1250:  return 'CP1250';    break;  //  ANSI Latin II (Central European)
            case 1251:  return 'CP1251';    break;  //  ANSI Cyrillic
            case 0:                                 //  CodePage is not always correctly set when the xls file was saved by Apple's Numbers program
            case 1252:  return 'CP1252';    break;  //  ANSI Latin I (BIFF4-BIFF7)
            case 1253:  return 'CP1253';    break;  //  ANSI Greek
            case 1254:  return 'CP1254';    break;  //  ANSI Turkish
            case 1255:  return 'CP1255';    break;  //  ANSI Hebrew
            case 1256:  return 'CP1256';    break;  //  ANSI Arabic
            case 1257:  return 'CP1257';    break;  //  ANSI Baltic
            case 1258:  return 'CP1258';    break;  //  ANSI Vietnamese
            case 1361:  return 'CP1361';    break;  //  ANSI Korean (Johab)
            case 10000: return 'MAC';       break;  //  Apple Roman
            case 32768: return 'MAC';       break;  //  Apple Roman
            case 32769: throw new Exception('Code page 32769 not supported.');
                                            break;  //  ANSI Latin I (BIFF2-BIFF3)
            case 65001: return 'UTF-8';     break;  //  Unicode (UTF-8)
        }
    }
    

    And store the returned value in $_defaultEncoding

    Alternatively, switch to an Excel reader that can handle the codepage correctly in the first place

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(2条)

报告相同问题?

悬赏问题

  • ¥20 基于MSP430f5529的MPU6050驱动,求出欧拉角
  • ¥20 Java-Oj-桌布的计算
  • ¥15 powerbuilder中的datawindow数据整合到新的DataWindow
  • ¥20 有人知道这种图怎么画吗?
  • ¥15 pyqt6如何引用qrc文件加载里面的的资源
  • ¥15 安卓JNI项目使用lua上的问题
  • ¥20 RL+GNN解决人员排班问题时梯度消失
  • ¥60 要数控稳压电源测试数据
  • ¥15 能帮我写下这个编程吗
  • ¥15 ikuai客户端l2tp协议链接报终止15信号和无法将p.p.p6转换为我的l2tp线路