dqfkd82886 2012-09-12 14:47
浏览 53
已采纳

如果它是变音符号,fgetcsv正在吃字符串的第一个字母

I am importing contents from an Excel-generated CSV-file into an XML document like:

$csv = fopen($csvfile, r);
$words = array();

while (($pair = fgetcsv($csv)) !== FALSE) {
    array_push($words, array('en' => $pair[0], 'de' => $pair[1]));
}

The inserted data are English/German expressions.

I insert these values into an XML structure and output the XML as following:

$dictionary = new SimpleXMLElement('<dictionary></dictionary>');
//do things
$dom = dom_import_simplexml($dictionary) -> ownerDocument;
$dom -> formatOutput = true;

header('Content-encoding: utf-8'); //<3 UTF-8
header('Content-type: text/xml'); //Headers set to correct mime-type for XML output!!!!

echo $dom -> saveXML();

This is working fine, yet I am encountering one really strange problem. When the first letter of a String is an Umlaut (like in Österreich or Ägypten) the character will be omitted, resulting in gypten or sterreich. If the Umlaut is in the middle of the String (Russische Föderation) it gets transferred correctly. Same goes for things like ß or é or whatever.

All files are UTF-8 encoded and served in UTF-8.

This seems rather strange and bug-like to me, yet maybe I am missing something, there's a lot of smart people around here.

  • 写回答

5条回答 默认 最新

  • doutan2111 2012-09-13 08:20
    关注

    Ok, so this seems to be a bug in fgetcsv.

    I am now processing the CSV data on my own (a little cumbersome), but it is working and I do not have any encoding issues at all.

    This is (a not-yet-optimized version of) what I am doing:

    $rawCSV = file_get_contents($csvfile);
    
    $lines = preg_split ('/$\R?^/m', $rawCSV); //split on line breaks in all operating systems: http://stackoverflow.com/a/7498886/797194
    
    foreach ($lines as $line) {
        array_push($words, getCSVValues($line));
    }
    

    The getCSVValues is coming from here and is needed to deal with CSV lines like this (commas!):

    "I'm a string, what should I do when I need commas?",Howdy there
    

    It looks like:

    function getCSVValues($string, $separator=","){
    
        $elements = explode($separator, $string);
    
        for ($i = 0; $i < count($elements); $i++) {
            $nquotes = substr_count($elements[$i], '"');
            if ($nquotes %2 == 1) {
                for ($j = $i+1; $j < count($elements); $j++) {
                    if (substr_count($elements[$j], '"') %2 == 1) { // Look for an odd-number of quotes
                        // Put the quoted string's pieces back together again
                        array_splice($elements, $i, $j-$i+1,
                            implode($separator, array_slice($elements, $i, $j-$i+1)));
                        break;
                    }
                }
            }
            if ($nquotes > 0) {
                // Remove first and last quotes, then merge pairs of quotes
                $qstr =& $elements[$i];
                $qstr = substr_replace($qstr, '', strpos($qstr, '"'), 1);
                $qstr = substr_replace($qstr, '', strrpos($qstr, '"'), 1);
                $qstr = str_replace('""', '"', $qstr);
            }
        }
        return $elements;
    
    }
    

    Quite a bit of a workaround, but it seems to work fine.

    EDIT:

    There's a also a filed bug for this, apparently this depends on the locale settings.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(4条)

报告相同问题?

悬赏问题

  • ¥15 如何让企业微信机器人实现消息汇总整合
  • ¥50 关于#ui#的问题:做yolov8的ui界面出现的问题
  • ¥15 如何用Python爬取各高校教师公开的教育和工作经历
  • ¥15 TLE9879QXA40 电机驱动
  • ¥20 对于工程问题的非线性数学模型进行线性化
  • ¥15 Mirare PLUS 进行密钥认证?(详解)
  • ¥15 物体双站RCS和其组成阵列后的双站RCS关系验证
  • ¥20 想用ollama做一个自己的AI数据库
  • ¥15 关于qualoth编辑及缝合服装领子的问题解决方案探寻
  • ¥15 请问怎么才能复现这样的图呀