dqfkd82886 2012-09-12 14:47
浏览 53
已采纳

如果它是变音符号,fgetcsv正在吃字符串的第一个字母

I am importing contents from an Excel-generated CSV-file into an XML document like:

$csv = fopen($csvfile, r);
$words = array();

while (($pair = fgetcsv($csv)) !== FALSE) {
    array_push($words, array('en' => $pair[0], 'de' => $pair[1]));
}

The inserted data are English/German expressions.

I insert these values into an XML structure and output the XML as following:

$dictionary = new SimpleXMLElement('<dictionary></dictionary>');
//do things
$dom = dom_import_simplexml($dictionary) -> ownerDocument;
$dom -> formatOutput = true;

header('Content-encoding: utf-8'); //<3 UTF-8
header('Content-type: text/xml'); //Headers set to correct mime-type for XML output!!!!

echo $dom -> saveXML();

This is working fine, yet I am encountering one really strange problem. When the first letter of a String is an Umlaut (like in Österreich or Ägypten) the character will be omitted, resulting in gypten or sterreich. If the Umlaut is in the middle of the String (Russische Föderation) it gets transferred correctly. Same goes for things like ß or é or whatever.

All files are UTF-8 encoded and served in UTF-8.

This seems rather strange and bug-like to me, yet maybe I am missing something, there's a lot of smart people around here.

  • 写回答

5条回答 默认 最新

  • doutan2111 2012-09-13 08:20
    关注

    Ok, so this seems to be a bug in fgetcsv.

    I am now processing the CSV data on my own (a little cumbersome), but it is working and I do not have any encoding issues at all.

    This is (a not-yet-optimized version of) what I am doing:

    $rawCSV = file_get_contents($csvfile);
    
    $lines = preg_split ('/$\R?^/m', $rawCSV); //split on line breaks in all operating systems: http://stackoverflow.com/a/7498886/797194
    
    foreach ($lines as $line) {
        array_push($words, getCSVValues($line));
    }
    

    The getCSVValues is coming from here and is needed to deal with CSV lines like this (commas!):

    "I'm a string, what should I do when I need commas?",Howdy there
    

    It looks like:

    function getCSVValues($string, $separator=","){
    
        $elements = explode($separator, $string);
    
        for ($i = 0; $i < count($elements); $i++) {
            $nquotes = substr_count($elements[$i], '"');
            if ($nquotes %2 == 1) {
                for ($j = $i+1; $j < count($elements); $j++) {
                    if (substr_count($elements[$j], '"') %2 == 1) { // Look for an odd-number of quotes
                        // Put the quoted string's pieces back together again
                        array_splice($elements, $i, $j-$i+1,
                            implode($separator, array_slice($elements, $i, $j-$i+1)));
                        break;
                    }
                }
            }
            if ($nquotes > 0) {
                // Remove first and last quotes, then merge pairs of quotes
                $qstr =& $elements[$i];
                $qstr = substr_replace($qstr, '', strpos($qstr, '"'), 1);
                $qstr = substr_replace($qstr, '', strrpos($qstr, '"'), 1);
                $qstr = str_replace('""', '"', $qstr);
            }
        }
        return $elements;
    
    }
    

    Quite a bit of a workaround, but it seems to work fine.

    EDIT:

    There's a also a filed bug for this, apparently this depends on the locale settings.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(4条)

报告相同问题?

悬赏问题

  • ¥15 求MCSCANX 帮助
  • ¥15 机器学习训练相关模型
  • ¥15 Todesk 远程写代码 anaconda jupyter python3
  • ¥15 我的R语言提示去除连锁不平衡时clump_data报错,图片以下所示,卡了好几天了,苦恼不知道如何解决,有人帮我看看怎么解决吗?
  • ¥15 在获取boss直聘的聊天的时候只能获取到前40条聊天数据
  • ¥20 关于URL获取的参数,无法执行二选一查询
  • ¥15 液位控制,当液位超过高限时常开触点59闭合,直到液位低于低限时,断开
  • ¥15 marlin编译错误,如何解决?
  • ¥15 VUE项目怎么运行,系统打不开
  • ¥50 pointpillars等目标检测算法怎么融合注意力机制