dpdhsq0783
2013-04-30 20:11
浏览 96

来自xml的php utf-8解码返回问号

I have some problems using xml. I know this is a comon question, but the answers i found didn't fix my problem. The problem is that when I add é or ä or another special char to my xml file, with php domdocument, it saves the é as xE9 and the ä as xE4. I don't know if this is ok but when I want to show the output it shows question marks at this places. I have tried alot. Like removing and adding the encoding in de xml header in the php domdocument. I also tried using file_get_contents and use php utf-8_decode to get the xml. I tried using iso intead, but nothing solved my problem. Instead I got php xml parse errors sometimes. I must do something wrong, but what? Thats my question and how I can solve this problem. My xml file looks like this: the xE9 and the xE4 have black backgrounds.

<?xml version="1.0" encoding="UTF-8"?>
<root>
  <row id="1">
    <question>blah</question>
    <answer>blah</answer>
  </row>
  <row id="2">
    <question>xE9</question>
    <answer>xE4</answer>
  </row>
</root>

and a part of my php xml class

function __construct($filePath) {
    $this->file = $filePath;
    $this->label = array('Vraag', 'Antwoord');
    $xmlStr = file_get_contents($filePath);
    $xmlStr = utf8_decode($xmlStr);
    $this->xmlDoc = new DOMDocument('1.0', 'UTF-8');
    $this->xmlDoc->preserveWhiteSpace = false;
    $this->xmlDoc->formatOutput = true;
    //$this->xmlDoc->load($filePath);   
    $this->xmlDoc->loadXML($xmlStr);
}       

this is the add new row function

//creates new xml row and saves it in xml file
function addNewRow($question, $answer) {
    $nextAttr = $this->getNextRowId();
    $parentNode = $this->xmlDoc->documentElement;
    $rowNode = $this->xmlDoc->createElement('row');
    $rowNode = $parentNode->appendChild($rowNode);
    $rowNode->setAttribute('id', $nextAttr);    
    $q = $this->xmlDoc->createElement('question');
    $q = $rowNode->appendChild($q);
    $qText = $this->xmlDoc->createTextNode($question);
    $qText = $q->appendChild($qText);
    $a = $this->xmlDoc->createElement('answer');
    $a = $rowNode->appendChild($a);
    $aText = $this->xmlDoc->createTextNode($answer);
    $aText = $a->appendChild($aText);
    $this->xmlDoc->save($this->file);
}

everything works fine till I add spcial chars. Those are shown as questionmarks.

  • 写回答
  • 好问题 提建议
  • 关注问题
  • 收藏
  • 邀请回答

1条回答 默认 最新

  • doubo4824 2013-05-01 23:32
    已采纳

    Okay the following is now a bit rough/verbose, especially as you already tried so much. Just try to keep fresh eyes and consider that once you do only a little mistake with encoding, it is often already screwed. Therefore it is important to properly understand which mechanics are at work here.

    I try to address some of these mechanics that are operating in DOMDocument in PHP. You might find this interesting or daunting and perhaps even at the end the solution is very simple and you don't even need to change your PHP code, but I'd like to address this anyway because it is not much documented on Stackoverflow and the PHP manual and it's good to have more reference material as it is important to properly understand - as I already wrote.

    So by default XML is in UTF-8. UTF-8 is pretty much the perfect choice for the internet nowadays. Sure this is not totally true in and for all cases, but generally, it is a safe bet. So XML on it's own and with it's default encoding UTF-8 is super fine.

    What does this mean for DOMDocument? Just that by default DOMDocument will take this encoding and we do not need to care about that. Here is a simple show of that, output follows commented:

    $doc = new DOMDocument();
    $doc->save('php://output');
    # <?xml version="1.0"?>
    

    This very short example shows the default UTF-8 encoding PHP has for the DOMDocument. This document even still not containing a root-node already shows the default XML UTF-8 encoding by not specifying one in the XML declaration: <?xml version="1.0"?>.

    So you might say "but I want", and sure you can. This is what the encoding parameter of DOMDocument is for when you call the constructor:

    $doc = new DOMDocument('1.0', 'UTF-8');
                                   #####  Encoding Parameter
    $doc->save('php://output');
    # <?xml version="1.0" encoding="UTF-8"?>
    

    As this shows, what we use as first (version) and second (encoding) parameter will be written out. So yes, we can do things that are not allowed. But what is allowed in this XML Declaration? There is one XML version AFAIK and that is 1.0. Therefore the version parameter must be 1.0 always. And what is allowed for the encodings? XML specs say all the IANA characters sets, in short it should be one of these common ones (should, not must): UTF-8, UTF-16, ISO-10646-UCS-2, ISO-10646-UCS-4, ISO-8859-1 to ISO-8859-9, ISO-2022-JP, Shift_JIS, EUC-JP. Okay wow, this already is a long list.

    So lets take a look what does PHP's DOMDocument allow us practically:

    $doc = new DOMDocument('♥♥ love, hugs and kisses ♥♥', 'UTF-8');
    $doc->save('php://output');
    # <?xml version="♥♥ love, hugs and kisses ♥♥" encoding="UTF-8"?>
    

    The encoding works as expected, the version is cosmetic, but it shows: This is using Unicode characters encoded as UTF-8. Now let's change the encoding to something different:

    $doc = new DOMDocument('♥♥ love, hugs and kisses ♥♥', 'ISO-8859-1');
    $doc->save('php://output');
    # <?xml version="&#9829;&#9829; love, hugs and kisses &#9829;&#9829;" encoding="ISO-8859-1"?>
    

    Because the Unicode hearts do not have a place in ISO-8859-1, they are replaced with their according numeric HTML entity (&#9829;). And what happens if we add an ISO-8859-1 character like ö (binary string in PHP "\xF6") directly in there?

    $doc = new DOMDocument("♥♥ l\xF6ve, hugs and kisses ♥♥", 'ISO-8859-1');
    $doc->save('php://output');
    # Warning: DOMDocument::save(): output conversion failed due to conv error, 
    #          bytes 0xF6 0x76 0x65 0x2C
    #                ^^^^  |    |    |
    #                "ö"   v    e   space
    

    This does not work. DOMDocument tells us that the information we have provided can not be turned into ISO-8859-1 output. This is expected: DOMDocument expects all input given being UTF-8. So lets take ö from unicode this time:

    $doc = new DOMDocument('♥♥ löve, hugs and kisses ♥♥', 'ISO-8859-1');
    $doc->save('php://output');
    # <?xml version="&#9829;&#9829; l�ve, hugs and kisses &#9829;&#9829;" encoding="ISO-8859-1"?>
    

    This looks now fine despite this question mark in a diamond. Because on my computer the display/output is in UTF-8 it can not display the ISO-8859-1 ö character here. So my display replaces it with the � Unicode Character 'REPLACEMENT CHARACTER' (U+FFFD). Which is correct, the "ö" now works.

    This so far makes clear that you can only pass UTF-8 encoded strings into DOMDocument and that is regardless of the XML encoding you have specified for that document.

    So let's break this rule with an UTF-8 document as in your question and add some non-UTF-8 text, for example in ISO-8859-1 resp. Windows-1252:

    $doc = new DOMDocument('1.0', 'UTF-8');
    
    $doc->appendChild($doc->createElement('root'))
        ->appendChild($doc->createElement('question'))
        ->appendChild($doc->createTextNode("l\xF6ve, hugs and kisses"));
    
    $doc->save('php://output');
    # <?xml version="1.0" encoding="UTF-8"?>
    # <root><question>l�ve, hugs and kisses</question></root>
    

    Depending with which program you view the output, it might show not the question mark � but just "xF6". I would say that is the case with your file-editor.

    So this is also the solution: When you pass in string-data into DOMDocument, ensure it is UTF-8 encoded:

    ->appendChild($doc->createTextNode(utf8_encode("l\xF6ve, hugs and kisses")));
                                       ########### (works with ISO-8859-1 only (!))
    
    # <?xml version="1.0" encoding="UTF-8"?>
    # <root><question>löve, hugs and kisses</question></root>
    

    Or in your case, tell the browser that your website expects UTF-8. Then you don't need to re-encode anything because your browser already sends the data in with the right encoding. The W3C has collected some useful resources for the topic I suggest you to read now:

    已采纳该答案
    评论
    解决 无用
    打赏 举报

相关推荐 更多相似问题