doumi1099 2012-12-19 20:28
浏览 71
已采纳

钻石中的两个问号而不是颠倒的感叹号

I'm processing some text files with Spanish text in php with eclipse-php on my Mac OS X 10. I have the encoding set to UTF-8, and everything works great except for one small problem. All of the ¡ (upside-down exclamation marks) are replaced with � � (two black diamonds with questions marks separated by a space) in the output text file. None of the other characters (¿ñáéíóúü) are giving me any trouble. I had a similar problem with my Windows Vista machine (it would replace all ¡ with é). Any ideas why this one character is bugging out in UTF-8 and how I can fix it?

Here's the code I'm using. I didn't include it originally because it is so long and I'm not sure where the problem lies. As you can see I've tried to incorporate shiplu.mokadd.im's suggestion, but I'm still getting the � �.

<?php

ini_set("auto_detect_line_endings", true);

$sourceH = fopen("MainInput.txt", "r") or die("Can't open MainInput.txt.");
$sourceData = array();
$tracker = 0;

while (!feof($sourceH)){
    $sourceData[$tracker] = fgets($sourceH);
    $sourceData[$tracker] = preg_split("/\t/", $sourceData[$tracker]);
    $tracker++;
}

$i = $tracker--;

$chars_hi = 'ABCDEFGHIJKLMNÑOPQRSTUVWXYZÁÉÍÓÚÜ';
$chars_lo = 'abcdefghijklmnñopqrstuvwxyzáéíóúü';
$characters = "ABCDEFGHIJKLMNÑOPQRSTUVWXYZÁÉÍÓÚÜabcdefghijklmnñopqrstuvwxyzáéíóúü1234567890'-";

function lowercase($s) {
    global $chars_hi, $chars_lo;
    return strtr($s, $chars_hi, $chars_lo);
}

$myNewFile = "Processing/Prepared.txt";
$fhNew = fopen($myNewFile, 'w') or die("can't open Prepared
");
$newText = "";

for ($n = 1; $n < $i; $n++) {

    $myFile = $sourceData[$n][1];
    $fh = fopen($myFile,'r') or die("can't open file ".$sourceData[$n][1]."
");
    fwrite($fhNew, "

StartFile ".$sourceData[$n][0]."

");
    $position = 0;
    $speaker = ">>u";

    while (!feof($fh)){
        $newText = fgets($fh);
        $isLast = false;
        $isFirst = true;
        $new = "";
        if (mb_strpos($newText, ">> i") !== false or mb_strpos($newText, ">>i") !== false or mb_strpos($newText, ">i") !== false or mb_strpos($newText, "> i") !== false) {
            $speaker = ">>i";
        }
        elseif (mb_strpos($newText, ">> s") !== false or mb_strpos($newText, ">>s") !== false or mb_strpos($newText, ">s") !== false or mb_strpos($newText, "> s") !== false) {
            $speaker = ">>s";
        }
        for ($in = 0; $in < mb_strlen($newText); $in++) {
            if (mb_strpos($characters, $newText[$in]) !== false) {
                if ($isFirst == true) {
                    $new = $new." ".$newText[$in];
                    $isFirst = false;
                    $isLast = true;
                }
                else {
                    $new = $new.$newText[$in];
                }
            }
            elseif ($isLast == true) {
                $isLast = false;
                $isFirst = true;
                $new = $new."   ".($in + $position)."   ".$speaker."    ".$newText[$in];
            }
            else {
                $new = $new.$newText[$in];
            }
        }
        $position += mb_strlen($newText);
        $newText = $new;
        $newText = lowercase($newText);
        fwrite($fhNew, $newText."
");
    }
    fclose($fh);
}
fclose($fhNew);

?>
  • 写回答

1条回答 默认 最新

  • douguan1887 2012-12-20 01:30
    关注

    You cannot do stuff like this:

    $new = $new." ".$newText[$in];
    

    Specifically, $newText[$in]. That does byte level access, but when using UTF-8, characters consist of multiple bytes. So when you hack and slash bytes like this, you will separate the UTF-8 bytes that belong together, resulting in .

    For example, run this PHP script (Saved in text editor as UTF-8):

    <?php
    header("Content-Type: text/html; charset=UTF-8");
    $text = "ä";
    echo $text[0] . " " . $text[1];
    

    The result is � �.

    You must fix all of your code where you are doing [] access on strings. You can replace $string[$i] with mb_substr( $string, $i, 1, "UTF-8" );

    Also, have you set mb_internal_encoding to "UTF-8"? Otherwise it will most likely not default to UTF-8 when you call mb_* functions without explicit encoding.

    I also recommend using something like mb_convert_case($str, MB_CASE_LOWER, "UTF-8"); over your custom lowercase function.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥40 万年历缺少农历,需要和阳历同时显示
  • ¥250 雷电模拟器内存穿透、寻基址和特征码的教学
  • ¥200 比特币ord程序wallet_constructor.rs文件支持一次性铸造1000个代币,并将它们分配到40个UTXO上(每个UTXO上分配25个代币),并设置找零地址
  • ¥15 关于Java的学习问题
  • ¥15 如何使用chatgpt完成文本分类任务?
  • ¥15 已知速度v关于位置s的等式,怎么转化为已知位置求速度v的等式
  • ¥15 我有个餐饮系统,用wampserver把环境配置好了,但是后端的网页却进去,是为什么,能不能帮远程一下?
  • ¥15 R运行没有名称为"species"的插槽对于此对象类"SDMmodelCV"
  • ¥20 基于决策树的数字信号处理,2ask 2psk 2fsk的代码,检查下报错的原因
  • ¥15 wincc已组态的变量过多