duan0821 2014-05-21 11:11
浏览 62
已采纳

比较字符串与php中的重音符号

I'm having problems when comparing two strings which contains accents. This is my case:

The first string is: Master The second string is: Máster Diseño Producción

Then, I need to remove the word Máster from the second string, because it's contained in the first string.

I have created a function for clean each string:

function sanear_string($cadena)
{
    $cadena = trim($cadena);

    $cadena = str_replace(
        array('á', 'à', 'ä', 'â', 'ª', 'Á', 'À', 'Â', 'Ä'),
        array('a', 'a', 'a', 'a', 'a', 'A', 'A', 'A', 'A'),
        $cadena
    );

    $cadena = str_replace(
        array('é', 'è', 'ë', 'ê', 'É', 'È', 'Ê', 'Ë'),
        array('e', 'e', 'e', 'e', 'E', 'E', 'E', 'E'),
        $cadena
    );

    $cadena = str_replace(
        array('í', 'ì', 'ï', 'î', 'Í', 'Ì', 'Ï', 'Î'),
        array('i', 'i', 'i', 'i', 'I', 'I', 'I', 'I'),
        $cadena
    );

    $cadena = str_replace(
        array('ó', 'ò', 'ö', 'ô', 'Ó', 'Ò', 'Ö', 'Ô'),
        array('o', 'o', 'o', 'o', 'O', 'O', 'O', 'O'),
        $cadena
    );

    $cadena = str_replace(
        array('ú', 'ù', 'ü', 'û', 'Ú', 'Ù', 'Û', 'Ü'),
        array('u', 'u', 'u', 'u', 'U', 'U', 'U', 'U'),
        $cadena
    );

    $cadena = str_replace(
        array('ñ', 'Ñ', 'ç', 'Ç'),
        array('n', 'N', 'c', 'C',),
        $cadena
    );

    //Esta parte se encarga de eliminar cualquier caracter extraño
    $cadena = str_replace(
        array("\\", "¨", "º", "-", "~",
            "#", "@", "|", "!", "\"",
            "·", "$", "%", "&", "/",
            "(", ")", "?", "'", "¡",
            "¿", "[", "^", "`", "]",
            "+", "}", "{", "¨", "´",
            ">", "<", ";", ",", ":",
            ".", " "),
        '',
        $cadena
    );


    return $cadena;
}

And it helps me to the problem of accents. Now I can use strpos to compare both strings...if result is > 0 then I know that the word is contained... but I need some help more.... Thanks in advance,

  • 写回答

3条回答 默认 最新

  • dtng5978 2014-05-21 12:00
    关注

    As usual when dealing with charset problems, you need to be extra careful about the character counts between multibyte strings and plain ASCII strings.

    Your biggest problem here is that you remove some pre-defined characters from the cleaned string, rendering character count coherence between the sanitized string and the original, thus greatly hardening the removal.

    I'll use a modified version of your sanitizing function:

    function sanitize($cadena) {
        $cadena = str_replace(
            array('á', 'à', 'ä', 'â', 'ª', 'Á', 'À', 'Â', 'Ä'),
            array('a', 'a', 'a', 'a', 'a', 'A', 'A', 'A', 'A'),
            $cadena
        );
    
        $cadena = str_replace(
            array('é', 'è', 'ë', 'ê', 'É', 'È', 'Ê', 'Ë'),
            array('e', 'e', 'e', 'e', 'E', 'E', 'E', 'E'),
            $cadena
        );
    
        $cadena = str_replace(
            array('í', 'ì', 'ï', 'î', 'Í', 'Ì', 'Ï', 'Î'),
            array('i', 'i', 'i', 'i', 'I', 'I', 'I', 'I'),
            $cadena
        );
    
        $cadena = str_replace(
            array('ó', 'ò', 'ö', 'ô', 'Ó', 'Ò', 'Ö', 'Ô'),
            array('o', 'o', 'o', 'o', 'O', 'O', 'O', 'O'),
            $cadena
        );
    
        $cadena = str_replace(
            array('ú', 'ù', 'ü', 'û', 'Ú', 'Ù', 'Û', 'Ü'),
            array('u', 'u', 'u', 'u', 'U', 'U', 'U', 'U'),
            $cadena
        );
    
        $cadena = str_replace(
            array('ñ', 'Ñ', 'ç', 'Ç'),
            array('n', 'N', 'c', 'C',),
            $cadena
        );
    
    
        return strtolower($cadena);
    }
    

    The remove_word function follows:

    function remove_word($haystack , $needle) {
        // sanitize input strings
        $haystack_san = sanitize($haystack);
        $needle_san = sanitize($needle);
    
        // Check for character loss
        if (mb_strlen($haystack_san, 'UTF-8') != mb_strlen($haystack, 'UTF-8') || mb_strlen($needle_san, 'UTF-8') != mb_strlen($needle, 'UTF-8')) {
            // Here for debugging purposes. You may want to drop it in production.
            echo "Lost some chars on the way. Aborting.
    ";
            echo "     haystack: $haystack (".mb_strlen($haystack, "UTF-8").")
    ";
            echo " haystack_san: $haystack_san (".mb_strlen($haystack_san, "UTF-8").")
    ";
            echo "       needle: $needle (".mb_strlen($needle, "UTF-8").")
    ";
            echo "   needle_san: $needle_san (".mb_strlen($needle_san, "UTF-8").")
    ";
            return;
        }
    
        // Check if $needle is found in $haystack
        if (($pos = strpos($haystack_san, $needle_san)) !== false) {
            // Get the string before the word
            $new = mb_substr($haystack, 0, $pos, 'UTF-8');
            // If applicable, get the string after
            if (mb_strlen($haystack, 'UTF-8') - $pos - mb_strlen($needle, 'UTF-8') > 0)
                $new .= mb_substr($haystack, $pos + mb_strlen($needle), NULL, 'UTF-8');
            // Return it
            return $new;
        }
    
        // If the word wasn't found, return $haystack as-is
        return $haystack;
    }
    
    echo remove_word("Hola, Máster Diseño Producción", "Master");
    // "Hola,  Diseño Producción"
    

    Note that:

    • This assumes your strings are UTF-8
    • The code relies on mb_* function to handle multi-byte characters
    • This only replaces the first occurence of the word (you may call remove_word until the string no longer changes if you want to replace all occurences)
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(2条)

报告相同问题?

悬赏问题

  • ¥100 set_link_state
  • ¥15 虚幻5 UE美术毛发渲染
  • ¥15 CVRP 图论 物流运输优化
  • ¥15 Tableau online 嵌入ppt失败
  • ¥100 支付宝网页转账系统不识别账号
  • ¥15 基于单片机的靶位控制系统
  • ¥15 真我手机蓝牙传输进度消息被关闭了,怎么打开?(关键词-消息通知)
  • ¥15 装 pytorch 的时候出了好多问题,遇到这种情况怎么处理?
  • ¥20 IOS游览器某宝手机网页版自动立即购买JavaScript脚本
  • ¥15 手机接入宽带网线,如何释放宽带全部速度