duan0821
2014-05-21 11:11
浏览 62
已采纳

比较字符串与php中的重音符号

I'm having problems when comparing two strings which contains accents. This is my case:

The first string is: Master The second string is: Máster Diseño Producción

Then, I need to remove the word Máster from the second string, because it's contained in the first string.

I have created a function for clean each string:

function sanear_string($cadena)
{
    $cadena = trim($cadena);

    $cadena = str_replace(
        array('á', 'à', 'ä', 'â', 'ª', 'Á', 'À', 'Â', 'Ä'),
        array('a', 'a', 'a', 'a', 'a', 'A', 'A', 'A', 'A'),
        $cadena
    );

    $cadena = str_replace(
        array('é', 'è', 'ë', 'ê', 'É', 'È', 'Ê', 'Ë'),
        array('e', 'e', 'e', 'e', 'E', 'E', 'E', 'E'),
        $cadena
    );

    $cadena = str_replace(
        array('í', 'ì', 'ï', 'î', 'Í', 'Ì', 'Ï', 'Î'),
        array('i', 'i', 'i', 'i', 'I', 'I', 'I', 'I'),
        $cadena
    );

    $cadena = str_replace(
        array('ó', 'ò', 'ö', 'ô', 'Ó', 'Ò', 'Ö', 'Ô'),
        array('o', 'o', 'o', 'o', 'O', 'O', 'O', 'O'),
        $cadena
    );

    $cadena = str_replace(
        array('ú', 'ù', 'ü', 'û', 'Ú', 'Ù', 'Û', 'Ü'),
        array('u', 'u', 'u', 'u', 'U', 'U', 'U', 'U'),
        $cadena
    );

    $cadena = str_replace(
        array('ñ', 'Ñ', 'ç', 'Ç'),
        array('n', 'N', 'c', 'C',),
        $cadena
    );

    //Esta parte se encarga de eliminar cualquier caracter extraño
    $cadena = str_replace(
        array("\\", "¨", "º", "-", "~",
            "#", "@", "|", "!", "\"",
            "·", "$", "%", "&", "/",
            "(", ")", "?", "'", "¡",
            "¿", "[", "^", "`", "]",
            "+", "}", "{", "¨", "´",
            ">", "<", ";", ",", ":",
            ".", " "),
        '',
        $cadena
    );


    return $cadena;
}

And it helps me to the problem of accents. Now I can use strpos to compare both strings...if result is > 0 then I know that the word is contained... but I need some help more.... Thanks in advance,

  • 点赞
  • 写回答
  • 关注问题
  • 收藏
  • 邀请回答

3条回答 默认 最新

  • dtng5978 2014-05-21 12:00
    已采纳

    As usual when dealing with charset problems, you need to be extra careful about the character counts between multibyte strings and plain ASCII strings.

    Your biggest problem here is that you remove some pre-defined characters from the cleaned string, rendering character count coherence between the sanitized string and the original, thus greatly hardening the removal.

    I'll use a modified version of your sanitizing function:

    function sanitize($cadena) {
        $cadena = str_replace(
            array('á', 'à', 'ä', 'â', 'ª', 'Á', 'À', 'Â', 'Ä'),
            array('a', 'a', 'a', 'a', 'a', 'A', 'A', 'A', 'A'),
            $cadena
        );
    
        $cadena = str_replace(
            array('é', 'è', 'ë', 'ê', 'É', 'È', 'Ê', 'Ë'),
            array('e', 'e', 'e', 'e', 'E', 'E', 'E', 'E'),
            $cadena
        );
    
        $cadena = str_replace(
            array('í', 'ì', 'ï', 'î', 'Í', 'Ì', 'Ï', 'Î'),
            array('i', 'i', 'i', 'i', 'I', 'I', 'I', 'I'),
            $cadena
        );
    
        $cadena = str_replace(
            array('ó', 'ò', 'ö', 'ô', 'Ó', 'Ò', 'Ö', 'Ô'),
            array('o', 'o', 'o', 'o', 'O', 'O', 'O', 'O'),
            $cadena
        );
    
        $cadena = str_replace(
            array('ú', 'ù', 'ü', 'û', 'Ú', 'Ù', 'Û', 'Ü'),
            array('u', 'u', 'u', 'u', 'U', 'U', 'U', 'U'),
            $cadena
        );
    
        $cadena = str_replace(
            array('ñ', 'Ñ', 'ç', 'Ç'),
            array('n', 'N', 'c', 'C',),
            $cadena
        );
    
    
        return strtolower($cadena);
    }
    

    The remove_word function follows:

    function remove_word($haystack , $needle) {
        // sanitize input strings
        $haystack_san = sanitize($haystack);
        $needle_san = sanitize($needle);
    
        // Check for character loss
        if (mb_strlen($haystack_san, 'UTF-8') != mb_strlen($haystack, 'UTF-8') || mb_strlen($needle_san, 'UTF-8') != mb_strlen($needle, 'UTF-8')) {
            // Here for debugging purposes. You may want to drop it in production.
            echo "Lost some chars on the way. Aborting.
    ";
            echo "     haystack: $haystack (".mb_strlen($haystack, "UTF-8").")
    ";
            echo " haystack_san: $haystack_san (".mb_strlen($haystack_san, "UTF-8").")
    ";
            echo "       needle: $needle (".mb_strlen($needle, "UTF-8").")
    ";
            echo "   needle_san: $needle_san (".mb_strlen($needle_san, "UTF-8").")
    ";
            return;
        }
    
        // Check if $needle is found in $haystack
        if (($pos = strpos($haystack_san, $needle_san)) !== false) {
            // Get the string before the word
            $new = mb_substr($haystack, 0, $pos, 'UTF-8');
            // If applicable, get the string after
            if (mb_strlen($haystack, 'UTF-8') - $pos - mb_strlen($needle, 'UTF-8') > 0)
                $new .= mb_substr($haystack, $pos + mb_strlen($needle), NULL, 'UTF-8');
            // Return it
            return $new;
        }
    
        // If the word wasn't found, return $haystack as-is
        return $haystack;
    }
    
    echo remove_word("Hola, Máster Diseño Producción", "Master");
    // "Hola,  Diseño Producción"
    

    Note that:

    • This assumes your strings are UTF-8
    • The code relies on mb_* function to handle multi-byte characters
    • This only replaces the first occurence of the word (you may call remove_word until the string no longer changes if you want to replace all occurences)
    点赞 打赏 评论
  • duanjian4331 2014-05-21 11:35

    if result is > 0 then I know that the word is contained

    Not exactly. strpos() will return 0 if substring offset is zero, as in the case of strings: 'Master' and 'Master Diseno Produccion' (assuming your accents removal function works as expected). What you need is strict (===) comparison to false, e.g.:

    if(strpos($haystack, $needle) !== false) {
        // $needle exists in $haystack
    } else {
        // no $needle in $haystack.
    }
    

    That said, if your goal is to remove the $substr from $str, use:

    str_replace($substr, '', $str)
    
    点赞 打赏 评论
  • dtdsbakn210537 2018-11-30 01:06

    Here's my go at it based on the answer above (and including more characters);

    /**
     * sanitize
     * 
     * @see     https://stackoverflow.com/a/23782573/115025
     * @access  public
     * @param   string $str
     * @return  string
     */
    function sanitize(string $str): string
    {
        $str = str_replace(
            array('à', 'á', 'â', 'ä', 'æ', 'ã', 'å', 'ā', 'À', 'Á', 'Â', 'Ä', 'Æ', 'Ã', 'Å', 'Ā'),
            array('a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A'),
            $str
        );
        $str = str_replace(
            array('ç', 'ć', 'č', 'Ç', 'Ć', 'Č'),
            array('c', 'c', 'c', 'C', 'C', 'C'),
            $str
        );
        $str = str_replace(
            array('è', 'é', 'ê', 'ë', 'ē', 'ė', 'ę', 'È', 'É', 'Ê', 'Ë', 'Ē', 'Ė', 'Ę'),
            array('e', 'e', 'e', 'e', 'e', 'e', 'e', 'E', 'E', 'E', 'E', 'E', 'E', 'E'),
            $str
        );
        $str = str_replace(
            array('î', 'ï', 'í', 'ī', 'į', 'ì', 'Î', 'Ï', 'Í', 'Ī', 'Į', 'Ì'),
            array('i', 'i', 'i', 'i', 'i', 'i', 'I', 'I', 'I', 'I', 'I', 'I'),
            $str
        );
        $str = str_replace(
            array('ł', 'Ł'),
            array('l', 'L'),
            $str
        );
        $str = str_replace(
            array('ñ', 'ń', 'Ñ', 'Ń'),
            array('n', 'n', 'N', 'N'),
            $str
        );
        $str = str_replace(
            array('ô', 'ö', 'ò', 'ó', 'œ', 'ø', 'ō', 'õ', 'Ô', 'Ö', 'Ò', 'Ó', 'Œ', 'Ø', 'Ō', 'Õ'),
            array('o', 'o', 'o', 'o', 'o', 'o', 'o', 'o', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'),
            $str
        );
        $str = str_replace(
            array('ß', 'ś', 'š', 'Ś', 'Š'),
            array('ss', 's', 's', 'S', 'S'),
            $str
        );
        $str = str_replace(
            array('û', 'ü', 'ù', 'ú', 'ū', 'Û', 'Ü', 'Ù', 'Ú', 'Ū'),
            array('u', 'u', 'u', 'u', 'u', 'U', 'U', 'U', 'U', 'U'),
            $str
        );
        $str = str_replace(
            array('ÿ', 'Ÿ'),
            array('y', 'Y'),
            $str
        );
        $str = str_replace(
            array('ž', 'ź', 'ż', 'Ž', 'Ź', 'Ż'),
            array('z', 'z', 'z', 'Z', 'Z', 'Z'),
            $str
        );
        return strtolower($str);
    }
    
    点赞 打赏 评论

相关推荐 更多相似问题