douwei8672 2018-04-19 11:31
浏览 98
已采纳

如何使用预定义的字母表在unicode中对字符串进行排序?

I have a mysql table with words in unicode using signs like , š, etc. The columns in the table are defined as utf8mb4_general_ci and recognize the above signs.

In the header of the webpage I put

<meta http-equiv="Content-Type" content="text/html; charset=utf8mb4">

This webpage contains a form sending data to a php page. In the beginning of the php page I put:

mysqli_set_charset($con,"utf8mb4");

In this page, I do a mysql search and I get an array and it is this array ($result) must be sorted by its keys using a lookup array of characters that I have produced which includes single and multi-byte characters.

This is the array:

Array ( 
[nṯr] => Array ( [0] => Ka.C.Coptite.urkVIII,176b [1] => Ka.C.Coptite.urkVIII,177,1 ) 
[n] => Array ( [0] => Ka.C.Coptite.urkVIII,176c [1] => Ka.C.Coptite.urkVIII,177,1 [2] => Ka.C.Coptite.urkVIII,177,2 ) 
[nḫȝḫȝ] => Array ( [0] => Ka.C.Coptite.urkVIII,176c ) 
[nwj] => Array ( [0] => Ka.C.Coptite.urkVIII,176c ) 
[nfr] => Array ( [0] => Ka.C.Coptite.urkVIII,176c [1] => Ka.C.Coptite.urkVIII,177,2 ) 
[nḥḥ] => Array ( [0] => Ka.C.Coptite.urkVIII,176e [1] => Ka.C.Coptite.urkVIII,177,1 [2] => Ka.C.Coptite.urkVIII,177,1 ) 
[nḏ] => Array ( [0] => Ka.C.Coptite.urkVIII,177,1 ) 
)

What I do is:

uksort($result, 'compare_keys_by_alphabet');

This refers to the function:

function compare_keys_by_alphabet($a, $b)
{
    static $alphabet = array( 1 => "-" , 2 => "," , 3 => ".", 4 => "ȝ", 5 => "j", 6 => "ʿ", 7 => "w", 8 => "b", 9 => "p", 10 => "f", 11 => "m", 12 => "n", 13 => "r", 14 => "h", 15 => "ḥ", 16 => "ḫ", 17 => "ẖ", 18 => "s", 19 => "š", 20 => "q", 21 => "k", 22 => "g", 23 => "t", 24 => "ṯ", 25 => "d", 26 => "ḏ", 27 => "⸗", 28 => "/", 29 => "(", 30 => ")", 31 => "[", 32 => "]", 33 => "<", 34 => ">", 35 => "{", 36 => "}", 37 => "'", 38 => "*", 39 => "#", 40 => "I", 41 => "0", 42 => "1", 43 => "2", 44 => "3", 45 => "4", 46 => "5", 47 => "6", 48 => "7", 49 => "8", 50 => "9", 51 => "&", 52 => "@", 53 => "%");

    return compare_by_alphabet($alphabet, $a, $b);
}

using:

function compare_by_alphabet(array $alphabet, $str1, $str2) {
    $c = max(strlen($str1), strlen($str2));

    for ($i = 0; $i < $c; $i++) {
        $s1 = $str1[$i];
        $s2 = $str2[$i];
        //if ($s1===$s2) continue;
        $i1 = array_search($s1, $alphabet);
        //if ($i1===false) continue;
        $i2 = array_search($s2, $alphabet);
        //sif ($i2===false) continue;
        if ($i2==$i1) continue;
        if ($i1 < $i2) return -1;
        else return 1;
    }
    return 0;
}

This worked perfect with the non-unicode alphabet:

static $alphabet2 = array( 1 => '-' , 2 => ',' , 3 => '.' , 4 => "A", 5 => "j", 6 => "a", 7 => "w", 8 => "b", 9 => "p", 10 => "f", 11 => "m", 12 => "n", 13 => "r", 14 => "h", 15 => "H", 16 => "x", 17 => "X", 18 => "s", 19 => "S", 20 => "q", 21 => "k", 22 => "g", 23 => "t", 24 => "T", 25 => "d", 26 => "D", 27 => "=", 28 => "/", 29 => "(", 30 => ")", 31 => "[", 32 => "]", 33 => "<", 34 => ">", 35 => "{", 36 => "}", 37 => "'", 38 => "*", 39 => "#", 40 => "I", 41 => "1", 42 => "2", 43 => "3", 44 => "4", 45 => "5", 46 => "6", 47 => "7", 48 => "8", 49 => "9", 50 => "0", 51 => "&", 52 => "@", 53 => "%");

but once I replaced for example H (nr 15) in alphabet2 with in alphabet1 it didn't work anymore.

I suppose it has to do with recognizing the unicode, because as long as the words do not contain any special signs, the order is correct; but all words containing special signs are put at the beginning of the result.

I tried to look at unicode normalization; but I'm really only an amateur, so this is quite difficult.

Is this the problem or is there another problem and how can I fix it?

  • 写回答

2条回答

  • dqtu14636 2018-04-19 14:42
    关注

    I've left all of my testing echoes in my code block and merely commented them out in case you wanted to see what is being generated throughout the process.

    I took some liberties with your code. I didn't like the function calling the function, and I condensed your lookup array into a space-led string. This will serve to have the same effect as your indexed array that starts from 1. The converting of the lookup from array to string means I can use mb_strpos() instead of array_search().

    The crucial point to fix in your code was in the looping, specifically accessing the letters with [$i]. You see, you cannot treat these multibyte characters as single byte characters -- you must use mb_substr() to access the "whole" letter.

    Setting values for $alphabet and encoding means, you don't have to write a second "helper" function to pass all of the necessary data. uksort() will pass its expected two arguments and everything goes ahead smoothly.

    One final piece of advice is: mb_ functions are expensive, so always try to return in your code as soon as possible and leave the mb_ functions farther "downscript" whenever logically possible.

    Here is my suggested code: (Demo)

    function alphabetize_custom($a, $b, $alphabet = " -,.ȝjʿwbpfmnrhḥḫẖsšqkgtṯdḏ⸗/()[]<>{}'*#I0123456789&@%", $encoding = 'UTF-8') {
        //echo "
    ----
    $a =vs= $b";
        $mb_length = max(mb_strlen($a, $encoding), mb_strlen($b, $encoding));
        for ($i = 0; $i < $mb_length; ++$i) {
            //echo "
    ";
            $a_char = mb_substr($a, $i, 1, $encoding);
            $b_char = mb_substr($b, $i, 1, $encoding);
            //echo "$a_char -vs- $b_char
    ";
            //echo "(" , mb_strlen($a_char, $encoding), " & ", mb_strlen($b_char, $encoding), ")
    ";
            if ($a_char === $b_char) {/*echo "identical, continue";*/ continue;}
            if (!mb_strlen($a_char, $encoding)) { /* echo "a is empty -1";*/ return -1;}
            if (!mb_strlen($b_char, $encoding)) { /*echo "b is empty 1";*/ return 1;}
            $a_offset = mb_strpos($alphabet, $a_char, 0, $encoding);
            $b_offset = mb_strpos($alphabet, $b_char, 0, $encoding);
            //echo "[" , $a_offset, " & ", $b_offset, "]
    ";
            if ($a_offset == $b_offset) { /*echo "== offsets, continue";*/ continue;}
            if ($a_offset < $b_offset) { /*echo "a offset -1";*/ return -1;}
            //echo "b offset 1";
            return 1;
        }
        //echo "0";
        return 0;
    }
    
    $result = [
        "nṯr" => ["Ka.C.Coptite.urkVIII,176b", "Ka.C.Coptite.urkVIII,177,1"],
        "n" => ["Ka.C.Coptite.urkVIII,176c", "Ka.C.Coptite.urkVIII,177,1", "Ka.C.Coptite.urkVIII,177,2"],
        "nḫȝḫȝ" => ["Ka.C.Coptite.urkVIII,176c"],
        "nwj" => ["Ka.C.Coptite.urkVIII,176c"],
        "nfr" => ["Ka.C.Coptite.urkVIII,176c", "Ka.C.Coptite.urkVIII,177,2"],
        "nḥḥ" => ["Ka.C.Coptite.urkVIII,176e", "Ka.C.Coptite.urkVIII,177,1", "Ka.C.Coptite.urkVIII,177,1"],
        "nḏ" => ["Ka.C.Coptite.urkVIII,177,1"]
    ];
    
    uksort($result, 'alphabetize_custom');
    
    var_export($result);
    

    Output:

    array (
      'n' => 
      array (
        0 => 'Ka.C.Coptite.urkVIII,176c',
        1 => 'Ka.C.Coptite.urkVIII,177,1',
        2 => 'Ka.C.Coptite.urkVIII,177,2',
      ),
      'nwj' => 
      array (
        0 => 'Ka.C.Coptite.urkVIII,176c',
      ),
      'nfr' => 
      array (
        0 => 'Ka.C.Coptite.urkVIII,176c',
        1 => 'Ka.C.Coptite.urkVIII,177,2',
      ),
      'nḥḥ' => 
      array (
        0 => 'Ka.C.Coptite.urkVIII,176e',
        1 => 'Ka.C.Coptite.urkVIII,177,1',
        2 => 'Ka.C.Coptite.urkVIII,177,1',
      ),
      'nḫȝḫȝ' => 
      array (
        0 => 'Ka.C.Coptite.urkVIII,176c',
      ),
      'nṯr' => 
      array (
        0 => 'Ka.C.Coptite.urkVIII,176b',
        1 => 'Ka.C.Coptite.urkVIII,177,1',
      ),
      'nḏ' => 
      array (
        0 => 'Ka.C.Coptite.urkVIII,177,1',
      ),
    )
    

    Just for comparison's sake, I wrote an alternative code block that uses array_search() as your original code does and not surprisingly it appears to be more efficient according to the speed tests on 3v4l.org. This is likely due to the removal of a couple of 4 mb_ functions, which I previously mentioned to be "expensive". The following snippet provides the same output.

    Code: (Demo)

    function alphabetize_custom($a, $b) {
        $alphabet = [' ', '-', ',', '.', 'ȝ', 'j', 'ʿ', 'w', 'b', 'p', 'f', 'm', 'n', 'r', 'h', 'ḥ', 'ḫ', 'ẖ', 's', 'š', 'q', 'k', 'g', 't', 'ṯ', 'd', 'ḏ', '⸗', '/', '(', ')', '[', ']', '<', '>', '{', '}', "'", '*', '#', 'I', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '&', '@', '%'];
        unset($alphabet[0]);  // removes dummy first key, effectively starting the keys from 1
        $encoding = 'UTF-8';
    
        $mb_length = max(mb_strlen($a, $encoding), mb_strlen($b, $encoding));
        for ($i = 0; $i < $mb_length; ++$i) {
            $a_char = mb_substr($a, $i, 1, $encoding);
            $b_char = mb_substr($b, $i, 1, $encoding);
            if ($a_char === $b_char) continue;
    
            $a_key = array_search($a_char, $alphabet);
            $b_key = array_search($b_char, $alphabet);
            if ($a_key === $b_key) continue;
    
            return $a_key - $b_key;
        }
        return 0;
    }
    
    $result = [
        "nṯr" => ["Ka.C.Coptite.urkVIII,176b", "Ka.C.Coptite.urkVIII,177,1"],
        "n" => ["Ka.C.Coptite.urkVIII,176c", "Ka.C.Coptite.urkVIII,177,1", "Ka.C.Coptite.urkVIII,177,2"],
        "nḫȝḫȝ" => ["Ka.C.Coptite.urkVIII,176c"],
        "nwj" => ["Ka.C.Coptite.urkVIII,176c"],
        "nfr" => ["Ka.C.Coptite.urkVIII,176c", "Ka.C.Coptite.urkVIII,177,2"],
        "nḥḥ" => ["Ka.C.Coptite.urkVIII,176e", "Ka.C.Coptite.urkVIII,177,1", "Ka.C.Coptite.urkVIII,177,1"],
        "nḏ" => ["Ka.C.Coptite.urkVIII,177,1"]
    ];
    
    uksort($result, 'alphabetize_custom');
    
    var_export($result);
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥15 程序不包含适用于入口点的静态Main方法
  • ¥15 素材场景中光线烘焙后灯光失效
  • ¥15 请教一下各位,为什么我这个没有实现模拟点击
  • ¥15 执行 virtuoso 命令后,界面没有,cadence 启动不起来
  • ¥50 comfyui下连接animatediff节点生成视频质量非常差的原因
  • ¥20 有关区间dp的问题求解
  • ¥15 多电路系统共用电源的串扰问题
  • ¥15 slam rangenet++配置
  • ¥15 有没有研究水声通信方面的帮我改俩matlab代码
  • ¥15 ubuntu子系统密码忘记