如何使用预定义的字母表在unicode中对字符串进行排序?

I have a mysql table with words in unicode using signs like , š, etc. The columns in the table are defined as utf8mb4_general_ci and recognize the above signs.

In the header of the webpage I put

<meta http-equiv="Content-Type" content="text/html; charset=utf8mb4">

This webpage contains a form sending data to a php page. In the beginning of the php page I put:

mysqli_set_charset($con,"utf8mb4");

In this page, I do a mysql search and I get an array and it is this array ($result) must be sorted by its keys using a lookup array of characters that I have produced which includes single and multi-byte characters.

This is the array:

Array ( 
[nṯr] => Array ( [0] => Ka.C.Coptite.urkVIII,176b [1] => Ka.C.Coptite.urkVIII,177,1 ) 
[n] => Array ( [0] => Ka.C.Coptite.urkVIII,176c [1] => Ka.C.Coptite.urkVIII,177,1 [2] => Ka.C.Coptite.urkVIII,177,2 ) 
[nḫȝḫȝ] => Array ( [0] => Ka.C.Coptite.urkVIII,176c ) 
[nwj] => Array ( [0] => Ka.C.Coptite.urkVIII,176c ) 
[nfr] => Array ( [0] => Ka.C.Coptite.urkVIII,176c [1] => Ka.C.Coptite.urkVIII,177,2 ) 
[nḥḥ] => Array ( [0] => Ka.C.Coptite.urkVIII,176e [1] => Ka.C.Coptite.urkVIII,177,1 [2] => Ka.C.Coptite.urkVIII,177,1 ) 
[nḏ] => Array ( [0] => Ka.C.Coptite.urkVIII,177,1 ) 
)

What I do is:

uksort($result, 'compare_keys_by_alphabet');

This refers to the function:

function compare_keys_by_alphabet($a, $b)
{
    static $alphabet = array( 1 => "-" , 2 => "," , 3 => ".", 4 => "ȝ", 5 => "j", 6 => "ʿ", 7 => "w", 8 => "b", 9 => "p", 10 => "f", 11 => "m", 12 => "n", 13 => "r", 14 => "h", 15 => "ḥ", 16 => "ḫ", 17 => "ẖ", 18 => "s", 19 => "š", 20 => "q", 21 => "k", 22 => "g", 23 => "t", 24 => "ṯ", 25 => "d", 26 => "ḏ", 27 => "⸗", 28 => "/", 29 => "(", 30 => ")", 31 => "[", 32 => "]", 33 => "<", 34 => ">", 35 => "{", 36 => "}", 37 => "'", 38 => "*", 39 => "#", 40 => "I", 41 => "0", 42 => "1", 43 => "2", 44 => "3", 45 => "4", 46 => "5", 47 => "6", 48 => "7", 49 => "8", 50 => "9", 51 => "&", 52 => "@", 53 => "%");

    return compare_by_alphabet($alphabet, $a, $b);
}

using:

function compare_by_alphabet(array $alphabet, $str1, $str2) {
    $c = max(strlen($str1), strlen($str2));

    for ($i = 0; $i < $c; $i++) {
        $s1 = $str1[$i];
        $s2 = $str2[$i];
        //if ($s1===$s2) continue;
        $i1 = array_search($s1, $alphabet);
        //if ($i1===false) continue;
        $i2 = array_search($s2, $alphabet);
        //sif ($i2===false) continue;
        if ($i2==$i1) continue;
        if ($i1 < $i2) return -1;
        else return 1;
    }
    return 0;
}

This worked perfect with the non-unicode alphabet:

static $alphabet2 = array( 1 => '-' , 2 => ',' , 3 => '.' , 4 => "A", 5 => "j", 6 => "a", 7 => "w", 8 => "b", 9 => "p", 10 => "f", 11 => "m", 12 => "n", 13 => "r", 14 => "h", 15 => "H", 16 => "x", 17 => "X", 18 => "s", 19 => "S", 20 => "q", 21 => "k", 22 => "g", 23 => "t", 24 => "T", 25 => "d", 26 => "D", 27 => "=", 28 => "/", 29 => "(", 30 => ")", 31 => "[", 32 => "]", 33 => "<", 34 => ">", 35 => "{", 36 => "}", 37 => "'", 38 => "*", 39 => "#", 40 => "I", 41 => "1", 42 => "2", 43 => "3", 44 => "4", 45 => "5", 46 => "6", 47 => "7", 48 => "8", 49 => "9", 50 => "0", 51 => "&", 52 => "@", 53 => "%");

but once I replaced for example H (nr 15) in alphabet2 with in alphabet1 it didn't work anymore.

I suppose it has to do with recognizing the unicode, because as long as the words do not contain any special signs, the order is correct; but all words containing special signs are put at the beginning of the result.

I tried to look at unicode normalization; but I'm really only an amateur, so this is quite difficult.

Is this the problem or is there another problem and how can I fix it?

2个回答

I've left all of my testing echoes in my code block and merely commented them out in case you wanted to see what is being generated throughout the process.

I took some liberties with your code. I didn't like the function calling the function, and I condensed your lookup array into a space-led string. This will serve to have the same effect as your indexed array that starts from 1. The converting of the lookup from array to string means I can use mb_strpos() instead of array_search().

The crucial point to fix in your code was in the looping, specifically accessing the letters with [$i]. You see, you cannot treat these multibyte characters as single byte characters -- you must use mb_substr() to access the "whole" letter.

Setting values for $alphabet and encoding means, you don't have to write a second "helper" function to pass all of the necessary data. uksort() will pass its expected two arguments and everything goes ahead smoothly.

One final piece of advice is: mb_ functions are expensive, so always try to return in your code as soon as possible and leave the mb_ functions farther "downscript" whenever logically possible.

Here is my suggested code: (Demo)

function alphabetize_custom($a, $b, $alphabet = " -,.ȝjʿwbpfmnrhḥḫẖsšqkgtṯdḏ⸗/()[]<>{}'*#I0123456789&@%", $encoding = 'UTF-8') {
    //echo "
----
$a =vs= $b";
    $mb_length = max(mb_strlen($a, $encoding), mb_strlen($b, $encoding));
    for ($i = 0; $i < $mb_length; ++$i) {
        //echo "
";
        $a_char = mb_substr($a, $i, 1, $encoding);
        $b_char = mb_substr($b, $i, 1, $encoding);
        //echo "$a_char -vs- $b_char
";
        //echo "(" , mb_strlen($a_char, $encoding), " & ", mb_strlen($b_char, $encoding), ")
";
        if ($a_char === $b_char) {/*echo "identical, continue";*/ continue;}
        if (!mb_strlen($a_char, $encoding)) { /* echo "a is empty -1";*/ return -1;}
        if (!mb_strlen($b_char, $encoding)) { /*echo "b is empty 1";*/ return 1;}
        $a_offset = mb_strpos($alphabet, $a_char, 0, $encoding);
        $b_offset = mb_strpos($alphabet, $b_char, 0, $encoding);
        //echo "[" , $a_offset, " & ", $b_offset, "]
";
        if ($a_offset == $b_offset) { /*echo "== offsets, continue";*/ continue;}
        if ($a_offset < $b_offset) { /*echo "a offset -1";*/ return -1;}
        //echo "b offset 1";
        return 1;
    }
    //echo "0";
    return 0;
}

$result = [
    "nṯr" => ["Ka.C.Coptite.urkVIII,176b", "Ka.C.Coptite.urkVIII,177,1"],
    "n" => ["Ka.C.Coptite.urkVIII,176c", "Ka.C.Coptite.urkVIII,177,1", "Ka.C.Coptite.urkVIII,177,2"],
    "nḫȝḫȝ" => ["Ka.C.Coptite.urkVIII,176c"],
    "nwj" => ["Ka.C.Coptite.urkVIII,176c"],
    "nfr" => ["Ka.C.Coptite.urkVIII,176c", "Ka.C.Coptite.urkVIII,177,2"],
    "nḥḥ" => ["Ka.C.Coptite.urkVIII,176e", "Ka.C.Coptite.urkVIII,177,1", "Ka.C.Coptite.urkVIII,177,1"],
    "nḏ" => ["Ka.C.Coptite.urkVIII,177,1"]
];

uksort($result, 'alphabetize_custom');

var_export($result);

Output:

array (
  'n' => 
  array (
    0 => 'Ka.C.Coptite.urkVIII,176c',
    1 => 'Ka.C.Coptite.urkVIII,177,1',
    2 => 'Ka.C.Coptite.urkVIII,177,2',
  ),
  'nwj' => 
  array (
    0 => 'Ka.C.Coptite.urkVIII,176c',
  ),
  'nfr' => 
  array (
    0 => 'Ka.C.Coptite.urkVIII,176c',
    1 => 'Ka.C.Coptite.urkVIII,177,2',
  ),
  'nḥḥ' => 
  array (
    0 => 'Ka.C.Coptite.urkVIII,176e',
    1 => 'Ka.C.Coptite.urkVIII,177,1',
    2 => 'Ka.C.Coptite.urkVIII,177,1',
  ),
  'nḫȝḫȝ' => 
  array (
    0 => 'Ka.C.Coptite.urkVIII,176c',
  ),
  'nṯr' => 
  array (
    0 => 'Ka.C.Coptite.urkVIII,176b',
    1 => 'Ka.C.Coptite.urkVIII,177,1',
  ),
  'nḏ' => 
  array (
    0 => 'Ka.C.Coptite.urkVIII,177,1',
  ),
)

Just for comparison's sake, I wrote an alternative code block that uses array_search() as your original code does and not surprisingly it appears to be more efficient according to the speed tests on 3v4l.org. This is likely due to the removal of a couple of 4 mb_ functions, which I previously mentioned to be "expensive". The following snippet provides the same output.

Code: (Demo)

function alphabetize_custom($a, $b) {
    $alphabet = [' ', '-', ',', '.', 'ȝ', 'j', 'ʿ', 'w', 'b', 'p', 'f', 'm', 'n', 'r', 'h', 'ḥ', 'ḫ', 'ẖ', 's', 'š', 'q', 'k', 'g', 't', 'ṯ', 'd', 'ḏ', '⸗', '/', '(', ')', '[', ']', '<', '>', '{', '}', "'", '*', '#', 'I', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '&', '@', '%'];
    unset($alphabet[0]);  // removes dummy first key, effectively starting the keys from 1
    $encoding = 'UTF-8';

    $mb_length = max(mb_strlen($a, $encoding), mb_strlen($b, $encoding));
    for ($i = 0; $i < $mb_length; ++$i) {
        $a_char = mb_substr($a, $i, 1, $encoding);
        $b_char = mb_substr($b, $i, 1, $encoding);
        if ($a_char === $b_char) continue;

        $a_key = array_search($a_char, $alphabet);
        $b_key = array_search($b_char, $alphabet);
        if ($a_key === $b_key) continue;

        return $a_key - $b_key;
    }
    return 0;
}

$result = [
    "nṯr" => ["Ka.C.Coptite.urkVIII,176b", "Ka.C.Coptite.urkVIII,177,1"],
    "n" => ["Ka.C.Coptite.urkVIII,176c", "Ka.C.Coptite.urkVIII,177,1", "Ka.C.Coptite.urkVIII,177,2"],
    "nḫȝḫȝ" => ["Ka.C.Coptite.urkVIII,176c"],
    "nwj" => ["Ka.C.Coptite.urkVIII,176c"],
    "nfr" => ["Ka.C.Coptite.urkVIII,176c", "Ka.C.Coptite.urkVIII,177,2"],
    "nḥḥ" => ["Ka.C.Coptite.urkVIII,176e", "Ka.C.Coptite.urkVIII,177,1", "Ka.C.Coptite.urkVIII,177,1"],
    "nḏ" => ["Ka.C.Coptite.urkVIII,177,1"]
];

uksort($result, 'alphabetize_custom');

var_export($result);
doulu1945
doulu1945 我在想我发布的代码块,并意识到我可以生成更好的版本。 我建议使用它而不是我的第一次尝试。 如果你不介意$ alphabet语法很长,你可以省去unset()调用,只是像你原来那样声明数组。
一年多之前 回复
douhao6557
douhao6557 这非常有效。 感谢您向我介绍mb_函数。
一年多之前 回复

The charset in the meta tag needs to be UTF-8. That is what the outside world calls it; MySQL calls it utf8mb4.

Inside MySQL, declare the collation of the columns you want to be ordered with COLLATION utf8mb4_unicode_520_ci. With that, MySQL can do the work for you:

SELECT ... ORDER BY col ...

展开翻译

译文



meta </ code>标记中的 charset </ code>需要 UTF- 8 </代码>。 这就是外界所说的; MySQL将其命名为 utf8mb4 </ code>。</ p>

在MySQL内部,使用 COLLATION utf8mb4_unicode_520_ci </ code>声明要排序的列的排序规则。 有了这个,MySQL可以为你完成工作:</ p>

  SELECT ... ORDER BY col ... 
</ code> </ pre>
</ div >

Csdn user default icon
上传中...
上传图片
插入图片
抄袭、复制答案,以达到刷声望分或其他目的的行为,在CSDN问答是严格禁止的,一经发现立刻封号。是时候展现真正的技术了!