douwei8672 2018-04-19 11:31

已采纳

如何使用预定义的字母表在unicode中对字符串进行排序？

I have a mysql table with words in unicode using signs like ḥ, ḫ š, etc. The columns in the table are defined as utf8mb4_general_ci and recognize the above signs.

In the header of the webpage I put

<meta http-equiv="Content-Type" content="text/html; charset=utf8mb4">

This webpage contains a form sending data to a php page. In the beginning of the php page I put:

mysqli_set_charset($con,"utf8mb4");

In this page, I do a mysql search and I get an array and it is this array ($result) must be sorted by its keys using a lookup array of characters that I have produced which includes single and multi-byte characters.

This is the array:

Array ( 
[nṯr] => Array ( [0] => Ka.C.Coptite.urkVIII,176b [1] => Ka.C.Coptite.urkVIII,177,1 ) 
[n] => Array ( [0] => Ka.C.Coptite.urkVIII,176c [1] => Ka.C.Coptite.urkVIII,177,1 [2] => Ka.C.Coptite.urkVIII,177,2 ) 
[nḫȝḫȝ] => Array ( [0] => Ka.C.Coptite.urkVIII,176c ) 
[nwj] => Array ( [0] => Ka.C.Coptite.urkVIII,176c ) 
[nfr] => Array ( [0] => Ka.C.Coptite.urkVIII,176c [1] => Ka.C.Coptite.urkVIII,177,2 ) 
[nḥḥ] => Array ( [0] => Ka.C.Coptite.urkVIII,176e [1] => Ka.C.Coptite.urkVIII,177,1 [2] => Ka.C.Coptite.urkVIII,177,1 ) 
[nḏ] => Array ( [0] => Ka.C.Coptite.urkVIII,177,1 ) 
)

What I do is:

uksort($result, 'compare_keys_by_alphabet');

This refers to the function:

function compare_keys_by_alphabet($a, $b)
{
    static $alphabet = array( 1 => "-" , 2 => "," , 3 => ".", 4 => "ȝ", 5 => "j", 6 => "ʿ", 7 => "w", 8 => "b", 9 => "p", 10 => "f", 11 => "m", 12 => "n", 13 => "r", 14 => "h", 15 => "ḥ", 16 => "ḫ", 17 => "ẖ", 18 => "s", 19 => "š", 20 => "q", 21 => "k", 22 => "g", 23 => "t", 24 => "ṯ", 25 => "d", 26 => "ḏ", 27 => "⸗", 28 => "/", 29 => "(", 30 => ")", 31 => "[", 32 => "]", 33 => "<", 34 => ">", 35 => "{", 36 => "}", 37 => "'", 38 => "*", 39 => "#", 40 => "I", 41 => "0", 42 => "1", 43 => "2", 44 => "3", 45 => "4", 46 => "5", 47 => "6", 48 => "7", 49 => "8", 50 => "9", 51 => "&", 52 => "@", 53 => "%");

    return compare_by_alphabet($alphabet, $a, $b);
}

using:

function compare_by_alphabet(array $alphabet, $str1, $str2) {
    $c = max(strlen($str1), strlen($str2));

    for ($i = 0; $i < $c; $i++) {
        $s1 = $str1[$i];
        $s2 = $str2[$i];
        //if ($s1===$s2) continue;
        $i1 = array_search($s1, $alphabet);
        //if ($i1===false) continue;
        $i2 = array_search($s2, $alphabet);
        //sif ($i2===false) continue;
        if ($i2==$i1) continue;
        if ($i1 < $i2) return -1;
        else return 1;
    }
    return 0;
}

This worked perfect with the non-unicode alphabet:

static $alphabet2 = array( 1 => '-' , 2 => ',' , 3 => '.' , 4 => "A", 5 => "j", 6 => "a", 7 => "w", 8 => "b", 9 => "p", 10 => "f", 11 => "m", 12 => "n", 13 => "r", 14 => "h", 15 => "H", 16 => "x", 17 => "X", 18 => "s", 19 => "S", 20 => "q", 21 => "k", 22 => "g", 23 => "t", 24 => "T", 25 => "d", 26 => "D", 27 => "=", 28 => "/", 29 => "(", 30 => ")", 31 => "[", 32 => "]", 33 => "<", 34 => ">", 35 => "{", 36 => "}", 37 => "'", 38 => "*", 39 => "#", 40 => "I", 41 => "1", 42 => "2", 43 => "3", 44 => "4", 45 => "5", 46 => "6", 47 => "7", 48 => "8", 49 => "9", 50 => "0", 51 => "&", 52 => "@", 53 => "%");

but once I replaced for example H (nr 15) in alphabet2 with ḥ in alphabet1 it didn't work anymore.

I suppose it has to do with recognizing the unicode, because as long as the words do not contain any special signs, the order is correct; but all words containing special signs are put at the beginning of the result.

I tried to look at unicode normalization; but I'm really only an amateur, so this is quite difficult.

Is this the problem or is there another problem and how can I fix it?

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

2条回答默认最新

dqtu14636 2018-04-19 14:42

关注

I've left all of my testing echoes in my code block and merely commented them out in case you wanted to see what is being generated throughout the process.

I took some liberties with your code. I didn't like the function calling the function, and I condensed your lookup array into a space-led string. This will serve to have the same effect as your indexed array that starts from 1. The converting of the lookup from array to string means I can use mb_strpos() instead of array_search().

The crucial point to fix in your code was in the looping, specifically accessing the letters with [$i]. You see, you cannot treat these multibyte characters as single byte characters -- you must use mb_substr() to access the "whole" letter.

Setting values for $alphabet and encoding means, you don't have to write a second "helper" function to pass all of the necessary data. uksort() will pass its expected two arguments and everything goes ahead smoothly.

One final piece of advice is: mb_ functions are expensive, so always try to return in your code as soon as possible and leave the mb_ functions farther "downscript" whenever logically possible.

Here is my suggested code: (Demo)

function alphabetize_custom($a, $b, $alphabet = " -,.ȝjʿwbpfmnrhḥḫẖsšqkgtṯdḏ⸗/()[]<>{}'*#I0123456789&@%", $encoding = 'UTF-8') {
    //echo "
----
$a =vs= $b";
    $mb_length = max(mb_strlen($a, $encoding), mb_strlen($b, $encoding));
    for ($i = 0; $i < $mb_length; ++$i) {
        //echo "
";
        $a_char = mb_substr($a, $i, 1, $encoding);
        $b_char = mb_substr($b, $i, 1, $encoding);
        //echo "$a_char -vs- $b_char
";
        //echo "(" , mb_strlen($a_char, $encoding), " & ", mb_strlen($b_char, $encoding), ")
";
        if ($a_char === $b_char) {/*echo "identical, continue";*/ continue;}
        if (!mb_strlen($a_char, $encoding)) { /* echo "a is empty -1";*/ return -1;}
        if (!mb_strlen($b_char, $encoding)) { /*echo "b is empty 1";*/ return 1;}
        $a_offset = mb_strpos($alphabet, $a_char, 0, $encoding);
        $b_offset = mb_strpos($alphabet, $b_char, 0, $encoding);
        //echo "[" , $a_offset, " & ", $b_offset, "]
";
        if ($a_offset == $b_offset) { /*echo "== offsets, continue";*/ continue;}
        if ($a_offset < $b_offset) { /*echo "a offset -1";*/ return -1;}
        //echo "b offset 1";
        return 1;
    }
    //echo "0";
    return 0;
}

$result = [
    "nṯr" => ["Ka.C.Coptite.urkVIII,176b", "Ka.C.Coptite.urkVIII,177,1"],
    "n" => ["Ka.C.Coptite.urkVIII,176c", "Ka.C.Coptite.urkVIII,177,1", "Ka.C.Coptite.urkVIII,177,2"],
    "nḫȝḫȝ" => ["Ka.C.Coptite.urkVIII,176c"],
    "nwj" => ["Ka.C.Coptite.urkVIII,176c"],
    "nfr" => ["Ka.C.Coptite.urkVIII,176c", "Ka.C.Coptite.urkVIII,177,2"],
    "nḥḥ" => ["Ka.C.Coptite.urkVIII,176e", "Ka.C.Coptite.urkVIII,177,1", "Ka.C.Coptite.urkVIII,177,1"],
    "nḏ" => ["Ka.C.Coptite.urkVIII,177,1"]
];

uksort($result, 'alphabetize_custom');

var_export($result);

Output:

array (
  'n' => 
  array (
    0 => 'Ka.C.Coptite.urkVIII,176c',
    1 => 'Ka.C.Coptite.urkVIII,177,1',
    2 => 'Ka.C.Coptite.urkVIII,177,2',
  ),
  'nwj' => 
  array (
    0 => 'Ka.C.Coptite.urkVIII,176c',
  ),
  'nfr' => 
  array (
    0 => 'Ka.C.Coptite.urkVIII,176c',
    1 => 'Ka.C.Coptite.urkVIII,177,2',
  ),
  'nḥḥ' => 
  array (
    0 => 'Ka.C.Coptite.urkVIII,176e',
    1 => 'Ka.C.Coptite.urkVIII,177,1',
    2 => 'Ka.C.Coptite.urkVIII,177,1',
  ),
  'nḫȝḫȝ' => 
  array (
    0 => 'Ka.C.Coptite.urkVIII,176c',
  ),
  'nṯr' => 
  array (
    0 => 'Ka.C.Coptite.urkVIII,176b',
    1 => 'Ka.C.Coptite.urkVIII,177,1',
  ),
  'nḏ' => 
  array (
    0 => 'Ka.C.Coptite.urkVIII,177,1',
  ),
)

Just for comparison's sake, I wrote an alternative code block that uses array_search() as your original code does and not surprisingly it appears to be more efficient according to the speed tests on 3v4l.org. This is likely due to the removal of a couple of 4 mb_ functions, which I previously mentioned to be "expensive". The following snippet provides the same output.

Code: (Demo)

function alphabetize_custom($a, $b) {
    $alphabet = [' ', '-', ',', '.', 'ȝ', 'j', 'ʿ', 'w', 'b', 'p', 'f', 'm', 'n', 'r', 'h', 'ḥ', 'ḫ', 'ẖ', 's', 'š', 'q', 'k', 'g', 't', 'ṯ', 'd', 'ḏ', '⸗', '/', '(', ')', '[', ']', '<', '>', '{', '}', "'", '*', '#', 'I', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '&', '@', '%'];
    unset($alphabet[0]);  // removes dummy first key, effectively starting the keys from 1
    $encoding = 'UTF-8';

    $mb_length = max(mb_strlen($a, $encoding), mb_strlen($b, $encoding));
    for ($i = 0; $i < $mb_length; ++$i) {
        $a_char = mb_substr($a, $i, 1, $encoding);
        $b_char = mb_substr($b, $i, 1, $encoding);
        if ($a_char === $b_char) continue;

        $a_key = array_search($a_char, $alphabet);
        $b_key = array_search($b_char, $alphabet);
        if ($a_key === $b_key) continue;

        return $a_key - $b_key;
    }
    return 0;
}

$result = [
    "nṯr" => ["Ka.C.Coptite.urkVIII,176b", "Ka.C.Coptite.urkVIII,177,1"],
    "n" => ["Ka.C.Coptite.urkVIII,176c", "Ka.C.Coptite.urkVIII,177,1", "Ka.C.Coptite.urkVIII,177,2"],
    "nḫȝḫȝ" => ["Ka.C.Coptite.urkVIII,176c"],
    "nwj" => ["Ka.C.Coptite.urkVIII,176c"],
    "nfr" => ["Ka.C.Coptite.urkVIII,176c", "Ka.C.Coptite.urkVIII,177,2"],
    "nḥḥ" => ["Ka.C.Coptite.urkVIII,176e", "Ka.C.Coptite.urkVIII,177,1", "Ka.C.Coptite.urkVIII,177,1"],
    "nḏ" => ["Ka.C.Coptite.urkVIII,177,1"]
];

uksort($result, 'alphabetize_custom');

var_export($result);

本回答被题主选为最佳回答 , 对您是否有帮助呢?

查看更多回答(1条)

报告相同问题？

关注问题

如何使用预定义的字母表在unicode中对字符串进行排序？ php
2018-04-19 11:31

回答 2 已采纳 I've left all of my testing echoes in my code block and merely commented them out in case you want
在PHP中对包含字母和数字的字符串进行排序 html php
2019-01-16 08:37

回答 1 已采纳 A sort() can do that for you. Here's an example from the PHP page doing pretty much the same thing
Qt对拥有2亿行字符串的文本文件进行排序？ c++
2019-12-13 12:45

回答 1 已采纳百度搜索 “外部排序”
php 自然排序法,PHP中的自然排序算法,支持Unicode？
2021-04-29 02:56

名侦探15号的博客是否可以使用自然顺序算法在PHP中使用Unicode / UTF-8字符对数组进行排序？例如(此数组中的顺序正确排序)：$array = array(0 => 'Agile',1 => 'Ágile',2 => 'Àgile',3 => 'Âgile',4 => 'Ägile',5...
如何检查字符串是否仅由字母和数字组成？（PHP） php
2019-05-26 01:36

回答 2 已采纳 You can use the ctype_alnum() function in PHP. From the manual.. Check for alphanumeric chara
按非字母，用户定义的字符串值对数组进行排序 php
2018-04-16 22:53

回答 2 已采纳 usort is definitely the way to go. Here's a fairly generalised way of doing it, you can expand the
putchar与getchar在函数中如何实现字符串逆序输出？ c语言
2021-01-14 14:39

回答 4 已采纳递归实现的，f这个函数是递归调用的，在if那里 getchar接收一个字符后，进行下一次递归，一层叠一层，等接收到回车，就结束递归，走下面的printf输出，变成了从下往上输出，也就是先进后出的效果，
《PHP 7从零基础到项目实战》学习笔记4——字符串
2020-05-31 11:24

梦里逆天的博客在使用单引号字符串时，字符串中需要转义的特殊字符只有反斜杠和单引号本身，单引号不能识别插入的变量。相比双引号，这种定义字符串的方式不但直观而且速度快。 <?php echo 'hello world \\ test'; // hello ...
如何在golang中删除字符串中的最后一个字母？
2019-07-12 09:33

回答 4 已采纳 How to remove the last letter from the string? In Go, character strings are UTF-8 encoded.
如何在 JavaScript 中使字符串的第一个字母大写？ javascript
2022-01-09 13:07

回答 2 已采纳 function capitalizeFirstLetter(string) { return string.charAt(0).toUpperCase() + string.slice(1);}
怎么统计字符串中数字和字母数量并忽略“#”后面的字符？ python
2022-03-19 23:21

回答 1 已采纳【有帮助请采纳】 s = input()#输入字符串 s = s.split('#')[0]#将字符串按‘#’分割并将分割后的字符串形成列表，并取其第一个值（相当于#后面的都不要了） m,n =
mysql字符串类型_mysql字符串类型
2021-02-08 06:38

逆光的白羊的博客该节描述了这些类型如何工作以及如何在查询中使用这些类型。1. CHAR和VARCHAR类型CHAR和VARCHAR类型类似，但它们保存和检索的方式不同。它们的最大长度和是否尾部空格被保留等方面也不同。在存储或检索过程中不进行...
python中统计字符串中每个字母出现的次数 python
2021-12-07 19:56

回答 1 已采纳 def Character(Str): letters = 0 for s in Str: if 97<=ord(s)<=122 or 65<=ord
如何检查字符串是否包含特定单词？
2019-12-27 09:08

asdfgh0077的博客如果我们可以根据术语在整个字符串中的代表性来对字符串中的术语进行加权，则可以按照与查询最匹配的结果对结果进行排序。这是向量空间模型的思想，与 SQL全文搜索的工作原理相距不远： function get_...
Java字符串的处理
2021-12-23 23:33

小熊coder的博客文章目录本章学习要点Java定义字符串（2种方式）直接定义字符串例 1使用 String 类定义1. String()2. String(String original)3. String(char[ ]value)4. String(char[] value,int offset,int count)小白如何使用...
没有解决我的问题, 去提问

悬赏问题

¥15 使用C#，asp.net读取Excel文件并保存到Oracle数据库
¥15 C# datagridview 单元格显示进度及值
¥15 thinkphp6配合social login单点登录问题
¥15 HFSS 中的 H 场图与 MATLAB 中绘制的 B1 场部分对应不上
¥15 如何在scanpy上做差异基因和通路富集？
¥20 关于#硬件工程#的问题，请各位专家解答！
¥15 关于#matlab#的问题：期望的系统闭环传递函数为G(s)=wn^2/s^2+2¢wn+wn^2阻尼系数¢=0.707，使系统具有较小的超调量
¥15 FLUENT如何实现在堆积颗粒的上表面加载高斯热源
¥30 截图中的mathematics程序转换成matlab
¥15 动力学代码报错，维度不匹配

码龄粉丝数原力等级 --

如何使用预定义的字母表在unicode中对字符串进行排序？

2条回答默认最新

码龄粉丝数原力等级 --

悬赏问题

如何使用预定义的字母表在unicode中对字符串进行排序？

2条回答 默认 最新

悬赏问题

2条回答默认最新