dongyan3616 2019-03-07 03:39
浏览 95
已采纳

不同的UTF-8逗号变种? [,] [,] - MySQL数据的CURL响应

Prepping a Curl Response for particular data to be inserted into a MySQL Table.

Noticed some special characters in the saved data for certain URL's.

$curldata = curl_exec($curl);
$encoding = mb_detect_encoding($curldata);

brought back ASCII encoding.

Okay, don't want that.

The tables in my database are an InnoDB type with a utf8mb4_unicode_ci collation.

Added this to my curl options:

curl_setopt($curl, CURLOPT_ENCODING, 1);

And an iconv function based on the above mb_detect_encoding / $encoding variable upon save.

$curldata = iconv($encoding, "UTF-8", $curldata);

// save to file to test output
file_put_contents('test.html', $curldata);

Not sure if this is the best way to go about this, but my test.html output no longer has any encoding for special characters, so... (perhaps) mission accomplished.

As I parse through the data, I then notice this character.

Not an ordinary comma... [Comparison: ,/,]

But acts like one. Try doing a ctrl+f and try to find a comma. It treats them as the same, and both as a UTF-8 character - var_dump(mb_detect_encoding(','));

I look at my table row, and see it as a row inserted as such

8,8

If I try to search for a , it does indeed bring back the instances where is present.

Vice versa, if I search for it brings back all instances where that and a comma occurs.

Basically for all intents and purposes it is a comma, yet obviously isn't.

This is of course workable, but rather annoying and feels riddled with inconsistency.

Can anyone explain why the two commas are the same, yet obviously different?

Is there a solution for me to prevent these odd characters from entering my CURL response, or further in within my DOM response and PDO Insert.

edit:

If relevant,

// dom
$dom = new DOMDocument('1.0', 'utf-8');
libxml_use_internal_errors(true);
$dom->preserveWhiteSpace = FALSE;
$dom->loadHTML(mb_convert_encoding($curldata, 'HTML-ENTITIES', 'UTF-8'));

// pdo
$pdoquery = "INSERT INTO `table` (`Attr`) VALUES (?)";
$value = "8,8";
$stmt = $pdo->prepare("INSERT INTO `table` (`Attr`) VALUES (?)");
$stmt->execute([$value]);

edit 2:

Well, it appears to be a FULLWIDTH COMMA..

var_dump(utf8_to_unicode(','));

string '%uff0c' (length=6)

var_dump(utf8_to_unicode(','));

string '%2c' (length=3)

Starting to make more sense... now to figure out how to prevent such characters from entering the curl response/DOM/database...

  • 写回答

2条回答 默认 最新

  • dragon0023 2019-03-07 05:08
    关注

    You might want the function mb_convert_kana which can convert characters of different widths into a uniform width.

    $s = 'This is a string with ,, (commas having different widths)';
    
    echo 'original : ', $s, PHP_EOL;
    echo 'converted: ', mb_convert_kana($s, 'a');
    

    result:

    original : This is a string with ,, (commas having different widths)
    converted: This is a string with ,, (commas having different widths)
    

    PHP documentation: mb_convert_kana
    To get an idea what the meaning is, see also http://unicode.org/reports/tr11-2/

    By convention, 1/2 Em wide characters of East Asian legacy encodings are called "half-width" (or hankaku characters in Japanese), the others are called correspondingly "full-width" (or zenkaku) characters.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥15 c程序不知道为什么得不到结果
  • ¥40 复杂的限制性的商函数处理
  • ¥15 程序不包含适用于入口点的静态Main方法
  • ¥15 素材场景中光线烘焙后灯光失效
  • ¥15 请教一下各位,为什么我这个没有实现模拟点击
  • ¥15 执行 virtuoso 命令后,界面没有,cadence 启动不起来
  • ¥50 comfyui下连接animatediff节点生成视频质量非常差的原因
  • ¥20 有关区间dp的问题求解
  • ¥15 多电路系统共用电源的串扰问题
  • ¥15 slam rangenet++配置