Prepping a Curl Response for particular data to be inserted into a MySQL Table.
Noticed some special characters in the saved data for certain URL's.
$curldata = curl_exec($curl);
$encoding = mb_detect_encoding($curldata);
brought back ASCII
encoding.
Okay, don't want that.
The tables in my database are an InnoDB
type with a utf8mb4_unicode_ci
collation.
Added this to my curl options:
curl_setopt($curl, CURLOPT_ENCODING, 1);
And an iconv
function based on the above mb_detect_encoding
/ $encoding
variable upon save.
$curldata = iconv($encoding, "UTF-8", $curldata);
// save to file to test output
file_put_contents('test.html', $curldata);
Not sure if this is the best way to go about this, but my test.html
output no longer has any encoding for special characters, so... (perhaps) mission accomplished.
As I parse through the data, I then notice this character.
,
Not an ordinary comma... [Comparison: ,/,]
But acts like one. Try doing a ctrl+f
and try to find a comma. It treats them as the same, and both as a UTF-8 character - var_dump(mb_detect_encoding(','));
I look at my table row, and see it as a row inserted as such
8,8
If I try to search for a ,
it does indeed bring back the instances where ,
is present.
Vice versa, if I search for ,
it brings back all instances where that and a comma occurs.
Basically for all intents and purposes it is a comma, yet obviously isn't.
This is of course workable, but rather annoying and feels riddled with inconsistency.
Can anyone explain why the two commas are the same, yet obviously different?
Is there a solution for me to prevent these odd characters from entering my CURL response, or further in within my DOM
response and PDO
Insert.
edit:
If relevant,
// dom
$dom = new DOMDocument('1.0', 'utf-8');
libxml_use_internal_errors(true);
$dom->preserveWhiteSpace = FALSE;
$dom->loadHTML(mb_convert_encoding($curldata, 'HTML-ENTITIES', 'UTF-8'));
// pdo
$pdoquery = "INSERT INTO `table` (`Attr`) VALUES (?)";
$value = "8,8";
$stmt = $pdo->prepare("INSERT INTO `table` (`Attr`) VALUES (?)");
$stmt->execute([$value]);
edit 2:
Well, it appears to be a FULLWIDTH COMMA
..
var_dump(utf8_to_unicode(','));
string '%uff0c' (length=6)
var_dump(utf8_to_unicode(','));
string '%2c' (length=3)
Starting to make more sense... now to figure out how to prevent such characters from entering the curl response/DOM/database...