Question:
As mentioned in the C reference page for the function, c16rtomb
, from CPPReference, under the Notes section:
In C11 as published, unlike
mbrtoc16
, which converts variable-width multibyte (such as UTF-8) to variable-width 16-bit (such as UTF-16) encoding, this function can only convert single-unit 16-bit encoding, meaning it cannot convert UTF-16 to UTF-8 despite that being the original intent of this function. This was corrected by the post-C11 defect report DR488.
And below this passage, the C reference page provided an example source code with the following sentence above it:
Note: this example assumes the fix for the defect report 488 is applied.
That phrase implied there is a way to take DR488 and somehow "apply" the fix to the C11 standard function, c16rtomb
.
I would like to know how to apply the fix for GCC. Because it seems to me the fix was already applied to Visual Studio 2017 Visual C++, as of v141.
The behavior seen in GCC, when debugging the code in GDB, is consistent with what was found in DR488, as follows:
Section 7.28.1 describes the function c16rtomb(). In particular, it states "When c16 is not a valid wide character, an encoding error occurs". "wide character" is defined in section 3.7.3 as "value representable by an object of type wchar_t, capable of representing any character in the current locale". This wording seems to imply that, e.g. for the common cases (e.g, an implementation that defines __STDC_UTF_16__ and a program that uses an UTF-8 locale), c16rtomb() will return -1 when it encounters a character that is encoded as multiple char16_t (for UTF-16 a wide character can be encoded as a surrogate pair consisting of two char16_t). In particular, c16rtomb() will not be able to process strings generated by mbrtoc16().
The boldfaced text is the behavior described.
Source code:
#include <stdio.h>
#include <uchar.h>
#define __STD_UTF_16__
int main() {
char16_t* ptr_string = (char16_t*) u"我是誰";
//C++ disallows variable-length arrays.
//GCC uses GNUC++, which has a C++ extension for variable length arrays.
//It is not a truly standard feature in C++ pedantic mode at all.
//https://stackoverflow.com/questions/40633344/variable-length-arrays-in-c14
char buffer[64];
char* bufferOut = buffer;
//Must zero this object before attempting to use mbstate_t at all.
mbstate_t multiByteState = {};
//c16 = 16-bit Characters or char16_t typed characters
//r = representation
//tomb = to Multi-Byte Strings
while (*ptr_string) {
char16_t character = *ptr_string;
size_t size = c16rtomb(bufferOut, character, &multiByteState);
if (size == (size_t) -1)
break;
bufferOut += size;
ptr_string++;
}
size_t bufferOutSize = bufferOut - buffer;
printf("Size: %zu - ", bufferOutSize);
for (int i = 0; i < bufferOutSize; i++) {
printf("%#x ", +(unsigned char) buffer[i]);
}
//This statement is used to set a breakpoint. It does not do anything else.
int debug = 0;
return 0;
}
Output from Visual Studio:
Size: 9 - 0xe6 0x88 0x91 0xe6 0x98 0xaf 0xe8 0xaa 0xb0
Output from GCC:
Size: 0 -
转载于:https://stackoverflow.com/questions/53148386/c-unicode-how-do-i-apply-c11-standard-amendment-dr488-fix-to-c11-standard-funct