非ASCII CHARSET作为C字符串
我正在开发具有多语言支持的软件。我必须使用一个字节字符集。这意味着我不能使用UTF-8编码格式。我的编码格式是:
- ENG:ASCII
- UKR:KOI8-U
- ARA:ISO8859-6
- SPA:ISO8859-1
我使用Notepad ++作为我的编辑。当我收到一种新语言的翻译时,我只需将数组大小增加,然后更改C文件的编码格式到新语言编码格式。例如,对于不同的编码类型,我的数组看起来像这样:
#define MAX_CHAR_PER_LINE 10
enum Langs {
en,
uk,
es,
ar
MAX_LANG
};
// ASCII
const char settingStr[][MAX_LANG][MAX_CHAR_PER_LINE] = {
//...
{ "SETTINGS", "îáìáûôõ÷áîîñ", "AJUSTES", "ÇÙÏÇÏÇÊ" },
//...
};
// KOI8-U
const char settingStr[][MAX_LANG][MAX_CHAR_PER_LINE] = {
//...
{ "SETTINGS", "НАЛАШТУВАННЯ", "AJUSTES", "гыогогй" },
//...
};
// ISO8859-6
const char settingStr[][MAX_LANG][MAX_CHAR_PER_LINE] = {
//...
{ "SETTINGS", "ففََّ", "AJUSTES", "اعدادات" },
//...
};
当我使用十六进制查看器检查C文件时,我确保字符的二进制值根据指定的编码标准正确。 我的问题是:
运行时间的逻辑错误也是如此。
在线GDB的示例代码是:
#include <stdio.h>
const char settingStr[][4][10] = {
//...
{ "SETTINGS", "ففََّ", "ÇÙÏÇÏÇÊ", "AJUSTES" },
//...
};
int main() {
for (int i = 0; i < 4; i++) {
for (int j = 0; j < 10; j++)
printf("0x%02X,", settingStr[0][i][j]);
printf("\n");
}
return 0;
}
我怀疑GCC预处理器无法解析这些字符串。我应该添加某种编译器标志吗?我不想用十六进制值填充数组。
I am developing a software which has multi-language support. I have to use one byte character sets. That means I cannot use UTF-8 encoding format. My Encoding formats are these:
- ENG: ASCII
- UKR: KOI8-U
- ARA: ISO8859-6
- SPA: ISO8859-1
I use notepad++ as my editor. When I receive the translation for a new language, I just simply increase my array size and change encoding format of the C file to new language encoding format. For example, my array looks like this for different encoding types:
#define MAX_CHAR_PER_LINE 10
enum Langs {
en,
uk,
es,
ar
MAX_LANG
};
// ASCII
const char settingStr[][MAX_LANG][MAX_CHAR_PER_LINE] = {
//...
{ "SETTINGS", "îáìáûôõ÷áîîñ", "AJUSTES", "ÇÙÏÇÏÇÊ" },
//...
};
// KOI8-U
const char settingStr[][MAX_LANG][MAX_CHAR_PER_LINE] = {
//...
{ "SETTINGS", "НАЛАШТУВАННЯ", "AJUSTES", "гыогогй" },
//...
};
// ISO8859-6
const char settingStr[][MAX_LANG][MAX_CHAR_PER_LINE] = {
//...
{ "SETTINGS", "ففََّ", "AJUSTES", "اعدادات" },
//...
};
When I check the C file with hex viewer, I then be sure that binary values of characters are correct according to specified encoding standards.
My problems are I am getting compile warnings as:
Also logical errors in run time.
Sample code for online gdb is:
#include <stdio.h>
const char settingStr[][4][10] = {
//...
{ "SETTINGS", "ففََّ", "ÇÙÏÇÏÇÊ", "AJUSTES" },
//...
};
int main() {
for (int i = 0; i < 4; i++) {
for (int j = 0; j < 10; j++)
printf("0x%02X,", settingStr[0][i][j]);
printf("\n");
}
return 0;
}
I suspect that gcc preprocessor cannot parse these strings. Should I add some kind of compiler flag? I do not want to fill my array with hex values.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
编译器错误消息具有误导性:字符串
“Çùïçïçê”
可能在UTF-8中编码(由您的编辑器或传输过程中的其他工具)编码,并使用14个字节(加上零件终端)。编译器指向错误(字符串中的第六个字符),但终端支持UTF-8和的14个字节“çùïçïçê”
仅显示为7个字符,将^~~ iS
^~~失错。 ~~~~
在下一行上输出。另一个字符串“فففف”
可能也被误解,导致额外的错位。问题是您的编辑环境:将翻译交给了您在UTF-8中编码的,这是当今事实上的标准,更确切地说,它可能已经编码两次:原始ISO8859-6 1字节编码为阿拉伯语和阿拉伯语编码从ISO8859-1误用UTF-8重新编码。
您无法轻松混合同一文件中的不同编码。每个人都非常令人困惑:翻译人员,程序员,编译器,用户...
这里有不同的选择来避免这些问题:
您应该认真重新考虑设计选择并使用UTF-8。所有翻译的源代码都可以在所有语言中可读,这更安全,更简单地进行审核。根据运行时环境,这可能会简化或使显示器复杂化。
您可以将字符串存储在每个翻译的单独文件中,每个翻译都用适当的编码编码并在运行时检索它们。这对翻译人员来说更友好,但需要对软件进行重大更改。
您可以用八分或十六进制的逃生序列编码ASCII中翻译的字符串,以避免重新编码问题。这将避免重新编码问题以及对遥远国家使用的历史编码的任何编译器误解。您可以使用一个小程序将字符串用作C源代码。
The compiler error message is misleading: the string
"ÇÙÏÇÏÇÊ"
is probably encoded in UTF-8 (by your editor or some other tool during transmission) and uses 14 bytes (plus a null terminator). The compiler points to the error (the 6th character in the string) but the terminal supports UTF-8 and the 14 bytes of"ÇÙÏÇÏÇÊ"
only appear as 7 characters, misaligning the^~~~~~~
output on the next line. The other string"ففََّ"
is probably misencoded too, causing extra misalignment.The problem is your editing environment: The translation was given back to you encoded in UTF-8, which is the de facto standard today, more precisely, it may have been encoded twice: the original ISO8859-6 1-byte encoding for arabic and reencoded in UTF-8 from ISO8859-1 by mistake.
You cannot easily mix different encodings in the same file. It is very confusing for everyone: the translator, the programmer, the compiler, the users...
Here are different options to avoid these issues:
You should seriously reconsider the design choice and use UTF-8. The source code with all translations will be readable in all languages, which is safer and simpler to audit. Depending on the runtime environment, this might simplify or complicate the display.
you could store the strings in a separate file for each translation, each encoded with the appropriate encoding and retrieve them at run time. This is more friendly for translators but requires substantial changes in the software.
you could encode the translated strings in ASCII with octal or hexadecimal escape sequences to avoid re-encoding issues. This will avoid re-encoding problems and any compiler misinterpretations with historic encodings used in Far-Eastern countries. You can use a small program to encode strings in as C source code.