C 中的 UTF-32 到 UTF-8 转换器,缓冲区充满空/零
我一直在努力让它发挥作用。该程序应该采用两个参数,一个用于缓冲区大小,另一个用于文件名,并将该文件从 UTF-32 转换为 UTF-8。我一直在使用 fgetc() 函数用 Unicode 代码点填充 int 数组。我已经测试了 printint 输出缓冲区的内容,它具有所有这些空字符而不是每个代码点。
例如,对于仅包含字符“A”的文件: 缓冲区[0]为0 缓冲区[1]为0 缓冲区 [2] 为 0 buffer [3] 是 41
U+7F 以上的任何代码点最终都会被分开。
这是初始化缓冲区的代码:
int main(int argc, char** argv) {
if (argc != 3) {
printf("Must input a buffer size and a file name :D");
return 0;
}
FILE* input = fopen(argv[2], "r");
if (!input) {
printf("The file %s does not exist.", argv[1]);
return 0;
} else {
int bufferLimit = atoi(argv[1]);
int buffer[bufferLimit];
int charReplaced = 0;
int fileEndReached = 0;
int i = 0;
int j = 0;
while(1) {
// fill the buffer with the characters from the file.
for(i = 0; i < bufferLimit; i++){
buffer[i] = fgetc(input);
// if EOF reached, move onto next step and mark that
// it has finished.
if (buffer[i] == EOF) {
fileEndReached = 1;
break;
}
}
// output buffer of chars until EOF or end of buffer
for(j = 0; j <= i; j++) {
if(buffer[j] == EOF) {
break;
}
// check for Character Replacements
charReplaced += !convert(buffer[j]);
}
if(fileEndReached != 0) {
break;
}
}
//return a 1 if any Character Replacements were used
if(charReplaced != 0) {
return 1;
}
}
}
I've been trying forever to get this working. The program is supposed to take two arguments, on for the buffer size and another for a file name and convert that file form UTF-32 to UTF-8. I've been using the fgetc() function to fill an int array with the Unicode codepoint. I've tested printint out the contents of my buffer, and it has all these null characters instead of each codepoint.
For example, for a file consisting of only the character 'A':
buffer [0] is 0
buffer [1] is 0
buffer [2] is 0
buffer [3] is 41
The codepoints for anything above U+7F end up getting split apart.
Here is the code for initializing my buffer:
int main(int argc, char** argv) {
if (argc != 3) {
printf("Must input a buffer size and a file name :D");
return 0;
}
FILE* input = fopen(argv[2], "r");
if (!input) {
printf("The file %s does not exist.", argv[1]);
return 0;
} else {
int bufferLimit = atoi(argv[1]);
int buffer[bufferLimit];
int charReplaced = 0;
int fileEndReached = 0;
int i = 0;
int j = 0;
while(1) {
// fill the buffer with the characters from the file.
for(i = 0; i < bufferLimit; i++){
buffer[i] = fgetc(input);
// if EOF reached, move onto next step and mark that
// it has finished.
if (buffer[i] == EOF) {
fileEndReached = 1;
break;
}
}
// output buffer of chars until EOF or end of buffer
for(j = 0; j <= i; j++) {
if(buffer[j] == EOF) {
break;
}
// check for Character Replacements
charReplaced += !convert(buffer[j]);
}
if(fileEndReached != 0) {
break;
}
}
//return a 1 if any Character Replacements were used
if(charReplaced != 0) {
return 1;
}
}
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
fgetc() 返回一个字节,而不是 unicode 代码点。
从那时起,基于这个错误的假设,整个事情就崩溃了。
fgetc() returns a byte, not a unicode code point.
From there on based on that false assumption the whole thing falls down.