UTF-8、CString 和 CFile？（C++，MFC）

发布于 2024-08-22 19:44:21 字数 2381 浏览 7 评论 0原文

我目前正在开发一个专门与 UTF-8 一起使用的 MFC 程序。在某些时候，我必须将 UTF-8 数据写入文件；为此，我使用 CFiles 和 CStrings。

当我将 utf-8（更准确地说是俄语字符）数据写入文件时，输出看起来像

Ðàñïå÷àòàíî:
Ñèñòåìà
Ïðîèçâîäñòâî

等等。这肯定不是 utf-8。为了正确读取这些数据，我必须更改我的系统设置；将非 ASCII 字符更改为俄语编码表确实有效，但随后我所有基于拉丁语的非 ASCII 字符都会失败。无论如何，我就是这么做的。

CFile CSVFile( m_sCible, CFile::modeCreate|CFile::modeWrite);
CString sWorkingLine;
//Add stuff into sWorkingline
CSVFile.Write(sWorkingLine,sWorkingLine.GetLength());
//Clean sWorkingline and start over

我错过了什么吗？我应该用别的东西代替吗？我是否错过了某种收获？各位程序员，我会倾听你们的智慧和经验。

编辑：当然，由于我刚刚问了一个问题，我终于找到了一些可能有趣的东西，可以在这里找到。我想我可以分享它。

编辑 2：

好的，所以我将 BOM 添加到我的文件中，该文件现在包含中文字符，可能是因为我没有将行转换为 UTF-8。为了添加 bom，我做了...

char BOM[3]={0xEF, 0xBB, 0xBF};
CSVFile.Write(BOM,3);

之后，我添加了...

    TCHAR TestLine;
    //Convert the line to UTF-8 multibyte.
    WideCharToMultiByte (CP_UTF8,0,sWorkingLine,sWorkingLine.GetLength(),TestLine,strlen(TestLine)+1,NULL,NULL);
    //Add the line to file.
    CSVFile.Write(TestLine,strlen(TestLine)+1);

但是我无法编译，因为我真的不知道如何获取 TestLine 的长度。 strlen 似乎不接受 TCHAR。 已修复，改为使用静态长度 1000。

编辑3：

所以，我添加了这段代码...

    wchar_t NewLine[1000];
    wcscpy( NewLine, CT2CW( (LPCTSTR) sWorkingLine ));
    TCHAR* TCHARBuf = new TCHAR[1000];

    //Convert the line to UTF-8 multibyte.
    WideCharToMultiByte (CP_UTF8,0,NewLine,1000,TCHARBuf,1000,NULL,NULL);

    //Find how many characters we have to add
    size_t size = 0;
    HRESULT hr = StringCchLength(TCHARBuf, MAX_PATH, &size);

    //Add the line to the file
    CSVFile.Write(TCHARBuf,size);

它编译得很好，但是当我查看我的新文件时，它与我没有所有这些新代码时完全相同（例如：Ðàñïå÷àòàíî:）。感觉就像我没有向前迈出一步，尽管我猜只有一小件事使我与胜利分开。

编辑4：

正如内特所要求的，我删除了之前添加的代码，我决定使用他的代码，这意味着现在，当我添加我的行时，我......

        CT2CA outputString(sWorkingLine, CP_UTF8);

    //Add line to file.
    CSVFile.Write(outputString,::strlen(outputString));

一切都编译良好，但俄语字符显示为??????。越来越近了，但仍然不是那样。顺便说一句，我要感谢所有试图帮助我的人，非常感谢。我已经被这个问题困扰了一段时间了，我等不及这个问题消失了。

最终编辑（我希望如此）通过改变我第一次获取 UTF-8 字符的方式（我在没有真正了解的情况下重新编码），这对于我输出文本的新方式来说是错误的，我得到了可接受的结果。通过在文件开头添加 UTF-8 BOM 字符，可以在其他程序（例如 Excel）中将其读取为 Unicode。

欢呼！谢谢大家！

原文

I'm currently working on a MFC program that specifically has to work with UTF-8. At some point, I have to write UTF-8 data into a file; to do that, I'm using CFiles and CStrings.

When I get to write utf-8 (russian characters, to be more precise) data into a file, the output looks like

Ðàñïå÷àòàíî:
Ñèñòåìà
Ïðîèçâîäñòâî

and etc. This is assurely not utf-8. To read this data properly, I have to change my system settings; changing non ASCII characters to a russian encoding table does work, but then all my latin based non-ascii characters get to fail.
Anyway, that's how I do it.

CFile CSVFile( m_sCible, CFile::modeCreate|CFile::modeWrite);
CString sWorkingLine;
//Add stuff into sWorkingline
CSVFile.Write(sWorkingLine,sWorkingLine.GetLength());
//Clean sWorkingline and start over

Am I missing something? Shall I use something else instead? Is there some kind of catch I've missed?
I'll be tuned in for your wisdom and experience, fellow programmers.

EDIT:
Of course, as I just asked a question, I finally find something which might be interesting, that can be found here. Thought I might share it.

EDIT 2:

Okay, so I added the BOM to my file, which now contains chineese character, probably because I didn't convert my line to UTF-8. To add the bom I did...

char BOM[3]={0xEF, 0xBB, 0xBF};
CSVFile.Write(BOM,3);

And after that, I added...

    TCHAR TestLine;
    //Convert the line to UTF-8 multibyte.
    WideCharToMultiByte (CP_UTF8,0,sWorkingLine,sWorkingLine.GetLength(),TestLine,strlen(TestLine)+1,NULL,NULL);
    //Add the line to file.
    CSVFile.Write(TestLine,strlen(TestLine)+1);

But then I cannot compile, as I don't really know how to get the length of TestLine. strlen doesn't seem to accept TCHAR.
Fixed, used a static lenght of 1000 instead.

EDIT 3:

So, I added this code...

    wchar_t NewLine[1000];
    wcscpy( NewLine, CT2CW( (LPCTSTR) sWorkingLine ));
    TCHAR* TCHARBuf = new TCHAR[1000];

    //Convert the line to UTF-8 multibyte.
    WideCharToMultiByte (CP_UTF8,0,NewLine,1000,TCHARBuf,1000,NULL,NULL);

    //Find how many characters we have to add
    size_t size = 0;
    HRESULT hr = StringCchLength(TCHARBuf, MAX_PATH, &size);

    //Add the line to the file
    CSVFile.Write(TCHARBuf,size);

It compiles fine, but when I go look at my new file, it's exactly the same as when I didn't have all this new code (ex : Ðàñïå÷àòàíî:). It feels like I didn't do a step forward, although I guess only a small thing is what separates me from victory.

EDIT 4:

I removed previously added code, as Nate asked, and I decided to use his code instead, meaning that now, when I get to add my line, I have...

        CT2CA outputString(sWorkingLine, CP_UTF8);

    //Add line to file.
    CSVFile.Write(outputString,::strlen(outputString));

Everything compiles fine, but the russian characters are shown as ???????. Getting closer, but still not that.
Btw, I'd like to thank everyone who tried/tries to help me, it is MUCH appreciated. I've been stuck on this for a while now, I can't wait for this problem to be gone.

FINAL EDIT (I hope)
By changing the way I first got my UTF-8 characters (I reencoded without really knowing), which was erroneous with my new way of outputting the text, I got acceptable results. By adding the UTF-8 BOM char at the beginning of my file, it could be read as Unicode in other programs, like Excel.

Hurray! Thank you everyone!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

素手挽清风 2024-08-29 19:44:21

当您输出数据时，您需要执行以下操作（这假设您正在以 Unicode 模式进行编译，强烈建议这样做）：

CString russianText = L"Привет мир";

CFile yourFile(_T("yourfile.txt"), CFile::modeWrite | CFile::modeCreate);

CT2CA outputString(russianText, CP_UTF8);
yourFile.Write(outputString, ::strlen(outputString));

如果未定义 _UNICODE （您正在多字节模式下工作），则需要了解您的输入文本所在的代码页并将其转换为您可以使用的内容。此示例显示如何处理 UTF-16 格式的俄语文本，并将其保存为 UTF-8：

// Example 1: convert from Russian text in UTF-16 (note the "L"
// in front of the string), into UTF-8.
CW2A russianTextAsUtf8(L"Привет мир", CP_UTF8);
yourFile.Write(russianTextAsUtf8, ::strlen(russianTextAsUtf8));

更有可能的是，您的俄语文本采用其他代码页，例如 KOI-8R。在这种情况下，您需要从其他代码页转换为 UTF-16。然后将UTF-16转换为UTF-8。您无法使用转换宏直接从 KOI-8R 转换为 UTF-8，因为它们总是尝试将窄文本转换为系统代码页。因此，简单的方法就是这样做：

// Example 2: convert from Russian text in KOI-8R (code page 20866)
// to UTF-16, and then to UTF-8. Conversions between UTFs are
// lossless.
CA2W russianTextAsUtf16("\xf0\xd2\xc9\xd7\xc5\xd4 \xcd\xc9\xd2", 20866);
CW2A russianTextAsUtf8(russianTextAsUtf16, CP_UTF8);
yourFile.Write(russianTextAsUtf8, ::strlen(russianTextAsUtf8));

您不需要 BOM（它是可选的；除非有特定原因，否则我不会使用它）。

请务必阅读以下内容：http ://msdn.microsoft.com/en-us/library/87zae4a3(VS.80).aspx。如果您错误地使用了CT2CA（例如，使用赋值运算符），您将会遇到麻烦。链接的文档页面显示了如何使用和如何不使用它的示例。

更多信息：

CT2CA 中的 C 表示 const。我尽可能使用它，但某些转换仅支持非常量版本（例如CW2A）。
CT2CA 中的T 表示您正在从LPCTSTR 转换。因此，无论您的代码是否使用 _UNICODE 标志编译，它都会起作用。您还可以使用CW2A（其中W表示宽字符）。
CT2CA 中的 A 表示您要转换为“ANSI”（8 位字符）字符串。
最后，CT2CA 的第二个参数指示您要转换到的代码页。

要进行反向转换（从 UTF-8 到 LPCTSTR），您可以执行以下操作：

CString myString(CA2CT(russianText, CP_UTF8));

在本例中，我们将 UTF-8 格式的“ANSI”字符串转换为 LPCTSTR。 LPCTSTR 始终假定为 UTF-16（如果定义了 _UNICODE）或当前系统代码页（如果未定义 _UNICODE）。

When you output data you need to do (this assumes you are compiling in Unicode mode, which is highly recommended):

CString russianText = L"Привет мир";

CFile yourFile(_T("yourfile.txt"), CFile::modeWrite | CFile::modeCreate);

CT2CA outputString(russianText, CP_UTF8);
yourFile.Write(outputString, ::strlen(outputString));

If _UNICODE is not defined (you are working in multi-byte mode instead), you need to know what code page your input text is in and convert it to something you can use. This example shows working with Russian text that is in UTF-16 format, saving it to UTF-8:

// Example 1: convert from Russian text in UTF-16 (note the "L"
// in front of the string), into UTF-8.
CW2A russianTextAsUtf8(L"Привет мир", CP_UTF8);
yourFile.Write(russianTextAsUtf8, ::strlen(russianTextAsUtf8));

More likely, your Russian text is in some other code page, such as KOI-8R. In that case, you need to convert from the other code page into UTF-16. Then convert the UTF-16 into UTF-8. You cannot convert directly from KOI-8R to UTF-8 using the conversion macros because they always try to convert narrow text to the system code page. So the easy way is to do this:

// Example 2: convert from Russian text in KOI-8R (code page 20866)
// to UTF-16, and then to UTF-8. Conversions between UTFs are
// lossless.
CA2W russianTextAsUtf16("\xf0\xd2\xc9\xd7\xc5\xd4 \xcd\xc9\xd2", 20866);
CW2A russianTextAsUtf8(russianTextAsUtf16, CP_UTF8);
yourFile.Write(russianTextAsUtf8, ::strlen(russianTextAsUtf8));

You don't need a BOM (it's optional; I wouldn't use it unless there was a specific reason to do so).

Make sure you read this: http://msdn.microsoft.com/en-us/library/87zae4a3(VS.80).aspx. If you incorrectly use CT2CA (for example, using the assignment operator) you will run into trouble. The linked documentation page shows examples of how to use and how not to use it.

Further information:

The C in CT2CA indicates const. I use it when possible, but some conversions only support the non-const version (e.g. CW2A).
The T in CT2CA indicates that you are converting from an LPCTSTR. Thus it will work whether your code is compiled with the _UNICODE flag or not. You could also use CW2A (where W indicates wide characters).
The A in CT2CA indicates that you are converting to an "ANSI" (8-bit char) string.
Finally, the second parameter to CT2CA indicates the code page you are converting to.

To do the reverse conversion (from UTF-8 to LPCTSTR), you could do:

CString myString(CA2CT(russianText, CP_UTF8));

In this case, we are converting from an "ANSI" string in UTF-8 format, to an LPCTSTR. The LPCTSTR is always assumed to be UTF-16 (if _UNICODE is defined) or the current system code page (if _UNICODE is not defined).

回复收藏 0 原文