无法从文件中读取 unicode(日语)
您好,我有一个包含日语文本的文件,保存为 unicode 文件。
我需要从文件中读取并将信息显示到标准输出。
我正在使用 Visual studio 2008
int main()
{
wstring line;
wifstream myfile("D:\sample.txt"); //file containing japanese characters, saved as unicode file
//myfile.imbue(locale("Japanese_Japan"));
if(!myfile)
cout<<"While opening a file an error is encountered"<<endl;
else
cout << "File is successfully opened" << endl;
//wcout.imbue (locale("Japanese_Japan"));
while ( myfile.good() )
{
getline(myfile,line);
wcout << line << endl;
}
myfile.close();
system("PAUSE");
return 0;
}
该程序生成一些随机输出,但我在屏幕上看不到任何日语文本。
Hi I have a file containing japanese text, saved as unicode file.
I need to read from the file and display the information to the stardard output.
I am using Visual studio 2008
int main()
{
wstring line;
wifstream myfile("D:\sample.txt"); //file containing japanese characters, saved as unicode file
//myfile.imbue(locale("Japanese_Japan"));
if(!myfile)
cout<<"While opening a file an error is encountered"<<endl;
else
cout << "File is successfully opened" << endl;
//wcout.imbue (locale("Japanese_Japan"));
while ( myfile.good() )
{
getline(myfile,line);
wcout << line << endl;
}
myfile.close();
system("PAUSE");
return 0;
}
This program generates some random output and I don't see any japanese text on the screen.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
噢,孩子。欢迎来到有趣、有趣的字符编码世界。
您需要知道的第一件事是您的控制台在 Windows 上不是 unicode。您在控制台应用程序中看到日语字符的唯一方法是 将您的非 unicode (ANSI) 区域设置设置为日语。这也会使反斜杠看起来像日元符号,并为使用 ANSI Windows API 的程序中断包含欧洲重音字符的路径(在 Windows XP 出现时应该已被弃用,但人们仍然习惯这样做)天...)
所以你要做的第一件事就是构建一个 GUI 程序。但我会将其作为练习留给感兴趣的读者。
其次,有很多方法来表示文本。您首先需要弄清楚正在使用的编码。是UTF-8吗? UTF-16(如果是的话,是小端还是大端?) Shift-JIS? EUC-JP?如果文件采用小端 UTF-16 格式,则只能使用
wstream
直接读取。即使如此,您也需要 使用其内部缓冲区< /a>. UTF-16 以外的任何内容都会导致无法读取的垃圾内容。而且这也仅在 Windows 上发生!其他操作系统可能有不同的wstream
表示形式。最好根本不要使用wstream
。因此,我们假设它不是 UTF-16(为了完全通用)。在这种情况下,您必须将其作为字符流读取 - 而不是使用
wstream
。然后,您必须将此字符串转换为 UTF-16(假设您使用的是 Windows!其他操作系统倾向于使用 UTF-8char*
)。在 Windows 上,可以使用MultiByteToWideChar< 来完成/代码>
。确保传递正确的代码页值,
CP_ACP
或CP_OEMCP
几乎总是错误的答案。现在,您可能想知道如何确定哪个代码页(即字符编码)是正确的。简短的回答是你不。没有一种表面上的方法可以查看文本字符串并说明它是哪种编码。当然,可能会有提示 - 例如,如果您看到 字节顺序标记,很可能它是什么unicode 的变体就是这个标志。但一般来说,您必须由用户告诉您,或者尝试猜测,如果您错了,则依靠用户来纠正您,或者您必须选择固定的字符集并且不要尝试支持任何字符集其他的。
Oh boy. Welcome to the Fun, Fun world of character encodings.
The first thing you need to know is that your console is not unicode on windows. The only way you'll ever see Japanese characters in a console application is if you set your non-unicode (ANSI) locale to Japanese. Which will also make backslashes look like yen symbols and break paths containing european accented characters for programs using the ANSI Windows API (which was supposed to have been deprecated when Windows XP came around, but people still use to this day...)
So first thing you'll want to do is build a GUI program instead. But I'll leave that as an exercise to the interested reader.
Second, there are a lot of ways to represent text. You first need to figure out the encoding in use. Is is UTF-8? UTF-16 (and if so, little or big endian?) Shift-JIS? EUC-JP? You can only use a
wstream
to read directly if the file is in little-endian UTF-16. And even then you need to futz with its internal buffer. Anything other than UTF-16 and you'll get unreadable junk. And this is all only the case on Windows as well! Other OSes may have a differentwstream
representation. It's best not to usewstream
s at all really.So, let's assume it's not UTF-16 (for full generality). In this case you must read it as a char stream - not using a
wstream
. You must then convert this character string into UTF-16 (assuming you're using windows! Other OSes tend to use UTF-8char*
s). On windows this can be done withMultiByteToWideChar
. Make sure you pass in the right code page value, andCP_ACP
orCP_OEMCP
are almost always the wrong answer.Now, you may be wondering how to determine which code page (ie, character encoding) is correct. The short answer is you don't. There is no prima facie way of looking at a text string and saying which encoding it is. Sure, there may be hints - eg, if you see a byte order mark, chances are it's whatever variant of unicode makes that mark. But in general, you have to be told by the user, or make an attempt to guess, relying on the user to correct you if you're wrong, or you have to select a fixed character set and don't attempt to support any others.
这里有人用俄语遇到了同样的问题字符(他使用 basic_ifstream,根据 此页面)。在该问题的评论中,他们还链接到
如果正确理解所有内容,wifstream 似乎可以正确读取字符,但您的程序会尝试将它们转换为程序运行的任何区域设置。
Someone here had the same problem with Russian characters (He's using basic_ifstream<wchar_t> wich should be the same as wifstream according to this page). In the comments of that question they also link to this which should help you further.
If understood everything correctly, it seems that wifstream reads the characters correctly but your program tries to convert them to whatever locale your program is running in.
两个错误:
并且不要混合使用
cout
和wcout
。另请检查您的文件是否采用 UTF-16、Little-Endian 编码。如果不是这样,您将很难阅读它。
Two errors:
And do not mix
cout
andwcout
.Also check that your file is encoded in UTF-16, Little-Endian. If not so, you will be in trouble reading it.
wfstream 使用 wfilebuf 来实际读取和写入数据。 wfilebuf 默认在内部使用字符缓冲区,这意味着文件中的文本被假定为窄文本,并在您看到它之前转换为宽文本。由于文本实际上很宽,因此会变得一团糟。
解决方案是将 wfilebuf 缓冲区替换为宽缓冲区。
您可能还需要以二进制方式打开文件。
确保缓冲区的寿命比流对象的寿命长!
请参阅此处的详细信息: http://msdn.microsoft .com/en-us/library/tzf8k3z8(v=VS.80).aspx
wfstream uses wfilebuf for the actual reading and writing of the data. wfilebuf defaults to using a char buffer internally which means that the text in the file is assumed narrow, and converted to wide before you see it. Since the text was actually wide, you get a mess.
The solution is to replace the wfilebuf buffer with a wide one.
You probably also need to open the file as binary.
Make sure the buffer outlives the stream object!
See details here: http://msdn.microsoft.com/en-us/library/tzf8k3z8(v=VS.80).aspx