C#:如何确定我的编码是否正确
我对文件、流和不同的代码页很陌生。 看看这是我的问题:
我得到文本文件,其中一些是使用代码页 Windows-1252 创建的,一些仍然是 IBM850,有时是 UTF8。当我导入它们时,我的数据库显示了 ä、ö、ü、ß 的各种符号,因为我使用错误的代码页读取了它们。仅当我使用正确的代码页导入它们时,一切正常。
我认为这可能是一个好方法:
将 ä、ö、ü、ß 转换为带有代码页 X 的字节数组,
例如:
byte[] myAeKl = Encoding.GetEncoding("IBM850").GetBytes("ä");
byte[] myAeGr = Encoding.GetEncoding("IBM850").GetBytes("Ä");
浏览文本文件并将每个字母字节数组与上面的字节数组进行比较。 如果找到使用该代码页,否则尝试另一个代码页。
这是我不明白的地方: 如何将文本文件中字母的字节与我正在查找的字母的字节数组进行比较。 例如:
if (Textfile.Letter == myAeKl || Textfile.Letter == myAeGr)
...
还有其他方法可以获得正确的代码页吗? 我有正确的解决方案吗?
I am quite new to files, streams and different codepages.
See this is my problem:
I get text files and some of them have been create with the codepage Windows-1252, some are still IBM850 and sometimes they are UTF8. When I import them, my database shows all kinds of symbols for ä, ö, ü, ß, because I read them with the wrong codepage. only when I import them with the right codepage, everything works fine.
This is what I thought would be could a good approach:
Convert ä, ö, ü, ß to bytes array with a codepage X
eg:
byte[] myAeKl = Encoding.GetEncoding("IBM850").GetBytes("ä");
byte[] myAeGr = Encoding.GetEncoding("IBM850").GetBytes("Ä");
go through the text files and compare each letters byte array with the ones above.
if found use that codepage, otherwise try another codepage.
This is what I don't understand:
How can I compare the bytes from the letters in the textfile to the byte arrays of the letters I am looking for.
Eg:
if (Textfile.Letter == myAeKl || Textfile.Letter == myAeGr)
...
Is there any other way to get the right codepage?
Do I have the right aproach to the solution?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
不幸的是,没有一种万无一失的方法,因为某个字节流在多种编码中可能是有意义的。
一种方法是使用其他业务数据进行猜测和启发。你能从文件名推断出编码吗?来自其他一些元数据,例如发件人姓名?如果是这样,请尝试使用它进行过滤。
如果没有,你可以尝试挖掘和猜测。如果文件可能很大,正如您所说,只需查看并引入文本示例(例如,前 512 个字节,这应该足够了)。你有办法猜出内容是什么吗?是英语/希伯来语或类似的自由文本吗?如果是,请在 512 字节样本中查找常见单词。文件包含固定格式吗?如果是这样,请寻找它。然后在实时样本上运行这些测试,查看结果,调整测试,然后重试,直到您有相对较好的机会识别编码。
祝你好运!
There isn't a foolproof method, unfortunately, since a certain stream of bytes can be meaningful in more than one encoding.
One way of doing it is using guesswork and heuristics using other business data. Can you infer the encoding from the filename? From some other metadata, like sender name? If so, try to filter using that.
If not, you can try digging and guessing. If the files can be large, as you say, just peek and bring in a sample of text (say, the first 512 bytes, that should be enough). Do you have any way of guessing what the content can be? Is it free text in English/Hebrew or something like that? If so, look for common words in the 512 byte sample. Do the files contain a fixed format? If so, look for it. Then run these tests on live samples, see the results, tweak the tests, and try again until you have a relatively good chance of recognizing the encoding.
Good luck!
我会尝试使用一种编码加载文件,如果遇到意外的字符,则使用另一种编码加载它。
I would try to load the file with one encoding and if I encounter unexpected chars, load it with the other one.