C#：如何确定我的编码是否正确

发布于 2024-12-29 05:31:06 字数 639 浏览 1 评论 0原文

我对文件、流和不同的代码页很陌生。看看这是我的问题：

我得到文本文件，其中一些是使用代码页 Windows-1252 创建的，一些仍然是 IBM850，有时是 UTF8。当我导入它们时，我的数据库显示了 ä、ö、ü、ß 的各种符号，因为我使用错误的代码页读取了它们。仅当我使用正确的代码页导入它们时，一切正常。

我认为这可能是一个好方法：

将 ä、ö、ü、ß 转换为带有代码页 X 的字节数组，

例如：

byte[] myAeKl = Encoding.GetEncoding("IBM850").GetBytes("ä");

byte[] myAeGr = Encoding.GetEncoding("IBM850").GetBytes("Ä");

浏览文本文件并将每个字母字节数组与上面的字节数组进行比较。如果找到使用该代码页，否则尝试另一个代码页。

这是我不明白的地方：如何将文本文件中字母的字节与我正在查找的字母的字节数组进行比较。例如：

if (Textfile.Letter == myAeKl || Textfile.Letter == myAeGr)
...

还有其他方法可以获得正确的代码页吗？我有正确的解决方案吗？

原文

I am quite new to files, streams and different codepages.
See this is my problem:

I get text files and some of them have been create with the codepage Windows-1252, some are still IBM850 and sometimes they are UTF8. When I import them, my database shows all kinds of symbols for ä, ö, ü, ß, because I read them with the wrong codepage. only when I import them with the right codepage, everything works fine.

This is what I thought would be could a good approach:

Convert ä, ö, ü, ß to bytes array with a codepage X

eg:

byte[] myAeKl = Encoding.GetEncoding("IBM850").GetBytes("ä");

byte[] myAeGr = Encoding.GetEncoding("IBM850").GetBytes("Ä");

go through the text files and compare each letters byte array with the ones above.
if found use that codepage, otherwise try another codepage.

This is what I don't understand:
How can I compare the bytes from the letters in the textfile to the byte arrays of the letters I am looking for.
Eg:

if (Textfile.Letter == myAeKl || Textfile.Letter == myAeGr)
...

Is there any other way to get the right codepage?
Do I have the right aproach to the solution?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

夏有森光若流苏 2025-01-05 05:31:06

不幸的是，没有一种万无一失的方法，因为某个字节流在多种编码中可能是有意义的。

一种方法是使用其他业务数据进行猜测和启发。你能从文件名推断出编码吗？来自其他一些元数据，例如发件人姓名？如果是这样，请尝试使用它进行过滤。

如果没有，你可以尝试挖掘和猜测。如果文件可能很大，正如您所说，只需查看并引入文本示例（例如，前 512 个字节，这应该足够了）。你有办法猜出内容是什么吗？是英语/希伯来语或类似的自由文本吗？如果是，请在 512 字节样本中查找常见单词。文件包含固定格式吗？如果是这样，请寻找它。然后在实时样本上运行这些测试，查看结果，调整测试，然后重试，直到您有相对较好的机会识别编码。

祝你好运！

回复收藏 0 原文