如何检测文本文件的字符编码?
我尝试检测我的文件中使用了哪种字符编码。
我尝试使用此代码来获取标准编码。
public static Encoding GetFileEncoding(string srcFile)
{
// *** Use Default of Encoding.Default (Ansi CodePage)
Encoding enc = Encoding.Default;
// *** Detect byte order mark if any - otherwise assume default
byte[] buffer = new byte[5];
FileStream file = new FileStream(srcFile, FileMode.Open);
file.Read(buffer, 0, 5);
file.Close();
if (buffer[0] == 0xef && buffer[1] == 0xbb && buffer[2] == 0xbf)
enc = Encoding.UTF8;
else if (buffer[0] == 0xfe && buffer[1] == 0xff)
enc = Encoding.Unicode;
else if (buffer[0] == 0 && buffer[1] == 0 && buffer[2] == 0xfe && buffer[3] == 0xff)
enc = Encoding.UTF32;
else if (buffer[0] == 0x2b && buffer[1] == 0x2f && buffer[2] == 0x76)
enc = Encoding.UTF7;
else if (buffer[0] == 0xFE && buffer[1] == 0xFF)
// 1201 unicodeFFFE Unicode (Big-Endian)
enc = Encoding.GetEncoding(1201);
else if (buffer[0] == 0xFF && buffer[1] == 0xFE)
// 1200 utf-16 Unicode
enc = Encoding.GetEncoding(1200);
return enc;
}
我的前五个字节是 60、118、56、46 和 49。
是否有图表显示哪种编码与前五个字节匹配?
I try to detect which character encoding is used in my file.
I try with this code to get the standard encoding
public static Encoding GetFileEncoding(string srcFile)
{
// *** Use Default of Encoding.Default (Ansi CodePage)
Encoding enc = Encoding.Default;
// *** Detect byte order mark if any - otherwise assume default
byte[] buffer = new byte[5];
FileStream file = new FileStream(srcFile, FileMode.Open);
file.Read(buffer, 0, 5);
file.Close();
if (buffer[0] == 0xef && buffer[1] == 0xbb && buffer[2] == 0xbf)
enc = Encoding.UTF8;
else if (buffer[0] == 0xfe && buffer[1] == 0xff)
enc = Encoding.Unicode;
else if (buffer[0] == 0 && buffer[1] == 0 && buffer[2] == 0xfe && buffer[3] == 0xff)
enc = Encoding.UTF32;
else if (buffer[0] == 0x2b && buffer[1] == 0x2f && buffer[2] == 0x76)
enc = Encoding.UTF7;
else if (buffer[0] == 0xFE && buffer[1] == 0xFF)
// 1201 unicodeFFFE Unicode (Big-Endian)
enc = Encoding.GetEncoding(1201);
else if (buffer[0] == 0xFF && buffer[1] == 0xFE)
// 1200 utf-16 Unicode
enc = Encoding.GetEncoding(1200);
return enc;
}
My five first byte are 60, 118, 56, 46 and 49.
Is there a chart that shows which encoding matches those five first bytes?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(9)
您不能依赖具有 BOM 的文件。 UTF-8 不需要它。非 Unicode 编码甚至没有 BOM。然而,还有其他方法来检测编码。
UTF-32
BOM 为 00 00 FE FF(对于 BE)或 FF FE 00 00(对于 LE)。
但 UTF-32 即使没有 BOM 也很容易被检测到。这是因为 Unicode 代码点范围限制为 U+10FFFF,因此 UTF-32 单位始终具有模式 00 {00-10} xx xx(对于 BE)或 xx xx {00-10} 00(对于 LE) 。如果数据的长度是 4 的倍数,并且遵循这些模式之一,则可以放心地假设它是 UTF-32。由于面向字节的编码中 00 字节的稀有性,误报几乎是不可能的。
US-ASCII
无 BOM,但您不需要。 ASCII 可以通过缺少 80-FF 范围内的字节来轻松识别。
UTF-8
BOM 为 EF BB BF。但你不能依赖这个。许多 UTF-8 文件没有 BOM,特别是如果它们源自非 Windows 系统。
但您可以放心地假设,如果文件验证为 UTF-8,那么它就是 UTF-8。误报很少见。
具体来说,考虑到数据不是 ASCII,2 字节序列的误报率仅为 3.9% (1920/49152)。对于 7 字节序列,它不到 1%。对于 12 字节序列,它小于 0.1%。对于 24 字节序列,它不到百万分之一。
UTF-16
BOM 为 FE FF(对于 BE)或 FF FE(对于 LE)。请注意,UTF-16LE BOM 位于 UTF-32LE BOM 的开头,因此请先检查 UTF-32。
如果您碰巧有一个主要由 ISO-8859-1 字符组成的文件,那么该文件的一半字节为 00 也将是 UTF-16 的强烈指示。
否则,识别没有 BOM 的 UTF-16 的唯一可靠方法是寻找代理对 (D[8-B]xx D[CF]xx),但非 BMP 字符很少使用,无法使这种方法实用。
XML
如果您的文件以字节 3C 3F 78 6D 6C(即 ASCII 字符“encoding= 声明。如果存在,则使用该编码。如果不存在,则假定为 UTF-8,这是默认的 XML 编码。
如果您需要支持 EBCDIC,还需查找等效序列 4C 6F A7 94 93。
一般来说,如果您的文件格式包含编码声明,则查找该声明而不是尝试猜测编码。
以上都没有,
还有数百种其他编码,需要花费更多精力来检测。我建议尝试 Mozilla 的字符集检测器 或 它的 .NET 端口。
合理的默认值
如果您排除了 UTF 编码,并且没有指向不同编码的编码声明或统计检测,则假设 ISO-8859-1 或密切相关的 Windows-1252。 (请注意,最新的 HTML 标准需要将“ISO-8859-1”声明解释为 Windows-1252。)作为 Windows 英语(以及其他流行语言,如西班牙语、葡萄牙语)的默认代码页、德语和法语),它是除 UTF-8 之外最常见的编码。
You can't depend on the file having a BOM. UTF-8 doesn't require it. And non-Unicode encodings don't even have a BOM. There are, however, other ways to detect the encoding.
UTF-32
BOM is 00 00 FE FF (for BE) or FF FE 00 00 (for LE).
But UTF-32 is easy to detect even without a BOM. This is because the Unicode code point range is restricted to U+10FFFF, and thus UTF-32 units always have the pattern 00 {00-10} xx xx (for BE) or xx xx {00-10} 00 (for LE). If the data has a length that's a multiple of 4, and follows one of these patterns, you can safely assume it's UTF-32. False positives are nearly impossible due to the rarity of 00 bytes in byte-oriented encodings.
US-ASCII
No BOM, but you don't need one. ASCII can be easily identified by the lack of bytes in the 80-FF range.
UTF-8
BOM is EF BB BF. But you can't rely on this. Lots of UTF-8 files don't have a BOM, especially if they originated on non-Windows systems.
But you can safely assume that if a file validates as UTF-8, it is UTF-8. False positives are rare.
Specifically, given that the data is not ASCII, the false positive rate for a 2-byte sequence is only 3.9% (1920/49152). For a 7-byte sequence, it's less than 1%. For a 12-byte sequence, it's less than 0.1%. For a 24-byte sequence, it's less than 1 in a million.
UTF-16
BOM is FE FF (for BE) or FF FE (for LE). Note that the UTF-16LE BOM is found at the start of the UTF-32LE BOM, so check UTF-32 first.
If you happen to have a file that consists mainly of ISO-8859-1 characters, having half of the file's bytes be 00 would also be a strong indicator of UTF-16.
Otherwise, the only reliable way to recognize UTF-16 without a BOM is to look for surrogate pairs (D[8-B]xx D[C-F]xx), but non-BMP characters are too rarely-used to make this approach practical.
XML
If your file starts with the bytes 3C 3F 78 6D 6C (i.e., the ASCII characters "<?xml"), then look for an
encoding=
declaration. If present, then use that encoding. If absent, then assume UTF-8, which is the default XML encoding.If you need to support EBCDIC, also look for the equivalent sequence 4C 6F A7 94 93.
In general, if you have a file format that contains an encoding declaration, then look for that declaration rather than trying to guess the encoding.
None of the above
There are hundreds of other encodings, which require more effort to detect. I recommend trying Mozilla's charset detector or a .NET port of it.
A reasonable default
If you've ruled out the UTF encodings, and don't have an encoding declaration or statistical detection that points to a different encoding, assume ISO-8859-1 or the closely related Windows-1252. (Note that the latest HTML standard requires a “ISO-8859-1” declaration to be interpreted as Windows-1252.) Being Windows' default code page for English (and other popular languages like Spanish, Portuguese, German, and French), it's the most commonly encountered encoding other than UTF-8.
如果您想追求“简单”的解决方案,您可能会发现我整理的这个类很有用:
http:// www.architectshack.com/TextFileEncodingDetector.ashx
它首先自动进行 BOM 检测,然后尝试区分没有 BOM 的 Unicode 编码和其他一些默认编码(通常是 Windows-1252,在中错误地标记为 Encoding.ASCII) 。网)。
如上所述,涉及 NCharDet 或 MLang 的“较重”解决方案可能更合适,正如我在此类概述页面上所指出的,最好是尽可能提供某种形式的与用户的交互性,因为简单地不可能有100%的检出率!
网站离线情况下的片段:
If you want to pursue a "simple" solution, you might find this class I put together useful:
http://www.architectshack.com/TextFileEncodingDetector.ashx
It does the BOM detection automatically first, and then tries to differentiate between Unicode encodings without BOM, vs some other default encoding (generally Windows-1252, incorrectly labelled as Encoding.ASCII in .Net).
As noted above, a "heavier" solution involving NCharDet or MLang may be more appropriate, and as I note on the overview page of this class, the best is to provide some form of interactivity with the user if at all possible, because there simply is no 100% detection rate possible!
Snippet in case the site is offline:
使用
StreamReader
并引导它为您检测编码:并使用代码页标识符 https://msdn.microsoft.com/en-us/library/windows/desktop/dd317756(v=vs.85 ).aspx
为了根据它切换逻辑。
Use
StreamReader
and direct it to detect the encoding for you:And use Code Page Identifiers https://msdn.microsoft.com/en-us/library/windows/desktop/dd317756(v=vs.85).aspx
in order to switch logic depending on it.
这里有几个答案,但没有人发布有用的代码。
下面是我的代码,用于检测 Microsoft 在 Framework 4 的 StreamReader 类中检测到的所有编码。
显然,您必须在打开流后立即调用此函数,然后才能从流中读取其他任何内容,因为 BOM 是流中的第一个字节。
该函数需要一个可以查找的 Stream(例如 FileStream)。如果您有一个无法查找的 Stream,则必须编写更复杂的代码,该代码返回一个字节缓冲区,其中包含已读取但不是 BOM 的字节。
Several answers are here but nobody has posted usefull code.
Here is my code that detects all encodings that Microsoft detects in Framework 4 in the StreamReader class.
Obviously you must call this function immediately after opening the stream before reading anything else from the stream because the BOM are the first bytes in the stream.
This function requires a Stream that can seek (for example a FileStream). If you have a Stream that cannot seek you must write a more complicated code that returns a Byte buffer with the bytes that have already been read but that are not BOM.
我使用 Ude,它是 Mozilla 通用字符集检测器的 C# 端口。它很容易使用并且给出了一些非常好的结果。
I use Ude that is a C# port of Mozilla Universal Charset Detector. It is easy to use and gives some really good results.
是的,这里有一个: http://en.wikipedia.org/wiki/Byte_order_mark#Representations_of_byte_order_marks_by_encoding< /a>.
Yes, there is one here: http://en.wikipedia.org/wiki/Byte_order_mark#Representations_of_byte_order_marks_by_encoding.
您应该阅读以下内容:如何检测文本文件的编码/代码页
You should read this: How can I detect the encoding/codepage of a text file
适合所有德国人的解决方案=> äÖÜäöüß
此函数打开文件并确定 BOM 的编码。
如果缺少 BOM,该文件将被解释为 ANSI,但如果其中包含 UTF8 编码的德语变音符号,则会被检测为 UTF8。
The solution for all Germans => ÄÖÜäöüß
This function opens the file an determines the Encoding by the BOM.
If the BOM is missing the file will be interpreted as ANSI, but if there are UTF8 encoded German Umlaute in it, it will be detected as UTF8.
如果您的文件以字节 60、118、56、46 和 49 开头,则您的情况不明确。它可以是 UTF-8(无 BOM)或任何单字节编码,如 ASCII、ANSI、ISO-8859-1 等。
If your file starts with the bytes 60, 118, 56, 46 and 49, then you have an ambiguous case. It could be UTF-8 (without BOM) or any of the single byte encodings like ASCII, ANSI, ISO-8859-1 etc.