如何检测文本文件的字符编码?

发布于 2024-10-08 16:41:52 字数 1337 浏览 3 评论 0原文

我尝试检测我的文件中使用了哪种字符编码。

我尝试使用此代码来获取标准编码。

public static Encoding GetFileEncoding(string srcFile)
    {
      // *** Use Default of Encoding.Default (Ansi CodePage)
      Encoding enc = Encoding.Default;

      // *** Detect byte order mark if any - otherwise assume default
      byte[] buffer = new byte[5];
      FileStream file = new FileStream(srcFile, FileMode.Open);
      file.Read(buffer, 0, 5);
      file.Close();

      if (buffer[0] == 0xef && buffer[1] == 0xbb && buffer[2] == 0xbf)
        enc = Encoding.UTF8;
      else if (buffer[0] == 0xfe && buffer[1] == 0xff)
        enc = Encoding.Unicode;
      else if (buffer[0] == 0 && buffer[1] == 0 && buffer[2] == 0xfe && buffer[3] == 0xff)
        enc = Encoding.UTF32;
      else if (buffer[0] == 0x2b && buffer[1] == 0x2f && buffer[2] == 0x76)
        enc = Encoding.UTF7;
      else if (buffer[0] == 0xFE && buffer[1] == 0xFF)      
        // 1201 unicodeFFFE Unicode (Big-Endian)
        enc = Encoding.GetEncoding(1201);      
      else if (buffer[0] == 0xFF && buffer[1] == 0xFE)      
        // 1200 utf-16 Unicode
        enc = Encoding.GetEncoding(1200);


      return enc;
    }

我的前五个字节是 60、118、56、46 和 49。

是否有图表显示哪种编码与前五个字节匹配?

I try to detect which character encoding is used in my file.

I try with this code to get the standard encoding

public static Encoding GetFileEncoding(string srcFile)
    {
      // *** Use Default of Encoding.Default (Ansi CodePage)
      Encoding enc = Encoding.Default;

      // *** Detect byte order mark if any - otherwise assume default
      byte[] buffer = new byte[5];
      FileStream file = new FileStream(srcFile, FileMode.Open);
      file.Read(buffer, 0, 5);
      file.Close();

      if (buffer[0] == 0xef && buffer[1] == 0xbb && buffer[2] == 0xbf)
        enc = Encoding.UTF8;
      else if (buffer[0] == 0xfe && buffer[1] == 0xff)
        enc = Encoding.Unicode;
      else if (buffer[0] == 0 && buffer[1] == 0 && buffer[2] == 0xfe && buffer[3] == 0xff)
        enc = Encoding.UTF32;
      else if (buffer[0] == 0x2b && buffer[1] == 0x2f && buffer[2] == 0x76)
        enc = Encoding.UTF7;
      else if (buffer[0] == 0xFE && buffer[1] == 0xFF)      
        // 1201 unicodeFFFE Unicode (Big-Endian)
        enc = Encoding.GetEncoding(1201);      
      else if (buffer[0] == 0xFF && buffer[1] == 0xFE)      
        // 1200 utf-16 Unicode
        enc = Encoding.GetEncoding(1200);


      return enc;
    }

My five first byte are 60, 118, 56, 46 and 49.

Is there a chart that shows which encoding matches those five first bytes?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(9

淡墨 2024-10-15 16:41:52

您不能依赖具有 BOM 的文件。 UTF-8 不需要它。非 Unicode 编码甚至没有 BOM。然而,还有其他方法来检测编码。

UTF-32

BOM 为 00 00 FE FF(对于 BE)或 FF FE 00 00(对于 LE)。

但 UTF-32 即使没有 BOM 也很容易被检测到。这是因为 Unicode 代码点范围限制为 U+10FFFF,因此 UTF-32 单位始终具有模式 00 {00-10} xx xx(对于 BE)或 xx xx {00-10} 00(对于 LE) 。如果数据的长度是 4 的倍数,并且遵循这些模式之一,则可以放心地假设它是 UTF-32。由于面向字节的编码中 00 字节的稀有性,误报几乎是不可能的。

US-ASCII

无 BOM,但您不需要。 ASCII 可以通过缺少 80-FF 范围内的字节来轻松识别。

UTF-8

BOM 为 EF BB BF。但你不能依赖这个。许多 UTF-8 文件没有 BOM,特别是如果它们源自非 Windows 系统。

但您可以放心地假设,如果文件验证为 UTF-8,那么它就是 UTF-8。误报很少见。

具体来说,考虑到数据不是 ASCII,2 字节序列的误报率仅为 3.9% (1920/49152)。对于 7 字节序列,它不到 1%。对于 12 字节序列,它小于 0.1%。对于 24 字节序列,它不到百万分之一。

UTF-16

BOM 为 FE FF(对于 BE)或 FF FE(对于 LE)。请注意,UTF-16LE BOM 位于 UTF-32LE BOM 的开头,因此请先检查 UTF-32。

如果您碰巧有一个主要由 ISO-8859-1 字符组成的文件,那么该文件的一半字节为 00 也将是 UTF-16 的强烈指示。

否则,识别没有 BOM 的 UTF-16 的唯一可靠方法是寻找代理对 (D[8-B]xx D[CF]xx),但非 BMP 字符很少使用,无法使这种方法实用。

XML

如果您的文件以字节 3C 3F 78 6D 6C(即 ASCII 字符“encoding= 声明。如果存在,则使用该编码。如果不存在,则假定为 UTF-8,这是默认的 XML 编码。

如果您需要支持 EBCDIC,还需查找等效序列 4C 6F A7 94 93。

一般来说,如果您的文件格式包含编码声明,则查找该声明而不是尝试猜测编码。

以上都没有,

还有数百种其他编码,需要花费更多精力来检测。我建议尝试 Mozilla 的字符集检测器它的 .NET 端口

合理的默认值

如果您排除了 UTF 编码,并且没有指向不同编码的编码声明或统计检测,则假设 ISO-8859-1 或密切相关的 Windows-1252。 (请注意,最新的 HTML 标准需要将“ISO-8859-1”声明解释为 Windows-1252。)作为 Windows 英语(以及其他流行语言,如西班牙语、葡萄牙语)的默认代码页、德语和法语),它是除 UTF-8 之外最常见的编码。

You can't depend on the file having a BOM. UTF-8 doesn't require it. And non-Unicode encodings don't even have a BOM. There are, however, other ways to detect the encoding.

UTF-32

BOM is 00 00 FE FF (for BE) or FF FE 00 00 (for LE).

But UTF-32 is easy to detect even without a BOM. This is because the Unicode code point range is restricted to U+10FFFF, and thus UTF-32 units always have the pattern 00 {00-10} xx xx (for BE) or xx xx {00-10} 00 (for LE). If the data has a length that's a multiple of 4, and follows one of these patterns, you can safely assume it's UTF-32. False positives are nearly impossible due to the rarity of 00 bytes in byte-oriented encodings.

US-ASCII

No BOM, but you don't need one. ASCII can be easily identified by the lack of bytes in the 80-FF range.

UTF-8

BOM is EF BB BF. But you can't rely on this. Lots of UTF-8 files don't have a BOM, especially if they originated on non-Windows systems.

But you can safely assume that if a file validates as UTF-8, it is UTF-8. False positives are rare.

Specifically, given that the data is not ASCII, the false positive rate for a 2-byte sequence is only 3.9% (1920/49152). For a 7-byte sequence, it's less than 1%. For a 12-byte sequence, it's less than 0.1%. For a 24-byte sequence, it's less than 1 in a million.

UTF-16

BOM is FE FF (for BE) or FF FE (for LE). Note that the UTF-16LE BOM is found at the start of the UTF-32LE BOM, so check UTF-32 first.

If you happen to have a file that consists mainly of ISO-8859-1 characters, having half of the file's bytes be 00 would also be a strong indicator of UTF-16.

Otherwise, the only reliable way to recognize UTF-16 without a BOM is to look for surrogate pairs (D[8-B]xx D[C-F]xx), but non-BMP characters are too rarely-used to make this approach practical.

XML

If your file starts with the bytes 3C 3F 78 6D 6C (i.e., the ASCII characters "<?xml"), then look for an encoding= declaration. If present, then use that encoding. If absent, then assume UTF-8, which is the default XML encoding.

If you need to support EBCDIC, also look for the equivalent sequence 4C 6F A7 94 93.

In general, if you have a file format that contains an encoding declaration, then look for that declaration rather than trying to guess the encoding.

None of the above

There are hundreds of other encodings, which require more effort to detect. I recommend trying Mozilla's charset detector or a .NET port of it.

A reasonable default

If you've ruled out the UTF encodings, and don't have an encoding declaration or statistical detection that points to a different encoding, assume ISO-8859-1 or the closely related Windows-1252. (Note that the latest HTML standard requires a “ISO-8859-1” declaration to be interpreted as Windows-1252.) Being Windows' default code page for English (and other popular languages like Spanish, Portuguese, German, and French), it's the most commonly encountered encoding other than UTF-8.

日久见人心 2024-10-15 16:41:52

如果您想追求“简单”的解决方案,您可能会发现我整理的这个类很有用:

http:// www.architectshack.com/TextFileEncodingDetector.ashx

它首先自动进行 BOM 检测,然后尝试区分没有 BOM 的 Unicode 编码和其他一些默认编码(通常是 Windows-1252,在中错误地标记为 Encoding.ASCII) 。网)。

如上所述,涉及 NCharDet 或 MLang 的“较重”解决方案可能更合适,正如我在此类概述页面上所指出的,最好是尽可能提供某种形式的与用户的交互性,因为简单地不可能有100%的检出率!

网站离线情况下的片段:

using System;
using System.Text;
using System.Text.RegularExpressions;
using System.IO;

namespace KlerksSoft
{
    public static class TextFileEncodingDetector
    {
        /*
         * Simple class to handle text file encoding woes (in a primarily English-speaking tech 
         *      world).
         * 
         *  - This code is fully managed, no shady calls to MLang (the unmanaged codepage
         *      detection library originally developed for Internet Explorer).
         * 
         *  - This class does NOT try to detect arbitrary codepages/charsets, it really only
         *      aims to differentiate between some of the most common variants of Unicode 
         *      encoding, and a "default" (western / ascii-based) encoding alternative provided
         *      by the caller.
         *      
         *  - As there is no "Reliable" way to distinguish between UTF-8 (without BOM) and 
         *      Windows-1252 (in .Net, also incorrectly called "ASCII") encodings, we use a 
         *      heuristic - so the more of the file we can sample the better the guess. If you 
         *      are going to read the whole file into memory at some point, then best to pass 
         *      in the whole byte byte array directly. Otherwise, decide how to trade off 
         *      reliability against performance / memory usage.
         *      
         *  - The UTF-8 detection heuristic only works for western text, as it relies on 
         *      the presence of UTF-8 encoded accented and other characters found in the upper 
         *      ranges of the Latin-1 and (particularly) Windows-1252 codepages.
         *  
         *  - For more general detection routines, see existing projects / resources:
         *    - MLang - Microsoft library originally for IE6, available in Windows XP and later APIs now (I think?)
         *      - MLang .Net bindings: http://www.codeproject.com/KB/recipes/DetectEncoding.aspx
         *    - CharDet - Mozilla browser's detection routines
         *      - Ported to Java then .Net: http://www.conceptdevelopment.net/Localization/NCharDet/
         *      - Ported straight to .Net: http://code.google.com/p/chardetsharp/source/browse
         *  
         * Copyright Tao Klerks, 2010-2012, [email protected]
         * Licensed under the modified BSD license:
         * 
Redistribution and use in source and binary forms, with or without modification, are 
permitted provided that the following conditions are met:
 - Redistributions of source code must retain the above copyright notice, this list of 
conditions and the following disclaimer.
 - Redistributions in binary form must reproduce the above copyright notice, this list 
of conditions and the following disclaimer in the documentation and/or other materials
provided with the distribution.
 - The name of the author may not be used to endorse or promote products derived from 
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, 
INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR 
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY 
DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, 
BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR 
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, 
WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) 
ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY 
OF SUCH DAMAGE.
         * 
         * CHANGELOG:
         *  - 2012-02-03: 
         *    - Simpler methods, removing the silly "DefaultEncoding" parameter (with "??" operator, saves no typing)
         *    - More complete methods
         *      - Optionally return indication of whether BOM was found in "Detect" methods
         *      - Provide straight-to-string method for byte arrays (GetStringFromByteArray)
         */

        const long _defaultHeuristicSampleSize = 0x10000; //completely arbitrary - inappropriate for high numbers of files / high speed requirements

        public static Encoding DetectTextFileEncoding(string InputFilename)
        {
            using (FileStream textfileStream = File.OpenRead(InputFilename))
            {
                return DetectTextFileEncoding(textfileStream, _defaultHeuristicSampleSize);
            }
        }

        public static Encoding DetectTextFileEncoding(FileStream InputFileStream, long HeuristicSampleSize)
        {
            bool uselessBool = false;
            return DetectTextFileEncoding(InputFileStream, _defaultHeuristicSampleSize, out uselessBool);
        }

        public static Encoding DetectTextFileEncoding(FileStream InputFileStream, long HeuristicSampleSize, out bool HasBOM)
        {
            if (InputFileStream == null)
                throw new ArgumentNullException("Must provide a valid Filestream!", "InputFileStream");

            if (!InputFileStream.CanRead)
                throw new ArgumentException("Provided file stream is not readable!", "InputFileStream");

            if (!InputFileStream.CanSeek)
                throw new ArgumentException("Provided file stream cannot seek!", "InputFileStream");

            Encoding encodingFound = null;

            long originalPos = InputFileStream.Position;

            InputFileStream.Position = 0;


            //First read only what we need for BOM detection
            byte[] bomBytes = new byte[InputFileStream.Length > 4 ? 4 : InputFileStream.Length];
            InputFileStream.Read(bomBytes, 0, bomBytes.Length);

            encodingFound = DetectBOMBytes(bomBytes);

            if (encodingFound != null)
            {
                InputFileStream.Position = originalPos;
                HasBOM = true;
                return encodingFound;
            }


            //BOM Detection failed, going for heuristics now.
            //  create sample byte array and populate it
            byte[] sampleBytes = new byte[HeuristicSampleSize > InputFileStream.Length ? InputFileStream.Length : HeuristicSampleSize];
            Array.Copy(bomBytes, sampleBytes, bomBytes.Length);
            if (InputFileStream.Length > bomBytes.Length)
                InputFileStream.Read(sampleBytes, bomBytes.Length, sampleBytes.Length - bomBytes.Length);
            InputFileStream.Position = originalPos;

            //test byte array content
            encodingFound = DetectUnicodeInByteSampleByHeuristics(sampleBytes);

            HasBOM = false;
            return encodingFound;
        }

        public static Encoding DetectTextByteArrayEncoding(byte[] TextData)
        {
            bool uselessBool = false;
            return DetectTextByteArrayEncoding(TextData, out uselessBool);
        }

        public static Encoding DetectTextByteArrayEncoding(byte[] TextData, out bool HasBOM)
        {
            if (TextData == null)
                throw new ArgumentNullException("Must provide a valid text data byte array!", "TextData");

            Encoding encodingFound = null;

            encodingFound = DetectBOMBytes(TextData);

            if (encodingFound != null)
            {
                HasBOM = true;
                return encodingFound;
            }
            else
            {
                //test byte array content
                encodingFound = DetectUnicodeInByteSampleByHeuristics(TextData);

                HasBOM = false;
                return encodingFound;
            }
        }

        public static string GetStringFromByteArray(byte[] TextData, Encoding DefaultEncoding)
        {
            return GetStringFromByteArray(TextData, DefaultEncoding, _defaultHeuristicSampleSize);
        }

        public static string GetStringFromByteArray(byte[] TextData, Encoding DefaultEncoding, long MaxHeuristicSampleSize)
        {
            if (TextData == null)
                throw new ArgumentNullException("Must provide a valid text data byte array!", "TextData");

            Encoding encodingFound = null;

            encodingFound = DetectBOMBytes(TextData);

            if (encodingFound != null)
            {
                //For some reason, the default encodings don't detect/swallow their own preambles!!
                return encodingFound.GetString(TextData, encodingFound.GetPreamble().Length, TextData.Length - encodingFound.GetPreamble().Length);
            }
            else
            {
                byte[] heuristicSample = null;
                if (TextData.Length > MaxHeuristicSampleSize)
                {
                    heuristicSample = new byte[MaxHeuristicSampleSize];
                    Array.Copy(TextData, heuristicSample, MaxHeuristicSampleSize);
                }
                else
                {
                    heuristicSample = TextData;
                }

                encodingFound = DetectUnicodeInByteSampleByHeuristics(TextData) ?? DefaultEncoding;
                return encodingFound.GetString(TextData);
            }
        }


        public static Encoding DetectBOMBytes(byte[] BOMBytes)
        {
            if (BOMBytes == null)
                throw new ArgumentNullException("Must provide a valid BOM byte array!", "BOMBytes");

            if (BOMBytes.Length < 2)
                return null;

            if (BOMBytes[0] == 0xff 
                && BOMBytes[1] == 0xfe 
                && (BOMBytes.Length < 4 
                    || BOMBytes[2] != 0 
                    || BOMBytes[3] != 0
                    )
                )
                return Encoding.Unicode;

            if (BOMBytes[0] == 0xfe 
                && BOMBytes[1] == 0xff
                )
                return Encoding.BigEndianUnicode;

            if (BOMBytes.Length < 3)
                return null;

            if (BOMBytes[0] == 0xef && BOMBytes[1] == 0xbb && BOMBytes[2] == 0xbf)
                return Encoding.UTF8;

            if (BOMBytes[0] == 0x2b && BOMBytes[1] == 0x2f && BOMBytes[2] == 0x76)
                return Encoding.UTF7;

            if (BOMBytes.Length < 4)
                return null;

            if (BOMBytes[0] == 0xff && BOMBytes[1] == 0xfe && BOMBytes[2] == 0 && BOMBytes[3] == 0)
                return Encoding.UTF32;

            if (BOMBytes[0] == 0 && BOMBytes[1] == 0 && BOMBytes[2] == 0xfe && BOMBytes[3] == 0xff)
                return Encoding.GetEncoding(12001);

            return null;
        }

        public static Encoding DetectUnicodeInByteSampleByHeuristics(byte[] SampleBytes)
        {
            long oddBinaryNullsInSample = 0;
            long evenBinaryNullsInSample = 0;
            long suspiciousUTF8SequenceCount = 0;
            long suspiciousUTF8BytesTotal = 0;
            long likelyUSASCIIBytesInSample = 0;

            //Cycle through, keeping count of binary null positions, possible UTF-8 
            //  sequences from upper ranges of Windows-1252, and probable US-ASCII 
            //  character counts.

            long currentPos = 0;
            int skipUTF8Bytes = 0;

            while (currentPos < SampleBytes.Length)
            {
                //binary null distribution
                if (SampleBytes[currentPos] == 0)
                {
                    if (currentPos % 2 == 0)
                        evenBinaryNullsInSample++;
                    else
                        oddBinaryNullsInSample++;
                }

                //likely US-ASCII characters
                if (IsCommonUSASCIIByte(SampleBytes[currentPos]))
                    likelyUSASCIIBytesInSample++;

                //suspicious sequences (look like UTF-8)
                if (skipUTF8Bytes == 0)
                {
                    int lengthFound = DetectSuspiciousUTF8SequenceLength(SampleBytes, currentPos);

                    if (lengthFound > 0)
                    {
                        suspiciousUTF8SequenceCount++;
                        suspiciousUTF8BytesTotal += lengthFound;
                        skipUTF8Bytes = lengthFound - 1;
                    }
                }
                else
                {
                    skipUTF8Bytes--;
                }

                currentPos++;
            }

            //1: UTF-16 LE - in english / european environments, this is usually characterized by a 
            //  high proportion of odd binary nulls (starting at 0), with (as this is text) a low 
            //  proportion of even binary nulls.
            //  The thresholds here used (less than 20% nulls where you expect non-nulls, and more than
            //  60% nulls where you do expect nulls) are completely arbitrary.

            if (((evenBinaryNullsInSample * 2.0) / SampleBytes.Length) < 0.2 
                && ((oddBinaryNullsInSample * 2.0) / SampleBytes.Length) > 0.6
                )
                return Encoding.Unicode;


            //2: UTF-16 BE - in english / european environments, this is usually characterized by a 
            //  high proportion of even binary nulls (starting at 0), with (as this is text) a low 
            //  proportion of odd binary nulls.
            //  The thresholds here used (less than 20% nulls where you expect non-nulls, and more than
            //  60% nulls where you do expect nulls) are completely arbitrary.

            if (((oddBinaryNullsInSample * 2.0) / SampleBytes.Length) < 0.2 
                && ((evenBinaryNullsInSample * 2.0) / SampleBytes.Length) > 0.6
                )
                return Encoding.BigEndianUnicode;


            //3: UTF-8 - Martin Dürst outlines a method for detecting whether something CAN be UTF-8 content 
            //  using regexp, in his w3c.org unicode FAQ entry: 
            //  http://www.w3.org/International/questions/qa-forms-utf-8
            //  adapted here for C#.
            string potentiallyMangledString = Encoding.ASCII.GetString(SampleBytes);
            Regex UTF8Validator = new Regex(@"\A(" 
                + @"[\x09\x0A\x0D\x20-\x7E]"
                + @"|[\xC2-\xDF][\x80-\xBF]"
                + @"|\xE0[\xA0-\xBF][\x80-\xBF]"
                + @"|[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}"
                + @"|\xED[\x80-\x9F][\x80-\xBF]"
                + @"|\xF0[\x90-\xBF][\x80-\xBF]{2}"
                + @"|[\xF1-\xF3][\x80-\xBF]{3}"
                + @"|\xF4[\x80-\x8F][\x80-\xBF]{2}"
                + @")*\z");
            if (UTF8Validator.IsMatch(potentiallyMangledString))
            {
                //Unfortunately, just the fact that it CAN be UTF-8 doesn't tell you much about probabilities.
                //If all the characters are in the 0-127 range, no harm done, most western charsets are same as UTF-8 in these ranges.
                //If some of the characters were in the upper range (western accented characters), however, they would likely be mangled to 2-byte by the UTF-8 encoding process.
                // So, we need to play stats.

                // The "Random" likelihood of any pair of randomly generated characters being one 
                //   of these "suspicious" character sequences is:
                //     128 / (256 * 256) = 0.2%.
                //
                // In western text data, that is SIGNIFICANTLY reduced - most text data stays in the <127 
                //   character range, so we assume that more than 1 in 500,000 of these character 
                //   sequences indicates UTF-8. The number 500,000 is completely arbitrary - so sue me.
                //
                // We can only assume these character sequences will be rare if we ALSO assume that this
                //   IS in fact western text - in which case the bulk of the UTF-8 encoded data (that is 
                //   not already suspicious sequences) should be plain US-ASCII bytes. This, I 
                //   arbitrarily decided, should be 80% (a random distribution, eg binary data, would yield 
                //   approx 40%, so the chances of hitting this threshold by accident in random data are 
                //   VERY low). 

                if ((suspiciousUTF8SequenceCount * 500000.0 / SampleBytes.Length >= 1) //suspicious sequences
                    && (
                           //all suspicious, so cannot evaluate proportion of US-Ascii
                           SampleBytes.Length - suspiciousUTF8BytesTotal == 0 
                           ||
                           likelyUSASCIIBytesInSample * 1.0 / (SampleBytes.Length - suspiciousUTF8BytesTotal) >= 0.8
                       )
                    )
                    return Encoding.UTF8;
            }

            return null;
        }

        private static bool IsCommonUSASCIIByte(byte testByte)
        {
            if (testByte == 0x0A //lf
                || testByte == 0x0D //cr
                || testByte == 0x09 //tab
                || (testByte >= 0x20 && testByte <= 0x2F) //common punctuation
                || (testByte >= 0x30 && testByte <= 0x39) //digits
                || (testByte >= 0x3A && testByte <= 0x40) //common punctuation
                || (testByte >= 0x41 && testByte <= 0x5A) //capital letters
                || (testByte >= 0x5B && testByte <= 0x60) //common punctuation
                || (testByte >= 0x61 && testByte <= 0x7A) //lowercase letters
                || (testByte >= 0x7B && testByte <= 0x7E) //common punctuation
                )
                return true;
            else
                return false;
        }

        private static int DetectSuspiciousUTF8SequenceLength(byte[] SampleBytes, long currentPos)
        {
            int lengthFound = 0;

            if (SampleBytes.Length >= currentPos + 1 
                && SampleBytes[currentPos] == 0xC2
                )
            {
                if (SampleBytes[currentPos + 1] == 0x81 
                    || SampleBytes[currentPos + 1] == 0x8D 
                    || SampleBytes[currentPos + 1] == 0x8F
                    )
                    lengthFound = 2;
                else if (SampleBytes[currentPos + 1] == 0x90 
                    || SampleBytes[currentPos + 1] == 0x9D
                    )
                    lengthFound = 2;
                else if (SampleBytes[currentPos + 1] >= 0xA0 
                    && SampleBytes[currentPos + 1] <= 0xBF
                    )
                    lengthFound = 2;
            }
            else if (SampleBytes.Length >= currentPos + 1 
                && SampleBytes[currentPos] == 0xC3
                )
            {
                if (SampleBytes[currentPos + 1] >= 0x80 
                    && SampleBytes[currentPos + 1] <= 0xBF
                    )
                    lengthFound = 2;
            }
            else if (SampleBytes.Length >= currentPos + 1 
                && SampleBytes[currentPos] == 0xC5
                )
            {
                if (SampleBytes[currentPos + 1] == 0x92 
                    || SampleBytes[currentPos + 1] == 0x93
                    )
                    lengthFound = 2;
                else if (SampleBytes[currentPos + 1] == 0xA0 
                    || SampleBytes[currentPos + 1] == 0xA1
                    )
                    lengthFound = 2;
                else if (SampleBytes[currentPos + 1] == 0xB8 
                    || SampleBytes[currentPos + 1] == 0xBD 
                    || SampleBytes[currentPos + 1] == 0xBE
                    )
                    lengthFound = 2;
            }
            else if (SampleBytes.Length >= currentPos + 1 
                && SampleBytes[currentPos] == 0xC6
                )
            {
                if (SampleBytes[currentPos + 1] == 0x92)
                    lengthFound = 2;
            }
            else if (SampleBytes.Length >= currentPos + 1 
                && SampleBytes[currentPos] == 0xCB
                )
            {
                if (SampleBytes[currentPos + 1] == 0x86 
                    || SampleBytes[currentPos + 1] == 0x9C
                    )
                    lengthFound = 2;
            }
            else if (SampleBytes.Length >= currentPos + 2 
                && SampleBytes[currentPos] == 0xE2
                )
            {
                if (SampleBytes[currentPos + 1] == 0x80)
                {
                    if (SampleBytes[currentPos + 2] == 0x93 
                        || SampleBytes[currentPos + 2] == 0x94
                        )
                        lengthFound = 3;
                    if (SampleBytes[currentPos + 2] == 0x98 
                        || SampleBytes[currentPos + 2] == 0x99 
                        || SampleBytes[currentPos + 2] == 0x9A
                        )
                        lengthFound = 3;
                    if (SampleBytes[currentPos + 2] == 0x9C 
                        || SampleBytes[currentPos + 2] == 0x9D 
                        || SampleBytes[currentPos + 2] == 0x9E
                        )
                        lengthFound = 3;
                    if (SampleBytes[currentPos + 2] == 0xA0 
                        || SampleBytes[currentPos + 2] == 0xA1 
                        || SampleBytes[currentPos + 2] == 0xA2
                        )
                        lengthFound = 3;
                    if (SampleBytes[currentPos + 2] == 0xA6)
                        lengthFound = 3;
                    if (SampleBytes[currentPos + 2] == 0xB0)
                        lengthFound = 3;
                    if (SampleBytes[currentPos + 2] == 0xB9 
                        || SampleBytes[currentPos + 2] == 0xBA
                        )
                        lengthFound = 3;
                }
                else if (SampleBytes[currentPos + 1] == 0x82 
                    && SampleBytes[currentPos + 2] == 0xAC
                    )
                    lengthFound = 3;
                else if (SampleBytes[currentPos + 1] == 0x84 
                    && SampleBytes[currentPos + 2] == 0xA2
                    )
                    lengthFound = 3;
            }

            return lengthFound;
        }

    }
}

If you want to pursue a "simple" solution, you might find this class I put together useful:

http://www.architectshack.com/TextFileEncodingDetector.ashx

It does the BOM detection automatically first, and then tries to differentiate between Unicode encodings without BOM, vs some other default encoding (generally Windows-1252, incorrectly labelled as Encoding.ASCII in .Net).

As noted above, a "heavier" solution involving NCharDet or MLang may be more appropriate, and as I note on the overview page of this class, the best is to provide some form of interactivity with the user if at all possible, because there simply is no 100% detection rate possible!

Snippet in case the site is offline:

using System;
using System.Text;
using System.Text.RegularExpressions;
using System.IO;

namespace KlerksSoft
{
    public static class TextFileEncodingDetector
    {
        /*
         * Simple class to handle text file encoding woes (in a primarily English-speaking tech 
         *      world).
         * 
         *  - This code is fully managed, no shady calls to MLang (the unmanaged codepage
         *      detection library originally developed for Internet Explorer).
         * 
         *  - This class does NOT try to detect arbitrary codepages/charsets, it really only
         *      aims to differentiate between some of the most common variants of Unicode 
         *      encoding, and a "default" (western / ascii-based) encoding alternative provided
         *      by the caller.
         *      
         *  - As there is no "Reliable" way to distinguish between UTF-8 (without BOM) and 
         *      Windows-1252 (in .Net, also incorrectly called "ASCII") encodings, we use a 
         *      heuristic - so the more of the file we can sample the better the guess. If you 
         *      are going to read the whole file into memory at some point, then best to pass 
         *      in the whole byte byte array directly. Otherwise, decide how to trade off 
         *      reliability against performance / memory usage.
         *      
         *  - The UTF-8 detection heuristic only works for western text, as it relies on 
         *      the presence of UTF-8 encoded accented and other characters found in the upper 
         *      ranges of the Latin-1 and (particularly) Windows-1252 codepages.
         *  
         *  - For more general detection routines, see existing projects / resources:
         *    - MLang - Microsoft library originally for IE6, available in Windows XP and later APIs now (I think?)
         *      - MLang .Net bindings: http://www.codeproject.com/KB/recipes/DetectEncoding.aspx
         *    - CharDet - Mozilla browser's detection routines
         *      - Ported to Java then .Net: http://www.conceptdevelopment.net/Localization/NCharDet/
         *      - Ported straight to .Net: http://code.google.com/p/chardetsharp/source/browse
         *  
         * Copyright Tao Klerks, 2010-2012, [email protected]
         * Licensed under the modified BSD license:
         * 
Redistribution and use in source and binary forms, with or without modification, are 
permitted provided that the following conditions are met:
 - Redistributions of source code must retain the above copyright notice, this list of 
conditions and the following disclaimer.
 - Redistributions in binary form must reproduce the above copyright notice, this list 
of conditions and the following disclaimer in the documentation and/or other materials
provided with the distribution.
 - The name of the author may not be used to endorse or promote products derived from 
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, 
INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR 
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY 
DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, 
BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR 
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, 
WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) 
ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY 
OF SUCH DAMAGE.
         * 
         * CHANGELOG:
         *  - 2012-02-03: 
         *    - Simpler methods, removing the silly "DefaultEncoding" parameter (with "??" operator, saves no typing)
         *    - More complete methods
         *      - Optionally return indication of whether BOM was found in "Detect" methods
         *      - Provide straight-to-string method for byte arrays (GetStringFromByteArray)
         */

        const long _defaultHeuristicSampleSize = 0x10000; //completely arbitrary - inappropriate for high numbers of files / high speed requirements

        public static Encoding DetectTextFileEncoding(string InputFilename)
        {
            using (FileStream textfileStream = File.OpenRead(InputFilename))
            {
                return DetectTextFileEncoding(textfileStream, _defaultHeuristicSampleSize);
            }
        }

        public static Encoding DetectTextFileEncoding(FileStream InputFileStream, long HeuristicSampleSize)
        {
            bool uselessBool = false;
            return DetectTextFileEncoding(InputFileStream, _defaultHeuristicSampleSize, out uselessBool);
        }

        public static Encoding DetectTextFileEncoding(FileStream InputFileStream, long HeuristicSampleSize, out bool HasBOM)
        {
            if (InputFileStream == null)
                throw new ArgumentNullException("Must provide a valid Filestream!", "InputFileStream");

            if (!InputFileStream.CanRead)
                throw new ArgumentException("Provided file stream is not readable!", "InputFileStream");

            if (!InputFileStream.CanSeek)
                throw new ArgumentException("Provided file stream cannot seek!", "InputFileStream");

            Encoding encodingFound = null;

            long originalPos = InputFileStream.Position;

            InputFileStream.Position = 0;


            //First read only what we need for BOM detection
            byte[] bomBytes = new byte[InputFileStream.Length > 4 ? 4 : InputFileStream.Length];
            InputFileStream.Read(bomBytes, 0, bomBytes.Length);

            encodingFound = DetectBOMBytes(bomBytes);

            if (encodingFound != null)
            {
                InputFileStream.Position = originalPos;
                HasBOM = true;
                return encodingFound;
            }


            //BOM Detection failed, going for heuristics now.
            //  create sample byte array and populate it
            byte[] sampleBytes = new byte[HeuristicSampleSize > InputFileStream.Length ? InputFileStream.Length : HeuristicSampleSize];
            Array.Copy(bomBytes, sampleBytes, bomBytes.Length);
            if (InputFileStream.Length > bomBytes.Length)
                InputFileStream.Read(sampleBytes, bomBytes.Length, sampleBytes.Length - bomBytes.Length);
            InputFileStream.Position = originalPos;

            //test byte array content
            encodingFound = DetectUnicodeInByteSampleByHeuristics(sampleBytes);

            HasBOM = false;
            return encodingFound;
        }

        public static Encoding DetectTextByteArrayEncoding(byte[] TextData)
        {
            bool uselessBool = false;
            return DetectTextByteArrayEncoding(TextData, out uselessBool);
        }

        public static Encoding DetectTextByteArrayEncoding(byte[] TextData, out bool HasBOM)
        {
            if (TextData == null)
                throw new ArgumentNullException("Must provide a valid text data byte array!", "TextData");

            Encoding encodingFound = null;

            encodingFound = DetectBOMBytes(TextData);

            if (encodingFound != null)
            {
                HasBOM = true;
                return encodingFound;
            }
            else
            {
                //test byte array content
                encodingFound = DetectUnicodeInByteSampleByHeuristics(TextData);

                HasBOM = false;
                return encodingFound;
            }
        }

        public static string GetStringFromByteArray(byte[] TextData, Encoding DefaultEncoding)
        {
            return GetStringFromByteArray(TextData, DefaultEncoding, _defaultHeuristicSampleSize);
        }

        public static string GetStringFromByteArray(byte[] TextData, Encoding DefaultEncoding, long MaxHeuristicSampleSize)
        {
            if (TextData == null)
                throw new ArgumentNullException("Must provide a valid text data byte array!", "TextData");

            Encoding encodingFound = null;

            encodingFound = DetectBOMBytes(TextData);

            if (encodingFound != null)
            {
                //For some reason, the default encodings don't detect/swallow their own preambles!!
                return encodingFound.GetString(TextData, encodingFound.GetPreamble().Length, TextData.Length - encodingFound.GetPreamble().Length);
            }
            else
            {
                byte[] heuristicSample = null;
                if (TextData.Length > MaxHeuristicSampleSize)
                {
                    heuristicSample = new byte[MaxHeuristicSampleSize];
                    Array.Copy(TextData, heuristicSample, MaxHeuristicSampleSize);
                }
                else
                {
                    heuristicSample = TextData;
                }

                encodingFound = DetectUnicodeInByteSampleByHeuristics(TextData) ?? DefaultEncoding;
                return encodingFound.GetString(TextData);
            }
        }


        public static Encoding DetectBOMBytes(byte[] BOMBytes)
        {
            if (BOMBytes == null)
                throw new ArgumentNullException("Must provide a valid BOM byte array!", "BOMBytes");

            if (BOMBytes.Length < 2)
                return null;

            if (BOMBytes[0] == 0xff 
                && BOMBytes[1] == 0xfe 
                && (BOMBytes.Length < 4 
                    || BOMBytes[2] != 0 
                    || BOMBytes[3] != 0
                    )
                )
                return Encoding.Unicode;

            if (BOMBytes[0] == 0xfe 
                && BOMBytes[1] == 0xff
                )
                return Encoding.BigEndianUnicode;

            if (BOMBytes.Length < 3)
                return null;

            if (BOMBytes[0] == 0xef && BOMBytes[1] == 0xbb && BOMBytes[2] == 0xbf)
                return Encoding.UTF8;

            if (BOMBytes[0] == 0x2b && BOMBytes[1] == 0x2f && BOMBytes[2] == 0x76)
                return Encoding.UTF7;

            if (BOMBytes.Length < 4)
                return null;

            if (BOMBytes[0] == 0xff && BOMBytes[1] == 0xfe && BOMBytes[2] == 0 && BOMBytes[3] == 0)
                return Encoding.UTF32;

            if (BOMBytes[0] == 0 && BOMBytes[1] == 0 && BOMBytes[2] == 0xfe && BOMBytes[3] == 0xff)
                return Encoding.GetEncoding(12001);

            return null;
        }

        public static Encoding DetectUnicodeInByteSampleByHeuristics(byte[] SampleBytes)
        {
            long oddBinaryNullsInSample = 0;
            long evenBinaryNullsInSample = 0;
            long suspiciousUTF8SequenceCount = 0;
            long suspiciousUTF8BytesTotal = 0;
            long likelyUSASCIIBytesInSample = 0;

            //Cycle through, keeping count of binary null positions, possible UTF-8 
            //  sequences from upper ranges of Windows-1252, and probable US-ASCII 
            //  character counts.

            long currentPos = 0;
            int skipUTF8Bytes = 0;

            while (currentPos < SampleBytes.Length)
            {
                //binary null distribution
                if (SampleBytes[currentPos] == 0)
                {
                    if (currentPos % 2 == 0)
                        evenBinaryNullsInSample++;
                    else
                        oddBinaryNullsInSample++;
                }

                //likely US-ASCII characters
                if (IsCommonUSASCIIByte(SampleBytes[currentPos]))
                    likelyUSASCIIBytesInSample++;

                //suspicious sequences (look like UTF-8)
                if (skipUTF8Bytes == 0)
                {
                    int lengthFound = DetectSuspiciousUTF8SequenceLength(SampleBytes, currentPos);

                    if (lengthFound > 0)
                    {
                        suspiciousUTF8SequenceCount++;
                        suspiciousUTF8BytesTotal += lengthFound;
                        skipUTF8Bytes = lengthFound - 1;
                    }
                }
                else
                {
                    skipUTF8Bytes--;
                }

                currentPos++;
            }

            //1: UTF-16 LE - in english / european environments, this is usually characterized by a 
            //  high proportion of odd binary nulls (starting at 0), with (as this is text) a low 
            //  proportion of even binary nulls.
            //  The thresholds here used (less than 20% nulls where you expect non-nulls, and more than
            //  60% nulls where you do expect nulls) are completely arbitrary.

            if (((evenBinaryNullsInSample * 2.0) / SampleBytes.Length) < 0.2 
                && ((oddBinaryNullsInSample * 2.0) / SampleBytes.Length) > 0.6
                )
                return Encoding.Unicode;


            //2: UTF-16 BE - in english / european environments, this is usually characterized by a 
            //  high proportion of even binary nulls (starting at 0), with (as this is text) a low 
            //  proportion of odd binary nulls.
            //  The thresholds here used (less than 20% nulls where you expect non-nulls, and more than
            //  60% nulls where you do expect nulls) are completely arbitrary.

            if (((oddBinaryNullsInSample * 2.0) / SampleBytes.Length) < 0.2 
                && ((evenBinaryNullsInSample * 2.0) / SampleBytes.Length) > 0.6
                )
                return Encoding.BigEndianUnicode;


            //3: UTF-8 - Martin Dürst outlines a method for detecting whether something CAN be UTF-8 content 
            //  using regexp, in his w3c.org unicode FAQ entry: 
            //  http://www.w3.org/International/questions/qa-forms-utf-8
            //  adapted here for C#.
            string potentiallyMangledString = Encoding.ASCII.GetString(SampleBytes);
            Regex UTF8Validator = new Regex(@"\A(" 
                + @"[\x09\x0A\x0D\x20-\x7E]"
                + @"|[\xC2-\xDF][\x80-\xBF]"
                + @"|\xE0[\xA0-\xBF][\x80-\xBF]"
                + @"|[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}"
                + @"|\xED[\x80-\x9F][\x80-\xBF]"
                + @"|\xF0[\x90-\xBF][\x80-\xBF]{2}"
                + @"|[\xF1-\xF3][\x80-\xBF]{3}"
                + @"|\xF4[\x80-\x8F][\x80-\xBF]{2}"
                + @")*\z");
            if (UTF8Validator.IsMatch(potentiallyMangledString))
            {
                //Unfortunately, just the fact that it CAN be UTF-8 doesn't tell you much about probabilities.
                //If all the characters are in the 0-127 range, no harm done, most western charsets are same as UTF-8 in these ranges.
                //If some of the characters were in the upper range (western accented characters), however, they would likely be mangled to 2-byte by the UTF-8 encoding process.
                // So, we need to play stats.

                // The "Random" likelihood of any pair of randomly generated characters being one 
                //   of these "suspicious" character sequences is:
                //     128 / (256 * 256) = 0.2%.
                //
                // In western text data, that is SIGNIFICANTLY reduced - most text data stays in the <127 
                //   character range, so we assume that more than 1 in 500,000 of these character 
                //   sequences indicates UTF-8. The number 500,000 is completely arbitrary - so sue me.
                //
                // We can only assume these character sequences will be rare if we ALSO assume that this
                //   IS in fact western text - in which case the bulk of the UTF-8 encoded data (that is 
                //   not already suspicious sequences) should be plain US-ASCII bytes. This, I 
                //   arbitrarily decided, should be 80% (a random distribution, eg binary data, would yield 
                //   approx 40%, so the chances of hitting this threshold by accident in random data are 
                //   VERY low). 

                if ((suspiciousUTF8SequenceCount * 500000.0 / SampleBytes.Length >= 1) //suspicious sequences
                    && (
                           //all suspicious, so cannot evaluate proportion of US-Ascii
                           SampleBytes.Length - suspiciousUTF8BytesTotal == 0 
                           ||
                           likelyUSASCIIBytesInSample * 1.0 / (SampleBytes.Length - suspiciousUTF8BytesTotal) >= 0.8
                       )
                    )
                    return Encoding.UTF8;
            }

            return null;
        }

        private static bool IsCommonUSASCIIByte(byte testByte)
        {
            if (testByte == 0x0A //lf
                || testByte == 0x0D //cr
                || testByte == 0x09 //tab
                || (testByte >= 0x20 && testByte <= 0x2F) //common punctuation
                || (testByte >= 0x30 && testByte <= 0x39) //digits
                || (testByte >= 0x3A && testByte <= 0x40) //common punctuation
                || (testByte >= 0x41 && testByte <= 0x5A) //capital letters
                || (testByte >= 0x5B && testByte <= 0x60) //common punctuation
                || (testByte >= 0x61 && testByte <= 0x7A) //lowercase letters
                || (testByte >= 0x7B && testByte <= 0x7E) //common punctuation
                )
                return true;
            else
                return false;
        }

        private static int DetectSuspiciousUTF8SequenceLength(byte[] SampleBytes, long currentPos)
        {
            int lengthFound = 0;

            if (SampleBytes.Length >= currentPos + 1 
                && SampleBytes[currentPos] == 0xC2
                )
            {
                if (SampleBytes[currentPos + 1] == 0x81 
                    || SampleBytes[currentPos + 1] == 0x8D 
                    || SampleBytes[currentPos + 1] == 0x8F
                    )
                    lengthFound = 2;
                else if (SampleBytes[currentPos + 1] == 0x90 
                    || SampleBytes[currentPos + 1] == 0x9D
                    )
                    lengthFound = 2;
                else if (SampleBytes[currentPos + 1] >= 0xA0 
                    && SampleBytes[currentPos + 1] <= 0xBF
                    )
                    lengthFound = 2;
            }
            else if (SampleBytes.Length >= currentPos + 1 
                && SampleBytes[currentPos] == 0xC3
                )
            {
                if (SampleBytes[currentPos + 1] >= 0x80 
                    && SampleBytes[currentPos + 1] <= 0xBF
                    )
                    lengthFound = 2;
            }
            else if (SampleBytes.Length >= currentPos + 1 
                && SampleBytes[currentPos] == 0xC5
                )
            {
                if (SampleBytes[currentPos + 1] == 0x92 
                    || SampleBytes[currentPos + 1] == 0x93
                    )
                    lengthFound = 2;
                else if (SampleBytes[currentPos + 1] == 0xA0 
                    || SampleBytes[currentPos + 1] == 0xA1
                    )
                    lengthFound = 2;
                else if (SampleBytes[currentPos + 1] == 0xB8 
                    || SampleBytes[currentPos + 1] == 0xBD 
                    || SampleBytes[currentPos + 1] == 0xBE
                    )
                    lengthFound = 2;
            }
            else if (SampleBytes.Length >= currentPos + 1 
                && SampleBytes[currentPos] == 0xC6
                )
            {
                if (SampleBytes[currentPos + 1] == 0x92)
                    lengthFound = 2;
            }
            else if (SampleBytes.Length >= currentPos + 1 
                && SampleBytes[currentPos] == 0xCB
                )
            {
                if (SampleBytes[currentPos + 1] == 0x86 
                    || SampleBytes[currentPos + 1] == 0x9C
                    )
                    lengthFound = 2;
            }
            else if (SampleBytes.Length >= currentPos + 2 
                && SampleBytes[currentPos] == 0xE2
                )
            {
                if (SampleBytes[currentPos + 1] == 0x80)
                {
                    if (SampleBytes[currentPos + 2] == 0x93 
                        || SampleBytes[currentPos + 2] == 0x94
                        )
                        lengthFound = 3;
                    if (SampleBytes[currentPos + 2] == 0x98 
                        || SampleBytes[currentPos + 2] == 0x99 
                        || SampleBytes[currentPos + 2] == 0x9A
                        )
                        lengthFound = 3;
                    if (SampleBytes[currentPos + 2] == 0x9C 
                        || SampleBytes[currentPos + 2] == 0x9D 
                        || SampleBytes[currentPos + 2] == 0x9E
                        )
                        lengthFound = 3;
                    if (SampleBytes[currentPos + 2] == 0xA0 
                        || SampleBytes[currentPos + 2] == 0xA1 
                        || SampleBytes[currentPos + 2] == 0xA2
                        )
                        lengthFound = 3;
                    if (SampleBytes[currentPos + 2] == 0xA6)
                        lengthFound = 3;
                    if (SampleBytes[currentPos + 2] == 0xB0)
                        lengthFound = 3;
                    if (SampleBytes[currentPos + 2] == 0xB9 
                        || SampleBytes[currentPos + 2] == 0xBA
                        )
                        lengthFound = 3;
                }
                else if (SampleBytes[currentPos + 1] == 0x82 
                    && SampleBytes[currentPos + 2] == 0xAC
                    )
                    lengthFound = 3;
                else if (SampleBytes[currentPos + 1] == 0x84 
                    && SampleBytes[currentPos + 2] == 0xA2
                    )
                    lengthFound = 3;
            }

            return lengthFound;
        }

    }
}
吐个泡泡 2024-10-15 16:41:52

使用 StreamReader 并引导它为您检测编码:

using (var reader = new System.IO.StreamReader(path, true))
{
    var currentEncoding = reader.CurrentEncoding;
}

并使用代码页标识符 https://msdn.microsoft.com/en-us/library/windows/desktop/dd317756(v=vs.85 ).aspx
为了根据它切换逻辑。

Use StreamReader and direct it to detect the encoding for you:

using (var reader = new System.IO.StreamReader(path, true))
{
    var currentEncoding = reader.CurrentEncoding;
}

And use Code Page Identifiers https://msdn.microsoft.com/en-us/library/windows/desktop/dd317756(v=vs.85).aspx
in order to switch logic depending on it.

别理我 2024-10-15 16:41:52

这里有几个答案,但没有人发布有用的代码。

下面是我的代码,用于检测 Microsoft 在 Framework 4 的 StreamReader 类中检测到的所有编码。

显然,您必须在打开流后立即调用此函数,然后才能从流中读取其他任何内容,因为 BOM 是流中的第一个字节。

该函数需要一个可以查找的 Stream(例如 FileStream)。如果您有一个无法查找的 Stream,则必须编写更复杂的代码,该代码返回一个字节缓冲区,其中包含已读取但不是 BOM 的字节。

public static Encoding DetectEncoding(String s_Path)
{
    using (FileStream i_Stream = new FileStream(s_Path, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
    {
        return DetectEncoding(i_Stream);
    }
}

/// <summary>
/// UTF8    : EF BB BF
/// UTF16 BE: FE FF
/// UTF16 LE: FF FE
/// UTF32 BE: 00 00 FE FF
/// UTF32 LE: FF FE 00 00
/// </summary>
public static Encoding DetectEncoding(Stream i_Stream)
{
    if (!i_Stream.CanSeek || !i_Stream.CanRead)
        throw new Exception("DetectEncoding() requires a seekable and readable Stream");

    // Try to read 4 bytes. If the stream is shorter, less bytes will be read.
    Byte[] u8_Buf = new Byte[4];
    int s32_Count = i_Stream.Read(u8_Buf, 0, 4);
    if (s32_Count >= 2)
    {
        if (u8_Buf[0] == 0xFE && u8_Buf[1] == 0xFF)
        {
            i_Stream.Position = 2;
            return new UnicodeEncoding(true, true);
        }

        if (u8_Buf[0] == 0xFF && u8_Buf[1] == 0xFE)
        {
            if (s32_Count >= 4 && u8_Buf[2] == 0 && u8_Buf[3] == 0)
            {
                i_Stream.Position = 4;
                return new UTF32Encoding(false, true);
            }
            else
            {
                i_Stream.Position = 2;
                return new UnicodeEncoding(false, true);
            }
        }

        if (s32_Count >= 3 && u8_Buf[0] == 0xEF && u8_Buf[1] == 0xBB && u8_Buf[2] == 0xBF)
        {
            i_Stream.Position = 3;
            return Encoding.UTF8;
        }

        if (s32_Count >= 4 && u8_Buf[0] == 0 && u8_Buf[1] == 0 && u8_Buf[2] == 0xFE && u8_Buf[3] == 0xFF)
        {
            i_Stream.Position = 4;
            return new UTF32Encoding(true, true);
        }
    }

    i_Stream.Position = 0;
    return Encoding.Default;
}

Several answers are here but nobody has posted usefull code.

Here is my code that detects all encodings that Microsoft detects in Framework 4 in the StreamReader class.

Obviously you must call this function immediately after opening the stream before reading anything else from the stream because the BOM are the first bytes in the stream.

This function requires a Stream that can seek (for example a FileStream). If you have a Stream that cannot seek you must write a more complicated code that returns a Byte buffer with the bytes that have already been read but that are not BOM.

public static Encoding DetectEncoding(String s_Path)
{
    using (FileStream i_Stream = new FileStream(s_Path, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
    {
        return DetectEncoding(i_Stream);
    }
}

/// <summary>
/// UTF8    : EF BB BF
/// UTF16 BE: FE FF
/// UTF16 LE: FF FE
/// UTF32 BE: 00 00 FE FF
/// UTF32 LE: FF FE 00 00
/// </summary>
public static Encoding DetectEncoding(Stream i_Stream)
{
    if (!i_Stream.CanSeek || !i_Stream.CanRead)
        throw new Exception("DetectEncoding() requires a seekable and readable Stream");

    // Try to read 4 bytes. If the stream is shorter, less bytes will be read.
    Byte[] u8_Buf = new Byte[4];
    int s32_Count = i_Stream.Read(u8_Buf, 0, 4);
    if (s32_Count >= 2)
    {
        if (u8_Buf[0] == 0xFE && u8_Buf[1] == 0xFF)
        {
            i_Stream.Position = 2;
            return new UnicodeEncoding(true, true);
        }

        if (u8_Buf[0] == 0xFF && u8_Buf[1] == 0xFE)
        {
            if (s32_Count >= 4 && u8_Buf[2] == 0 && u8_Buf[3] == 0)
            {
                i_Stream.Position = 4;
                return new UTF32Encoding(false, true);
            }
            else
            {
                i_Stream.Position = 2;
                return new UnicodeEncoding(false, true);
            }
        }

        if (s32_Count >= 3 && u8_Buf[0] == 0xEF && u8_Buf[1] == 0xBB && u8_Buf[2] == 0xBF)
        {
            i_Stream.Position = 3;
            return Encoding.UTF8;
        }

        if (s32_Count >= 4 && u8_Buf[0] == 0 && u8_Buf[1] == 0 && u8_Buf[2] == 0xFE && u8_Buf[3] == 0xFF)
        {
            i_Stream.Position = 4;
            return new UTF32Encoding(true, true);
        }
    }

    i_Stream.Position = 0;
    return Encoding.Default;
}
怎樣才叫好 2024-10-15 16:41:52

我使用 Ude,它是 Mozilla 通用字符集检测器的 C# 端口。它很容易使用并且给出了一些非常好的结果。

I use Ude that is a C# port of Mozilla Universal Charset Detector. It is easy to use and gives some really good results.

秋心╮凉 2024-10-15 16:41:52

适合所有德国人的解决方案=> äÖÜäöüß

此函数打开文件并确定 BOM 的编码。
如果缺少 BOM,该文件将被解释为 ANSI,但如果其中包含 UTF8 编码的德语变音符号,则会被检测为 UTF8。

private static Encoding GetEncoding(string sFileName)
{
    using (var reader = new StreamReader(sFileName, Encoding.Default, true))
    {
        string sContent = "";
        if (reader.Peek() >= 0) // you need this!
            sContent = reader.ReadToEnd();
        Encoding MyEncoding = reader.CurrentEncoding;
        if (MyEncoding == Encoding.Default) // Ansi detected (this happens if BOM is missing)
        { // Look, if there are typical UTF8 chars in this file...
            string sUmlaute = "ÄÖÜäöüß";
            bool bUTF8CharDetected = false;
            for (int z=0; z<sUmlaute.Length; z++)
            {
                string sUTF8Letter = sUmlaute.Substring(z, 1);
                string sUTF8LetterInAnsi = Encoding.Default.GetString(Encoding.UTF8.GetBytes(sUTF8Letter));
                if (sContent.Contains(sUTF8LetterInAnsi))
                {
                    bUTF8CharDetected = true;
                    break;
                }
            }
            if (bUTF8CharDetected) MyEncoding = Encoding.UTF8;
        }
        return MyEncoding;
    }
}

The solution for all Germans => ÄÖÜäöüß

This function opens the file an determines the Encoding by the BOM.
If the BOM is missing the file will be interpreted as ANSI, but if there are UTF8 encoded German Umlaute in it, it will be detected as UTF8.

private static Encoding GetEncoding(string sFileName)
{
    using (var reader = new StreamReader(sFileName, Encoding.Default, true))
    {
        string sContent = "";
        if (reader.Peek() >= 0) // you need this!
            sContent = reader.ReadToEnd();
        Encoding MyEncoding = reader.CurrentEncoding;
        if (MyEncoding == Encoding.Default) // Ansi detected (this happens if BOM is missing)
        { // Look, if there are typical UTF8 chars in this file...
            string sUmlaute = "ÄÖÜäöüß";
            bool bUTF8CharDetected = false;
            for (int z=0; z<sUmlaute.Length; z++)
            {
                string sUTF8Letter = sUmlaute.Substring(z, 1);
                string sUTF8LetterInAnsi = Encoding.Default.GetString(Encoding.UTF8.GetBytes(sUTF8Letter));
                if (sContent.Contains(sUTF8LetterInAnsi))
                {
                    bUTF8CharDetected = true;
                    break;
                }
            }
            if (bUTF8CharDetected) MyEncoding = Encoding.UTF8;
        }
        return MyEncoding;
    }
}
谁对谁错谁最难过 2024-10-15 16:41:52

如果您的文件以字节 60、118、56、46 和 49 开头,则您的情况不明确。它可以是 UTF-8(无 BOM)或任何单字节编码,如 ASCII、ANSI、ISO-8859-1 等。

If your file starts with the bytes 60, 118, 56, 46 and 49, then you have an ambiguous case. It could be UTF-8 (without BOM) or any of the single byte encodings like ASCII, ANSI, ISO-8859-1 etc.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文