当前位置：文江博客话题详情

查找任何文件编码的有效方法

发布于 2024-09-25 14:52:17 字数 90 浏览 3 评论 0原文

是的，这是一个最常见的问题，这件事对我来说很模糊，因为我对此了解不多。

但我想要一种非常精确的方法来查找文件编码。与 Notepad++ 一样精确。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

我只土不豪 2024-10-02 14:52:17

StreamReader.CurrentEncoding 属性很少为我返回正确的文本文件编码。通过分析文件的字节顺序标记 (BOM)，我在确定文件的字节顺序方面取得了更大的成功。如果文件没有 BOM，则无法确定文件的编码。

*已更新 2020 年 4 月 8 日，包括 UTF-32LE 检测并返回 UTF-32BE 的正确编码

/// <summary>
/// Determines a text file's encoding by analyzing its byte order mark (BOM).
/// Defaults to ASCII when detection of the text file's endianness fails.
/// </summary>
/// <param name="filename">The text file to analyze.</param>
/// <returns>The detected encoding.</returns>
public static Encoding GetEncoding(string filename)
{
    // Read the BOM
    var bom = new byte[4];
    using (var file = new FileStream(filename, FileMode.Open, FileAccess.Read))
    {
        file.Read(bom, 0, 4);
    }

    // Analyze the BOM
    if (bom[0] == 0x2b && bom[1] == 0x2f && bom[2] == 0x76) return Encoding.UTF7;
    if (bom[0] == 0xef && bom[1] == 0xbb && bom[2] == 0xbf) return Encoding.UTF8;
    if (bom[0] == 0xff && bom[1] == 0xfe && bom[2] == 0 && bom[3] == 0) return Encoding.UTF32; //UTF-32LE
    if (bom[0] == 0xff && bom[1] == 0xfe) return Encoding.Unicode; //UTF-16LE
    if (bom[0] == 0xfe && bom[1] == 0xff) return Encoding.BigEndianUnicode; //UTF-16BE
    if (bom[0] == 0 && bom[1] == 0 && bom[2] == 0xfe && bom[3] == 0xff) return new UTF32Encoding(true, true);  //UTF-32BE

    // We actually have no idea what the encoding is if we reach this point, so
    // you may wish to return null instead of defaulting to ASCII
    return Encoding.ASCII;
}

The StreamReader.CurrentEncoding property rarely returns the correct text file encoding for me. I've had greater success determining a file's endianness, by analyzing its byte order mark (BOM). If the file does not have a BOM, this cannot determine the file's encoding.

*UPDATED 4/08/2020 to include UTF-32LE detection and return correct encoding for UTF-32BE

/// <summary>
/// Determines a text file's encoding by analyzing its byte order mark (BOM).
/// Defaults to ASCII when detection of the text file's endianness fails.
/// </summary>
/// <param name="filename">The text file to analyze.</param>
/// <returns>The detected encoding.</returns>
public static Encoding GetEncoding(string filename)
{
    // Read the BOM
    var bom = new byte[4];
    using (var file = new FileStream(filename, FileMode.Open, FileAccess.Read))
    {
        file.Read(bom, 0, 4);
    }

    // Analyze the BOM
    if (bom[0] == 0x2b && bom[1] == 0x2f && bom[2] == 0x76) return Encoding.UTF7;
    if (bom[0] == 0xef && bom[1] == 0xbb && bom[2] == 0xbf) return Encoding.UTF8;
    if (bom[0] == 0xff && bom[1] == 0xfe && bom[2] == 0 && bom[3] == 0) return Encoding.UTF32; //UTF-32LE
    if (bom[0] == 0xff && bom[1] == 0xfe) return Encoding.Unicode; //UTF-16LE
    if (bom[0] == 0xfe && bom[1] == 0xff) return Encoding.BigEndianUnicode; //UTF-16BE
    if (bom[0] == 0 && bom[1] == 0 && bom[2] == 0xfe && bom[3] == 0xff) return new UTF32Encoding(true, true);  //UTF-32BE

    // We actually have no idea what the encoding is if we reach this point, so
    // you may wish to return null instead of defaulting to ASCII
    return Encoding.ASCII;
}

回复收藏 0 原文

弱骨蛰伏 2024-10-02 14:52:17

使用 StreamReader 类，以下代码对我来说效果很好：

  using (var reader = new StreamReader(fileName, defaultEncodingIfNoBom, true))
  {
      reader.Peek(); // you need this!
      var encoding = reader.CurrentEncoding;
  }

技巧是使用 Peek 调用，否则，.NET 不会执行任何操作（并且它还没有执行任何操作）阅读序言、BOM）。当然，如果您在检查编码之前使用任何其他 ReadXXX 调用，它也可以工作。

如果文件没有 BOM，则将使用 defaultEncodingIfNoBom 编码。还有一个没有此参数的 StreamReader 构造函数重载（在这种情况下，在任何读取之前编码将默认设置为 UTF8），但我建议在您的上下文中定义您认为默认的编码。

我已经使用带有 UTF8、UTF16/Unicode（LE 和 BE）和 UTF32（LE 和 BE）BOM 的文件成功测试了这一点。它不适用于 UTF7。

The following code works fine for me, using the StreamReader class:

  using (var reader = new StreamReader(fileName, defaultEncodingIfNoBom, true))
  {
      reader.Peek(); // you need this!
      var encoding = reader.CurrentEncoding;
  }

The trick is to use the Peek call, otherwise, .NET has not done anything (and it hasn't read the preamble, the BOM). Of course, if you use any other ReadXXX call before checking the encoding, it works too.

If the file has no BOM, then the defaultEncodingIfNoBom encoding will be used. There is also a StreamReader constructor overload without this argument (in this case, the encoding will by default be set to UTF8 before any read), but I recommend to define what you consider the default encoding in your context.

I have tested this successfully with files with BOM for UTF8, UTF16/Unicode (LE & BE) and UTF32 (LE & BE). It does not work for UTF7.

回复收藏 0 原文

怼怹恏 2024-10-02 14:52:17

提供 @CodesInChaos 提议的步骤的实现细节：

1) 检查是否有字节顺序标记

2) 检查文件是否有效 UTF8

3) 使用本地“ANSI”代码页（Microsoft 定义的 ANSI）

步骤 2 有效因为除了 UTF8 之外，代码页中的大多数非 ASCII 序列都不是有效的 UTF8。 https://stackoverflow.com/a/4522251/867248 更详细地解释了该策略。

using System; using System.IO; using System.Text;

// Using encoding from BOM or UTF8 if no BOM found,
// check if the file is valid, by reading all lines
// If decoding fails, use the local "ANSI" codepage

public string DetectFileEncoding(Stream fileStream)
{
    var Utf8EncodingVerifier = Encoding.GetEncoding("utf-8", new EncoderExceptionFallback(), new DecoderExceptionFallback());
    using (var reader = new StreamReader(fileStream, Utf8EncodingVerifier,
           detectEncodingFromByteOrderMarks: true, leaveOpen: true, bufferSize: 1024))
    {
        string detectedEncoding;
        try
        {
            while (!reader.EndOfStream)
            {
                var line = reader.ReadLine();
            }
            detectedEncoding = reader.CurrentEncoding.BodyName;
        }
        catch (Exception e)
        {
            // Failed to decode the file using the BOM/UT8. 
            // Assume it's local ANSI
            detectedEncoding = "ISO-8859-1";
        }
        // Rewind the stream
        fileStream.Seek(0, SeekOrigin.Begin);
        return detectedEncoding;
   }
}


[Test]
public void Test1()
{
    Stream fs = File.OpenRead(@".\TestData\TextFile_ansi.csv");
    var detectedEncoding = DetectFileEncoding(fs);

    using (var reader = new StreamReader(fs, Encoding.GetEncoding(detectedEncoding)))
    {
       // Consume your file
        var line = reader.ReadLine();
        ...

Providing the implementation details for the steps proposed by @CodesInChaos:

1) Check if there is a Byte Order Mark

2) Check if the file is valid UTF8

3) Use the local "ANSI" codepage (ANSI as Microsoft defines it)

Step 2 works because most non ASCII sequences in codepages other that UTF8 are not valid UTF8. https://stackoverflow.com/a/4522251/867248 explains the tactic in more details.

using System; using System.IO; using System.Text;

// Using encoding from BOM or UTF8 if no BOM found,
// check if the file is valid, by reading all lines
// If decoding fails, use the local "ANSI" codepage

public string DetectFileEncoding(Stream fileStream)
{
    var Utf8EncodingVerifier = Encoding.GetEncoding("utf-8", new EncoderExceptionFallback(), new DecoderExceptionFallback());
    using (var reader = new StreamReader(fileStream, Utf8EncodingVerifier,
           detectEncodingFromByteOrderMarks: true, leaveOpen: true, bufferSize: 1024))
    {
        string detectedEncoding;
        try
        {
            while (!reader.EndOfStream)
            {
                var line = reader.ReadLine();
            }
            detectedEncoding = reader.CurrentEncoding.BodyName;
        }
        catch (Exception e)
        {
            // Failed to decode the file using the BOM/UT8. 
            // Assume it's local ANSI
            detectedEncoding = "ISO-8859-1";
        }
        // Rewind the stream
        fileStream.Seek(0, SeekOrigin.Begin);
        return detectedEncoding;
   }
}


[Test]
public void Test1()
{
    Stream fs = File.OpenRead(@".\TestData\TextFile_ansi.csv");
    var detectedEncoding = DetectFileEncoding(fs);

    using (var reader = new StreamReader(fs, Encoding.GetEncoding(detectedEncoding)))
    {
       // Consume your file
        var line = reader.ReadLine();
        ...

回复收藏 0 原文

碍人泪离人颜 2024-10-02 14:52:17

检查一下。

UDE

这是 Mozilla 通用字符集检测器的端口，您可以像这样使用它......

public static void Main(String[] args)
{
    string filename = args[0];
    using (FileStream fs = File.OpenRead(filename)) {
        Ude.CharsetDetector cdet = new Ude.CharsetDetector();
        cdet.Feed(fs);
        cdet.DataEnd();
        if (cdet.Charset != null) {
            Console.WriteLine("Charset: {0}, confidence: {1}", 
                 cdet.Charset, cdet.Confidence);
        } else {
            Console.WriteLine("Detection failed.");
        }
    }
}

Check this.

UDE

This is a port of Mozilla Universal Charset Detector and you can use it like this...

public static void Main(String[] args)
{
    string filename = args[0];
    using (FileStream fs = File.OpenRead(filename)) {
        Ude.CharsetDetector cdet = new Ude.CharsetDetector();
        cdet.Feed(fs);
        cdet.DataEnd();
        if (cdet.Charset != null) {
            Console.WriteLine("Charset: {0}, confidence: {1}", 
                 cdet.Charset, cdet.Confidence);
        } else {
            Console.WriteLine("Detection failed.");
        }
    }
}

回复收藏 0 原文

神经暖 2024-10-02 14:52:17

我会尝试以下步骤：

1）检查是否有字节顺序标记

2）检查文件是否有效 UTF8

3）使用本地“ANSI”代码页（Microsoft 定义的 ANSI）

步骤 2 有效，因为大多数非 ASCII UTF8 以外的代码页中的序列不是有效的 UTF8。

回复收藏 0 原文

别想她 2024-10-02 14:52:17

.NET 不是很有帮助，但您可以尝试以下算法：

尝试通过 BOM（字节顺序标记）查找编码...很可能找不到
尝试解析为不同的编码

这是调用：

var encoding = FileHelper.GetEncoding(filePath);
if (encoding == null)
    throw new Exception("The file encoding is not supported. Please choose one of the following encodings: UTF8/UTF7/iso-8859-1");

这是代码:

public class FileHelper
{
    /// <summary>
    /// Determines a text file's encoding by analyzing its byte order mark (BOM) and if not found try parsing into diferent encodings       
    /// Defaults to UTF8 when detection of the text file's endianness fails.
    /// </summary>
    /// <param name="filename">The text file to analyze.</param>
    /// <returns>The detected encoding or null.</returns>
    public static Encoding GetEncoding(string filename)
    {
        var encodingByBOM = GetEncodingByBOM(filename);
        if (encodingByBOM != null)
            return encodingByBOM;

        // BOM not found :(, so try to parse characters into several encodings
        var encodingByParsingUTF8 = GetEncodingByParsing(filename, Encoding.UTF8);
        if (encodingByParsingUTF8 != null)
            return encodingByParsingUTF8;

        var encodingByParsingLatin1 = GetEncodingByParsing(filename, Encoding.GetEncoding("iso-8859-1"));
        if (encodingByParsingLatin1 != null)
            return encodingByParsingLatin1;

        var encodingByParsingUTF7 = GetEncodingByParsing(filename, Encoding.UTF7);
        if (encodingByParsingUTF7 != null)
            return encodingByParsingUTF7;

        return null;   // no encoding found
    }

    /// <summary>
    /// Determines a text file's encoding by analyzing its byte order mark (BOM)  
    /// </summary>
    /// <param name="filename">The text file to analyze.</param>
    /// <returns>The detected encoding.</returns>
    private static Encoding GetEncodingByBOM(string filename)
    {
        // Read the BOM
        var byteOrderMark = new byte[4];
        using (var file = new FileStream(filename, FileMode.Open, FileAccess.Read))
        {
            file.Read(byteOrderMark, 0, 4);
        }

        // Analyze the BOM
        if (byteOrderMark[0] == 0x2b && byteOrderMark[1] == 0x2f && byteOrderMark[2] == 0x76) return Encoding.UTF7;
        if (byteOrderMark[0] == 0xef && byteOrderMark[1] == 0xbb && byteOrderMark[2] == 0xbf) return Encoding.UTF8;
        if (byteOrderMark[0] == 0xff && byteOrderMark[1] == 0xfe) return Encoding.Unicode; //UTF-16LE
        if (byteOrderMark[0] == 0xfe && byteOrderMark[1] == 0xff) return Encoding.BigEndianUnicode; //UTF-16BE
        if (byteOrderMark[0] == 0 && byteOrderMark[1] == 0 && byteOrderMark[2] == 0xfe && byteOrderMark[3] == 0xff) return Encoding.UTF32;

        return null;    // no BOM found
    }

    private static Encoding GetEncodingByParsing(string filename, Encoding encoding)
    {            
        var encodingVerifier = Encoding.GetEncoding(encoding.BodyName, new EncoderExceptionFallback(), new DecoderExceptionFallback());

        try
        {
            using (var textReader = new StreamReader(filename, encodingVerifier, detectEncodingFromByteOrderMarks: true))
            {
                while (!textReader.EndOfStream)
                {                        
                    textReader.ReadLine();   // in order to increment the stream position
                }

                // all text parsed ok
                return textReader.CurrentEncoding;
            }
        }
        catch (Exception ex) { }

        return null;    // 
    }
}

.NET is not very helpful, but you can try the following algorithm:

try to find the encoding by BOM(byte order mark) ... very likely not to be found
try parsing into different encodings

Here is the call:

var encoding = FileHelper.GetEncoding(filePath);
if (encoding == null)
    throw new Exception("The file encoding is not supported. Please choose one of the following encodings: UTF8/UTF7/iso-8859-1");

Here is the code:

public class FileHelper
{
    /// <summary>
    /// Determines a text file's encoding by analyzing its byte order mark (BOM) and if not found try parsing into diferent encodings       
    /// Defaults to UTF8 when detection of the text file's endianness fails.
    /// </summary>
    /// <param name="filename">The text file to analyze.</param>
    /// <returns>The detected encoding or null.</returns>
    public static Encoding GetEncoding(string filename)
    {
        var encodingByBOM = GetEncodingByBOM(filename);
        if (encodingByBOM != null)
            return encodingByBOM;

        // BOM not found :(, so try to parse characters into several encodings
        var encodingByParsingUTF8 = GetEncodingByParsing(filename, Encoding.UTF8);
        if (encodingByParsingUTF8 != null)
            return encodingByParsingUTF8;

        var encodingByParsingLatin1 = GetEncodingByParsing(filename, Encoding.GetEncoding("iso-8859-1"));
        if (encodingByParsingLatin1 != null)
            return encodingByParsingLatin1;

        var encodingByParsingUTF7 = GetEncodingByParsing(filename, Encoding.UTF7);
        if (encodingByParsingUTF7 != null)
            return encodingByParsingUTF7;

        return null;   // no encoding found
    }

    /// <summary>
    /// Determines a text file's encoding by analyzing its byte order mark (BOM)  
    /// </summary>
    /// <param name="filename">The text file to analyze.</param>
    /// <returns>The detected encoding.</returns>
    private static Encoding GetEncodingByBOM(string filename)
    {
        // Read the BOM
        var byteOrderMark = new byte[4];
        using (var file = new FileStream(filename, FileMode.Open, FileAccess.Read))
        {
            file.Read(byteOrderMark, 0, 4);
        }

        // Analyze the BOM
        if (byteOrderMark[0] == 0x2b && byteOrderMark[1] == 0x2f && byteOrderMark[2] == 0x76) return Encoding.UTF7;
        if (byteOrderMark[0] == 0xef && byteOrderMark[1] == 0xbb && byteOrderMark[2] == 0xbf) return Encoding.UTF8;
        if (byteOrderMark[0] == 0xff && byteOrderMark[1] == 0xfe) return Encoding.Unicode; //UTF-16LE
        if (byteOrderMark[0] == 0xfe && byteOrderMark[1] == 0xff) return Encoding.BigEndianUnicode; //UTF-16BE
        if (byteOrderMark[0] == 0 && byteOrderMark[1] == 0 && byteOrderMark[2] == 0xfe && byteOrderMark[3] == 0xff) return Encoding.UTF32;

        return null;    // no BOM found
    }

    private static Encoding GetEncodingByParsing(string filename, Encoding encoding)
    {            
        var encodingVerifier = Encoding.GetEncoding(encoding.BodyName, new EncoderExceptionFallback(), new DecoderExceptionFallback());

        try
        {
            using (var textReader = new StreamReader(filename, encodingVerifier, detectEncodingFromByteOrderMarks: true))
            {
                while (!textReader.EndOfStream)
                {                        
                    textReader.ReadLine();   // in order to increment the stream position
                }

                // all text parsed ok
                return textReader.CurrentEncoding;
            }
        }
        catch (Exception ex) { }

        return null;    // 
    }
}

回复收藏 0 原文

黯然#的苍凉 2024-10-02 14:52:17

@nonoandy 提出的解决方案非常有趣，我已经成功地测试了它并且似乎工作得很好。

所需的 nuget 包是 Microsoft.ProgramSynthesis.Detection （目前版本 8.17.0）

我建议使用 EncodingTypeUtils.GetDotNetName 而不是使用开关来获取 <代码>编码实例：

using System.Text;
using Microsoft.ProgramSynthesis.Detection.Encoding;

...

public Encoding? DetectEncoding(Stream stream)
{
    try
    {
        if (stream.CanSeek)
        {
            // Read from the beginning if possible
            stream.Seek(0, SeekOrigin.Begin);
        }

        // Detect encoding type (enum)
        var encodingType = EncodingIdentifier.IdentifyEncoding(stream);
        
        // Get the corresponding encoding name to be passed to System.Text.Encoding.GetEncoding
        var encodingDotNetName = EncodingTypeUtils.GetDotNetName(encodingType);

        if (!string.IsNullOrEmpty(encodingDotNetName))
        {
            return Encoding.GetEncoding(encodingDotNetName);
        }
    }
    catch (Exception e)
    {
        // Handle exception (log, throw, etc...)
    }

    // In case of error return null or a default value
    return null;
}

The solution proposed by @nonoandy is really interesting, I have succesfully tested it and seems to be working perfectly.

The nuget package needed is Microsoft.ProgramSynthesis.Detection (version 8.17.0 at the moment)

I suggest to use the EncodingTypeUtils.GetDotNetName instead of using a switch for getting the Encoding instance:

using System.Text;
using Microsoft.ProgramSynthesis.Detection.Encoding;

...

public Encoding? DetectEncoding(Stream stream)
{
    try
    {
        if (stream.CanSeek)
        {
            // Read from the beginning if possible
            stream.Seek(0, SeekOrigin.Begin);
        }

        // Detect encoding type (enum)
        var encodingType = EncodingIdentifier.IdentifyEncoding(stream);
        
        // Get the corresponding encoding name to be passed to System.Text.Encoding.GetEncoding
        var encodingDotNetName = EncodingTypeUtils.GetDotNetName(encodingType);

        if (!string.IsNullOrEmpty(encodingDotNetName))
        {
            return Encoding.GetEncoding(encodingDotNetName);
        }
    }
    catch (Exception e)
    {
        // Handle exception (log, throw, etc...)
    }

    // In case of error return null or a default value
    return null;
}

回复收藏 0 原文

凉栀 2024-10-02 14:52:17

在此处查找 c#

https ://msdn.microsoft.com/en-us/library/system.io.streamreader.currentencoding%28v=vs.110%29.aspx

string path = @"path\to\your\file.ext";

using (StreamReader sr = new StreamReader(path, true))
{
    while (sr.Peek() >= 0)
    {
        Console.Write((char)sr.Read());
    }

    //Test for the encoding after reading, or at least
    //after the first read.
    Console.WriteLine("The encoding used was {0}.", sr.CurrentEncoding);
    Console.ReadLine();
    Console.WriteLine();
}

Look here for c#

https://msdn.microsoft.com/en-us/library/system.io.streamreader.currentencoding%28v=vs.110%29.aspx

string path = @"path\to\your\file.ext";

using (StreamReader sr = new StreamReader(path, true))
{
    while (sr.Peek() >= 0)
    {
        Console.Write((char)sr.Read());
    }

    //Test for the encoding after reading, or at least
    //after the first read.
    Console.WriteLine("The encoding used was {0}.", sr.CurrentEncoding);
    Console.ReadLine();
    Console.WriteLine();
}

回复收藏 0 原文

黒涩兲箜 2024-10-02 14:52:17

以下代码是我的Powershell代码，用于确定某些cpp或h或ml文件是否使用ISO-8859-1(Latin-1)或不带BOM的UTF-8编码，如果两者都不是，则假设它是GB18030。我是一名在法国工作的中国人，MSVC 在法国计算机上保存为 Latin-1，在中国计算机上保存为 GB，因此这有助于我在系统和同事之间进行源文件交换时避免编码问题。

方法很简单，如果所有字符都在x00-x7E之间，ASCII、UTF-8和Latin-1都是一样的，但是如果我用UTF-8读取非ASCII文件，我们会发现特殊字符�出现，所以尝试用 Latin-1 来阅读。在Latin-1中，\x7F和\xAF之间是空的，而GB在x00-xFF之间使用完整的，所以如果我在两者之间有任何一个，它不是Latin-1

代码是用PowerShell编写的，但使用.net，所以很容易被翻译成 C# 或 F#

$Utf8NoBomEncoding = New-Object System.Text.UTF8Encoding($False)
foreach($i in Get-ChildItem .\ -Recurse -include *.cpp,*.h, *.ml) {
    $openUTF = New-Object System.IO.StreamReader -ArgumentList ($i, [Text.Encoding]::UTF8)
    $contentUTF = $openUTF.ReadToEnd()
    [regex]$regex = '�'
    $c=$regex.Matches($contentUTF).count
    $openUTF.Close()
    if ($c -ne 0) {
        $openLatin1 = New-Object System.IO.StreamReader -ArgumentList ($i, [Text.Encoding]::GetEncoding('ISO-8859-1'))
        $contentLatin1 = $openLatin1.ReadToEnd()
        $openLatin1.Close()
        [regex]$regex = '[\x7F-\xAF]'
        $c=$regex.Matches($contentLatin1).count
        if ($c -eq 0) {
            [System.IO.File]::WriteAllLines($i, $contentLatin1, $Utf8NoBomEncoding)
            $i.FullName
        } 
        else {
            $openGB = New-Object System.IO.StreamReader -ArgumentList ($i, [Text.Encoding]::GetEncoding('GB18030'))
            $contentGB = $openGB.ReadToEnd()
            $openGB.Close()
            [System.IO.File]::WriteAllLines($i, $contentGB, $Utf8NoBomEncoding)
            $i.FullName
        }
    }
}
Write-Host -NoNewLine 'Press any key to continue...';
$null = $Host.UI.RawUI.ReadKey('NoEcho,IncludeKeyDown');

The following codes are my Powershell codes to determinate if some cpp or h or ml files are encodeding with ISO-8859-1(Latin-1) or UTF-8 without BOM, if neither then suppose it to be GB18030. I am a Chinese working in France and MSVC saves as Latin-1 on french computer and saves as GB on Chinese computer so this helps me avoid encoding problem when do source file exchanges between my system and my colleagues.

The way is simple, if all characters are between x00-x7E, ASCII, UTF-8 and Latin-1 are all the same, but if I read a non ASCII file by UTF-8, we will find the special character � show up, so try to read with Latin-1. In Latin-1, between \x7F and \xAF is empty, while GB uses full between x00-xFF so if I got any between the two, it's not Latin-1

The code is written in PowerShell, but uses .net so it's easy to be translated into C# or F#

$Utf8NoBomEncoding = New-Object System.Text.UTF8Encoding($False)
foreach($i in Get-ChildItem .\ -Recurse -include *.cpp,*.h, *.ml) {
    $openUTF = New-Object System.IO.StreamReader -ArgumentList ($i, [Text.Encoding]::UTF8)
    $contentUTF = $openUTF.ReadToEnd()
    [regex]$regex = '�'
    $c=$regex.Matches($contentUTF).count
    $openUTF.Close()
    if ($c -ne 0) {
        $openLatin1 = New-Object System.IO.StreamReader -ArgumentList ($i, [Text.Encoding]::GetEncoding('ISO-8859-1'))
        $contentLatin1 = $openLatin1.ReadToEnd()
        $openLatin1.Close()
        [regex]$regex = '[\x7F-\xAF]'
        $c=$regex.Matches($contentLatin1).count
        if ($c -eq 0) {
            [System.IO.File]::WriteAllLines($i, $contentLatin1, $Utf8NoBomEncoding)
            $i.FullName
        } 
        else {
            $openGB = New-Object System.IO.StreamReader -ArgumentList ($i, [Text.Encoding]::GetEncoding('GB18030'))
            $contentGB = $openGB.ReadToEnd()
            $openGB.Close()
            [System.IO.File]::WriteAllLines($i, $contentGB, $Utf8NoBomEncoding)
            $i.FullName
        }
    }
}
Write-Host -NoNewLine 'Press any key to continue...';
$null = $Host.UI.RawUI.ReadKey('NoEcho,IncludeKeyDown');

回复收藏 0 原文

烟花肆意 2024-10-02 14:52:17

这似乎运作良好。

首先创建一个辅助方法：

  private static Encoding TestCodePage(Encoding testCode, byte[] byteArray)
    {
      try
      {
        var encoding = Encoding.GetEncoding(testCode.CodePage, EncoderFallback.ExceptionFallback, DecoderFallback.ExceptionFallback);
        var a = encoding.GetCharCount(byteArray);
        return testCode;
      }
      catch (Exception e)
      {
        return null;
      }
    }

然后创建代码来测试源代码。在这种情况下，我有一个字节数组，我需要获取以下内容的编码：

 public static Encoding DetectCodePage(byte[] contents)
    {
      if (contents == null || contents.Length == 0)
      {
        return Encoding.Default;
      }

      return TestCodePage(Encoding.UTF8, contents)
             ?? TestCodePage(Encoding.Unicode, contents)
             ?? TestCodePage(Encoding.BigEndianUnicode, contents)
             ?? TestCodePage(Encoding.GetEncoding(1252), contents) // Western European
             ?? TestCodePage(Encoding.GetEncoding(28591), contents) // ISO Western European
             ?? TestCodePage(Encoding.ASCII, contents)
             ?? TestCodePage(Encoding.Default, contents); // likely Unicode
    }

This seems to work well.

First create a helper method:

  private static Encoding TestCodePage(Encoding testCode, byte[] byteArray)
    {
      try
      {
        var encoding = Encoding.GetEncoding(testCode.CodePage, EncoderFallback.ExceptionFallback, DecoderFallback.ExceptionFallback);
        var a = encoding.GetCharCount(byteArray);
        return testCode;
      }
      catch (Exception e)
      {
        return null;
      }
    }

Then create code to test the source. In this case, I've got a byte array I need to get the encoding of:

 public static Encoding DetectCodePage(byte[] contents)
    {
      if (contents == null || contents.Length == 0)
      {
        return Encoding.Default;
      }

      return TestCodePage(Encoding.UTF8, contents)
             ?? TestCodePage(Encoding.Unicode, contents)
             ?? TestCodePage(Encoding.BigEndianUnicode, contents)
             ?? TestCodePage(Encoding.GetEncoding(1252), contents) // Western European
             ?? TestCodePage(Encoding.GetEncoding(28591), contents) // ISO Western European
             ?? TestCodePage(Encoding.ASCII, contents)
             ?? TestCodePage(Encoding.Default, contents); // likely Unicode
    }

回复收藏 0 原文

孤单情人 2024-10-02 14:52:17

我尝试了几种不同的方法来检测编码并遇到大多数问题。

我利用 Microsoft Nuget 包制作了以下内容，到目前为止它似乎对我有用，但需要进行更多测试。
我的大部分测试都是针对 UTF8、带有 BOM 的 UTF8 和 ANSI。

static void Main(string[] args)
{
    var path = Directory.GetCurrentDirectory() + "\\TextFile2.txt";
    List<string> contents = File.ReadLines(path, GetEncoding(path)).Where(w => !string.IsNullOrWhiteSpace(w)).ToList();

    int i = 0;
    foreach (var line in contents)
    {
        i++;
        Console.WriteLine(line);
        if (i > 100)
            break;
    }

}


public static Encoding GetEncoding(string filename)
{
    using (var file = new FileStream(filename, FileMode.Open, FileAccess.Read))
    {
        var detectedEncoding = Microsoft.ProgramSynthesis.Detection.Encoding.EncodingIdentifier.IdentifyEncoding(file);
        switch (detectedEncoding)
        {
            case Microsoft.ProgramSynthesis.Detection.Encoding.EncodingType.Utf8:
                return Encoding.UTF8;
            case Microsoft.ProgramSynthesis.Detection.Encoding.EncodingType.Utf16Be:
                return Encoding.BigEndianUnicode;
            case Microsoft.ProgramSynthesis.Detection.Encoding.EncodingType.Utf16Le:
                return Encoding.Unicode;
            case Microsoft.ProgramSynthesis.Detection.Encoding.EncodingType.Utf32Le:
                return Encoding.UTF32;
            case Microsoft.ProgramSynthesis.Detection.Encoding.EncodingType.Ascii:
                return Encoding.ASCII;
            case Microsoft.ProgramSynthesis.Detection.Encoding.EncodingType.Iso88591:
            case Microsoft.ProgramSynthesis.Detection.Encoding.EncodingType.Unknown:
            case Microsoft.ProgramSynthesis.Detection.Encoding.EncodingType.Windows1252:
            default:
            return Encoding.Default;
        }
    }
}

I have tried a few different ways to detect encoding and hit issues with most of them.

I made the following leveraging a Microsoft Nuget Package and it seems to work for me so far but needs tested a lot more.
Most of my testing has been on UTF8, UTF8 with BOM and ANSI.

static void Main(string[] args)
{
    var path = Directory.GetCurrentDirectory() + "\\TextFile2.txt";
    List<string> contents = File.ReadLines(path, GetEncoding(path)).Where(w => !string.IsNullOrWhiteSpace(w)).ToList();

    int i = 0;
    foreach (var line in contents)
    {
        i++;
        Console.WriteLine(line);
        if (i > 100)
            break;
    }

}


public static Encoding GetEncoding(string filename)
{
    using (var file = new FileStream(filename, FileMode.Open, FileAccess.Read))
    {
        var detectedEncoding = Microsoft.ProgramSynthesis.Detection.Encoding.EncodingIdentifier.IdentifyEncoding(file);
        switch (detectedEncoding)
        {
            case Microsoft.ProgramSynthesis.Detection.Encoding.EncodingType.Utf8:
                return Encoding.UTF8;
            case Microsoft.ProgramSynthesis.Detection.Encoding.EncodingType.Utf16Be:
                return Encoding.BigEndianUnicode;
            case Microsoft.ProgramSynthesis.Detection.Encoding.EncodingType.Utf16Le:
                return Encoding.Unicode;
            case Microsoft.ProgramSynthesis.Detection.Encoding.EncodingType.Utf32Le:
                return Encoding.UTF32;
            case Microsoft.ProgramSynthesis.Detection.Encoding.EncodingType.Ascii:
                return Encoding.ASCII;
            case Microsoft.ProgramSynthesis.Detection.Encoding.EncodingType.Iso88591:
            case Microsoft.ProgramSynthesis.Detection.Encoding.EncodingType.Unknown:
            case Microsoft.ProgramSynthesis.Detection.Encoding.EncodingType.Windows1252:
            default:
            return Encoding.Default;
        }
    }
}

回复收藏 0 原文

寄居者 2024-10-02 14:52:17

可能有用

string path = @"address/to/the/file.extension";

using (StreamReader sr = new StreamReader(path))
{ 
    Console.WriteLine(sr.CurrentEncoding);                        
}

It may be useful

string path = @"address/to/the/file.extension";

using (StreamReader sr = new StreamReader(path))
{ 
    Console.WriteLine(sr.CurrentEncoding);                        
}

回复收藏 0 原文

~没有更多了~