如何检测文本文件的编码/代码页?

发布于 2024-07-05 07:27:56 字数 679 浏览 8 评论 0原文

在我们的应用程序中,我们从不同来源接收文本文件(.txt.csv 等)。 读取时,这些文件有时包含垃圾,因为这些文件是在不同/未知的代码页中创建的。

有没有办法(自动)检测文本文件的代码页?

StreamReader 构造函数上的 detectEncodingFromByteOrderMarks 适用于 UTF8 和其他 unicode 标记的文件,但我正在寻找一种检测代码页的方法,如 ibm850、windows1252。


感谢您的回答,这就是我所做的。

我们收到的文件来自最终用户,他们对代码页一无所知。 接收者也是最终用户,到目前为止,这就是他们对代码页的了解:代码页存在,并且很烦人。

解决办法:

  • 用记事本打开收到的文件,查看有一段乱码的文字。 如果某人叫弗朗索瓦或其他什么名字,以你的人类智慧你可以猜出来。
  • 我创建了一个小应用程序,用户可以用它来打开文件,并输入用户知道在使用正确的代码页时它将出现在文件中的文本。
  • 循环遍历所有代码页,并显示那些使用用户提供的文本提供解决方案的代码页。
  • 如果弹出多个代码页,请要求用户指定更多文本。

In our application, we receive text files (.txt, .csv, etc.) from diverse sources. When reading, these files sometimes contain garbage, because the files where created in a different/unknown codepage.

Is there a way to (automatically) detect the codepage of a text file?

The detectEncodingFromByteOrderMarks, on the StreamReader constructor, works for UTF8 and other unicode marked files, but I'm looking for a way to detect code pages, like ibm850, windows1252.


Thanks for your answers, this is what I've done.

The files we receive are from end-users, they do not have a clue about codepages. The receivers are also end-users, by now this is what they know about codepages: Codepages exist, and are annoying.

Solution:

  • Open the received file in Notepad, look at a garbled piece of text. If somebody is called François or something, with your human intelligence you can guess this.
  • I've created a small app that the user can use to open the file with, and enter a text that user knows it will appear in the file, when the correct codepage is used.
  • Loop through all codepages, and display the ones that give a solution with the user provided text.
  • If more as one codepage pops up, ask the user to specify more text.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(21

朦胧时间 2024-07-12 07:27:56

您无法检测到代码页,您需要被告知。 您可以分析字节并猜测它,但这可能会给出一些奇怪的(有时是有趣的)结果。 我现在找不到它,但我确信记事本可以被欺骗以中文显示英文文本。

无论如何,这是您需要阅读的内容:
每个软件开发人员绝对必须了解 Unicode 和字符集的绝对最低限度(没有任何借口!)。

乔尔具体说:

关于编码的最重要的事实

如果您完全忘记了我刚才解释的所有内容,请记住一个极其重要的事实。 如果不知道字符串使用什么编码,那么它是没有意义的。 您不能再把头埋在沙子里假装“纯”文本是 ASCII。
不存在纯文本这样的东西。

如果内存、文件或电子邮件中有一个字符串,您必须知道它的编码方式,否则无法正确解释它或向用户显示它。

You can't detect the codepage, you need to be told it. You can analyse the bytes and guess it, but that can give some bizarre (sometimes amusing) results. I can't find it now, but I'm sure Notepad can be tricked into displaying English text in Chinese.

Anyway, this is what you need to read:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

Specifically Joel says:

The Single Most Important Fact About Encodings

If you completely forget everything I just explained, please remember one extremely important fact. It does not make sense to have a string without knowing what encoding it uses. You can no longer stick your head in the sand and pretend that "plain" text is ASCII.
There Ain't No Such Thing As Plain Text.

If you have a string, in memory, in a file, or in an email message, you have to know what encoding it is in or you cannot interpret it or display it to users correctly.

扬花落满肩 2024-07-12 07:27:56

如果您想要检测非 UTF 编码(即无 BOM),您基本上需要对文本进行启发式分析和统计分析。 您可能需要查看有关通用字符集检测的 Mozilla 论文相同的链接,格式更好通过时光机)。

If you're looking to detect non-UTF encodings (i.e. no BOM), you're basically down to heuristics and statistical analysis of the text. You might want to take a look at the Mozilla paper on universal charset detection (same link, with better formatting via Wayback Machine).

十雾 2024-07-12 07:27:56

我知道这个问题已经很晚了,而且这个解决方案不会吸引某些人(因为它以英语为中心的偏见并且缺乏统计/实证测试),但它对我来说非常有效,特别是在处理上传的 CSV 数据时:

http://www.architectshack.com/TextFileEncodingDetector.ashx

优点:

  • 内置 BOM 检测
  • 默认/后备编码 可自定义
  • 漂亮(根据我的经验)对于包含一些外来数据(例如法国名称)以及 UTF-8 和 Latin-1 样式文件混合的西欧文件来说是可靠的 - 基本上是美国和西欧环境的大部分。

注意:我是这门课的编写者,所以显然对此持保留态度! :)

I know it's very late for this question and this solution won't appeal to some (because of its english-centric bias and its lack of statistical/empirical testing), but it's worked very well for me, especially for processing uploaded CSV data:

http://www.architectshack.com/TextFileEncodingDetector.ashx

Advantages:

  • BOM detection built-in
  • Default/fallback encoding customizable
  • pretty reliable (in my experience) for western-european-based files containing some exotic data (eg french names) with a mixture of UTF-8 and Latin-1-style files - basically the bulk of US and western european environments.

Note: I'm the one who wrote this class, so obviously take it with a grain of salt! :)

终难愈 2024-07-12 07:27:56

您无法检测到代码页

这显然是错误的。 每个网络浏览器都有某种通用字符集检测器来处理没有任何编码指示的页面。 火狐浏览器有一个。 您可以下载代码并查看它是如何实现的。 请参阅此处的一些文档。 基本上,这是一种启发式方法,但效果非常好。

给定合理数量的文本,甚至可以检测到语言。

这是我刚刚使用 Google 发现的另一个

You can't detect the codepage

This is clearly false. Every web browser has some kind of universal charset detector to deal with pages which have no indication whatsoever of an encoding. Firefox has one. You can download the code and see how it does it. See some documentation here. Basically, it is a heuristic, but one that works really well.

Given a reasonable amount of text, it is even possible to detect the language.

Here's another one I just found using Google:

无畏 2024-07-12 07:27:56

您是否尝试过 Mozilla 通用字符集检测器的 C# 端口

来自 http://code.google.com/p/ude/

public static void Main(String[] args)
{
    string filename = args[0];
    using (FileStream fs = File.OpenRead(filename)) {
        Ude.CharsetDetector cdet = new Ude.CharsetDetector();
        cdet.Feed(fs);
        cdet.DataEnd();
        if (cdet.Charset != null) {
            Console.WriteLine("Charset: {0}, confidence: {1}", 
                 cdet.Charset, cdet.Confidence);
        } else {
            Console.WriteLine("Detection failed.");
        }
    }
}    

Have you tried C# port for Mozilla Universal Charset Detector

Example from http://code.google.com/p/ude/

public static void Main(String[] args)
{
    string filename = args[0];
    using (FileStream fs = File.OpenRead(filename)) {
        Ude.CharsetDetector cdet = new Ude.CharsetDetector();
        cdet.Feed(fs);
        cdet.DataEnd();
        if (cdet.Charset != null) {
            Console.WriteLine("Charset: {0}, confidence: {1}", 
                 cdet.Charset, cdet.Confidence);
        } else {
            Console.WriteLine("Detection failed.");
        }
    }
}    
枯叶蝶 2024-07-12 07:27:56

如果有人正在寻找 93.9% 的解决方案。 这对我有用:

public static class StreamExtension
{
    /// <summary>
    /// Convert the content to a string.
    /// </summary>
    /// <param name="stream">The stream.</param>
    /// <returns></returns>
    public static string ReadAsString(this Stream stream)
    {
        var startPosition = stream.Position;
        try
        {
            // 1. Check for a BOM
            // 2. or try with UTF-8. The most (86.3%) used encoding. Visit: http://w3techs.com/technologies/overview/character_encoding/all/
            var streamReader = new StreamReader(stream, new UTF8Encoding(encoderShouldEmitUTF8Identifier: false, throwOnInvalidBytes: true), detectEncodingFromByteOrderMarks: true);
            return streamReader.ReadToEnd();
        }
        catch (DecoderFallbackException ex)
        {
            stream.Position = startPosition;

            // 3. The second most (6.7%) used encoding is ISO-8859-1. So use Windows-1252 (0.9%, also know as ANSI), which is a superset of ISO-8859-1.
            var streamReader = new StreamReader(stream, Encoding.GetEncoding(1252));
            return streamReader.ReadToEnd();
        }
    }
}

If someone is looking for a 93.9% solution. This works for me:

public static class StreamExtension
{
    /// <summary>
    /// Convert the content to a string.
    /// </summary>
    /// <param name="stream">The stream.</param>
    /// <returns></returns>
    public static string ReadAsString(this Stream stream)
    {
        var startPosition = stream.Position;
        try
        {
            // 1. Check for a BOM
            // 2. or try with UTF-8. The most (86.3%) used encoding. Visit: http://w3techs.com/technologies/overview/character_encoding/all/
            var streamReader = new StreamReader(stream, new UTF8Encoding(encoderShouldEmitUTF8Identifier: false, throwOnInvalidBytes: true), detectEncodingFromByteOrderMarks: true);
            return streamReader.ReadToEnd();
        }
        catch (DecoderFallbackException ex)
        {
            stream.Position = startPosition;

            // 3. The second most (6.7%) used encoding is ISO-8859-1. So use Windows-1252 (0.9%, also know as ANSI), which is a superset of ISO-8859-1.
            var streamReader = new StreamReader(stream, Encoding.GetEncoding(1252));
            return streamReader.ReadToEnd();
        }
    }
}
宫墨修音 2024-07-12 07:27:56

Notepad++ 具有开箱即用的此功能。 它还支持更改它。

Notepad++ has this feature out-of-the-box. It also supports changing it.

沫离伤花 2024-07-12 07:27:56

寻找不同的解决方案,我发现

https://code.google.com/p/ude/

这个解决方案是有点重。

我需要一些基本的编码检测,基于前 4 个字节,可能还有 xml 字符集检测 - 所以我从互联网上获取了一些示例源代码,并添加了稍加修改的版本

http://lists.w3.org/Archives/Public/www-validator/2002Aug/0084.html

为 Java 编写。

    public static Encoding DetectEncoding(byte[] fileContent)
    {
        if (fileContent == null)
            throw new ArgumentNullException();

        if (fileContent.Length < 2)
            return Encoding.ASCII;      // Default fallback

        if (fileContent[0] == 0xff
            && fileContent[1] == 0xfe
            && (fileContent.Length < 4
                || fileContent[2] != 0
                || fileContent[3] != 0
                )
            )
            return Encoding.Unicode;

        if (fileContent[0] == 0xfe
            && fileContent[1] == 0xff
            )
            return Encoding.BigEndianUnicode;

        if (fileContent.Length < 3)
            return null;

        if (fileContent[0] == 0xef && fileContent[1] == 0xbb && fileContent[2] == 0xbf)
            return Encoding.UTF8;

        if (fileContent[0] == 0x2b && fileContent[1] == 0x2f && fileContent[2] == 0x76)
            return Encoding.UTF7;

        if (fileContent.Length < 4)
            return null;

        if (fileContent[0] == 0xff && fileContent[1] == 0xfe && fileContent[2] == 0 && fileContent[3] == 0)
            return Encoding.UTF32;

        if (fileContent[0] == 0 && fileContent[1] == 0 && fileContent[2] == 0xfe && fileContent[3] == 0xff)
            return Encoding.GetEncoding(12001);

        String probe;
        int len = fileContent.Length;

        if( fileContent.Length >= 128 ) len = 128;
        probe = Encoding.ASCII.GetString(fileContent, 0, len);

        MatchCollection mc = Regex.Matches(probe, "^<\\?xml[^<>]*encoding[ \\t\\n\\r]?=[\\t\\n\\r]?['\"]([A-Za-z]([A-Za-z0-9._]|-)*)", RegexOptions.Singleline);
        // Add '[0].Groups[1].Value' to the end to test regex

        if( mc.Count == 1 && mc[0].Groups.Count >= 2 )
        {
            // Typically picks up 'UTF-8' string
            Encoding enc = null;

            try {
                enc = Encoding.GetEncoding( mc[0].Groups[1].Value );
            }catch (Exception ) { }

            if( enc != null )
                return enc;
        }

        return Encoding.ASCII;      // Default fallback
    }

从文件中读取前 1024 个字节就足够了,但我正在加载整个文件。

Looking for different solution, I found that

https://code.google.com/p/ude/

this solution is kinda heavy.

I needed some basic encoding detection, based on 4 first bytes and probably xml charset detection - so I've took some sample source code from internet and added slightly modified version of

http://lists.w3.org/Archives/Public/www-validator/2002Aug/0084.html

written for Java.

    public static Encoding DetectEncoding(byte[] fileContent)
    {
        if (fileContent == null)
            throw new ArgumentNullException();

        if (fileContent.Length < 2)
            return Encoding.ASCII;      // Default fallback

        if (fileContent[0] == 0xff
            && fileContent[1] == 0xfe
            && (fileContent.Length < 4
                || fileContent[2] != 0
                || fileContent[3] != 0
                )
            )
            return Encoding.Unicode;

        if (fileContent[0] == 0xfe
            && fileContent[1] == 0xff
            )
            return Encoding.BigEndianUnicode;

        if (fileContent.Length < 3)
            return null;

        if (fileContent[0] == 0xef && fileContent[1] == 0xbb && fileContent[2] == 0xbf)
            return Encoding.UTF8;

        if (fileContent[0] == 0x2b && fileContent[1] == 0x2f && fileContent[2] == 0x76)
            return Encoding.UTF7;

        if (fileContent.Length < 4)
            return null;

        if (fileContent[0] == 0xff && fileContent[1] == 0xfe && fileContent[2] == 0 && fileContent[3] == 0)
            return Encoding.UTF32;

        if (fileContent[0] == 0 && fileContent[1] == 0 && fileContent[2] == 0xfe && fileContent[3] == 0xff)
            return Encoding.GetEncoding(12001);

        String probe;
        int len = fileContent.Length;

        if( fileContent.Length >= 128 ) len = 128;
        probe = Encoding.ASCII.GetString(fileContent, 0, len);

        MatchCollection mc = Regex.Matches(probe, "^<\\?xml[^<>]*encoding[ \\t\\n\\r]?=[\\t\\n\\r]?['\"]([A-Za-z]([A-Za-z0-9._]|-)*)", RegexOptions.Singleline);
        // Add '[0].Groups[1].Value' to the end to test regex

        if( mc.Count == 1 && mc[0].Groups.Count >= 2 )
        {
            // Typically picks up 'UTF-8' string
            Encoding enc = null;

            try {
                enc = Encoding.GetEncoding( mc[0].Groups[1].Value );
            }catch (Exception ) { }

            if( enc != null )
                return enc;
        }

        return Encoding.ASCII;      // Default fallback
    }

It's enough to read probably first 1024 bytes from file, but I'm loading whole file.

静谧 2024-07-12 07:27:56

我在 Python 中做过类似的事情。 基本上,您需要来自各种编码的大量样本数据,这些数据被滑动的两字节窗口分解并存储在字典(散列)中,以提供编码列表值的字节对为键。

给定该字典(哈希),您获取输入文本并且:

  • 如果它以任何 BOM 字符开头(对于 UTF-16-BE 为“\xfe\xff”,对于 UTF-16-LE 为“\xff\xfe”,“\ xef\xbb\xbf' 对于 UTF-8 等),如果没有,我将其视为建议
  • ,然后获取足够大的文本样本,获取样本的所有字节对并选择最不常见建议的编码词典。

如果您还采样了不以任何 BOM 开头的 UTF 编码文本,则第二步将覆盖第一步中漏掉的内容。

到目前为止,它对我有用(示例数据和后续输入数据是各种语言的字幕),并且错误率不断降低。

I've done something similar in Python. Basically, you need lots of sample data from various encodings, which are broken down by a sliding two-byte window and stored in a dictionary (hash), keyed on byte-pairs providing values of lists of encodings.

Given that dictionary (hash), you take your input text and:

  • if it starts with any BOM character ('\xfe\xff' for UTF-16-BE, '\xff\xfe' for UTF-16-LE, '\xef\xbb\xbf' for UTF-8 etc), I treat it as suggested
  • if not, then take a large enough sample of the text, take all byte-pairs of the sample and choose the encoding that is the least common suggested from the dictionary.

If you've also sampled UTF encoded texts that do not start with any BOM, the second step will cover those that slipped from the first step.

So far, it works for me (the sample data and subsequent input data are subtitles in various languages) with diminishing error rates.

咆哮 2024-07-12 07:27:56

工具“uchardet”使用每个字符集的字符频率分布模型很好地做到了这一点。 较大的文件和更“典型”的文件更有信心(显然)。

在 ubuntu 上,您只需apt-get install uchardet

在其他系统上,获取来源、用法和信息 文档在这里: https://github.com/BYVoid/uchardet

The tool "uchardet" does this well using character frequency distribution models for each charset. Larger files and more "typical" files have more confidence (obviously).

On ubuntu, you just apt-get install uchardet.

On other systems, get the source, usage & docs here: https://github.com/BYVoid/uchardet

下壹個目標 2024-07-12 07:27:56

如果您可以链接到 C 库,则可以使用 libenca。 请参阅http://cihar.com/software/enca/。 从手册页:

Enca 读取给定的文本文件,或在没有给定的情况下读取标准输入,
并使用有关他们语言的知识(必须得到您的支持)并且
解析、统计分析、猜测和黑魔法的混合体
确定它们的编码。

它是 GPL v2。

If you can link to a C library, you can use libenca. See http://cihar.com/software/enca/. From the man page:

Enca reads given text files, or standard input when none are given,
and uses knowledge about their language (must be supported by you) and
a mixture of parsing, statistical analysis, guessing and black magic
to determine their encodings.

It's GPL v2.

梦毁影碎の 2024-07-12 07:27:56

遇到了同样的问题,但尚未找到自动检测它的好解决方案。
现在我正在使用 PsPad (www.pspad.com) ;) 工作正常

Got the same problem but didn't found a good solution yet for detecting it automatically .
Now im using PsPad (www.pspad.com) for that ;) Works fine

乱世争霸 2024-07-12 07:27:56

StreamReader 类的构造函数采用“检测编码”参数。

The StreamReader class's constructor takes a 'detect encoding' parameter.

孤云独去闲 2024-07-12 07:27:56

在 AkelPad 中打开文件(或者只是复制/粘贴乱码文本),转到“编辑”->“ 选择-> 重新编码...-> 检查“自动检测”。

Open file in AkelPad(or just copy/paste a garbled text), go to Edit -> Selection -> Recode... -> check "Autodetect".

无声无音无过去 2024-07-12 07:27:56

由于它基本上归结为启发式,因此使用先前从同一源接收的文件的编码作为第一个提示可能会有所帮助。

大多数人(或应用程序)每次都以几乎相同的顺序执行操作,通常是在同一台计算机上,因此当 Bob 创建 .csv 文件并将其发送给 Mary 时,它很可能始终使用 Windows-1252 或无论他的机器默认为什么。

如果可能的话,接受一些客户培训也没什么坏处:-)

Since it basically comes down to heuristics, it may help to use the encoding of previously received files from the same source as a first hint.

Most people (or applications) do stuff in pretty much the same order every time, often on the same machine, so its quite likely that when Bob creates a .csv file and sends it to Mary it'll always be using Windows-1252 or whatever his machine defaults to.

Where possible a bit of customer training never hurts either :-)

单身情人 2024-07-12 07:27:56

我实际上正在寻找一种通用的、非编程的方式来检测文件编码,但我还没有找到。
通过使用不同的编码进行测试,我发现我的文本是 UTF-7。

所以我首先在做什么:
StreamReader file = File.OpenText(完整文件名);

我不得不将其更改为:
StreamReader 文件 = new StreamReader(fullfilename, System.Text.Encoding.UTF7);

OpenText 假定它是 UTF-8。

您还可以像这样创建 StreamReader
new StreamReader(fullfilename, true),第二个参数意味着它应该尝试从文件的字节顺序标记中检测编码,但这在我的情况下不起作用。

I was actually looking for a generic, not programming way of detecting the file encoding, but I didn't find that yet.
What I did find by testing with different encodings was that my text was UTF-7.

So where I first was doing:
StreamReader file = File.OpenText(fullfilename);

I had to change it to:
StreamReader file = new StreamReader(fullfilename, System.Text.Encoding.UTF7);

OpenText assumes it's UTF-8.

you can also create the StreamReader like this
new StreamReader(fullfilename, true), the second parameter meaning that it should try and detect the encoding from the byteordermark of the file, but that didn't work in my case.

静谧 2024-07-12 07:27:56

感谢 @Erik Aronesty 提及 uchardet

同时,Linux 上也有(相同?)工具:chardet
或者,在 cygwin 上您可能需要使用:chardetect

请参阅:chardet 手册页:https://www.commandlinux.com/man-page/man1/chardetect.1.html< /a>

这将启发式地检测(猜测)每个给定文件的字符编码,并将报告每个文件检测到的字符编码的名称和置信度。

Thanks @Erik Aronesty for mentioning uchardet.

Meanwhile the (same?) tool exists for linux: chardet.
Or, on cygwin you may want to use: chardetect.

See: chardet man page: https://www.commandlinux.com/man-page/man1/chardetect.1.html

This will heuristically detect (guess) the character encoding for each given file and will report the name and confidence level for each file's detected character encoding.

七禾 2024-07-12 07:27:56

作为 ITmeze 帖子的插件,我使用此函数来转换 Mozilla 通用字符集检测器的 C# 端口的输出

    private Encoding GetEncodingFromString(string codePageName)
    {
        try
        {
            return Encoding.GetEncoding(codePageName);
        }
        catch
        {
            return Encoding.ASCII;
        }
    }

MSDN

As addon to ITmeze post, I've used this function to convert the output of C# port for Mozilla Universal Charset Detector

    private Encoding GetEncodingFromString(string codePageName)
    {
        try
        {
            return Encoding.GetEncoding(codePageName);
        }
        catch
        {
            return Encoding.ASCII;
        }
    }

MSDN

优雅的叶子 2024-07-12 07:27:56

尝试通过输入 cpanm Text::Unaccent 来安装 perl 模块 Text::Unaccent::PurePerl 这会生成一个 build.log 文件,在某些应用程序中显示为中文,而在其他应用程序中显示为英文 cpanm 是初始文本 如果你幸运的话,这是一个看似合理的尝试语言中存在空格就足以通过统计测试来比较单词的分布频率

try and install the perl module Text::Unaccent::PurePerl by typing cpanm Text::Unaccent this generates a build.log file that displays as chinese in some applications as english in others cpanm is the initial text a plausible attempt should you be lucky enough to have spaces in the language is to compare the distribution frequency of words via a statistical test

抹茶夏天i‖ 2024-07-12 07:27:56

我在读取文件时使用此代码来检测 Unicode 和 Windows 默认的 ansi 代码页。 对于其他编码,需要手动或通过编程检查内容。 这可以用来以与打开时相同的编码保存文本。 (我使用VB.NET)

'Works for Default and unicode (auto detect)
Dim mystreamreader As New StreamReader(LocalFileName, Encoding.Default) 
MyEditTextBox.Text = mystreamreader.ReadToEnd()
Debug.Print(mystreamreader.CurrentEncoding.CodePage) 'Autodetected encoding
mystreamreader.Close()

I use this code to detect Unicode and windows default ansi codepage when reading a file. For other codings a check of content is necessary, manually or by programming. This can de used to save the text with the same encoding as when it was opened. (I use VB.NET)

'Works for Default and unicode (auto detect)
Dim mystreamreader As New StreamReader(LocalFileName, Encoding.Default) 
MyEditTextBox.Text = mystreamreader.ReadToEnd()
Debug.Print(mystreamreader.CurrentEncoding.CodePage) 'Autodetected encoding
mystreamreader.Close()
柠檬 2024-07-12 07:27:56

自从提出这个问题以来,已经过去了 10 年(!),但我仍然没有看到任何提及 MS 的良好的非 GPL 解决方案: IMultiLanguage2 API。

已经提到的大多数库都是基于 Mozilla 的 UDE - 并且浏览器已经解决了类似的问题似乎是合理的。 我不知道 chrome 的解决方案是什么,但自从 IE 5.0 MS 发布了他们的解决方案以来,它是:

  1. 没有 GPL 之类的许可问题,
  2. 可能永远得到支持和维护,
  3. 提供丰富的输出 - 所有有效的编码候选者/codepages 以及置信度分数,
  4. 令人惊讶地易于使用(它是单个函数调用)。

这是一个本机 COM 调用,但是 这里有一些Carsten Zeumer 的工作非常出色,它处理了 .net 使用中的互操作混乱问题。 周围还有其他一些库,但总的来说,这个库没有得到应有的关注。

10Y (!) had passed since this was asked, and still I see no mention of MS's good, non-GPL'ed solution: IMultiLanguage2 API.

Most libraries already mentioned are based on Mozilla's UDE - and it seems reasonable that browsers have already tackled similar problems. I don't know what is chrome's solution, but since IE 5.0 MS have released theirs, and it is:

  1. Free of GPL-and-the-like licensing issues,
  2. Backed and maintained probably forever,
  3. Gives rich output - all valid candidates for encoding/codepages along with confidence scores,
  4. Surprisingly easy to use (it is a single function call).

It is a native COM call, but here's some very nice work by Carsten Zeumer, that handles the interop mess for .net usage. There are some others around, but by and large this library doesn't get the attention it deserves.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文