如何检测文本文件的编码/代码页?
在我们的应用程序中,我们从不同来源接收文本文件(.txt
、.csv
等)。 读取时,这些文件有时包含垃圾,因为这些文件是在不同/未知的代码页中创建的。
有没有办法(自动)检测文本文件的代码页?
StreamReader
构造函数上的 detectEncodingFromByteOrderMarks
适用于 UTF8
和其他 unicode 标记的文件,但我正在寻找一种检测代码页的方法,如 ibm850、windows1252。
感谢您的回答,这就是我所做的。
我们收到的文件来自最终用户,他们对代码页一无所知。 接收者也是最终用户,到目前为止,这就是他们对代码页的了解:代码页存在,并且很烦人。
解决办法:
- 用记事本打开收到的文件,查看有一段乱码的文字。 如果某人叫弗朗索瓦或其他什么名字,以你的人类智慧你可以猜出来。
- 我创建了一个小应用程序,用户可以用它来打开文件,并输入用户知道在使用正确的代码页时它将出现在文件中的文本。
- 循环遍历所有代码页,并显示那些使用用户提供的文本提供解决方案的代码页。
- 如果弹出多个代码页,请要求用户指定更多文本。
In our application, we receive text files (.txt
, .csv
, etc.) from diverse sources. When reading, these files sometimes contain garbage, because the files where created in a different/unknown codepage.
Is there a way to (automatically) detect the codepage of a text file?
The detectEncodingFromByteOrderMarks
, on the StreamReader
constructor, works for UTF8
and other unicode marked files, but I'm looking for a way to detect code pages, like ibm850
, windows1252
.
Thanks for your answers, this is what I've done.
The files we receive are from end-users, they do not have a clue about codepages. The receivers are also end-users, by now this is what they know about codepages: Codepages exist, and are annoying.
Solution:
- Open the received file in Notepad, look at a garbled piece of text. If somebody is called François or something, with your human intelligence you can guess this.
- I've created a small app that the user can use to open the file with, and enter a text that user knows it will appear in the file, when the correct codepage is used.
- Loop through all codepages, and display the ones that give a solution with the user provided text.
- If more as one codepage pops up, ask the user to specify more text.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(21)
您无法检测到代码页,您需要被告知。 您可以分析字节并猜测它,但这可能会给出一些奇怪的(有时是有趣的)结果。 我现在找不到它,但我确信记事本可以被欺骗以中文显示英文文本。
无论如何,这是您需要阅读的内容:
每个软件开发人员绝对必须了解 Unicode 和字符集的绝对最低限度(没有任何借口!)。
乔尔具体说:
You can't detect the codepage, you need to be told it. You can analyse the bytes and guess it, but that can give some bizarre (sometimes amusing) results. I can't find it now, but I'm sure Notepad can be tricked into displaying English text in Chinese.
Anyway, this is what you need to read:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
Specifically Joel says:
如果您想要检测非 UTF 编码(即无 BOM),您基本上需要对文本进行启发式分析和统计分析。 您可能需要查看有关通用字符集检测的 Mozilla 论文 (相同的链接,格式更好通过时光机)。
If you're looking to detect non-UTF encodings (i.e. no BOM), you're basically down to heuristics and statistical analysis of the text. You might want to take a look at the Mozilla paper on universal charset detection (same link, with better formatting via Wayback Machine).
我知道这个问题已经很晚了,而且这个解决方案不会吸引某些人(因为它以英语为中心的偏见并且缺乏统计/实证测试),但它对我来说非常有效,特别是在处理上传的 CSV 数据时:
http://www.architectshack.com/TextFileEncodingDetector.ashx
优点:
注意:我是这门课的编写者,所以显然对此持保留态度! :)
I know it's very late for this question and this solution won't appeal to some (because of its english-centric bias and its lack of statistical/empirical testing), but it's worked very well for me, especially for processing uploaded CSV data:
http://www.architectshack.com/TextFileEncodingDetector.ashx
Advantages:
Note: I'm the one who wrote this class, so obviously take it with a grain of salt! :)
这显然是错误的。 每个网络浏览器都有某种通用字符集检测器来处理没有任何编码指示的页面。 火狐浏览器有一个。 您可以下载代码并查看它是如何实现的。 请参阅此处的一些文档。 基本上,这是一种启发式方法,但效果非常好。
给定合理数量的文本,甚至可以检测到语言。
这是我刚刚使用 Google 发现的另一个:
This is clearly false. Every web browser has some kind of universal charset detector to deal with pages which have no indication whatsoever of an encoding. Firefox has one. You can download the code and see how it does it. See some documentation here. Basically, it is a heuristic, but one that works really well.
Given a reasonable amount of text, it is even possible to detect the language.
Here's another one I just found using Google:
您是否尝试过 Mozilla 通用字符集检测器的 C# 端口
来自 http://code.google.com/p/ude/
Have you tried C# port for Mozilla Universal Charset Detector
Example from http://code.google.com/p/ude/
如果有人正在寻找 93.9% 的解决方案。 这对我有用:
If someone is looking for a 93.9% solution. This works for me:
Notepad++ 具有开箱即用的此功能。 它还支持更改它。
Notepad++ has this feature out-of-the-box. It also supports changing it.
寻找不同的解决方案,我发现
https://code.google.com/p/ude/
这个解决方案是有点重。
我需要一些基本的编码检测,基于前 4 个字节,可能还有 xml 字符集检测 - 所以我从互联网上获取了一些示例源代码,并添加了稍加修改的版本
http://lists.w3.org/Archives/Public/www-validator/2002Aug/0084.html
为 Java 编写。
从文件中读取前 1024 个字节就足够了,但我正在加载整个文件。
Looking for different solution, I found that
https://code.google.com/p/ude/
this solution is kinda heavy.
I needed some basic encoding detection, based on 4 first bytes and probably xml charset detection - so I've took some sample source code from internet and added slightly modified version of
http://lists.w3.org/Archives/Public/www-validator/2002Aug/0084.html
written for Java.
It's enough to read probably first 1024 bytes from file, but I'm loading whole file.
我在 Python 中做过类似的事情。 基本上,您需要来自各种编码的大量样本数据,这些数据被滑动的两字节窗口分解并存储在字典(散列)中,以提供编码列表值的字节对为键。
给定该字典(哈希),您获取输入文本并且:
如果您还采样了不以任何 BOM 开头的 UTF 编码文本,则第二步将覆盖第一步中漏掉的内容。
到目前为止,它对我有用(示例数据和后续输入数据是各种语言的字幕),并且错误率不断降低。
I've done something similar in Python. Basically, you need lots of sample data from various encodings, which are broken down by a sliding two-byte window and stored in a dictionary (hash), keyed on byte-pairs providing values of lists of encodings.
Given that dictionary (hash), you take your input text and:
If you've also sampled UTF encoded texts that do not start with any BOM, the second step will cover those that slipped from the first step.
So far, it works for me (the sample data and subsequent input data are subtitles in various languages) with diminishing error rates.
工具“uchardet”使用每个字符集的字符频率分布模型很好地做到了这一点。 较大的文件和更“典型”的文件更有信心(显然)。
在 ubuntu 上,您只需
apt-get install uchardet
。在其他系统上,获取来源、用法和信息 文档在这里: https://github.com/BYVoid/uchardet
The tool "uchardet" does this well using character frequency distribution models for each charset. Larger files and more "typical" files have more confidence (obviously).
On ubuntu, you just
apt-get install uchardet
.On other systems, get the source, usage & docs here: https://github.com/BYVoid/uchardet
如果您可以链接到 C 库,则可以使用
libenca
。 请参阅http://cihar.com/software/enca/。 从手册页:它是 GPL v2。
If you can link to a C library, you can use
libenca
. See http://cihar.com/software/enca/. From the man page:It's GPL v2.
遇到了同样的问题,但尚未找到自动检测它的好解决方案。
现在我正在使用 PsPad (www.pspad.com) ;) 工作正常
Got the same problem but didn't found a good solution yet for detecting it automatically .
Now im using PsPad (www.pspad.com) for that ;) Works fine
StreamReader 类的构造函数采用“检测编码”参数。
The StreamReader class's constructor takes a 'detect encoding' parameter.
在 AkelPad 中打开文件(或者只是复制/粘贴乱码文本),转到“编辑”->“ 选择-> 重新编码...-> 检查“自动检测”。
Open file in AkelPad(or just copy/paste a garbled text), go to Edit -> Selection -> Recode... -> check "Autodetect".
由于它基本上归结为启发式,因此使用先前从同一源接收的文件的编码作为第一个提示可能会有所帮助。
大多数人(或应用程序)每次都以几乎相同的顺序执行操作,通常是在同一台计算机上,因此当 Bob 创建 .csv 文件并将其发送给 Mary 时,它很可能始终使用 Windows-1252 或无论他的机器默认为什么。
如果可能的话,接受一些客户培训也没什么坏处:-)
Since it basically comes down to heuristics, it may help to use the encoding of previously received files from the same source as a first hint.
Most people (or applications) do stuff in pretty much the same order every time, often on the same machine, so its quite likely that when Bob creates a .csv file and sends it to Mary it'll always be using Windows-1252 or whatever his machine defaults to.
Where possible a bit of customer training never hurts either :-)
我实际上正在寻找一种通用的、非编程的方式来检测文件编码,但我还没有找到。
通过使用不同的编码进行测试,我发现我的文本是 UTF-7。
所以我首先在做什么:
StreamReader file = File.OpenText(完整文件名);
我不得不将其更改为:
StreamReader 文件 = new StreamReader(fullfilename, System.Text.Encoding.UTF7);
OpenText 假定它是 UTF-8。
您还可以像这样创建 StreamReader
new StreamReader(fullfilename, true),第二个参数意味着它应该尝试从文件的字节顺序标记中检测编码,但这在我的情况下不起作用。
I was actually looking for a generic, not programming way of detecting the file encoding, but I didn't find that yet.
What I did find by testing with different encodings was that my text was UTF-7.
So where I first was doing:
StreamReader file = File.OpenText(fullfilename);
I had to change it to:
StreamReader file = new StreamReader(fullfilename, System.Text.Encoding.UTF7);
OpenText assumes it's UTF-8.
you can also create the StreamReader like this
new StreamReader(fullfilename, true), the second parameter meaning that it should try and detect the encoding from the byteordermark of the file, but that didn't work in my case.
感谢 @Erik Aronesty 提及
uchardet
。同时,Linux 上也有(相同?)工具:
chardet
。或者,在 cygwin 上您可能需要使用:
chardetect
。请参阅:chardet 手册页:https://www.commandlinux.com/man-page/man1/chardetect.1.html< /a>
这将启发式地检测(猜测)每个给定文件的字符编码,并将报告每个文件检测到的字符编码的名称和置信度。
Thanks @Erik Aronesty for mentioning
uchardet
.Meanwhile the (same?) tool exists for linux:
chardet
.Or, on cygwin you may want to use:
chardetect
.See: chardet man page: https://www.commandlinux.com/man-page/man1/chardetect.1.html
This will heuristically detect (guess) the character encoding for each given file and will report the name and confidence level for each file's detected character encoding.
作为 ITmeze 帖子的插件,我使用此函数来转换 Mozilla 通用字符集检测器的 C# 端口的输出
MSDN
As addon to ITmeze post, I've used this function to convert the output of C# port for Mozilla Universal Charset Detector
MSDN
尝试通过输入 cpanm Text::Unaccent 来安装 perl 模块 Text::Unaccent::PurePerl 这会生成一个 build.log 文件,在某些应用程序中显示为中文,而在其他应用程序中显示为英文 cpanm 是初始文本 如果你幸运的话,这是一个看似合理的尝试语言中存在空格就足以通过统计测试来比较单词的分布频率
try and install the perl module Text::Unaccent::PurePerl by typing cpanm Text::Unaccent this generates a build.log file that displays as chinese in some applications as english in others cpanm is the initial text a plausible attempt should you be lucky enough to have spaces in the language is to compare the distribution frequency of words via a statistical test
我在读取文件时使用此代码来检测 Unicode 和 Windows 默认的 ansi 代码页。 对于其他编码,需要手动或通过编程检查内容。 这可以用来以与打开时相同的编码保存文本。 (我使用VB.NET)
I use this code to detect Unicode and windows default ansi codepage when reading a file. For other codings a check of content is necessary, manually or by programming. This can de used to save the text with the same encoding as when it was opened. (I use VB.NET)
自从提出这个问题以来,已经过去了 10 年(!),但我仍然没有看到任何提及 MS 的良好的非 GPL 解决方案: IMultiLanguage2 API。
已经提到的大多数库都是基于 Mozilla 的 UDE - 并且浏览器已经解决了类似的问题似乎是合理的。 我不知道 chrome 的解决方案是什么,但自从 IE 5.0 MS 发布了他们的解决方案以来,它是:
这是一个本机 COM 调用,但是 这里有一些Carsten Zeumer 的工作非常出色,它处理了 .net 使用中的互操作混乱问题。 周围还有其他一些库,但总的来说,这个库没有得到应有的关注。
10Y (!) had passed since this was asked, and still I see no mention of MS's good, non-GPL'ed solution: IMultiLanguage2 API.
Most libraries already mentioned are based on Mozilla's UDE - and it seems reasonable that browsers have already tackled similar problems. I don't know what is chrome's solution, but since IE 5.0 MS have released theirs, and it is:
It is a native COM call, but here's some very nice work by Carsten Zeumer, that handles the interop mess for .net usage. There are some others around, but by and large this library doesn't get the attention it deserves.