如何确定文本文件的编码表
我有 .txt
和 .java
文件,但我不知道如何确定文件的编码表(Unicode、UTF-8、ISO-8525,...) 。是否存在任何程序可以确定文件编码或查看编码?
I have .txt
and .java
files and I don't know how to determine the encoding table of the files (Unicode, UTF-8, ISO-8525, …). Does there exist any program to determine the file encoding or to see the encoding?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
如果您使用的是 Linux,请尝试
file -i filename.txt
。作为参考,这里是我的环境:
一些
file
版本(例如 OS X/macOS 上的 file-5.04)的命令行开关略有不同:另外,看看 此处。
If you're on Linux, try
file -i filename.txt
.For reference, here is my environment:
Some
file
versions (e.g. file-5.04 on OS X/macOS) have slightly different command-line switches:Also, have a look here.
用Notepad++打开该文件,会在右下角看到编码表名称。在菜单编码中,您可以更改编码表并保存文件。
Open the file with Notepad++ and will see on the right down corner the encoding table name. And in the menu encoding you can change the encoding table and save the file.
您无法可靠地检测文本文件中的编码 - 您可以做的是制作一个
通过搜索非 ASCII 字符并尝试确定它是否是有根据的猜测
unicode 组合对您正在解析的语言有意义。
You can't reliably detect the encoding from a textfile - what you can do is make an
educated guess by searching for a non-ascii char and trying to determine if it is a
unicode combination that makes sens in the languages you are parsing.
请参阅此问题和所选答案。没有万无一失的方法。最多,你可以排除一些事情。 UTF 编码不太可能出现误报,但 8 位编码很难,尤其是在您不知道起始语言的情况下。目前没有工具可以处理 Mac、Windows、Unix 中的所有常见 8 位编码,但所选答案提供了一种算法方法,应该足以适用于某些编码子集。
See this question and the selected answer. There’s no sure-fire way of doing it. At most, you can rule things out. The UTF encodings you’re unlikely to get false positives on, but the 8-bit encodings are tough, especially if you don’t know the starting language. No tool out there currently handles all the common 8-bit encodings from Macs, Windows, Unix, but the selected answer provides an algorithmic approach that should work adequately for a certain subset of encodings.
如果您使用 python,chardet 包是一个不错的选择,例如
给我结果:
if you are using python, the chardet package is a good option, for example
gives me as a result:
在文本文件中没有保存编码的标头等。您可以尝试 linux/unix 命令
find
来尝试猜测编码:或者在某些系统上
但这通常会给您
text/plain; charset=iso-8859-1
尽管该文件不可读(神秘的字形)。这就是我在安装 iconv 后为无法读取的文件找到正确的文件编码并将其转换为 utf8 所做的事情。首先,我尝试了所有编码,显示 (
grep
) 一行包含单词 www.(网站地址):最后一个命令行显示了测试的文件编码,然后显示了翻译/转码行。
有些行显示出可读且一致的结果(一次一种语言)。我手动尝试了其中一些,例如:
在我的例子中,它是中文 Windows 编码,现在是可读的(如果你懂中文)。
In a text file there is no header that saves the encoding or so. You can try the linux/unix command
find
which tries to guess the encoding:or on some systems
But that often gives you
text/plain; charset=iso-8859-1
although the file is unreadable (cryptic glyphs).This is what I did to find the correct file encoding for an unreadable file and then translate it to utf8 was, after installing
iconv
. First I tried all encodings, displaying (grep
) a line that contained the word www. (a website address):This last commandline shows the the tested file encoding and then the translated/transcoded line.
There were some lines which showed readable and consistent (one language at a time) results. I tried manually some of them, for example:
In my case it was a chinese windows encoding, which is now readable (if you know chinese).
当我写这篇文章时,这个问题已有 10 年历史了,答案仍然是“否”——至少不可靠。不幸的是,情况并没有太大改善。我最近的经验表明
file -I
命令非常"命中或错过”。例如,在 macOS 10.15.6 上检查文本文件时:somefile.asc
是一个文本文件。其中的所有字符均以 UTF-16 Little Endian 编码。我怎么知道这个?我使用了BBedit
- 一个强大的文本编辑器。确定文件中使用的编码当然是一个难题,但是......?This question is 10 years old as I write this, and the answer is still, "No" - at least not reliably. There's not been much improvement unfortunately. My recent experience suggests the
file -I
command is very much "hit-or-miss". For example, when checking a text file on macOS 10.15.6:somefile.asc
was a text file. All charcters in it were encoded in UTF-16 Little Endian. How did I know this? I usedBBedit
- a competent text editor. Determining the encoding used in a file is certainly a tough problem, but...?