当前位置：文江博客话题详情

如何确定文本文件的编码表

发布于 2024-10-04 16:09:36 字数 119 浏览 4 评论 0原文

我有 .txt 和 .java 文件，但我不知道如何确定文件的编码表（Unicode、UTF-8、ISO-8525，...）。是否存在任何程序可以确定文件编码或查看编码？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

彩虹直至黑白 2024-10-11 16:09:36

如果您使用的是 Linux，请尝试 file -i filename.txt。

$ file -i vol34.tex 
vol34.tex: text/x-tex; charset=us-ascii

作为参考，这里是我的环境：

$ which file
/usr/bin/file
$ file --version
file-5.09
magic file from /etc/magic:/usr/share/misc/magic

一些 file 版本（例如 OS X/macOS 上的 file-5.04）的命令行开关略有不同：

$ file -I vol34.tex 
vol34.tex: text/x-tex; charset=us-ascii
$ file --mime vol34.tex
vol34.tex: text/x-tex; charset=us-ascii

另外，看看此处。

If you're on Linux, try file -i filename.txt.

$ file -i vol34.tex 
vol34.tex: text/x-tex; charset=us-ascii

For reference, here is my environment:

$ which file
/usr/bin/file
$ file --version
file-5.09
magic file from /etc/magic:/usr/share/misc/magic

Some file versions (e.g. file-5.04 on OS X/macOS) have slightly different command-line switches:

$ file -I vol34.tex 
vol34.tex: text/x-tex; charset=us-ascii
$ file --mime vol34.tex
vol34.tex: text/x-tex; charset=us-ascii

Also, have a look here.

回复收藏 0 原文

尐籹人 2024-10-11 16:09:36

用Notepad++打开该文件，会在右下角看到编码表名称。在菜单编码中，您可以更改编码表并保存文件。

回复收藏 0 原文

呢古 2024-10-11 16:09:36

您无法可靠地检测文本文件中的编码 - 您可以做的是制作一个
通过搜索非 ASCII 字符并尝试确定它是否是有根据的猜测
unicode 组合对您正在解析的语言有意义。

回复收藏 0 原文

夏了南城 2024-10-11 16:09:36

请参阅此问题和所选答案。没有万无一失的方法。最多，你可以排除一些事情。 UTF 编码不太可能出现误报，但 8 位编码很难，尤其是在您不知道起始语言的情况下。目前没有工具可以处理 Mac、Windows、Unix 中的所有常见 8 位编码，但所选答案提供了一种算法方法，应该足以适用于某些编码子集。

回复收藏 0 原文

凡间太子 2024-10-11 16:09:36

如果您使用 python，chardet 包是一个不错的选择，例如

from chardet.universaldetector import UniversalDetector

files = ['a-1.txt','a-2.txt']

detector = UniversalDetector()
for filename in files:
    print(filename.ljust(20), end='')
    detector.reset()
    for line in open(filename, 'rb'):
        detector.feed(line)
        if detector.done: break
    detector.close()
    print(detector.result)

给我结果：

a-1.txt   {'encoding': 'Windows-1252', 'confidence': 0.7255358182877111, 'language': ''}
a-2.txt   {'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}

if you are using python, the chardet package is a good option, for example

from chardet.universaldetector import UniversalDetector

files = ['a-1.txt','a-2.txt']

detector = UniversalDetector()
for filename in files:
    print(filename.ljust(20), end='')
    detector.reset()
    for line in open(filename, 'rb'):
        detector.feed(line)
        if detector.done: break
    detector.close()
    print(detector.result)

gives me as a result:

a-1.txt   {'encoding': 'Windows-1252', 'confidence': 0.7255358182877111, 'language': ''}
a-2.txt   {'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}

回复收藏 0 原文

怎樣才叫好 2024-10-11 16:09:36

在文本文件中没有保存编码的标头等。您可以尝试 linux/unix 命令 find 来尝试猜测编码：

file -i unreadablefile.txt

或者在某些系统上

file -I unreadablefile.txt

但这通常会给您 text/plain; charset=iso-8859-1 尽管该文件不可读（神秘的字形）。

这就是我在安装 iconv 后为无法读取的文件找到正确的文件编码并将其转换为 utf8 所做的事情。首先，我尝试了所有编码，显示 (grep) 一行包含单词 www.（网站地址）：

for ENCODING in $(iconv -l); do echo -n "$ENCODING "; iconv -f $ENCODING -t utf-8 unreadablefile.txt 2>/dev/null| grep 'www'; done | less

最后一个命令行显示了测试的文件编码，然后显示了翻译/转码行。

有些行显示出可读且一致的结果（一次一种语言）。我手动尝试了其中一些，例如：

ENCODING=WINDOWS-936; iconv -f $ENCODING -t utf-8 unreadablefile.txt -o test_with_${ENCODING}.txt

在我的例子中，它是中文 Windows 编码，现在是可读的（如果你懂中文）。

In a text file there is no header that saves the encoding or so. You can try the linux/unix command find which tries to guess the encoding:

file -i unreadablefile.txt

or on some systems

file -I unreadablefile.txt

But that often gives you text/plain; charset=iso-8859-1 although the file is unreadable (cryptic glyphs).

This is what I did to find the correct file encoding for an unreadable file and then translate it to utf8 was, after installing iconv. First I tried all encodings, displaying (grep) a line that contained the word www. (a website address):

for ENCODING in $(iconv -l); do echo -n "$ENCODING "; iconv -f $ENCODING -t utf-8 unreadablefile.txt 2>/dev/null| grep 'www'; done | less

This last commandline shows the the tested file encoding and then the translated/transcoded line.

There were some lines which showed readable and consistent (one language at a time) results. I tried manually some of them, for example:

ENCODING=WINDOWS-936; iconv -f $ENCODING -t utf-8 unreadablefile.txt -o test_with_${ENCODING}.txt

In my case it was a chinese windows encoding, which is now readable (if you know chinese).

回复收藏 0 原文

煞人兵器 2024-10-11 16:09:36

是否有任何程序可以确定文件编码或查看编码？

当我写这篇文章时，这个问题已有 10 年历史了，答案仍然是“否”——至少不可靠。不幸的是，情况并没有太大改善。我最近的经验表明 file -I 命令非常"命中或错过”。例如，在 macOS 10.15.6 上检查文本文件时：

% file -i somefile.asc
somefile.asc: application/octet-stream; charset=binary

somefile.asc 是一个文本文件。其中的所有字符均以 UTF-16 Little Endian 编码。我怎么知道这个？我使用了 BBedit - 一个强大的文本编辑器。确定文件中使用的编码当然是一个难题，但是......？

Does there exist any program to determine the file encoding or to see the encoding?

This question is 10 years old as I write this, and the answer is still, "No" - at least not reliably. There's not been much improvement unfortunately. My recent experience suggests the file -I command is very much "hit-or-miss". For example, when checking a text file on macOS 10.15.6:

% file -i somefile.asc
somefile.asc: application/octet-stream; charset=binary

somefile.asc was a text file. All charcters in it were encoded in UTF-16 Little Endian. How did I know this? I used BBedit - a competent text editor. Determining the encoding used in a file is certainly a tough problem, but...?

回复收藏 0 原文

~没有更多了~