如何确定文本文件的编码表

发布于 2024-10-04 16:09:36 字数 119 浏览 4 评论 0原文

我有 .txt.java 文件,但我不知道如何确定文件的编码表(Unicode、UTF-8、ISO-8525,...) 。是否存在任何程序可以确定文件编码或查看编码?

I have .txt and .java files and I don't know how to determine the encoding table of the files (Unicode, UTF-8, ISO-8525, …). Does there exist any program to determine the file encoding or to see the encoding?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

彩虹直至黑白 2024-10-11 16:09:36

如果您使用的是 Linux,请尝试 file -i filename.txt

$ file -i vol34.tex 
vol34.tex: text/x-tex; charset=us-ascii

作为参考,这里是我的环境:

$ which file
/usr/bin/file
$ file --version
file-5.09
magic file from /etc/magic:/usr/share/misc/magic

一些 file 版本(例如 OS X/macOS 上的 file-5.04)的命令行开关略有不同:

$ file -I vol34.tex 
vol34.tex: text/x-tex; charset=us-ascii
$ file --mime vol34.tex
vol34.tex: text/x-tex; charset=us-ascii

另外,看看 此处

If you're on Linux, try file -i filename.txt.

$ file -i vol34.tex 
vol34.tex: text/x-tex; charset=us-ascii

For reference, here is my environment:

$ which file
/usr/bin/file
$ file --version
file-5.09
magic file from /etc/magic:/usr/share/misc/magic

Some file versions (e.g. file-5.04 on OS X/macOS) have slightly different command-line switches:

$ file -I vol34.tex 
vol34.tex: text/x-tex; charset=us-ascii
$ file --mime vol34.tex
vol34.tex: text/x-tex; charset=us-ascii

Also, have a look here.

尐籹人 2024-10-11 16:09:36

用Notepad++打开该文件,会在右下角看到编码表名称。在菜单编码中,您可以更改编码表并保存文件。

Open the file with Notepad++ and will see on the right down corner the encoding table name. And in the menu encoding you can change the encoding table and save the file.

呢古 2024-10-11 16:09:36

您无法可靠地检测文本文件中的编码 - 您可以做的是制作一个
通过搜索非 ASCII 字符并尝试确定它是否是有根据的猜测
unicode 组合对您正在解析的语言有意义。

You can't reliably detect the encoding from a textfile - what you can do is make an
educated guess by searching for a non-ascii char and trying to determine if it is a
unicode combination that makes sens in the languages you are parsing.

夏了南城 2024-10-11 16:09:36

请参阅此问题和所选答案。没有万无一失的方法。最多,你可以排除一些事情。 UTF 编码不太可能出现误报,但 8 位编码很难,尤其是在您不知道起始语言的情况下。目前没有工具可以处理 Mac、Windows、Unix 中的所有常见 8 位编码,但所选答案提供了一种算法方法,应该足以适用于某些编码子集。

See this question and the selected answer. There’s no sure-fire way of doing it. At most, you can rule things out. The UTF encodings you’re unlikely to get false positives on, but the 8-bit encodings are tough, especially if you don’t know the starting language. No tool out there currently handles all the common 8-bit encodings from Macs, Windows, Unix, but the selected answer provides an algorithmic approach that should work adequately for a certain subset of encodings.

凡间太子 2024-10-11 16:09:36

如果您使用 python,chardet 包是一个不错的选择,例如

from chardet.universaldetector import UniversalDetector

files = ['a-1.txt','a-2.txt']

detector = UniversalDetector()
for filename in files:
    print(filename.ljust(20), end='')
    detector.reset()
    for line in open(filename, 'rb'):
        detector.feed(line)
        if detector.done: break
    detector.close()
    print(detector.result)

给我结果:

a-1.txt   {'encoding': 'Windows-1252', 'confidence': 0.7255358182877111, 'language': ''}
a-2.txt   {'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}

if you are using python, the chardet package is a good option, for example

from chardet.universaldetector import UniversalDetector

files = ['a-1.txt','a-2.txt']

detector = UniversalDetector()
for filename in files:
    print(filename.ljust(20), end='')
    detector.reset()
    for line in open(filename, 'rb'):
        detector.feed(line)
        if detector.done: break
    detector.close()
    print(detector.result)

gives me as a result:

a-1.txt   {'encoding': 'Windows-1252', 'confidence': 0.7255358182877111, 'language': ''}
a-2.txt   {'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}
怎樣才叫好 2024-10-11 16:09:36

在文本文件中没有保存编码的标头等。您可以尝试 linux/unix 命令 find 来尝试猜测编码:

file -i unreadablefile.txt

或者在某些系统上

file -I unreadablefile.txt

但这通常会给您 text/plain; charset=iso-8859-1 尽管该文件不可读(神秘的字形)。

这就是我在安装 iconv 后为无法读取的文件找到正确的文件编码并将其转换为 utf8 所做的事情。首先,我尝试了所有编码,显示 (grep) 一行包含单词 www.(网站地址):

for ENCODING in $(iconv -l); do echo -n "$ENCODING "; iconv -f $ENCODING -t utf-8 unreadablefile.txt 2>/dev/null| grep 'www'; done | less

最后一个命令行显示了测试的文件编码,然后显示了翻译/转码行。

有些行显示出可读且一致的结果(一次一种语言)。我手动尝试了其中一些,例如:

ENCODING=WINDOWS-936; iconv -f $ENCODING -t utf-8 unreadablefile.txt -o test_with_${ENCODING}.txt

在我的例子中,它是中文 Windows 编码,现在是可读的(如果你懂中文)。

In a text file there is no header that saves the encoding or so. You can try the linux/unix command find which tries to guess the encoding:

file -i unreadablefile.txt

or on some systems

file -I unreadablefile.txt

But that often gives you text/plain; charset=iso-8859-1 although the file is unreadable (cryptic glyphs).

This is what I did to find the correct file encoding for an unreadable file and then translate it to utf8 was, after installing iconv. First I tried all encodings, displaying (grep) a line that contained the word www. (a website address):

for ENCODING in $(iconv -l); do echo -n "$ENCODING "; iconv -f $ENCODING -t utf-8 unreadablefile.txt 2>/dev/null| grep 'www'; done | less

This last commandline shows the the tested file encoding and then the translated/transcoded line.

There were some lines which showed readable and consistent (one language at a time) results. I tried manually some of them, for example:

ENCODING=WINDOWS-936; iconv -f $ENCODING -t utf-8 unreadablefile.txt -o test_with_${ENCODING}.txt

In my case it was a chinese windows encoding, which is now readable (if you know chinese).

煞人兵器 2024-10-11 16:09:36

是否有任何程序可以确定文件编码或查看编码?

当我写这篇文章时,这个问题已有 10 年历史了,答案仍然是“否”——至少不可靠。不幸的是,情况并没有太大改善。我最近的经验表明 file -I 命令非常"命中或错过。例如,在 macOS 10.15.6 上检查文本文件时:

% file -i somefile.asc
somefile.asc: application/octet-stream; charset=binary

somefile.asc 是一个文本文件。其中的所有字符均以 UTF-16 Little Endian 编码。我怎么知道这个?我使用了 BBedit - 一个强大的文本编辑器。确定文件中使用的编码当然是一个难题,但是......?

Does there exist any program to determine the file encoding or to see the encoding?

This question is 10 years old as I write this, and the answer is still, "No" - at least not reliably. There's not been much improvement unfortunately. My recent experience suggests the file -I command is very much "hit-or-miss". For example, when checking a text file on macOS 10.15.6:

% file -i somefile.asc
somefile.asc: application/octet-stream; charset=binary

somefile.asc was a text file. All charcters in it were encoded in UTF-16 Little Endian. How did I know this? I used BBedit - a competent text editor. Determining the encoding used in a file is certainly a tough problem, but...?

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文