“文件”中的 ISO-8859 是什么意思?

发布于 2025-01-03 01:30:03 字数 406 浏览 1 评论 0原文

我在我有权访问的软件存储库中运行了以下命令:

找到 . -not -name ".svn" -type f -exec 文件 "{}" \;

并看到许多输出线,例如

./File.java:ISO-8859 C++ 程序文本

是什么意思? ISO-8859 是一个编码,而不是某种编码。我预计所有文件都是 UTF-8,但大多数文件都采用所提供的编码。 ISO-8859 也是 UTF-8 的真子集吗?

我是否可以使用 ISO-8859-1 作为源编码来安全地转换所有这些文件,同时使用 iconv 将其转换为 UTF-8?

I ran the following command in a software repository I have access to:

find . -not -name ".svn" -type f -exec file "{}" \;

and saw many output lines like

./File.java: ISO-8859 C++ program text

What does that mean? ISO-8859 is an encoding class, not a certain encoding. I've expected all files to be UTF-8, but most are in the presented encoding. Is ISO-8859 a proper subset of UTF-8, too?

Is it possible for me to convert all those files safely by using ISO-8859-1 as source encoding while translating it into UTF-8 with iconv for example?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

笑咖 2025-01-10 01:30:03

恐怕 Unix file 程序在这方面相当糟糕。它只是意味着它是字节编码的。这并不意味着它是 ISO-8859-1。它甚至可能是非 ISO 字节编码,尽管它通常会解决这个问题。

我有一个比 file 更好的系统,但它是在英语语料库上训练的,所以可能不如德语。

简而言之,file 的结果不可靠。您必须知道真正的编码才能对其进行上转换。

I am afraid that the Unix file program is rather bad at this. It just means it is in a byte encoding. It does not mean that it is ISO-8859-1. It might even be in a non-ISO byte encdidng, although it usually figures that out.

I have a system that does much better than file, but it is trained on an English-language corpus, so might not do as well as on German.

The short answer is that the result of file is not reliable. You have to know the real encoding to up-convert it.

短暂陪伴 2025-01-10 01:30:03

file 使用的字符集检测相当简单。它识别 UTF-8。它通过在 0x80-0x9F 范围内查找 ISO 8859 编码存在“漏洞”的字节来区分“ISO-8859”和“非 ISO 扩展 ASCII”。但它不会尝试确定正在使用哪种 ISO 8859 编码。这就是为什么它只显示 ISO-8859 而不是 ISO-8859-1ISO-8859-15

The charset detection used by file is rather simplistic. It recognizes UTF-8. And it distinguished between "ISO-8859" and "non-ISO extended-ASCII" by looking for bytes in the 0x80-0x9F range where the ISO 8859 encodings have "holes". But it makes no attempt to determine which ISO 8859 encoding is in use. Which is why it just says ISO-8859 instead of ISO-8859-1 or ISO-8859-15.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文