“文件”中的 ISO-8859 是什么意思?
我在我有权访问的软件存储库中运行了以下命令:
找到 . -not -name ".svn" -type f -exec 文件 "{}" \;
并看到许多输出线,例如
./File.java:ISO-8859 C++ 程序文本
是什么意思? ISO-8859 是一个编码类,而不是某种编码。我预计所有文件都是 UTF-8,但大多数文件都采用所提供的编码。 ISO-8859 也是 UTF-8 的真子集吗?
我是否可以使用 ISO-8859-1 作为源编码来安全地转换所有这些文件,同时使用 iconv
将其转换为 UTF-8?
I ran the following command in a software repository I have access to:
find . -not -name ".svn" -type f -exec file "{}" \;
and saw many output lines like
./File.java: ISO-8859 C++ program text
What does that mean? ISO-8859 is an encoding class, not a certain encoding. I've expected all files to be UTF-8, but most are in the presented encoding. Is ISO-8859 a proper subset of UTF-8, too?
Is it possible for me to convert all those files safely by using ISO-8859-1 as source encoding while translating it into UTF-8 with iconv
for example?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
恐怕 Unix
file
程序在这方面相当糟糕。它只是意味着它是字节编码的。这并不意味着它是 ISO-8859-1。它甚至可能是非 ISO 字节编码,尽管它通常会解决这个问题。我有一个比 file 更好的系统,但它是在英语语料库上训练的,所以可能不如德语。
简而言之,
file
的结果不可靠。您必须知道真正的编码才能对其进行上转换。I am afraid that the Unix
file
program is rather bad at this. It just means it is in a byte encoding. It does not mean that it is ISO-8859-1. It might even be in a non-ISO byte encdidng, although it usually figures that out.I have a system that does much better than file, but it is trained on an English-language corpus, so might not do as well as on German.
The short answer is that the result of
file
is not reliable. You have to know the real encoding to up-convert it.file
使用的字符集检测相当简单。它识别 UTF-8。它通过在 0x80-0x9F 范围内查找 ISO 8859 编码存在“漏洞”的字节来区分“ISO-8859”和“非 ISO 扩展 ASCII”。但它不会尝试确定正在使用哪种 ISO 8859 编码。这就是为什么它只显示ISO-8859
而不是ISO-8859-1
或ISO-8859-15
。The charset detection used by
file
is rather simplistic. It recognizes UTF-8. And it distinguished between "ISO-8859" and "non-ISO extended-ASCII" by looking for bytes in the 0x80-0x9F range where the ISO 8859 encodings have "holes". But it makes no attempt to determine which ISO 8859 encoding is in use. Which is why it just saysISO-8859
instead ofISO-8859-1
orISO-8859-15
.