为什么 VIM 忽略我的文件的 BOM?

发布于 2024-12-02 02:20:45 字数 1235 浏览 2 评论 0原文

我需要一个要确保使用 utf8 编码的文件。

因此,我在 VIM 中创建文件,

c:\> gvim umlaute.txt

输入变音符号:

äöü

我检查编码...

:set enc

(VIM 回显 encoding=latin1

,然后检查文件编码...

:set fenc

(VIM 回显 fileencoding=)

然后我写入文件

:w

并检查硬盘上文件的大小:(

!dir umlaute.txt

大小为 5 字节)这当然是预期的,3 字节用于文本,2 字节用于 \x0a \x0d。

好的,所以我现在将编码设置为

:set enc=utf8

缓冲区变得很奇怪

<e4><f6><fc>

我猜这是我之前输入的 ascii 字符的十六进制表示。所以我重写它们

äöü

写入,检查大小:

:w
:$ dir umlaute.txt

这次,它是 8字节。我想每个字符 2 个字节加上 \x0d \x0a 是有意义的。

好的,所以我想确保下次打开文件时它将使用 encodiung=utf8 打开。

:setb
:w

:$ dir umlaute.txt

11 字节。这当然是 8 个(之前的)字节 + BOM 的 3 个字节 (ef bb bf)。

所以我用

:quit

vim 再次打开文件

并检查编码是否已设置:

:set enc

但 VIM 坚持其 encoding=latin1

那么,这是为什么呢。我本来希望 BOM 告诉 VIM 这是一个 UTF8 文件。

I need a file that I want to make sure is encoded with utf8.

So, I create the file

c:\> gvim umlaute.txt

In VIM I type the Umlaute:

äöü

I check the encoding ...

:set enc

(VIM echoes encoding=latin1)

and then I check the file encoding ...

:set fenc

(VIM echoes fileencoding=)

Then I write the file

:w

And check the file's size on the harddisk:

!dir umlaute.txt

(The size is 5 bytes) That is of course expected, 3 bytes for the text and 2 for the \x0a \x0d.

Ok, so I now set the encoding to

:set enc=utf8

The buffer get's wierd

<e4><f6><fc>

I guess this is the hex representation of the ascii characters I previously typed in. So I rewrite them

äöü

Writing, checking size:

:w
:$ dir umlaute.txt

This time, it's 8 bytes. I guess that makes sense 2 bytes for every character plus \x0d \x0a.

Ok, so I want to make sure the next time I open the file it will be opened with encodiung=utf8.

:setb
:w

:$ dir umlaute.txt

11 Bytes. This is of course 8 (previous) Bytes + 3 Bytes for the BOM (ef bb bf).

So I

:quit

vim and open the file again

and check, if the encoding is set:

:set enc

But VIM insists its encoding=latin1.

So, why is that. I would have expected the BOM to tell VIM that this is a UTF8 file.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

郁金香雨 2024-12-09 02:20:45

您混淆了 'encoding'(Vim 全局设置)和 'fileencoding'(每个缓冲区的本地设置)。

打开文件时,变量 'fileencodings' (注意最后的 s)决定 Vim 将尝试使用什么编码打开文件。如果它以 ucs-bom 开头,那么任何带有 BOM 的文件如果解析正确,都将被正确打开。

如果要更改文件的编码,应使用 :set fenc=。如果您想删除 BOM,您应该使用 :set [no]bomb。然后使用 :w 保存。

避免在打开缓冲区后更改 enc,这可能会弄乱事情。 enc 决定了 vim 可以使用哪些字符,它与你正在使用的文件无关。

细节

<代码>c:\> gvim umlaute.txt

您正在打开 vim,但文件名不存在。 Vim 创建一个缓冲区,为其指定名称,并将 fenc 设置为空值,因为没有与其关联的文件。

<代码>:设置编码

(VIM 回显编码=latin1)

这意味着 Vim 将缓冲区内容存储在 ISO-8859-1(可能是另一个数字)中。

然后我检查文件编码...

:设置fenc

(VIM 回显文件编码=)

这是正常的,暂时没有文件。

然后我写入文件

<代码>:w

由于 'fileencoding' 为空,它将使用内部编码 latin1 将其写入磁盘。

并检查硬盘上文件的大小:

!dir umlaute.txt

(大小为 5 个字节)这当然是预期的,3 个字节用于文本,2 个字节用于 \x0a \x0d。

好的,所以我现在将编码设置为

:设置enc=utf8

错误! 你告诉 vim 它必须将缓冲区内容解释为 UTF8 内容。缓冲区包含十六进制的e4 f6 fc 0a 0d,前三个字节是无效的UTF8字符序列。您应该输入:set fenc=utf-8。这会转换缓冲区。

缓冲区变得很奇怪

这就是当你强制 Vim 将非法 UTF-8 文件解释为 UTF8 时会发生的情况。

我猜这是我之前输入的 ascii 字符的十六进制表示。所以我重写了它们

阿欧乌

书写、检查尺寸:

<代码>:w
:$ dir umlaute.txt

这次是 8 个字节。我想每个字符 2 个字节加上 \x0d \x0a 是有意义的。

好的,所以我想确保下次打开文件时它将使用 encodiung=utf8 打开。

:设置炸弹
<代码>:w

:$ dir umlaute.txt

11 字节。这当然是 8 个(之前的)字节 + BOM 的 3 个字节 (ef bb bf)。

所以我

:退出

vim 并再次打开文件

并检查编码是否已设置:

:设置编码

但 VIM 坚持其编码=latin1。

您应该运行 set fenc? 来了解检测到的文件编码是什么。如果你希望 Vim 能够使用 Unicode 文件,你应该在 vimrc 中设置 'enc' 为 utf-8。

You are confusing 'encoding' which is a Vim global setting, and 'fileencoding', which is a local setting to each buffer.

When opening a file, the variable 'fileencodings' (note the final s) determines what encodings Vim will try to open the file with. If it starts with ucs-bom then any file with a BOM will be properly opened if it parses correctly.

If you want to change the encoding of a file, you should use :set fenc=<foo>. If you want to remove the BOM you should use :set [no]bomb. Then use :w to save.

Avoid changing enc after having opened a buffer, it could mess up things. enc determines what characters vim can work with, and it has nothing to do with the files that you are working with.

Details

c:\> gvim umlaute.txt

You are opening vim, with a nonexistent file name. Vim creates a buffer, gives it that name, and sets fenc to an empty value since there is no file associated with it.

:set enc

(VIM echoes encoding=latin1)

This means that the Vim stores the buffer contents in ISO-8859-1 (maybe another number).

and then I check the file encoding ...

:set fenc

(VIM echoes fileencoding=)

This is normal, there is no file for the moment.

Then I write the file

:w

Since 'fileencoding' is empty, it will write it to the disk using the internal encoding, latin1.

And check the file's size on the harddisk:

!dir umlaute.txt

(The size is 5 bytes) That is of course expected, 3 bytes for the text and 2 for the \x0a \x0d.

Ok, so I now set the encoding to

:set enc=utf8

WRONG! You are telling vim that it must interpret the buffer contents as UTF8 content. the buffer contains, in hexadecimal, e4 f6 fc 0a 0d, the first three bytes are invalid UTF8 character sequences. You should have typed :set fenc=utf-8. This would have converted the buffer.

The buffer get's wierd

That's what happens when you force Vim to interpret an illegal UTF-8 file as UTF8.

I guess this is the hex representation of the ascii characters I previously typed in. So I rewrite them

äöü

Writing, checking size:

:w
:$ dir umlaute.txt

This time, it's 8 bytes. I guess that makes sense 2 bytes for every character plus \x0d \x0a.

Ok, so I want to make sure the next time I open the file it will be opened with encodiung=utf8.

:set bomb
:w

:$ dir umlaute.txt

11 Bytes. This is of course 8 (previous) Bytes + 3 Bytes for the BOM (ef bb bf).

So I

:quit

vim and open the file again

and check, if the encoding is set:

:set enc

But VIM insists its encoding=latin1.

You should run set fenc? to know what is the detected encoding of your file. And if you want Vim to be able to work with Unicode files, you should set in your vimrc that 'enc' is utf-8.

太阳男子 2024-12-09 02:20:45

经过多次尝试,我得到了一个工作示例:

    setglobal bomb 
    set fileencodings=ucs-bom,utf-8,cp1251,koi8-r,cp866
    set nobin
    set fileencoding=utf-8 bomb

如果您想使用 BOM 创建新字段:

    c:\gvim umlaute.txt

它现在可以工作了!

After many attempts I get here is a working example:

    setglobal bomb 
    set fileencodings=ucs-bom,utf-8,cp1251,koi8-r,cp866
    set nobin
    set fileencoding=utf-8 bomb

and if you want to cteate new fiel with BOM:

    c:\gvim umlaute.txt

it is working now!

沧笙踏歌 2024-12-09 02:20:45

:helpomb 显示以下信息:

写入文件时,如果满足以下条件,则会在文件前面添加 BOM(字节顺序标记):

  • 此选项已启用(编辑:即“:set炸弹”)
  • “二进制”选项已关闭
  • “文件编码”是“utf-8”、“ucs-2”、“ucs-4”或小/大端变体之一。

某些应用程序使用 BOM 来识别文件的编码。
通常用于 MS-Windows 上的 UCS-2 文件。对于其他应用程序
造成麻烦,例如:“cat file1 file2”生成file2的BOM
出现在结果文件的中间。 Gcc 不接受 BOM。
当 Vim 读取文件并且 'fileencodings' 以“ucs-bom”开头时,
检查 BOM 是否存在并相应设置“炸弹”。
除非设置了“binary”,否则它将从第一行中删除,以便您
编辑时看不到它。当您不更改选项时,BOM
写入文件时会被恢复。

所以尝试在你的 .vimrc 中设置它:

set fileencodings=ucs-bom,utf-8,latin1
set nobin
setglobal fileencoding=utf-8

:help bomb reveals the following information:

When writing a file and the following conditions are met, a BOM (Byte Order Mark) is prepended to the file:

  • this option is on (edit: i.e. ':set bomb')
  • the 'binary' option is off
  • 'fileencoding' is "utf-8", "ucs-2", "ucs-4" or one of the little/big endian variants.

Some applications use the BOM to recognize the encoding of the file.
Often used for UCS-2 files on MS-Windows. For other applications it
causes trouble, for example: "cat file1 file2" makes the BOM of file2
appear halfway the resulting file. Gcc doesn't accept a BOM.
When Vim reads a file and 'fileencodings' starts with "ucs-bom", a
check for the presence of the BOM is done and 'bomb' set accordingly.
Unless 'binary' is set, it is removed from the first line, so that you
don't see it when editing. When you don't change the options, the BOM
will be restored when writing the file.

So try setting this in your .vimrc:

set fileencodings=ucs-bom,utf-8,latin1
set nobin
setglobal fileencoding=utf-8
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文