文件编码在 ruby 中生成空白字符——为什么？

发布于 2024-09-25 19:48:20 字数 932 浏览 9 评论 0原文

我正在使用这一点 ruby：

File.open(ARGV[0], "r").each_line do |line|
   puts "encoding: #{line.encoding}"
   line.chomp.split(//).each do |char|
     puts "[#{char}]"
  end
end

我有一个示例文件，我在该文件中输入的文件仅包含三个句点和一个换行符。

当我使用 utf-8 的文件编码保存此文件（在 vim 中：set fileencoding=utf-8）并在其上运行此脚本时，我得到以下输出：

encoding: UTF-8
[]
[.]
[.]
[.]

然后，如果我将文件编码更改为 latin1 （在 vim 中：set fileencoding=latin1）并运行脚本，我没有得到第一个空白字符：

encoding: UTF-8
[.]
[.]
[.]

这里发生了什么？我知道 utf8 编码在文件开头放置一些字节以将文件标记为 utf8 编码，但我认为它们在处理文本时应该是不可见的（即：ruby 运行时应该处理它们）。我缺少什么？

顺便说一句：

ubuntu:~$ ruby --version
ruby 1.9.2p0 (2010-08-18 revision 29034) [i686-linux]

谢谢！

更新：

带有额外字符（BOM）的文件的十六进制转储：

ubuntu:~$ hexdump new.board
0000000 bbef 2ebf 2e2e 0a0d 0a0d
000000a

原文

I'm using this little bit of ruby:

File.open(ARGV[0], "r").each_line do |line|
   puts "encoding: #{line.encoding}"
   line.chomp.split(//).each do |char|
     puts "[#{char}]"
  end
end

And I have a sample file that I'm feeding in the file just contains three periods and a newline.

When I save this file with a fileencoding of utf-8 (in vim: set fileencoding=utf-8) and run this script on it I get this output:

encoding: UTF-8
[]
[.]
[.]
[.]

And then if I change the fileencoding to latin1 (in vim: set fileencoding=latin1) and run the script, I don't get that first blank char:

encoding: UTF-8
[.]
[.]
[.]

What's going on here? I understand that the utf8 encoding puts some bytes at the start of the file to mark the file as utf8 encoded, but I thought they were supposed to be invisible when processing the text (i.e.: the ruby runtime was supposed to process them). What am I missing?

btw:

ubuntu:~$ ruby --version
ruby 1.9.2p0 (2010-08-18 revision 29034) [i686-linux]

Thanks!

Update:

Hex dump of the file with the extra char (the BOM):

ubuntu:~$ hexdump new.board
0000000 bbef 2ebf 2e2e 0a0d 0a0d
000000a

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

心作怪 2024-10-02 19:48:20

尝试跑步

data = IO.read(ARGV[0])
puts data.dump

，看看你会得到什么。这将打印任何非打印字符的转义码。

如果我使用 设置 BOM，它看起来不像 UTF8 字节顺序标记在文件上的vim中设置炸弹并尝试我得到的代码

[?]
[?]
[?]
[.]
[.]
[.]

，而dump给我

"\357\273\277...\n"

这将是BOM的八进制表示（EF BB BF in十六进制）

Try running

data = IO.read(ARGV[0])
puts data.dump

and see what you get. This will print the escape codes of any nonprinting characters.

It doesn't look like the UTF8 byte order mark, if I set the BOM using :set bomb in vim on the file and try your code I get

[?]
[?]
[?]
[.]
[.]
[.]

while dump gives me

"\357\273\277...\n"

which will be the octal representation of the BOM (EF BB BF in hex)

回复收藏 0 原文

~没有更多了~

关于作者

浅笑轻吟梦一曲

暂无简介

0 文章

0 评论

23 人气

关注发私信

友情链接

文江博客

文件编码在 ruby 中生成空白字符——为什么？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

苦中寻乐

lueluelue

嗼ふ静

王权女流氓

与花如笺

残酷

友情链接

文件编码在 ruby​​ 中生成空白字符——为什么？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

苦中寻乐

lueluelue

嗼ふ静

王权女流氓

与花如笺

残酷

友情链接

文件编码在 ruby 中生成空白字符——为什么？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。