文件编码在 ruby 中生成空白字符——为什么?
我正在使用这一点 ruby:
File.open(ARGV[0], "r").each_line do |line|
puts "encoding: #{line.encoding}"
line.chomp.split(//).each do |char|
puts "[#{char}]"
end
end
我有一个示例文件,我在该文件中输入的文件仅包含三个句点和一个换行符。
当我使用 utf-8 的文件编码保存此文件(在 vim 中:set fileencoding=utf-8
)并在其上运行此脚本时,我得到以下输出:
encoding: UTF-8
[]
[.]
[.]
[.]
然后,如果我将文件编码更改为 latin1 (在 vim 中:set fileencoding=latin1
)并运行脚本,我没有得到第一个空白字符:
encoding: UTF-8
[.]
[.]
[.]
这里发生了什么?我知道 utf8 编码在文件开头放置一些字节以将文件标记为 utf8 编码,但我认为它们在处理文本时应该是不可见的(即:ruby 运行时应该处理它们)。我缺少什么?
顺便说一句:
ubuntu:~$ ruby --version
ruby 1.9.2p0 (2010-08-18 revision 29034) [i686-linux]
谢谢!
更新:
带有额外字符(BOM)的文件的十六进制转储:
ubuntu:~$ hexdump new.board
0000000 bbef 2ebf 2e2e 0a0d 0a0d
000000a
I'm using this little bit of ruby:
File.open(ARGV[0], "r").each_line do |line|
puts "encoding: #{line.encoding}"
line.chomp.split(//).each do |char|
puts "[#{char}]"
end
end
And I have a sample file that I'm feeding in the file just contains three periods and a newline.
When I save this file with a fileencoding of utf-8 (in vim: set fileencoding=utf-8
) and run this script on it I get this output:
encoding: UTF-8
[]
[.]
[.]
[.]
And then if I change the fileencoding to latin1 (in vim: set fileencoding=latin1
) and run the script, I don't get that first blank char:
encoding: UTF-8
[.]
[.]
[.]
What's going on here? I understand that the utf8 encoding puts some bytes at the start of the file to mark the file as utf8 encoded, but I thought they were supposed to be invisible when processing the text (i.e.: the ruby runtime was supposed to process them). What am I missing?
btw:
ubuntu:~$ ruby --version
ruby 1.9.2p0 (2010-08-18 revision 29034) [i686-linux]
Thanks!
Update:
Hex dump of the file with the extra char (the BOM):
ubuntu:~$ hexdump new.board
0000000 bbef 2ebf 2e2e 0a0d 0a0d
000000a
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
尝试跑步
,看看你会得到什么。这将打印任何非打印字符的转义码。
如果我使用
设置 BOM,它看起来不像 UTF8 字节顺序标记在文件上的vim中设置炸弹
并尝试我得到的代码,而
dump
给我这将是BOM的八进制表示(
EF BB BF
in十六进制)Try running
and see what you get. This will print the escape codes of any nonprinting characters.
It doesn't look like the UTF8 byte order mark, if I set the BOM using
:set bomb
in vim on the file and try your code I getwhile
dump
gives mewhich will be the octal representation of the BOM (
EF BB BF
in hex)