Ruby 1.9、YAML 和字符串编码:如何过上理智的生活?
在我看来,ruby 1.9 附带的 YAML 库对编码不敏感。
这意味着在生成 YAML 时,它将采用任何字节字符串,并对不输出干净 ASCII 的任何字节序列进行转义。这很蹩脚,但可以接受。
我的问题是相反的。从所述 YAML 转储加载内容时。
在下面的示例中,我创建了一个 UTF-8 字符串,将其转储,转储的类型为 !binary
。当我加载它时,它的编码是 ASCII-8BIT。在示例的最后,我尝试将原始字符串和重新加载的字符串与另一个 UTF-8 字符串连接起来。后者将失败并出现 Encoding::CompatibilityError
。
require 'yaml'
s0 = "Iñtërnâtiônàlizætiøn"
y = s0.to_yaml
s1 = YAML::load y
puts s0 # => Iñtërnâtiônàlizætiøn
puts s0.encoding # => UTF-8
puts s1 # => Iñtërnâtiônàlizætiøn
puts s1.encoding # => ASCII-8BIT
puts y # => --- !binary |
# ScOxdMOrcm7DonRpw7Ruw6BsaXrDpnRpw7hu
puts "ñårƒ" + s0 # => ñårƒIñtërnâtiônàlizætiøn
puts "ñårƒ" + s1 # => Encoding::CompatibilityError: incompatible character encodings: UTF-8 and ASCII-8BIT
我认为很明显,当您处理某些包含嵌套哈希值和带有叶字符串的数组的 YAML 源时,这将很快导致麻烦。
目前我有一些代码可以遍历所有哈希和数组并对每个字符串调用 force_encoding
。至少可以说,这是难看的。
我现在正在寻找是一种告诉 YAML::load
任何传入的字符串都应该被视为,因此将其编码设置为 UTF 的方法-8。
理想情况下,Ruby 的 YAML 应该只使用正确的编码来注释它转储的字符串。有一个 Ya2YAML 项目尝试转储 UTF-8 安全 YAML。我不确定它还有多远。如果有人玩过它,我欢迎任何想法。
不管怎样,我仍然有这些转储,没有任何编码信息需要处理。虽然我知道它们都是UTF-8。
It seems to me that the YAML library that ships with ruby 1.9 is encoding-deaf.
What this means is that when generating YAML, it'll take any string of bytes, and escape any byte sequence that doesn't output clean ASCII. That's lame, but acceptable.
My problem is the other way around. When loading content from said YAML dump.
In the example that follows I create a UTF-8 string, dump it, it's dumped with the type !binary
. When I load it back, it has the encoding ASCII-8BIT. In the end of the example I try to concatenate both the original and the reloaded string with another UTF-8 string. The latter will fail with an Encoding::CompatibilityError
.
require 'yaml'
s0 = "Iñtërnâtiônàlizætiøn"
y = s0.to_yaml
s1 = YAML::load y
puts s0 # => Iñtërnâtiônàlizætiøn
puts s0.encoding # => UTF-8
puts s1 # => Iñtërnâtiônàlizætiøn
puts s1.encoding # => ASCII-8BIT
puts y # => --- !binary |
# ScOxdMOrcm7DonRpw7Ruw6BsaXrDpnRpw7hu
puts "ñårƒ" + s0 # => ñårƒIñtërnâtiônàlizætiøn
puts "ñårƒ" + s1 # => Encoding::CompatibilityError: incompatible character encodings: UTF-8 and ASCII-8BIT
I think it's clear how this will quickly lead to trouble when you're dealing with some YAML source containing nested hashes and arrays with leaf strings.
Currently I have some code that traverses all hashes and arrays and calls force_encoding
on each string. That, to say the least, is unsightly.
What I'm looking for right now is a way to tell YAML::load
that any string that comes in should be treated as, and therefore have its encoding set to UTF-8.
Ideally, ruby's YAML should just annotate the strings it dumps with the proper encoding. There's a Ya2YAML project that attempts to dump UTF-8 safe YAML. I'm not sure how far along it is. If anyone has played with it, I welcome any thoughts.
Regardless of that, I still have these dumps without any encoding information to deal with. Although I know they are all UTF-8.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
考虑将您的 ruby 升级到最新的 1.9.2 。
我在 1.9.1 中发现了该错误,但在 1.9.2 中没有发现。
Consider to upgrade your ruby to the latest 1.9.2 .
I found that bug in 1.9.1 but not 1.9.2 .
首先,您尝试读取的文本文件必须是 UTF-8 编码的(这应该是您的 YAML 文件)。
然后将此行添加到您的 ruby 文件、散列和所有内容的顶部。
这将意味着所有字符串的默认编码将为 UTF-8,并且应该意味着您使用 YAML.dump('text') 转储的任何文本,甚至字符串文字“像这样”也应该编码为 UTF-8,并且从这里开始一切都应该正常工作。
Firstly the textfile that you're attempting to read must be UTF-8 encoded(this should be your YAML file).
Then add this line to the top of your ruby file, hash and all
This will mean that the default encoding for all strings will be UTF-8, and should mean that any text you dump with YAML.dump('text') or even string literals 'like this' should also be encoded UTF-8, and all should work well from here on in.
Evgeny 的答案对我来说仍然显示二进制,但这有效(“syck”代替“psych”):
我正在使用 Ruby 1.9。请注意,就我的目的而言,转义特殊字符是可以的 - 我只是需要它不显示!二进制... 对于普通单词。感谢上帝 .to_yaml 再次对我有用 - 以前一直使用它。如何真正过上理智的生活:)
Evgeny's answer still shows binary for me, but this works ('syck' instoad of 'psych'):
I'm using Ruby 1.9. Note for my purposes having the specials be escaped is fine - I just needed it to not show !binary... for normal words. Thank god .to_yaml is functional for me again - used to use it all the time. How to load a life of sanity indeed :)