为什么 Ruby 1.9 中具有相同字节和编码的两个字符串不相同?
在 Ruby 1.9.2 中,我找到了一种方法来使两个字符串具有相同的字节、相同的编码并且相等,但它们具有不同的 length
和 [] 返回的不同字符
。
这是一个错误吗?如果这不是一个错误,那么我想完全理解它。 Ruby 1.9.2 String 对象中存储了哪些类型的信息,使得这两个字符串具有不同的行为?
下面是重现此行为的代码。以 #=>
开头的注释显示了我从该脚本中获得的输出,括号内的文字告诉您我对该输出的判断。
#!/usr/bin/ruby1.9
# coding: utf-8
string1 = "\xC2\xA2" # A well-behaved string with one character (¢)
string2 = "".concat(0xA2) # A bizarre string very similar to string1.
p string1.bytes.to_a #=> [194, 162] (good)
p string2.bytes.to_a #=> [194, 162] (good)
puts string1.encoding.name #=> UTF-8 (good)
puts string2.encoding.name #=> UTF-8 (good)
puts string1 == string2 #=> true (good)
puts string1.length #=> 1 (good)
puts string2.length #=> 2 (weird!)
p string1[0] #=> "¢" (good)
p string2[0] #=> "\xC2" (weird!)
我正在运行 Ubuntu 并从源代码编译了 Ruby。我的 Ruby 版本是:
ruby 1.9.2p0 (2010-08-18 revision 29036) [x86_64-linux]
In Ruby 1.9.2, I found a way to make two strings that have the same bytes, same encoding, and are equal, but they have a different length
and different characters returned by []
.
Is this a bug? If it is not a bug, then I'd like to fully understand it. What kind of information is stored inside Ruby 1.9.2 String objects that allows these two strings to behave differently?
Below is the code that reproduces this behavior. The comments that start with #=>
show you what output I am getting from this script, and the parenthetical words tell you my judgment of that output.
#!/usr/bin/ruby1.9
# coding: utf-8
string1 = "\xC2\xA2" # A well-behaved string with one character (¢)
string2 = "".concat(0xA2) # A bizarre string very similar to string1.
p string1.bytes.to_a #=> [194, 162] (good)
p string2.bytes.to_a #=> [194, 162] (good)
puts string1.encoding.name #=> UTF-8 (good)
puts string2.encoding.name #=> UTF-8 (good)
puts string1 == string2 #=> true (good)
puts string1.length #=> 1 (good)
puts string2.length #=> 2 (weird!)
p string1[0] #=> "¢" (good)
p string2[0] #=> "\xC2" (weird!)
I am running Ubuntu and compiled Ruby from source. My Ruby version is:
ruby 1.9.2p0 (2010-08-18 revision 29036) [x86_64-linux]
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
这是 Ruby 的错误,已修复 r29848。
It is Ruby's bug and fixed r29848.
Matz 通过 Twitter 提到了这个问题:
http://twitter.com/matz_translator/status/6597021662187520
< a href="http://twitter.com/matz_translator/status/6597055132733440" rel="nofollow">http://twitter.com/matz_translator/status/6597055132733440
“很难确定这是一个错误,但是,保持原样是不可接受的,我宁愿解决这个问题。”
Matz mentioned this question via Twitter:
http://twitter.com/matz_translator/status/6597021662187520
http://twitter.com/matz_translator/status/6597055132733440
"It's hard to determine as a bug but, it's not acceptable to leave it as is. I'd prefer to fix this issue."
我认为问题出在字符串的编码上。查看 James Grey 的关于 Unicode 编码的 灰色阴影:Ruby 1.9 的 String 文章。
其他奇怪的行为:
I think the problem is in the string's encoding. Check out James Grey's Shades of Gray: Ruby 1.9's String article on Unicode encoding.
Additional odd behavior: