为什么 Rails 3 认为 xE2x80x89 意味着 â x80 x89

发布于 2024-11-19 03:20:01 字数 510 浏览 3 评论 0原文

我有一个从 utf-8 页面中抓取的字段:

"O’Reilly"

并保存在 yml 文件中:

:name: "O\xE2\x80\x99Reilly"

(xE2x80x99 是 此撇号的正确 UTF-8 表示

但是,当我将值加载到哈希中并将其生成标记为 utf-8 的页面时,我get:

OâReilly

我查找了字符 â,它在 UTF-16 中编码为 x00E2,字符 x80 和 x89 是不可见的,但在粘贴字符串时出现在 â 后面。我认为这意味着我的应用程序正在输出三个 UTF-16 字符而不是一个 UTF-8。

如何让 Rails 将 3 字节 UTF-8 代码解释为单个字符?

I have a field scraped from a utf-8 page:

"O’Reilly"

And saved in a yml file:

:name: "O\xE2\x80\x99Reilly"

(xE2x80x99 is the correct UTF-8 representation of this apostrophe)

However when I load the value into a hash and yield it to a page tagged as utf-8, I get:

OâReilly

I looked up the character â, which is encoded in UTF-16 as x00E2, and the characters x80 and x89 were invisible but present after the â when I pasted the string. I assume this means my app is outputting three UTF-16 characters instead of one UTF-8.

How do I make rails interpret a 3-byte UTF-8 code as a single character?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

丢了幸福的猪 2024-11-26 03:20:01

Ruby 字符串是字节序列而不是字符:

$ irb
>> "O\xE2\x80\x99Reilly"
=> "O\342\200\231Reilly"

您的字符串是 10 个字节但 8 个字符的序列(如您所知)。查看在 HTML 中输出正确字符串的最安全方法(因为您提到了 Rails,所以我假设您需要 HTML)是将不可打印的字符转换为 HTML 实体;在您的情况下,

O’Reilly

这需要一些工作,但在以 UTF-8 发送 HTML 但您的最终用户已将他或她的浏览器设置为覆盖并显示 Latin-1 或其他一些愚蠢的限制字符集的情况下应该会有所帮助。

Ruby strings are sequences of bytes instead of characters:

$ irb
>> "O\xE2\x80\x99Reilly"
=> "O\342\200\231Reilly"

Your string is a sequence of 10 bytes but 8 characters (as you know). The safest way to see that you output the correct string in HTML (I assume you want HTML since you mentioned Rails) is to convert non-printable characters to HTML entities; in your case to

O’Reilly

This takes some work but it should help in cases where send your HTML in UTF-8 but your end-user has set his or her browser to override and show Latin-1 or some other silly restricted charset.

天冷不及心凉 2024-11-26 03:20:01

最终,这是由于使用 psych(在 Rails 中)加载 syck 文件(由外部脚本生成)引起的。使用 syck 加载解决了这个问题:

#in ruby environment
puts YAML::ENGINE.yamler => syck

#in rails
puts YAML::ENGINE.yamler => psych

#in webapp
YAML::ENGINE.yamler = 'syck'
a = YAML::load(file_saved_with_syck)
a[index][:name] => "O’Reilly"
YAML::ENGINE.yamler = 'psych'

Ultimately this was caused by loading a syck file (generated by an external script) with psych (in rails). Loading with syck solved the issue:

#in ruby environment
puts YAML::ENGINE.yamler => syck

#in rails
puts YAML::ENGINE.yamler => psych

#in webapp
YAML::ENGINE.yamler = 'syck'
a = YAML::load(file_saved_with_syck)
a[index][:name] => "O’Reilly"
YAML::ENGINE.yamler = 'psych'
如日中天 2024-11-26 03:20:01

我认为这意味着我的应用正在输出三个 UTF-16 字符,而不是一个 UTF-8。

它并不是真正的 UTF-16,它很少在网络上使用(并且在很大程度上被破坏)。您的应用输出三个Unicode字符(包括两个不可见的控制代码),但这与UTF-16编码不同。

问题似乎是 YAML 文件被读取为 ISO-8859-1 编码,因此 \xE2 字节映射到字符 U+00E2 等等。我猜测您使用的是 Ruby 1.9,并且 YAML 被解析为具有关联 ASCII-8BIT 编码而不是 UTF-8 的字节字符串,导致字符串稍后经历一轮转码(修改)。

如果是这种情况,您可能必须force_encoding将读取的字符串恢复为应有的状态,或者设置default_internal以使字符串读回UTF-8 。这个有点乱啊

I assume this means my app is outputting three UTF-16 characters instead of one UTF-8.

It's not really UTF-16, which is rarely used on the web (and largely breaks there). Your app is outputting three Unicode characters (including the two invisible control codes), but that's not the same thing as the UTF-16 encoding.

The problem would seem to be that the YAML file is being read in as if it were ISO-8859-1-encoded, so that the \xE2 byte maps to character U+00E2 and so on. I am guessing you are using Ruby 1.9 and the YAML is being parsed into byte strings with associated ASCII-8BIT encoding instead of UTF-8, causing the strings to undergo a round of trancoding (mangling) later.

If this is the case you might have to force_encoding the read strings back to what they should have been, or set default_internal to cause the strings to be read back into UTF-8. Bit of a mess this.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文