为什么 Rails 3 认为 xE2x80x89 意味着 â x80 x89
我有一个从 utf-8 页面中抓取的字段:
"O’Reilly"
并保存在 yml 文件中:
:name: "O\xE2\x80\x99Reilly"
(xE2x80x99 是 此撇号的正确 UTF-8 表示)
但是,当我将值加载到哈希中并将其生成标记为 utf-8 的页面时,我get:
OâReilly
我查找了字符 â,它在 UTF-16 中编码为 x00E2,字符 x80 和 x89 是不可见的,但在粘贴字符串时出现在 â 后面。我认为这意味着我的应用程序正在输出三个 UTF-16 字符而不是一个 UTF-8。
如何让 Rails 将 3 字节 UTF-8 代码解释为单个字符?
I have a field scraped from a utf-8 page:
"O’Reilly"
And saved in a yml file:
:name: "O\xE2\x80\x99Reilly"
(xE2x80x99 is the correct UTF-8 representation of this apostrophe)
However when I load the value into a hash and yield it to a page tagged as utf-8, I get:
OâReilly
I looked up the character â, which is encoded in UTF-16 as x00E2, and the characters x80 and x89 were invisible but present after the â when I pasted the string. I assume this means my app is outputting three UTF-16 characters instead of one UTF-8.
How do I make rails interpret a 3-byte UTF-8 code as a single character?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
Ruby 字符串是字节序列而不是字符:
您的字符串是 10 个字节但 8 个字符的序列(如您所知)。查看在 HTML 中输出正确字符串的最安全方法(因为您提到了 Rails,所以我假设您需要 HTML)是将不可打印的字符转换为 HTML 实体;在您的情况下,
这需要一些工作,但在以 UTF-8 发送 HTML 但您的最终用户已将他或她的浏览器设置为覆盖并显示 Latin-1 或其他一些愚蠢的限制字符集的情况下应该会有所帮助。
Ruby strings are sequences of bytes instead of characters:
Your string is a sequence of 10 bytes but 8 characters (as you know). The safest way to see that you output the correct string in HTML (I assume you want HTML since you mentioned Rails) is to convert non-printable characters to HTML entities; in your case to
This takes some work but it should help in cases where send your HTML in UTF-8 but your end-user has set his or her browser to override and show Latin-1 or some other silly restricted charset.
最终,这是由于使用 psych(在 Rails 中)加载 syck 文件(由外部脚本生成)引起的。使用 syck 加载解决了这个问题:
Ultimately this was caused by loading a syck file (generated by an external script) with psych (in rails). Loading with syck solved the issue:
它并不是真正的 UTF-16,它很少在网络上使用(并且在很大程度上被破坏)。您的应用输出三个Unicode字符(包括两个不可见的控制代码),但这与UTF-16编码不同。
问题似乎是 YAML 文件被读取为 ISO-8859-1 编码,因此
\xE2
字节映射到字符 U+00E2 等等。我猜测您使用的是 Ruby 1.9,并且 YAML 被解析为具有关联 ASCII-8BIT 编码而不是 UTF-8 的字节字符串,导致字符串稍后经历一轮转码(修改)。如果是这种情况,您可能必须
force_encoding
将读取的字符串恢复为应有的状态,或者设置default_internal
以使字符串读回UTF-8 。这个有点乱啊It's not really UTF-16, which is rarely used on the web (and largely breaks there). Your app is outputting three Unicode characters (including the two invisible control codes), but that's not the same thing as the UTF-16 encoding.
The problem would seem to be that the YAML file is being read in as if it were ISO-8859-1-encoded, so that the
\xE2
byte maps to character U+00E2 and so on. I am guessing you are using Ruby 1.9 and the YAML is being parsed into byte strings with associated ASCII-8BIT encoding instead of UTF-8, causing the strings to undergo a round of trancoding (mangling) later.If this is the case you might have to
force_encoding
the read strings back to what they should have been, or setdefault_internal
to cause the strings to be read back into UTF-8. Bit of a mess this.