Ruby:将编码字符转换为实际的 UTF-8 字符
Ruby 不能很好地处理 UTF-8 字符串。我在 XML 文件中传递数据,尽管 XML 文档被指定为 UTF-8,但它仍将 ascii 编码(每个字符两个字节)视为单个字符。
我已经开始以 '\uXXXX' 格式对输入字符串进行编码,但是我不知道如何将其转换为实际的 UTF-8 字符。我一直在这个网站和谷歌上进行搜索,但没有结果,我现在的挫败感非常高。我正在使用 Ruby 1.8.6
基本上,我想转换字符串 '\u03a3' -> “Σ”。
我得到的是:
data.gsub /\\u([a-zA-Z0-9]{4})/, $1.hex.to_i.chr
这当然会给出“931 out of char range”错误。
谢谢 蒂姆
Ruby will not play nice with UTF-8 strings. I am passing data in an XML file and although the XML document is specified as UTF-8 it treats the ascii encoding (two bytes per character) as individual characters.
I have started encoding the input strings in the '\uXXXX' format, however I can not figure out how to convert this to an actual UTF-8 character. I have been searching all over on this site and google to no avail and my frustration is pretty high right now. I am using Ruby 1.8.6
Basically, I want to convert the string '\u03a3' -> "Σ".
What I had is:
data.gsub /\\u([a-zA-Z0-9]{4})/, $1.hex.to_i.chr
Which of course gives "931 out of char range" error.
Thank you
Tim
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
试试这个:
其中
0x50
是 utf8 字符的十六进制代码。Try this :
where
0x50
is the hex code of the utf8 char.由于 Ruby 字符串将 UTF-8 编码的代码点视为两个字符,是否会造成某些问题?如果没有,那么您不必对此过于担心。如果确实出现问题,请添加评论以告知我们。解决该问题可能比寻找解决方法更好。
如果您需要进行转换,请查看 Iconv 库。
无论如何,
Σ
可能是\u03a3
的更好替代方案。 \uXXXX 用于 JSON,但不用于 XML。如果你想解析 \uXXXX 格式,请查看一些 JSON 库,他们是如何做到这一点的。Does something break because Ruby strings treats UTF-8 encoded code points as two characters? If not, then that you should not worry too much about that. If something does break, then please add a comment to let us know. It is probably better to solve that problem instead of looking for a workaround.
If you need to do conversions, look at the Iconv library.
In any case,
Σ
could be better alternative to\u03a3
. \uXXXX is used in JSON, but not in XML. If you want to parse \uXXXX format, look at some JSON library how they do it.Ruby(至少 1.8.6)没有完整的 Unicode 支持。
Integer#chr
仅支持 ASCII 字符否则,八进制表示法最多可达255
('\377'
)。为了演示:
您可以尝试升级到 Ruby 1.9。
chr
文档不这样做t 明确声明 ASCII,因此支持可能已扩展 - 尽管示例停在 255。或者,您可以尝试调查 ruby-unicode。我自己从未尝试过,所以我不知道它会有多大帮助。
否则,目前我认为您无法在 Ruby 中做您想做的事情。
Ruby (at least, 1.8.6) doesn't have full Unicode support.
Integer#chr
only supports ASCII characters and otherwise only up to255
in octal notation ('\377'
).To demonstrate:
You might try upgrading to Ruby 1.9. The
chr
docs don't explicitly state ASCII, so support may have expanded -- though the examples stop at 255.Or, you might try investigating ruby-unicode. I've never tried it myself, so I don't know how well it'll help.
Otherwise, I don't think you can do quite what you want in Ruby, currently.
您可以将编码传递给
Integer#字符
:因此,不要使用
.chr
,而是使用.chr(Encoding::UTF_8)
。You can pass an encoding to the
Integer#chr
:So instead of using
.chr
, use.chr(Encoding::UTF_8)
.