清理 ruby 中的奇怪编码

发布于 2024-08-14 17:05:42 字数 741 浏览 4 评论 0原文

我目前正在使用 couchdb。
我正在尝试将一些博客数据从 redis （键值存储）迁移到 couchdb （键值存储）。
鉴于我可能将这些数据在不同的博客引擎之间迁移了无数次（每个人都必须有一个爱好:)），似乎存在一些编码混乱。
我正在使用 CouchREST 从 ruby 访问 CouchDB，我得到了这个：

<JSON::GeneratorError: source sequence is illegal/malformed>

问题似乎是对象的 body_html 部分：

<Post:0x00000000e9ee18 @body_html="[.....]Wie Sie bereits wissen, m\xF6chte EUserv k\xFCnftig seine  [...]

这些应该是变音符号（“möchte”和“künftig”）。

知道如何摆脱这些问题吗？我在插入之前尝试使用 ruby 1.9 编码功能或 iconv 进行一些转换，但还没有任何运气:(

如果我尝试使用 ruby 1.9 的 .encode() 方法将这些内容转换为 ISO-8859-1，这就是发生的情况（不同的文本，相同的问题）：

#<Encoding::UndefinedConversionError: "\xC6\x92" from UTF-8 to ISO-8859-1>

原文

I'm currently playing a bit with couchdb.
I'm trying to migrate some blog data from redis (key value store) to couchdb (key value store).
Seeing as I probably migrated this data a gazillion times from and to different blogging engines (everybody has got to have a hobby :) ), there seem to be some encoding snafus.
I'm using CouchREST to access CouchDB from ruby and I'm getting this:

<JSON::GeneratorError: source sequence is illegal/malformed>

the problem seems to be the body_html part of the object:

<Post:0x00000000e9ee18 @body_html="[.....]Wie Sie bereits wissen, m\xF6chte EUserv k\xFCnftig seine  [...]

Those are supposed to be Umlauts ("möchte" and "künftig").

Any idea how to get rid of those problems? I tried some conversions using the ruby 1.9 encoding feature or iconv before inserting, but haven't got any luck yet :(

If I try to e.g. convert that stuff to ISO-8859-1 using the .encode() method of ruby 1.9, this is what happens (different text, same problem):

#<Encoding::UndefinedConversionError: "\xC6\x92" from UTF-8 to ISO-8859-1>

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

吹梦到西洲 2024-08-21 17:05:42

我尝试将这些内容转换为 ISO-8859-1

关闭。实际上，您想以相反的方式进行操作：您已经拥有 ISO-8859-1(*)，您想要 UTF-8(**)。所以 str.encode('utf-8', 'iso-8859-1') 更有可能达到这个目的。

*：实际上，您可能拥有 Windows 代码页 1252，它类似于 ISO-8859-1，但具有额外的智能引号和 ISO-8859-1 用于控制代码的 0x80-0x9F 范围内的内容。如果是这样，请改用'cp1252'。

**：嗯，你可能会这样做。使用 UTF-8 是最好的方法，这样您就可以存储所有可能的字符。如果您真的想继续使用 ISO-8859-1/cp1252，那么问题可能只是 Ruby 错误地猜测了正在使用的字符集，您可以通过调用 来修复它str.force_encoding('iso-8859-1')。