如何在 Ruby/Rails 中安全地解析多字节提要？

发布于 2024-07-27 01:57:07 字数 893 浏览 8 评论 0原文

（抱歉，如果是一个新手问题...我已经做了相当多的研究，老实说...）

我正在编写一些 Ruby on Rails 代码来解析 RSS/ATOM 提要。我的代码在令人讨厌的“£”符号上呕吐。

在执行其他操作之前，我一直在尝试标准化提要的描述和标题字段的方法：

descr = self.description.mb_chars.normalize(:kc)

但是，当它使用“£”命中字符串时，我猜测 mb_chars 遇到了问题并返回常规的 Ruby 字符串目的。我收到错误：

undefined method `normalize' for #<String:0x5ef8490>

那么防御性准备这些字符串以插入数据库的最佳过程是什么？（我还需要对它们进行一堆字符串处理）

我的问题很复杂，因为我不知道我正在处理的提要的格式。例如，我很幸运地使用了以下行：

descr = Iconv.new('UTF-8//IGNORE', 'UTF-8').iconv descr

但是，当它遇到“£”时，它只会截断该点之后的所有内容。

当我使用 String.inspect 函数显示“£”符号时，它显示在“\243”处。如果无法“正确”处理该符号，我很乐意将其替换为另一个值（例如“GBP”）。因此，对该代码的帮助也将不胜感激。

有问题的提要是 http://www.dailymail.co.uk/sport /football/index.rss

原文

(Sorry if a newb question...I've done quite a bit of research, honestly...)

I'm writing some Ruby on Rails code to parse RSS/ATOM feeds. My code is throwing-up on on a pesky '£' symbol.

I've been trying the approach of normalizing the description and title fields of the feeds before doing anything else:

descr = self.description.mb_chars.normalize(:kc)

However, when it hits the string with the '£', I'm guessing that mb_chars hits a problem and returns a regular Ruby String object. I get the error:

undefined method `normalize' for #<String:0x5ef8490>

So what is the best process to defensively prep these strings for insertion into the database? (I need to do a bunch of string processing on them as well)

My problem is compounded in that I don't know the format of the feed I'm processing. For instance, I've had some luck with the following line:

descr = Iconv.new('UTF-8//IGNORE', 'UTF-8').iconv descr

However, when it encounters the '£' it simply truncates everything after that point.

When I display the '£' symbol with the String.inspect function, it displays at '\243'. Failing a method to 'correctly' deal with this symbol, I'd be happy enough to substitute it for another value (like 'GBP'). So help with that code would be appreciated as well.

The feed in question is http://www.dailymail.co.uk/sport/football/index.rss

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

孤凫 2024-08-03 01:57:07

我找到了一种解决方案：

为了解决这个问题，我必须为文档定义 $KCODE（编码）：

require 'rubygems'
require 'active_support/all'

$KCODE = 'UTF8'

str = "test ščž"
puts str.parameterize.inspect
puts str.parameterize.to_s

=> #
=> test-scz

原始帖子： https://rails.lighthouseapp .com/projects/8994/tickets/3504-string-parameterize-normalize-bug

I've found one solution:

To fix it, I had to define the $KCODE (encoding) for the document:

require 'rubygems'
require 'active_support/all'

$KCODE = 'UTF8'

str = "test ščž"
puts str.parameterize.inspect
puts str.parameterize.to_s

=> #
=> test-scz

Original post: https://rails.lighthouseapp.com/projects/8994/tickets/3504-string-parameterize-normalize-bug

回复收藏 0 原文

牵你手 2024-08-03 01:57:07

我错过了一些非常基本的东西 - 我猜测传入的提要的编码。

所以现在我正在查看 (a) HTTP 响应标头中的字符集，然后 (b) XML 声明中的编码饲料本身。

获得编码后，我使用 iconv 将其转换为 UTF-8。

到目前为止，一切都很好。

回复收藏 0 原文

~没有更多了~

关于作者

谎言月老

暂无简介

0 文章

0 评论

24 人气

关注发私信

友情链接

文江博客

如何在 Ruby/Rails 中安全地解析多字节提要？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

已经忘了多久

15867725375

LonelySnow

走过海棠暮

轻许诺言

信馬由缰

友情链接

如何在 Ruby/Rails 中安全地解析多字节提要？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

已经忘了多久

15867725375

LonelySnow

走过海棠暮

轻许诺言

信馬由缰

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。