如何在 Ruby/Rails 中安全地解析多字节提要?

发布于 2024-07-27 01:57:07 字数 893 浏览 5 评论 0原文

(抱歉,如果是一个新手问题...我已经做了相当多的研究,老实说...)

我正在编写一些 Ruby on Rails 代码来解析 RSS/ATOM 提要。 我的代码在令人讨厌的“£”符号上呕吐。

在执行其他操作之前,我一直在尝试标准化提要的描述和标题字段的方法:

descr = self.description.mb_chars.normalize(:kc)

但是,当它使用“£”命中字符串时,我猜测 mb_chars 遇到了问题并返回常规的 Ruby 字符串目的。 我收到错误:

undefined method `normalize' for #<String:0x5ef8490>

那么防御性准备这些字符串以插入数据库的最佳过程是什么? (我还需要对它们进行一堆字符串处理)

我的问题很复杂,因为我不知道我正在处理的提要的格式。 例如,我很幸运地使用了以下行:

descr = Iconv.new('UTF-8//IGNORE', 'UTF-8').iconv descr

但是,当它遇到“£”时,它只会截断该点之后的所有内容。

当我使用 String.inspect 函数显示“£”符号时,它显示在“\243”处。 如果无法“正确”处理该符号,我很乐意将其替换为另一个值(例如“GBP”)。 因此,对该代码的帮助也将不胜感激。

有问题的提要是 http://www.dailymail.co.uk/sport /football/index.rss

(Sorry if a newb question...I've done quite a bit of research, honestly...)

I'm writing some Ruby on Rails code to parse RSS/ATOM feeds. My code is throwing-up on on a pesky '£' symbol.

I've been trying the approach of normalizing the description and title fields of the feeds before doing anything else:

descr = self.description.mb_chars.normalize(:kc)

However, when it hits the string with the '£', I'm guessing that mb_chars hits a problem and returns a regular Ruby String object. I get the error:

undefined method `normalize' for #<String:0x5ef8490>

So what is the best process to defensively prep these strings for insertion into the database? (I need to do a bunch of string processing on them as well)

My problem is compounded in that I don't know the format of the feed I'm processing. For instance, I've had some luck with the following line:

descr = Iconv.new('UTF-8//IGNORE', 'UTF-8').iconv descr

However, when it encounters the '£' it simply truncates everything after that point.

When I display the '£' symbol with the String.inspect function, it displays at '\243'. Failing a method to 'correctly' deal with this symbol, I'd be happy enough to substitute it for another value (like 'GBP'). So help with that code would be appreciated as well.

The feed in question is http://www.dailymail.co.uk/sport/football/index.rss

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

孤凫 2024-08-03 01:57:07

我找到了一种解决方案:

为了解决这个问题,我必须为文档定义 $KCODE(编码):

require 'rubygems'
require 'active_support/all'

$KCODE = 'UTF8'

str = "test ščž"
puts str.parameterize.inspect
puts str.parameterize.to_s

=> #
=> test-scz

原始帖子: https://rails.lighthouseapp .com/projects/8994/tickets/3504-string-parameterize-normalize-bug

I've found one solution:

To fix it, I had to define the $KCODE (encoding) for the document:

require 'rubygems'
require 'active_support/all'

$KCODE = 'UTF8'

str = "test ščž"
puts str.parameterize.inspect
puts str.parameterize.to_s

=> #
=> test-scz

Original post: https://rails.lighthouseapp.com/projects/8994/tickets/3504-string-parameterize-normalize-bug

牵你手 2024-08-03 01:57:07

我错过了一些非常基本的东西 - 我猜测传入的提要的编码。

所以现在我正在查看 (a) HTTP 响应标头中的字符集,然后 (b) XML 声明中的编码饲料本身。

获得编码后,我使用 iconv 将其转换为 UTF-8。

到目前为止,一切都很好。

I was missing something pretty basic - I was guessing at the encoding of the feed that was coming in.

So now I'm looking at (a) the charset in the HTTP response headers, then (b) the encoding in the XML declaration in the feed itself.

Once I have the encoding I use iconv to move it into UTF-8.

So far so good.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文