如何在 Ruby/Rails 中安全地解析多字节提要?
(抱歉,如果是一个新手问题...我已经做了相当多的研究,老实说...)
我正在编写一些 Ruby on Rails 代码来解析 RSS/ATOM 提要。 我的代码在令人讨厌的“£”符号上呕吐。
在执行其他操作之前,我一直在尝试标准化提要的描述和标题字段的方法:
descr = self.description.mb_chars.normalize(:kc)
但是,当它使用“£”命中字符串时,我猜测 mb_chars 遇到了问题并返回常规的 Ruby 字符串目的。 我收到错误:
undefined method `normalize' for #<String:0x5ef8490>
那么防御性准备这些字符串以插入数据库的最佳过程是什么? (我还需要对它们进行一堆字符串处理)
我的问题很复杂,因为我不知道我正在处理的提要的格式。 例如,我很幸运地使用了以下行:
descr = Iconv.new('UTF-8//IGNORE', 'UTF-8').iconv descr
但是,当它遇到“£”时,它只会截断该点之后的所有内容。
当我使用 String.inspect 函数显示“£”符号时,它显示在“\243”处。 如果无法“正确”处理该符号,我很乐意将其替换为另一个值(例如“GBP”)。 因此,对该代码的帮助也将不胜感激。
有问题的提要是 http://www.dailymail.co.uk/sport /football/index.rss
(Sorry if a newb question...I've done quite a bit of research, honestly...)
I'm writing some Ruby on Rails code to parse RSS/ATOM feeds. My code is throwing-up on on a pesky '£' symbol.
I've been trying the approach of normalizing the description and title fields of the feeds before doing anything else:
descr = self.description.mb_chars.normalize(:kc)
However, when it hits the string with the '£', I'm guessing that mb_chars hits a problem and returns a regular Ruby String object. I get the error:
undefined method `normalize' for #<String:0x5ef8490>
So what is the best process to defensively prep these strings for insertion into the database? (I need to do a bunch of string processing on them as well)
My problem is compounded in that I don't know the format of the feed I'm processing. For instance, I've had some luck with the following line:
descr = Iconv.new('UTF-8//IGNORE', 'UTF-8').iconv descr
However, when it encounters the '£' it simply truncates everything after that point.
When I display the '£' symbol with the String.inspect function, it displays at '\243'. Failing a method to 'correctly' deal with this symbol, I'd be happy enough to substitute it for another value (like 'GBP'). So help with that code would be appreciated as well.
The feed in question is http://www.dailymail.co.uk/sport/football/index.rss
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我找到了一种解决方案:
=> #
=> test-scz
原始帖子: https://rails.lighthouseapp .com/projects/8994/tickets/3504-string-parameterize-normalize-bug
I've found one solution:
=> #
=> test-scz
Original post: https://rails.lighthouseapp.com/projects/8994/tickets/3504-string-parameterize-normalize-bug
我错过了一些非常基本的东西 - 我猜测传入的提要的编码。
所以现在我正在查看 (a) HTTP 响应标头中的字符集,然后 (b) XML 声明中的编码饲料本身。
获得编码后,我使用 iconv 将其转换为 UTF-8。
到目前为止,一切都很好。
I was missing something pretty basic - I was guessing at the encoding of the feed that was coming in.
So now I'm looking at (a) the charset in the HTTP response headers, then (b) the encoding in the XML declaration in the feed itself.
Once I have the encoding I use iconv to move it into UTF-8.
So far so good.