Nokogiri、open-uri 和 Unicode 字符
我正在使用 Nokogiri 和 open-uri 来获取网页上标题标签的内容,但在处理重音字符时遇到问题。处理这些问题的最佳方法是什么?这就是我正在做的:
require 'open-uri'
require 'nokogiri'
doc = Nokogiri::HTML(open(link))
title = doc.at_css("title")
此时,标题如下所示:
抹布\303\271
而不是:
拉古
如何让 nokogiri 返回正确的字符(例如本例中的 ù )?
以下是示例 URL:
http://www.epicurious。 com/recipes/food/views/Tagliatelle-with-Duck-Ragu-242037
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
总结:当通过 open-uri 将 UTF-8 提供给 Nokogiri 时,使用
open(...).read
并将结果字符串传递给 Nokogiri。分析:
如果我使用curl获取页面,标题会正确显示
Content-Type: text/html; charset=UTF-8
并且文件内容包含有效的UTF-8,例如“Genealogía de Jesucristo”
。但即使对 Ruby 文件进行了神奇的注释并设置了 doc 编码,也没有什么好处:我们可以看到这不是 open-uri 的错:
这似乎是处理 open-uri 时的 Nokogiri 问题。这可以通过将 HTML 作为原始字符串传递给 Nokogiri 来解决:
Summary: When feeding UTF-8 to Nokogiri through open-uri, use
open(...).read
and pass the resulting string to Nokogiri.Analysis:
If I fetch the page using curl, the headers properly show
Content-Type: text/html; charset=UTF-8
and the file content includes valid UTF-8, e.g."Genealogía de Jesucristo"
. But even with a magic comment on the Ruby file and setting the doc encoding, it's no good:We can see that this is not the fault of open-uri:
This is a Nokogiri issue when dealing with open-uri, it seems. This can be worked around by passing the HTML as a raw string to Nokogiri:
我遇到了同样的问题,并且 Iconv 方法不起作用。
Nokogiri::HTML
是Nokogiri::HTML.parse(thing, url,encoding, options)
的别名。因此,您只需要执行:
doc = Nokogiri::HTML(open(link).read, nil, 'utf-8')
,它就会将页面编码正确转换为 utf-8 。您将看到
Ragù
而不是Rag\303\271
。I was having the same problem and the Iconv approach wasn't working.
Nokogiri::HTML
is an alias toNokogiri::HTML.parse(thing, url, encoding, options)
.So, you just need to do:
doc = Nokogiri::HTML(open(link).read, nil, 'utf-8')
and it'll convert the page encoding properly to utf-8. You'll see
Ragù
instead ofRag\303\271
.当你说“看起来像这样”时,你正在查看这个值IRB吗?它将使用表示字符的字节序列的 C 风格转义来转义非 ASCII 范围字符。
如果您使用 put 打印它们,您将按预期返回它们,假设您的 shell 控制台使用与相关字符串相同的编码(在本例中显然是 UTF-8,基于为该字符返回的两个字节) 。如果将值存储在文本文件中,则打印到句柄也应生成 UTF-8 序列。
如果您需要在 UTF-8 和其他编码之间进行转换,具体情况取决于您使用的是 Ruby 1.9 还是 1.8.6。
对于 1.9:http://blog.grayproducts.net/articles/ruby_19s_string
对于 1.8,您可能需要查看 Iconv。
另外,如果您需要与 Windows 中的 COM 组件交互,则需要告诉 ruby 使用正确的编码,如下所示:
如果您与 mysql 交互,则需要在表上设置排序规则到支持您正在使用的编码的编码。一般来说,最好将排序规则设置为 UTF-8,即使您的某些内容以其他编码返回;您只需要根据需要进行转换。
Nokogiri 有一些处理不同编码的功能(可能通过 Iconv),但我对此有点不熟悉,所以我将把解释留给其他人。
When you say "looks like this," are you viewing this value IRB? It's going to escape non-ASCII range characters with C-style escaping of the byte sequences that represent the characters.
If you print them with puts, you'll get them back as you expect, presuming your shell console is using the same encoding as the string in question (Apparently UTF-8 in this case, based on the two bytes returned for that character). If you are storing the values in a text file, printing to a handle should also result in UTF-8 sequences.
If you need to translate between UTF-8 and other encodings, the specifics depend on whether you're in Ruby 1.9 or 1.8.6.
For 1.9: http://blog.grayproductions.net/articles/ruby_19s_string
for 1.8, you probably need to look at Iconv.
Also, if you need to interact with COM components in Windows, you'll need to tell ruby to use the correct encoding with something like the following:
If you're interacting with mysql, you'll need to set the collation on the table to one that supports the encoding that you're working with. In general, it's best to set the collation to UTF-8, even if some of your content is coming back in other encodings; you'll just need to convert as necessary.
Nokogiri has some features for dealing with different encodings (probably through Iconv), but I'm a little out of practice with that, so I'll leave explanation of that to someone else.
尝试设置 Nokogiri 的编码选项,如下所示:
Try setting the encoding option of Nokogiri, like so:
将 Nokogiri::HTML(...) 更改为 Nokogiri::HTML5(...) 解决了我在解析某些特殊字符(特别是破折号)时遇到的问题。
(链接中的重音字符在两者中都表现良好,因此不知道这是否会对您有所帮助。)
示例:
Changing Nokogiri::HTML(...) to Nokogiri::HTML5(...) fixed issues I was having with parsing certain special character, specifically em-dashes.
(The accented characters in your link came through fine in both, so don't know if this would help you with that.)
EXAMPLE:
您需要将正在抓取的网站(此处为 Epiurious.com)的响应转换为 utf-8 编码。
根据正在抓取的页面的 html 内容,目前其为“ISO-8859-1”。因此,您需要执行以下操作:
在此处阅读更多相关信息:http://www.quarkruby.com/2009/9/22/rails-utf-8-and-html-screen-scraping
You need to convert the response from the website being scraped (here epicurious.com) into utf-8 encoding.
as per the html content from the page being scraped, its "ISO-8859-1" for now. So, you need to do something like this:
Read more about it here: http://www.quarkruby.com/2009/9/22/rails-utf-8-and-html-screen-scraping
只是为了添加交叉引用,此 SO 页面提供了一些相关信息:
如何让 Nokogiri 透明地返回未触及/编码的 Html 实体?
Just to add a cross-reference, this SO page gives some related information:
How to make Nokogiri transparently return un/encoded Html entities untouched?
提示:您还可以使用 Scrapifier gem 以非常简单的方式从 URI 获取元数据(作为页面标题)。数据全部采用UTF-8编码。
看看:https://github.com/tiagopog/scrapifier
希望它对您有用。
Tip: you could also use the Scrapifier gem to get metadata, as the page title, from URIs in a very simple way. The data are all encoded in UTF-8.
Check it out: https://github.com/tiagopog/scrapifier
Hope it's useful for you.