Nokogiri、open-uri 和 Unicode 字符

发布于 2024-08-27 23:41:54 字数 590 浏览 7 评论 0 原文

我正在使用 Nokogiri 和 open-uri 来获取网页上标题标签的内容,但在处理重音字符时遇到问题。处理这些问题的最佳方法是什么?这就是我正在做的:

require 'open-uri'
require 'nokogiri'

doc = Nokogiri::HTML(open(link))
title = doc.at_css("title")

此时,标题如下所示:

抹布\303\271

而不是:

拉古

如何让 nokogiri 返回正确的字符(例如本例中的 ù )?

以下是示例 URL:

http://www.epicurious。 com/recipes/food/views/Tagliatelle-with-Duck-Ragu-242037

I'm using Nokogiri and open-uri to grab the contents of the title tag on a webpage, but am having trouble with accented characters. What's the best way to deal with these? Here's what I'm doing:

require 'open-uri'
require 'nokogiri'

doc = Nokogiri::HTML(open(link))
title = doc.at_css("title")

At this point, the title looks like this:

Rag\303\271

Instead of:

Ragù

How can I have nokogiri return the proper character (e.g. ù in this case)?

Here's an example URL:

http://www.epicurious.com/recipes/food/views/Tagliatelle-with-Duck-Ragu-242037

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

哭了丶谁疼 2024-09-03 23:41:54

总结:当通过 open-uri 将 UTF-8 提供给 Nokogiri 时,使用 open(...).read 并将结果字符串传递给 Nokogiri。

分析:
如果我使用curl获取页面,标题会正确显示Content-Type: text/html; charset=UTF-8 并且文件内容包含有效的UTF-8,例如“Genealogía de Jesucristo”。但即使对 Ruby 文件进行了神奇的注释并设置了 doc 编码,也没有什么好处:

# encoding: UTF-8
require 'nokogiri'
require 'open-uri'

doc = Nokogiri::HTML(open('http://www.biblegateway.com/passage/?search=Mateo1-2&version=NVI'))
doc.encoding = 'utf-8'
h52 = doc.css('h5')[1]
puts h52.text, h52.text.encoding
#=> Genealogà a de Jesucristo
#=> UTF-8

我们可以看到这不是 open-uri 的错:

html = open('http://www.biblegateway.com/passage/?search=Mateo1-2&version=NVI')
gene = html.read[/Gene\S+/]
puts gene, gene.encoding
#=> Genealogía
#=> UTF-8

这似乎是处理 open-uri 时的 Nokogiri 问题。这可以通过将 HTML 作为原始字符串传递给 Nokogiri 来解决:

# encoding: UTF-8
require 'nokogiri'
require 'open-uri'

html = open('http://www.biblegateway.com/passage/?search=Mateo1-2&version=NVI')
doc = Nokogiri::HTML(html.read)
doc.encoding = 'utf-8'
h52 = doc.css('h5')[1].text
puts h52, h52.encoding, h52 == "Genealogía de Jesucristo"
#=> Genealogía de Jesucristo
#=> UTF-8
#=> true

Summary: When feeding UTF-8 to Nokogiri through open-uri, use open(...).read and pass the resulting string to Nokogiri.

Analysis:
If I fetch the page using curl, the headers properly show Content-Type: text/html; charset=UTF-8 and the file content includes valid UTF-8, e.g. "Genealogía de Jesucristo". But even with a magic comment on the Ruby file and setting the doc encoding, it's no good:

# encoding: UTF-8
require 'nokogiri'
require 'open-uri'

doc = Nokogiri::HTML(open('http://www.biblegateway.com/passage/?search=Mateo1-2&version=NVI'))
doc.encoding = 'utf-8'
h52 = doc.css('h5')[1]
puts h52.text, h52.text.encoding
#=> Genealogà a de Jesucristo
#=> UTF-8

We can see that this is not the fault of open-uri:

html = open('http://www.biblegateway.com/passage/?search=Mateo1-2&version=NVI')
gene = html.read[/Gene\S+/]
puts gene, gene.encoding
#=> Genealogía
#=> UTF-8

This is a Nokogiri issue when dealing with open-uri, it seems. This can be worked around by passing the HTML as a raw string to Nokogiri:

# encoding: UTF-8
require 'nokogiri'
require 'open-uri'

html = open('http://www.biblegateway.com/passage/?search=Mateo1-2&version=NVI')
doc = Nokogiri::HTML(html.read)
doc.encoding = 'utf-8'
h52 = doc.css('h5')[1].text
puts h52, h52.encoding, h52 == "Genealogía de Jesucristo"
#=> Genealogía de Jesucristo
#=> UTF-8
#=> true
囍笑 2024-09-03 23:41:54

我遇到了同样的问题,并且 Iconv 方法不起作用。 Nokogiri::HTMLNokogiri::HTML.parse(thing, url,encoding, options) 的别名。

因此,您只需要执行:

doc = Nokogiri::HTML(open(link).read, nil, 'utf-8')

,它就会将页面编码正确转换为 utf-8 。您将看到 Ragù 而不是 Rag\303\271

I was having the same problem and the Iconv approach wasn't working. Nokogiri::HTML is an alias to Nokogiri::HTML.parse(thing, url, encoding, options).

So, you just need to do:

doc = Nokogiri::HTML(open(link).read, nil, 'utf-8')

and it'll convert the page encoding properly to utf-8. You'll see Ragù instead of Rag\303\271.

久夏青 2024-09-03 23:41:54

当你说“看起来像这样”时,你正在查看这个值IRB吗?它将使用表示字符的字节序列的 C 风格转义来转义非 ASCII 范围字符。

如果您使用 put 打印它们,您将按预期返回它们,假设您的 shell 控制台使用与相关字符串相同的编码(在本例中显然是 UTF-8,基于为该字符返回的两个字节) 。如果将值存储在文本文件中,则打印到句柄也应生成 UTF-8 序列。

如果您需要在 UTF-8 和其他编码之间进行转换,具体情况取决于您使用的是 Ruby 1.9 还是 1.8.6。

对于 1.9:http://blog.grayproducts.net/articles/ruby_19s_string
对于 1.8,您可能需要查看 Iconv。

另外,如果您需要与 Windows 中的 COM 组件交互,则需要告诉 ruby​​ 使用正确的编码,如下所示:

require 'win32ole'

WIN32OLE.codepage = WIN32OLE::CP_UTF8

如果您与 mysql 交互,则需要在表上设置排序规则到支持您正在使用的编码的编码。一般来说,最好将排序规则设置为 UTF-8,即使您的某些内容以其他编码返回;您只需要根据需要进行转换。

Nokogiri 有一些处理不同编码的功能(可能通过 Iconv),但我对此有点不熟悉,所以我将把解释留给其他人。

When you say "looks like this," are you viewing this value IRB? It's going to escape non-ASCII range characters with C-style escaping of the byte sequences that represent the characters.

If you print them with puts, you'll get them back as you expect, presuming your shell console is using the same encoding as the string in question (Apparently UTF-8 in this case, based on the two bytes returned for that character). If you are storing the values in a text file, printing to a handle should also result in UTF-8 sequences.

If you need to translate between UTF-8 and other encodings, the specifics depend on whether you're in Ruby 1.9 or 1.8.6.

For 1.9: http://blog.grayproductions.net/articles/ruby_19s_string
for 1.8, you probably need to look at Iconv.

Also, if you need to interact with COM components in Windows, you'll need to tell ruby to use the correct encoding with something like the following:

require 'win32ole'

WIN32OLE.codepage = WIN32OLE::CP_UTF8

If you're interacting with mysql, you'll need to set the collation on the table to one that supports the encoding that you're working with. In general, it's best to set the collation to UTF-8, even if some of your content is coming back in other encodings; you'll just need to convert as necessary.

Nokogiri has some features for dealing with different encodings (probably through Iconv), but I'm a little out of practice with that, so I'll leave explanation of that to someone else.

征棹 2024-09-03 23:41:54

尝试设置 Nokogiri 的编码选项,如下所示:

require 'open-uri'
require 'nokogiri'
doc = Nokogiri::HTML(open(link))
doc.encoding = 'utf-8'
title = doc.at_css("title")

Try setting the encoding option of Nokogiri, like so:

require 'open-uri'
require 'nokogiri'
doc = Nokogiri::HTML(open(link))
doc.encoding = 'utf-8'
title = doc.at_css("title")
尤怨 2024-09-03 23:41:54

将 Nokogiri::HTML(...) 更改为 Nokogiri::HTML5(...) 解决了我在解析某些特殊字符(特别是破折号)时遇到的问题。

(链接中的重音字符在两者中都表现良好,因此不知道这是否会对您有所帮助。)

示例:

url = 'https://www.youtube.com/watch?v=4r6gr7uytQA'

doc = Nokogiri::HTML(open(url))
doc.title
=> "Josh Waitzkin â\u0080\u0094 How to Cram 2 Months of Learning into 1 Day | The Tim Ferriss Show - YouTube"

doc = Nokogiri::HTML5(open(url))
doc.title
=> "Josh Waitzkin — How to Cram 2 Months of Learning into 1 Day | The Tim Ferriss Show - YouTube"

Changing Nokogiri::HTML(...) to Nokogiri::HTML5(...) fixed issues I was having with parsing certain special character, specifically em-dashes.

(The accented characters in your link came through fine in both, so don't know if this would help you with that.)

EXAMPLE:

url = 'https://www.youtube.com/watch?v=4r6gr7uytQA'

doc = Nokogiri::HTML(open(url))
doc.title
=> "Josh Waitzkin â\u0080\u0094 How to Cram 2 Months of Learning into 1 Day | The Tim Ferriss Show - YouTube"

doc = Nokogiri::HTML5(open(url))
doc.title
=> "Josh Waitzkin — How to Cram 2 Months of Learning into 1 Day | The Tim Ferriss Show - YouTube"
违心° 2024-09-03 23:41:54

您需要将正在抓取的网站(此处为 Epiurious.com)的响应转换为 utf-8 编码。

根据正在抓取的页面的 html 内容,目前其为“ISO-8859-1”。因此,您需要执行以下操作:

require 'iconv'
doc = Nokogiri::HTML(Iconv.conv('utf-8//IGNORE', 'ISO-8859-1', open(link).read))

在此处阅读更多相关信息:http://www.quarkruby.com/2009/9/22/rails-utf-8-and-html-screen-scraping

You need to convert the response from the website being scraped (here epicurious.com) into utf-8 encoding.

as per the html content from the page being scraped, its "ISO-8859-1" for now. So, you need to do something like this:

require 'iconv'
doc = Nokogiri::HTML(Iconv.conv('utf-8//IGNORE', 'ISO-8859-1', open(link).read))

Read more about it here: http://www.quarkruby.com/2009/9/22/rails-utf-8-and-html-screen-scraping

于我来说 2024-09-03 23:41:54

只是为了添加交叉引用,此 SO 页面提供了一些相关信息:

如何让 Nokogiri 透明地返回未触及/编码的 Html 实体?

Just to add a cross-reference, this SO page gives some related information:

How to make Nokogiri transparently return un/encoded Html entities untouched?

无所的.畏惧 2024-09-03 23:41:54

提示:您还可以使用 Scrapifier gem 以非常简单的方式从 URI 获取元数据(作为页面标题)。数据全部采用UTF-8编码。

看看:https://github.com/tiagopog/scrapifier

希望它对您有用。

Tip: you could also use the Scrapifier gem to get metadata, as the page title, from URIs in a very simple way. The data are all encoded in UTF-8.

Check it out: https://github.com/tiagopog/scrapifier

Hope it's useful for you.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文