Nokogiri 和机械化问题

发布于 2024-10-20 16:25:48 字数 1046 浏览 1 评论 0原文

我正在 mechanize 文档网站上做一个示例,我想使用解析结果 诺科吉里。

我的问题是,当执行以下行时:

doc = Nokogiri::HTML(search_results, 'UTF-8' )

出现以下错误:

C:/Ruby192/lib/ruby/gems/1.9.1/gems/nokogiri-1.4.4.1-x86-mingw32/lib/nokogiri/html/document.rb:71:in `parse': undefined method `name' for "UTF-8":String (NoMethodError)
    from C:/Ruby192/lib/ruby/gems/1.9.1/gems/nokogiri-1.4.4.1-x86-mingw32/lib/nokogiri/html.rb:13:in `HTML'
    from mechanize_test.rb:16:in `<main>'

我在 Windows Vista 机器上安装了 ruby​​ 1.9

mechanize 返回的结果是非拉丁语 (utf8)

代码示例如下。

# encoding: UTF-8

 require 'rubygems'
 require 'mechanize'
 require 'nokogiri'

 agent = Mechanize.new
 agent.user_agent_alias = 'Mac Safari'
 page = agent.get("http://www.google.com/")
 search_form = page.form_with(:name => "f")
 search_form.field_with(:name => "q").value = "invitations"
 search_results = agent.submit(search_form)
 puts search_results.body

 doc = Nokogiri::HTML(search_results, 'UTF-8')

I am doing one the examples at the mechanize doc site and I want to parse the results using
nokogiri.

My problem is that when the following line gets executed:

doc = Nokogiri::HTML(search_results, 'UTF-8' )

the following error occurs:

C:/Ruby192/lib/ruby/gems/1.9.1/gems/nokogiri-1.4.4.1-x86-mingw32/lib/nokogiri/html/document.rb:71:in `parse': undefined method `name' for "UTF-8":String (NoMethodError)
    from C:/Ruby192/lib/ruby/gems/1.9.1/gems/nokogiri-1.4.4.1-x86-mingw32/lib/nokogiri/html.rb:13:in `HTML'
    from mechanize_test.rb:16:in `<main>'

I have installed ruby 1.9 on a windows vista machine

The results returned by mechanize are non-latin (utf8)

The code sample follows.

# encoding: UTF-8

 require 'rubygems'
 require 'mechanize'
 require 'nokogiri'

 agent = Mechanize.new
 agent.user_agent_alias = 'Mac Safari'
 page = agent.get("http://www.google.com/")
 search_form = page.form_with(:name => "f")
 search_form.field_with(:name => "q").value = "invitations"
 search_results = agent.submit(search_form)
 puts search_results.body

 doc = Nokogiri::HTML(search_results, 'UTF-8')

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

我也只是我 2024-10-27 16:25:48

@Douglas Drouillard

感谢您对此进行调查。我发现我犯了一个错误。对 nokogiri 的调用应该是:

doc = Nokogiri::HTML(search_results.body, 'UTF-8')

请注意,search_resultssearch_results.body 不同。

Search_results 包含来自机械化实例化的信息
search_resuls.body 包含 nokogiri 可以毫无问题解析的 html utf8 信息。

@Douglas Drouillard

Thanx for looking into this. I found out I made a mistake. The call to nokogiri should have been:

doc = Nokogiri::HTML(search_results.body, 'UTF-8')

Note that search_results is different that search_results.body.

Search_results contains info coming right out of mechanize instantiation
while search_resuls.body contains html utf8 info that nokogiri can parse with no problem.

十年九夏 2024-10-27 16:25:48

这似乎与 Nokogiri 期望作为正在调用的解析方法的参数的问题有关。我看到的第一个问题是,您在错误的参数槽中传递了编码选项,

解析示例< /a> 来自指定编码的 Nokogiri 项目页面

Nokogiri.XML('<foo><bar /><foo>', nil, 'EUC-JP')

请注意,编码是第三个参数,而不是第二个。但这仍然不能完全解释您所看到的行为,因为编码应该被忽略。

根据 Nokogiri 文档,对 Nokogiri::HTML() 的调用是 parse 方法的一种便捷方法。

Nokogiri::HTML::parse 的代码

   def parse thing, url = nil, encoding = nil, options = XML::ParseOptions::DEFAULT_HTML, &block
      document.parse(thing, url, encoding, options, &block)
   end

Nokogiri::HTML:: 的文档解析方法有点长,但这里是相关部分:

 string_or_io.respond_to?(:encoding)
   unless string_or_io.encoding.name == "ASCII-8BIT"
      encoding ||= string_or_io.encoding.name
   end
 end

注意string_or_io.encoding.name,这与您看到的错误相匹配,undefined method 'name' for "UTF-8 “:字符串(NoMethodError)

您的 search_results 对象是否具有键值对 {:encoding =>; 的属性‘UTF-8’}? Nokogiri 似乎正在寻找编码来存储一个名称属性为“UTF-8”的对象。

This appears to be issue with what Nokogiri expects as parameters to the parse method that is being called. The first issue I see, is that you are passing in the encoding option in the wrong parameter slot,

A parsing example from Nokogiri project page that specifies encoding

Nokogiri.XML('<foo><bar /><foo>', nil, 'EUC-JP')

Notice the encoding is the third parameter, not the second. But that still does not fully explain the behavior you are seeing, as the encoding should simply be ignored.

Per the Nokogiri documentation a call to Nokogiri::HTML() is a convenience method for the parse method.

Code for Nokogiri::HTML::parse

   def parse thing, url = nil, encoding = nil, options = XML::ParseOptions::DEFAULT_HTML, &block
      document.parse(thing, url, encoding, options, &block)
   end

The source for the Nokogiri::HTML::Document parse method is a bit long, but here is the relevant part though:

 string_or_io.respond_to?(:encoding)
   unless string_or_io.encoding.name == "ASCII-8BIT"
      encoding ||= string_or_io.encoding.name
   end
 end

Notice string_or_io.encoding.name, this matches the error your saw, undefined method 'name' for "UTF-8":String (NoMethodError).

Does your search_results object has an attribute with a key value pair of {:encoding => 'UTF-8'}? It appears Nokogiri is looking for the encoding to store an object that then has a name attribute of 'UTF-8'.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文