为什么 Nokogiri 不加载整个页面？

发布于 2024-11-18 11:09:25 字数 746 浏览 10 评论 0原文

我使用 Nokogiri 打开有关各个国家/地区的维基百科页面，然后从维基间链接（外语维基百科的链接）中提取其他语言的这些国家/地区的名称。但是，当我尝试打开法国页面时，Nokogiri 不会下载完整页面。也许它太大了，无论如何它不包含我需要的跨维基链接。我怎样才能强制它下载全部？

这是我的代码：

url = "http://en.wikipedia.org/wiki/" + country_name
page = nil
begin
  page = Nokogiri::HTML(open(url))
rescue   OpenURI::HTTPError=>e
  puts "No article found for " + country_name
end

language_part = page.css('div#p-lang')

测试：

with country_name = "France"
=> []

with country_name = "Thailand"
=> really long array that I don't want to quote here,
   but containing all the right data

也许这个问题超出了 Nokogiri 的范围并进入了 OpenURI - 无论如何我需要找到解决方案。

原文

I'm using Nokogiri to open Wikipedia pages about various countries, and then extracting the names of these countries in other languages from the interwiki links (links to foreign-language wikipedias). However, when I try to open the page for France, Nokogiri does not download the full page. Maybe it's too large, anyway it doesn't contain the interwiki links that I need. How can I force it to download all?

Here's my code:

url = "http://en.wikipedia.org/wiki/" + country_name
page = nil
begin
  page = Nokogiri::HTML(open(url))
rescue   OpenURI::HTTPError=>e
  puts "No article found for " + country_name
end

language_part = page.css('div#p-lang')

Test:

with country_name = "France"
=> []

with country_name = "Thailand"
=> really long array that I don't want to quote here,
   but containing all the right data

Maybe this issue goes beyond Nokogiri and into OpenURI - anyway I need to find a solution.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

情独悲 2024-11-25 11:09:25

Nokogiri 不会检索页面，它要求 OpenURI 通过 Open::URI 返回的 StringIO 对象上的内部读取来完成此操作。

require 'open-uri'
require 'zlib'

stream = open('http://en.wikipedia.org/wiki/France')
if (stream.content_encoding.empty?)
  body = stream.read
else
  body = Zlib::GzipReader.new(stream).read
end

p body

以下是您可以关闭的内容：

>> require 'open-uri' #=> true
>> open('http://en.wikipedia.org/wiki/France').content_encoding #=> ["gzip"]
>> open('http://en.wikipedia.org/wiki/Thailand').content_encoding #=> []

在本例中，如果它是 []，又称为“text/html”，则它会读取。如果是 ["gzip"] 则进行解码。

完成上面所有的事情并将其扔到：

require 'nokogiri'
page = Nokogiri::HTML(body)
language_part = page.css('div#p-lang')

应该会让你回到正轨。

在执行上述所有操作后，请执行此操作，以从视觉上确认您得到了可用的东西：

p language_part.text.gsub("\t", '')

请参阅 Casper 的答案和有关为什么您看到两个不同结果的评论。最初看起来 Open-URI 在处理返回的数据方面不一致，但根据 Casper 所说的以及我使用curl 看到的内容，维基百科不支持大型文档的“Accept-Encoding”标头并返回 gzip。对于当今的浏览器来说，这相当安全，但像 Open-URI 这样不能自动感知编码的客户端将会出现问题。这就是上面的代码应该帮助解决的问题。

Nokogiri does not retrieve the page, it asks OpenURI to do it with an internal read on the StringIO object that Open::URI returns.

require 'open-uri'
require 'zlib'

stream = open('http://en.wikipedia.org/wiki/France')
if (stream.content_encoding.empty?)
  body = stream.read
else
  body = Zlib::GzipReader.new(stream).read
end

p body

Here's what you can key off of:

>> require 'open-uri' #=> true
>> open('http://en.wikipedia.org/wiki/France').content_encoding #=> ["gzip"]
>> open('http://en.wikipedia.org/wiki/Thailand').content_encoding #=> []

In this case if it's [], AKA "text/html", it reads. If it's ["gzip"] it decodes.

Doing all the stuff above and tossing it to:

require 'nokogiri'
page = Nokogiri::HTML(body)
language_part = page.css('div#p-lang')

should get you back on track.

Do this after all the above to confirm visually you're getting something usable:

p language_part.text.gsub("\t", '')

See Casper's answer and comments about why you saw two different results. Originally it looked like Open-URI was inconsistent in its processing of the returned data, but based on what Casper said, and what I saw using curl, Wikipedia isn't honoring the "Accept-Encoding" header for large documents and returns gzip. That is fairly safe with today's browsers but clients like Open-URI that don't automatically sense the encoding will have problems. That's what the code above should help fix.

回复收藏 0 原文

呢古 2024-11-25 11:09:25

经过相当多的绞尽脑汁之后，问题就在这里：

> wget -S 'http://en.wikipedia.org/wiki/France'
Resolving en.wikipedia.org... 91.198.174.232
Connecting to en.wikipedia.org|91.198.174.232|:80... connected.
HTTP request sent, awaiting response...
  HTTP/1.0 200 OK
  Content-Language: en
  Last-Modified: Fri, 01 Jul 2011 23:31:36 GMT
  Content-Encoding: gzip <<<<------ BINGO!
  ...

您需要解压 gzip 压缩的数据，而 open-uri 不会自动执行此操作。
解决方案：

def http_get(uri)
  url = URI.parse uri

  res = Net::HTTP.start(url.host, url.port) { |h|
    h.get(url.path)
  }

  headers = res.to_hash
  gzipped = headers['content-encoding'] && headers['content-encoding'][0] == "gzip"
  content = gzipped ? Zlib::GzipReader.new(StringIO.new(res.body)).read : res.body

  content
end

然后：

page = Nokogiri::HTML(http_get("http://en.wikipedia.org/wiki/France"))

After quite a bit of head scratching the problem is here:

> wget -S 'http://en.wikipedia.org/wiki/France'
Resolving en.wikipedia.org... 91.198.174.232
Connecting to en.wikipedia.org|91.198.174.232|:80... connected.
HTTP request sent, awaiting response...
  HTTP/1.0 200 OK
  Content-Language: en
  Last-Modified: Fri, 01 Jul 2011 23:31:36 GMT
  Content-Encoding: gzip <<<<------ BINGO!
  ...

You need to unpack the gzipped data, which open-uri does not do automatically.
Solution:

def http_get(uri)
  url = URI.parse uri

  res = Net::HTTP.start(url.host, url.port) { |h|
    h.get(url.path)
  }

  headers = res.to_hash
  gzipped = headers['content-encoding'] && headers['content-encoding'][0] == "gzip"
  content = gzipped ? Zlib::GzipReader.new(StringIO.new(res.body)).read : res.body

  content
end

And then:

page = Nokogiri::HTML(http_get("http://en.wikipedia.org/wiki/France"))

回复收藏 0 原文

热血少△年 2024-11-25 11:09:25

require 'open-uri'
require 'zlib'

open('Accept-Encoding' => 'gzip, deflate') do |response|
  if response.content_encoding.include?('gzip')
    response = Zlib::GzipReader.new(response)
    response.define_singleton_method(:method_missing) do |name|
      to_io.public_send(name)
    end
  end

  yield response if block_given?

  response
end

require 'open-uri'
require 'zlib'

open('Accept-Encoding' => 'gzip, deflate') do |response|
  if response.content_encoding.include?('gzip')
    response = Zlib::GzipReader.new(response)
    response.define_singleton_method(:method_missing) do |name|
      to_io.public_send(name)
    end
  end

  yield response if block_given?

  response
end

回复收藏 0 原文

~没有更多了~