为什么 Nokogiri 不加载整个页面?
我使用 Nokogiri 打开有关各个国家/地区的维基百科页面,然后从维基间链接(外语维基百科的链接)中提取其他语言的这些国家/地区的名称。但是,当我尝试打开法国页面时,Nokogiri 不会下载完整页面。也许它太大了,无论如何它不包含我需要的跨维基链接。我怎样才能强制它下载全部?
这是我的代码:
url = "http://en.wikipedia.org/wiki/" + country_name
page = nil
begin
page = Nokogiri::HTML(open(url))
rescue OpenURI::HTTPError=>e
puts "No article found for " + country_name
end
language_part = page.css('div#p-lang')
测试:
with country_name = "France"
=> []
with country_name = "Thailand"
=> really long array that I don't want to quote here,
but containing all the right data
也许这个问题超出了 Nokogiri 的范围并进入了 OpenURI - 无论如何我需要找到解决方案。
I'm using Nokogiri to open Wikipedia pages about various countries, and then extracting the names of these countries in other languages from the interwiki links (links to foreign-language wikipedias). However, when I try to open the page for France, Nokogiri does not download the full page. Maybe it's too large, anyway it doesn't contain the interwiki links that I need. How can I force it to download all?
Here's my code:
url = "http://en.wikipedia.org/wiki/" + country_name
page = nil
begin
page = Nokogiri::HTML(open(url))
rescue OpenURI::HTTPError=>e
puts "No article found for " + country_name
end
language_part = page.css('div#p-lang')
Test:
with country_name = "France"
=> []
with country_name = "Thailand"
=> really long array that I don't want to quote here,
but containing all the right data
Maybe this issue goes beyond Nokogiri and into OpenURI - anyway I need to find a solution.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
Nokogiri 不会检索页面,它要求 OpenURI 通过 Open::URI 返回的 StringIO 对象上的内部读取来完成此操作。
以下是您可以关闭的内容:
在本例中,如果它是
[]
,又称为“text/html”,则它会读取。如果是["gzip"]
则进行解码。完成上面所有的事情并将其扔到:
应该会让你回到正轨。
在执行上述所有操作后,请执行此操作,以从视觉上确认您得到了可用的东西:
请参阅 Casper 的答案和有关为什么您看到两个不同结果的评论。最初看起来 Open-URI 在处理返回的数据方面不一致,但根据 Casper 所说的以及我使用curl 看到的内容,维基百科不支持大型文档的“Accept-Encoding”标头并返回 gzip。对于当今的浏览器来说,这相当安全,但像 Open-URI 这样不能自动感知编码的客户端将会出现问题。这就是上面的代码应该帮助解决的问题。
Nokogiri does not retrieve the page, it asks OpenURI to do it with an internal
read
on the StringIO object that Open::URI returns.Here's what you can key off of:
In this case if it's
[]
, AKA "text/html", it reads. If it's["gzip"]
it decodes.Doing all the stuff above and tossing it to:
should get you back on track.
Do this after all the above to confirm visually you're getting something usable:
See Casper's answer and comments about why you saw two different results. Originally it looked like Open-URI was inconsistent in its processing of the returned data, but based on what Casper said, and what I saw using curl, Wikipedia isn't honoring the "Accept-Encoding" header for large documents and returns gzip. That is fairly safe with today's browsers but clients like Open-URI that don't automatically sense the encoding will have problems. That's what the code above should help fix.
经过相当多的绞尽脑汁之后,问题就在这里:
您需要解压 gzip 压缩的数据,而 open-uri 不会自动执行此操作。
解决方案:
然后:
After quite a bit of head scratching the problem is here:
You need to unpack the gzipped data, which open-uri does not do automatically.
Solution:
And then: