打开 uri +普里科特nokogiri 无法正确解析 html

发布于 08-03 05:51 字数 972 浏览 21 评论 0原文

我正在尝试使用 open-uri + hpricot 解析网页，但这似乎是解析过程中的一个问题，因为 gem 没有给我带来我想要的东西。

具体来说，我想在这个网址中获取这个div（其ID是'pasajes'）：

http://www.despegar.com.ar

我写了这段代码：

require 'nokogiri'
require 'hpricot'
require 'open-uri'

document = Hpricot(open('http://www.despegar.com.ar/')) # WITH HPRICOT
document2 = Nokogiri::HTML(open('http://www.despegar.com.ar/')) # WITH NOKOGIRI

pasajes = document.search("//div[@id='pasajes']")
pasajes2 = document2.xpath("//div[@id='pasajes']")

但它什么也没带来！我在 hpricot 和 nokogiri 中尝试了很多东西：

我尝试给出该 div 的绝对路径
我尝试使用选择器的 CSS 路径
我尝试使用 hpricot 搜索快捷方式（doc//“div#pasajes”）
几乎所有可能的相对路径到达“pasajes”div

最后我找到了一个可怕的解决方案。我使用了 watir 库，打开网络浏览器后，我将 html 传递给 hpricot。通过这种方式，hpricot 可以识别“pasajes”div。但我不想只是为了解析目的而打开网络浏览器......

我做错了什么？ open-uri 工作不好吗？是杏子吗？

原文

I'm trying to parse a webpage using open-uri + hpricot but it seems to be a problem in the parsing proccess as the gems don't bring me the things I want.

Specifically I want to get this div (whose id is 'pasajes') in this url:

http://www.despegar.com.ar

I write this code:

require 'nokogiri'
require 'hpricot'
require 'open-uri'

document = Hpricot(open('http://www.despegar.com.ar/')) # WITH HPRICOT
document2 = Nokogiri::HTML(open('http://www.despegar.com.ar/')) # WITH NOKOGIRI

pasajes = document.search("//div[@id='pasajes']")
pasajes2 = document2.xpath("//div[@id='pasajes']")

But it bring NOTHING! I've tried lot of things in both hpricot and nokogiri:

I try giving the absolute path to that div
I try CSS path with selectors
I try with hpricot search shortcut (doc//"div#pasajes")
Almost every posible relative path to reach the 'pasajes' div

Finally i found a horrible solution. I have used the watir library and after open a web browser, i have passed the html to hpricot. In this way hpricot DO RECOGNIZE the 'pasajes' div. But i don't want just to open a web-browsere only for parsing purposes...

What I'm doing wrong? Is open-uri working bad? Is hpricot?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

染年凉城似染瑾2024-08-10 05:51:54

静态 HTML 页面中没有 id 为 pasajes 的 DIV。如果您正在运行 *nix，您可以通过以下操作看到这一点：

curl http://www.despegar.com.ar/ | grep pasajes

我的猜测是它是 JavaScript 生成的。

如果您使用 MacRuby，您可以尝试 Lyndon。

There's no DIV with the id pasajes in the static HTML page. If you are running *nix you can see that by doing:

curl http://www.despegar.com.ar/ | grep pasajes

My guess is that it's JavaScript-generated.

If you are using MacRuby you could try Lyndon.

回复收藏 0 原文

萌能量女王2024-08-10 05:51:54

该页面中没有 id 为“pasajes”的 div。这就是问题所在。

回复收藏 0 原文

若无相欠,怎会相见2024-08-10 05:51:54

这更适合作为对上面 Jonas 答案的附加评论，而不是答案本身......但我是 SO 的新手，还没有“评论权”:)

您可以使用 Selenium RC 下载完整的 HTML 和然后对下载的文件使用 nokogiri。请注意，只有当内容由 Javascript 生成/修改时，这才有效。如果网页依赖 cookie 来设置内容，您的选择将是 Selenium（在浏览器中）或 watir，正如您所指出的。

我很想听到更好的解决方案（想用 nokogiri 解析网页，但页面是由 JS 修改的）。

回复收藏 0 原文

菊凝晚露2024-08-10 05:51:54

我在 OS X 10.5 上使用 Nokogiri 时遇到了类似的问题。然而，我首先尝试 open-uri 打开有问题的页面，其中有很多 HTML div、p 等等。我发现通过使用：

urldoc = open('http://hivelogic.com/articles/using_usr_local')
urldoc.readlines{|line| puts line}

我会看到很多精彩的 HTML。我还发现，通过将“文件”读入字符串并将其传递给 Nokogiri，我可以让它正常工作。我什至不得不修改他们在 rubyforge 上使用的演示来教您有关 Nokogiri 的知识。

用他们自己的例子我得到这个：

>> doc = Nokogiri::HTML(open('http://www.google.com/search?q=tenderlove'))
=> <!DOCTYPE html>

>> doc.children
=>

恶心！

如果我调整将 url 读入字符串，我会得到好东西：

>> doc = Nokogiri::HTML(open('http://www.google.com/search?q=tenderlove').read)
=> <!DOCTYPE html>
<html>
<head>
..... TONS OF HTML HERE ........
</div>
</body>
</html>

注意
当我使用 irb 玩时，我确实看到了这个可爱的警告：

嗨。您使用的是 libxml2 版本 2.6.16，该版本已有 4 年多了，并且具有
有很多错误。我们建议，为了获得最大的 HTML/XML 解析乐趣，您
升级您的 libxml2 版本并重新安装 nokogiri。如果你喜欢使用
libxml2 版本 2.6.16，但不喜欢这个警告，请定义常量
在要求 nokogiri 之前，我_KNOW_I_AM_USING_AN_OLD_AND_BUGGY_VERSION_OF_LIBXML2。

但我没有心情处理这些恐怖的事情和各种专家关于修复 /usr/local 中的 libxml 的相互矛盾的建议等等。链接文本上的帖子对此有很好的解释，但另一个 *nix 向导攻击了这个概念并提出一些合理的警告和担忧。所以我说，“没门”。

我为什么要写这个？因为我认为我的 Nokogiri 布鲁斯和 libxml 警告之间可能存在联系。 OS X 10.5 是旧的东西，他们可能有问题。

问题

其他 OS X 10.5 用户是否也遇到过 Nokogiri 的此问题？

I ran into a similar issue with Nokogiri but on OS X 10.5. However, I first tried open-uri to open the pages in question which have lots of HTML div, p whatever. I found by using:

urldoc = open('http://hivelogic.com/articles/using_usr_local')
urldoc.readlines{|line| puts line}

I would see lots of wonderful HTML. I also found by doing read of the "file" into a string and passing that to Nokogiri I could get that to work fine. I even had to modify the very demo they use on rubyforge to teach you about Nokogiri.

Using their own example I get this:

>> doc = Nokogiri::HTML(open('http://www.google.com/search?q=tenderlove'))
=> <!DOCTYPE html>

>> doc.children
=>

YUCK!

If I tweak to read in the url to a string, I get good stuff:

>> doc = Nokogiri::HTML(open('http://www.google.com/search?q=tenderlove').read)
=> <!DOCTYPE html>
<html>
<head>
..... TONS OF HTML HERE ........
</div>
</body>
</html>

Note
I do see this lovely warning when I use irb to play:

HI. You're using libxml2 version 2.6.16 which is over 4 years old and has
plenty of bugs. We suggest that for maximum HTML/XML parsing pleasure, you
upgrade your version of libxml2 and re-install nokogiri. If you like using
libxml2 version 2.6.16, but don't like this warning, please define the constant
I_KNOW_I_AM_USING_AN_OLD_AND_BUGGY_VERSION_OF_LIBXML2 before requring nokogiri.

But I am not in the mood to deal with the horrors and various expert but contradicting advice on fixing libxml in /usr/local blah blah. A post on link text has a great explanation of it, but then another *nix wizard attacks the very concept with some sound warnings and concerns. So I say, "no way".

Why do I write this? Because IMO I think there might be a link between my Nokogiri blues and the libxml warning. OS X 10.5 is on old stuff and they may have issues with that.

QUESTION

Do any other OS X 10.5 users have this issue with Nokogiri?

回复收藏 0 原文

~没有更多了~