打开 uri +普里科特nokogiri 无法正确解析 html
我正在尝试使用 open-uri + hpricot 解析网页,但这似乎是解析过程中的一个问题,因为 gem 没有给我带来我想要的东西。
具体来说,我想在这个网址中获取这个div(其ID是'pasajes'):
我写了这段代码:
require 'nokogiri'
require 'hpricot'
require 'open-uri'
document = Hpricot(open('http://www.despegar.com.ar/')) # WITH HPRICOT
document2 = Nokogiri::HTML(open('http://www.despegar.com.ar/')) # WITH NOKOGIRI
pasajes = document.search("//div[@id='pasajes']")
pasajes2 = document2.xpath("//div[@id='pasajes']")
但它什么也没带来!我在 hpricot 和 nokogiri 中尝试了很多东西:
- 我尝试给出该 div 的绝对路径
- 我尝试使用选择器的 CSS 路径
- 我尝试使用 hpricot 搜索快捷方式(doc//“div#pasajes”)
- 几乎所有可能的相对路径到达“pasajes”div
最后我找到了一个可怕的解决方案。我使用了 watir 库,打开网络浏览器后,我将 html 传递给 hpricot。通过这种方式,hpricot 可以识别“pasajes”div。但我不想只是为了解析目的而打开网络浏览器......
我做错了什么? open-uri 工作不好吗?是杏子吗?
I'm trying to parse a webpage using open-uri + hpricot but it seems to be a problem in the parsing proccess as the gems don't bring me the things I want.
Specifically I want to get this div (whose id is 'pasajes') in this url:
I write this code:
require 'nokogiri'
require 'hpricot'
require 'open-uri'
document = Hpricot(open('http://www.despegar.com.ar/')) # WITH HPRICOT
document2 = Nokogiri::HTML(open('http://www.despegar.com.ar/')) # WITH NOKOGIRI
pasajes = document.search("//div[@id='pasajes']")
pasajes2 = document2.xpath("//div[@id='pasajes']")
But it bring NOTHING! I've tried lot of things in both hpricot and nokogiri:
- I try giving the absolute path to that div
- I try CSS path with selectors
- I try with hpricot search shortcut (doc//"div#pasajes")
- Almost every posible relative path to reach the 'pasajes' div
Finally i found a horrible solution. I have used the watir library and after open a web browser, i have passed the html to hpricot. In this way hpricot DO RECOGNIZE the 'pasajes' div. But i don't want just to open a web-browsere only for parsing purposes...
What I'm doing wrong? Is open-uri working bad? Is hpricot?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
静态 HTML 页面中没有 id 为 pasajes 的 DIV。如果您正在运行 *nix,您可以通过以下操作看到这一点:
我的猜测是它是 JavaScript 生成的。
如果您使用 MacRuby,您可以尝试 Lyndon。
There's no DIV with the id pasajes in the static HTML page. If you are running *nix you can see that by doing:
My guess is that it's JavaScript-generated.
If you are using MacRuby you could try Lyndon.
该页面中没有 id 为“pasajes”的 div。这就是问题所在。
There's no div with id 'pasajes' in that page. That's the problem.
这更适合作为对上面 Jonas 答案的附加评论,而不是答案本身......但我是 SO 的新手,还没有“评论权”:)
您可以使用 Selenium RC 下载完整的 HTML 和然后对下载的文件使用 nokogiri。请注意,只有当内容由 Javascript 生成/修改时,这才有效。如果网页依赖 cookie 来设置内容,您的选择将是 Selenium(在浏览器中)或 watir,正如您所指出的。
我很想听到更好的解决方案(想用 nokogiri 解析网页,但页面是由 JS 修改的)。
This fits more as an additional comment on Jonas' answer above rather than an answer in itself... But I am new to SO and do not have the "commenting powers" yet :)
You can use Selenium RC to download the full HTML and then use nokogiri on the downloaded file. Note that this will work only if the content is being generated/modified by Javascript. If the webpage depends on cookies to setup the content your options would be Selenium (in the browser) or watir as you have noted.
I would love to hear a better solution to this (want to parse webpage with nokogiri, but the page is modified by JS).
我在 OS X 10.5 上使用 Nokogiri 时遇到了类似的问题。然而,我首先尝试 open-uri 打开有问题的页面,其中有很多 HTML div、p 等等。我发现通过使用:
我会看到很多精彩的 HTML。我还发现,通过将“文件”读入字符串并将其传递给 Nokogiri,我可以让它正常工作。我什至不得不修改他们在 rubyforge 上使用的演示来教您有关 Nokogiri 的知识。
用他们自己的例子我得到这个:
恶心!
如果我调整将 url 读入字符串,我会得到好东西:
注意
当我使用 irb 玩时,我确实看到了这个可爱的警告:
但我没有心情处理这些恐怖的事情和各种专家关于修复 /usr/local 中的 libxml 的相互矛盾的建议等等。 链接文本上的帖子对此有很好的解释,但另一个 *nix 向导攻击了这个概念并提出一些合理的警告和担忧。所以我说,“没门”。
我为什么要写这个?因为我认为我的 Nokogiri 布鲁斯和 libxml 警告之间可能存在联系。 OS X 10.5 是旧的东西,他们可能有问题。
问题
其他 OS X 10.5 用户是否也遇到过 Nokogiri 的此问题?
I ran into a similar issue with Nokogiri but on OS X 10.5. However, I first tried open-uri to open the pages in question which have lots of HTML div, p whatever. I found by using:
I would see lots of wonderful HTML. I also found by doing read of the "file" into a string and passing that to Nokogiri I could get that to work fine. I even had to modify the very demo they use on rubyforge to teach you about Nokogiri.
Using their own example I get this:
YUCK!
If I tweak to read in the url to a string, I get good stuff:
Note
I do see this lovely warning when I use irb to play:
But I am not in the mood to deal with the horrors and various expert but contradicting advice on fixing libxml in /usr/local blah blah. A post on link text has a great explanation of it, but then another *nix wizard attacks the very concept with some sound warnings and concerns. So I say, "no way".
Why do I write this? Because IMO I think there might be a link between my Nokogiri blues and the libxml warning. OS X 10.5 is on old stuff and they may have issues with that.
QUESTION
Do any other OS X 10.5 users have this issue with Nokogiri?