HPricot css 搜索:如何使用字符串选择器选择特定元素的父/祖先?
我正在使用 HPricot 的 css 搜索来识别网页中的表格。这是我正在解析的示例 html 片段:
<table height=61 width=700>
<tbody>
<tr>
<td><font size=3pt color = 'Blue'><b><A NAME=a1>Some header text</A></b></font></td></tr>
...
</tbody></table>
页面中有很多表格。我想找到包含 A Name=a1
引用的表。 现在,我这样做的方式是
(page/"a[@name=a1]")[0].parent.parent.parent.parent.parent
我不喜欢这样,因为
- 它很难看,而且
- 容易出错(如果维护网页的人删除了 tbody 怎么办?)
有没有办法告诉 hpricot 来找我指定元素的表祖先?
编辑:这是我正在解析的完整页面: http://www.blonnet.com /businessline/scoboard/a.htm
我感兴趣的是两个表格,一个包含季度结果,另一个包含年度结果。现在,我提取这些表的方法是查找并从那里向上移动。
I'm using HPricot's css search to identify a table within a web page. Here's a sample html snippet I'm parsing:
<table height=61 width=700>
<tbody>
<tr>
<td><font size=3pt color = 'Blue'><b><A NAME=a1>Some header text</A></b></font></td></tr>
...
</tbody></table>
There are lots of tables in the page. I want to find the table which contains the A Name=a1
reference.
Right now, the way I'm doing it is
(page/"a[@name=a1]")[0].parent.parent.parent.parent.parent
I don't like this because
- It is ugly
- It is error prone (what if the folks who maintain the web page remove the tbody?)
Is there a way to tell hpricot to get me the table ancestor of the specified element?
Edit: Here's the full blown page I'm parsing: http://www.blonnet.com/businessline/scoboard/a.htm
The bits I'm interested in are the two tables, one with quarterly results and another with the annual results. Right now, the way I'm extracting those tables is by finding and and moving up from there.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
罗希特是对的。它很丑陋,而且容易出错(超出了它需要的范围)。正如他所说,“找到最近的父母是一张桌子”的意图更加明确,这可能适用于任何孩子/父母关系。
如果用 hpricot “不可能”做到这一点,那就直接说出来。但不要只是说“无论如何尝试这样做都是没有希望的,有什么意义”。这是一个虚假的答案。它也无助于下一个人(我自己)寻找同一问题的答案,但出于不同的原因,即解析许多页面,其中差异是假设的,而不仅仅是担心的。
要真正回答这个问题......我还不知道。我对通过 hpricot 找到答案没有太大希望。可怕的是,该文档根本不存在。
但这里有一个解决方法可以完成同样的事情。
Rohith is right. It is ugly and it is error prone (more than it needs to be). Again as he says it is much more clear with the intent to say "find the closest parent that is a table", and this could go for any child/parent relationship.
If it's "not possible" to do that with hpricot then just say so. But don't just say "it's hopeless to try to do that anyway what's the point". That's a bogus answer. It also doesn't help the next person who comes along (myself) looking for the answer to the same question but for different reasons, which is parsing many pages where differences are ASSUMED and not just feared.
To actually answer the question... I don't know, yet. And I don't have much hope of finding out with hpricot. The documentation is absolutely horridly nonexistent.
But here's a workaround that does about the same thing.
在看不到整个页面的情况下,很难给出明确的答案,但通常你处理问题的方式就是正确的答案。你必须找到一个像样的地标,然后从那里导航,如果涉及备份链,那么这就是你要做的。
您也许可以使用 XPATH 查找该表,然后在其中查找链接,但这并不能真正改善情况,而只会改变它们。 Firebug 是 Firefox 插件,可以轻松获取页面中元素的 XPATH,因此您可以找到有问题的表并让 Firebug 显示路径,或者只需右键单击 xpath 中的节点即可复制它显示,并将其传递到您的查找中。
“它很丑”,好吧,也许吧,但并不是所有的代码都是美丽或优雅的,因为并非所有问题都适合美丽和/或优雅的解决方案。有时我们必须对“它有效”感到高兴。只要它工作可靠并且您知道为什么,那么您就领先于许多其他程序员。
“...如果维护网页的人删除了 tbody 会怎样?”,几乎所有 HTML 或 XML 解析都会遇到同样的问题,因为我们无法控制源代码。您尽可能地编写代码,注释如果内容发生变化可能会失败的地方,然后祈祷并继续前进。即使您正在解析 TPS 报告中的表格数据,您也可能会遇到同样的问题。
我建议做的唯一不同的事情是使用
%
(又名“at”)而不是/
(又名搜索)。%
仅返回第一个匹配项,因此您可以删除[0]
索引。或者
使用 XPath 引擎来备份链。如果考虑速度的话,应该会快一点。
如果您知道目标表是唯一具有该宽度和高度的表,则可以使用更具体的 xpath:
我推荐 Nokogiri 而不是 Hpricot。
您还可以从文档顶部向下使用 XPath:
基本上,XPath 模式意味着:
注意:Firefox 会自动将
标记添加到源中,即使收到的 HTML 文件中不存在该标记。这确实会让您在尝试使用 Firefox 查看源代码来开发自己的 XPath 时陷入困境。
根据 Firefox,您要查找的另一个表是
/html/body/table[2]/tbody/tr/td[2]/table[3]
,因此您必须删除tbody
。另外,您不需要锚定在/html
处。Without seeing the whole page it's hard to give a definitive answer, but often the way you're going about it is the right answer. You have to find a decent landmark, then navigate from there, and if it involves backing up the chain then that's what you do.
You might be able to use XPATH to find the table then look inside it for the link, but that doesn't really improve things, it only changes them. Firebug, the Firefox plugin, makes it easy to get the XPATH to an element in the page, so you could find the table in question and have Firebug show you the path, or just copy it by right-clicking on the node in the xpath display, and past that into your lookup.
"It is ugly", well, maybe, but not all code is beautiful or elegant because not all problems lend themselves to beautiful and/or elegant solutions. Sometimes we have to be happy with "it works". As long as it works reliably and you know why then you're ahead of many other coders.
"... what if the folks who maintain the web page remove the tbody?", almost all parsing of HTML or XML suffers from the same concern because we're not in control of the source. You write your code as best as you can, comment the spots that are likely to fail if content changes, then cross your fingers and move on. Even if you were parsing tabular data from a TPS report you could run into the same problem.
The only thing I'd suggest doing differently, is to use the
%
(AKA "at") instead of/
(AKA search).%
returns only the first occurrence so you can drop the[0]
index.or
which uses the XPath engine to step back up the chain. That should be a little faster if speed is a consideration.
If you know that the target table is the only one with that width and height, you can use a more specific xpath:
I recommend Nokogiri over Hpricot.
You can also use XPath from the top of the document down:
Basically the XPath pattern means:
Note: Firefox automatically adds the
<tbody>
tag to the source, even if it wasn't there in the HTML file received. That can really mess you up trying to use Firefox to view the source to develop your own XPaths.The other table you are after is
/html/body/table[2]/tbody/tr/td[2]/table[3]
according to Firefox so you have to strip thetbody
. Also you don't need to anchor at/html
.