无法提取 html 表格行
我尝试提取上表中列出的所有五行。
我正在使用 Ruby hpricot 库使用 xpath 表达式提取表行。
在我的示例中,我使用的 xpath 表达式是 /html/body/center/table/tr。请注意,我已从表达式中删除了 tbody 标记,这通常是成功提取的情况。
奇怪的是,我得到结果中的前三行,但缺少最后两行。我只是不知道那里发生了什么事。
编辑:代码没有什么神奇之处,只是根据要求附加它。
require 'open-uri'
require 'hpricot'
faculty = Hpricot(open("http://www.utm.utoronto.ca/7800.0.html"))
(faculty/"/html/body/center/table/tr").each do |text|
puts text.to_s
end
I try to extract all five rows listed in the table above.
I'm using Ruby hpricot library to extract the table rows using xpath expression.
In my example, the xpath expression I use is /html/body/center/table/tr. Note that I've removed the tbody tag from the expression, which is usually the case for successful extraction.
The weird thing is that I'm getting the first three rows in the result with the last two rows missing. I just have no idea what's going on there.
EDIT: Nothing magic about the code, just attaching it upon request.
require 'open-uri'
require 'hpricot'
faculty = Hpricot(open("http://www.utm.utoronto.ca/7800.0.html"))
(faculty/"/html/body/center/table/tr").each do |text|
puts text.to_s
end
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
相关 HTML 文档无效。 (请参阅http://validator.w3。 org/check?uri=http%3A%2F%2Fwww.utm.utoronto.ca%2F7800.0.html。)Hpricot 以不同于您的方式解析它浏览器——因此有不同的结果——但这不能归咎于它。在 HTML5 之前,还没有关于如何解析无效 HTML 文档的标准。
我尝试用 Nokogiri 替换 Hpricot,它似乎给出了预期的解析。代码:
也许你应该切换?
The HTML document in question is invalid. (See http://validator.w3.org/check?uri=http%3A%2F%2Fwww.utm.utoronto.ca%2F7800.0.html.) Hpricot parses it in another way than your browser — hence the different results — but it can't really be blamed. Until HTML5, there was no standard on how to parse invalid HTML documents.
I tried replacing Hpricot with Nokogiri and it seems to give the expected parse. Code:
Maybe you should switch?
路径
table/tr
不存在。它是table/tbody/tr
或table//tr
。当您使用table/tr
时,您专门寻找,它是
的直接后代,但从你的图像来看,这不是标记的结构方式。
The path
table/tr
does not exist. It'stable/tbody/tr
ortable//tr
. When you usetable/tr
, you're specifically looking for a<tr>
that is a direct descendant of<table>
, but from your image, this isn't how the markup is structured.