无法提取 html 表格行

发布于 2024-12-17 06:31:50 字数 518 浏览 2 评论 0原文

在此处输入图像描述

我尝试提取上表中列出的所有五行。

我正在使用 Ruby hpricot 库使用 xpath 表达式提取表行。

在我的示例中，我使用的 xpath 表达式是 /html/body/center/table/tr。请注意，我已从表达式中删除了 tbody 标记，这通常是成功提取的情况。

奇怪的是，我得到结果中的前三行，但缺少最后两行。我只是不知道那里发生了什么事。

编辑：代码没有什么神奇之处，只是根据要求附加它。

require 'open-uri'
require 'hpricot'

faculty = Hpricot(open("http://www.utm.utoronto.ca/7800.0.html"))
(faculty/"/html/body/center/table/tr").each do |text|
  puts text.to_s
end

原文

enter image description here

I try to extract all five rows listed in the table above.

I'm using Ruby hpricot library to extract the table rows using xpath expression.

In my example, the xpath expression I use is /html/body/center/table/tr. Note that I've removed the tbody tag from the expression, which is usually the case for successful extraction.

The weird thing is that I'm getting the first three rows in the result with the last two rows missing. I just have no idea what's going on there.

EDIT: Nothing magic about the code, just attaching it upon request.

require 'open-uri'
require 'hpricot'

faculty = Hpricot(open("http://www.utm.utoronto.ca/7800.0.html"))
(faculty/"/html/body/center/table/tr").each do |text|
  puts text.to_s
end

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

旧城空念 2024-12-24 06:31:50

相关 HTML 文档无效。（请参阅http://validator.w3。 org/check?uri=http%3A%2F%2Fwww.utm.utoronto.ca%2F7800.0.html。）Hpricot 以不同于您的方式解析它浏览器——因此有不同的结果——但这不能归咎于它。在 HTML5 之前，还没有关于如何解析无效 HTML 文档的标准。

我尝试用 Nokogiri 替换 Hpricot，它似乎给出了预期的解析。代码：

require 'open-uri'
require 'nokogiri'

faculty = Nokogiri.HTML(open("http://www.utm.utoronto.ca/7800.0.html"))

faculty.search("/html/body/center/table/tr").each do |text|
  puts text
end

也许你应该切换？

The HTML document in question is invalid. (See http://validator.w3.org/check?uri=http%3A%2F%2Fwww.utm.utoronto.ca%2F7800.0.html.) Hpricot parses it in another way than your browser — hence the different results — but it can't really be blamed. Until HTML5, there was no standard on how to parse invalid HTML documents.

I tried replacing Hpricot with Nokogiri and it seems to give the expected parse. Code:

require 'open-uri'
require 'nokogiri'

faculty = Nokogiri.HTML(open("http://www.utm.utoronto.ca/7800.0.html"))

faculty.search("/html/body/center/table/tr").each do |text|
  puts text
end

Maybe you should switch?

回复收藏 0 原文