使用 Nokogiri 解析内部标签

发布于 2024-11-09 05:33:56 字数 1183 浏览 0 评论 0原文

我无法解析不规则嵌入的 html 标签。有没有办法从节点中删除所有 html 标签并保留所有文本?

我正在使用代码:

rows = doc.search('//table[@id="table_1"]/tbody/tr')

details = rows.collect do |row|
  detail = {}
  [
    [:word, 'td[1]/text()'],
    [:meaning, 'td[6]/font'],
  ].collect do |name, xpath|
      detail[name] = row.at_xpath(xpath).to_s.strip
    end
  detail
end

Using Xpath:

[:meaning, 'td[6]/font']

generates

:meaning: ! '<font size="3">asking for information specifying <font
    color="#CC0000" size="3">what is your name?</font> /what/ as in, <font color="#CC0000" size="3">I'm not sure what you mean</font>
    /what/ as in <a style="text-decoration: none;" href="http://somesecretlink.com">what</a></font>

另一方面,使用Xpath:

'td/font/text()'

generates

:meaning: asking for information specifying

从而忽略节点的所有子节点。我想要实现的是这个

:meaning: asking for information specifying what is your name? /what/ as in, I'm not sure what you mean /what/ as in what? I can't hear you

I'm stuck not being able to parse irregularly embedded html tags. Is there a way to remove all html tags from a node and retain all text?

I'm using the code:

rows = doc.search('//table[@id="table_1"]/tbody/tr')

details = rows.collect do |row|
  detail = {}
  [
    [:word, 'td[1]/text()'],
    [:meaning, 'td[6]/font'],
  ].collect do |name, xpath|
      detail[name] = row.at_xpath(xpath).to_s.strip
    end
  detail
end

Using Xpath:

[:meaning, 'td[6]/font']

generates

:meaning: ! '<font size="3">asking for information specifying <font
    color="#CC0000" size="3">what is your name?</font> /what/ as in, <font color="#CC0000" size="3">I'm not sure what you mean</font>
    /what/ as in <a style="text-decoration: none;" href="http://somesecretlink.com">what</a></font>

On the other hand, using Xpath:

'td/font/text()'

generates

:meaning: asking for information specifying

thus ignoring all children of the node. What I want to achieve is this

:meaning: asking for information specifying what is your name? /what/ as in, I'm not sure what you mean /what/ as in what? I can't hear you

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

心是晴朗的。 2024-11-16 05:33:57

这取决于您需要提取什么。如果您想要字体元素中的所有文本,可以使用以下 xpath 来实现:

'td/font//text()'

它提取字体标签中的所有文本节点。如果您想要单元格中的所有文本节点,那么:

'td//text()'

您还可以在 Nokogiri 节点上调用 text 方法:

row.at_xpath(xpath).text

This depends on what you need to extract. If you want all text in font elements, you can do it with the following xpath:

'td/font//text()'

It extracts all text nodes in font tags. If you want all text nodes in the cell, then:

'td//text()'

You can also call the text method on a Nokogiri node:

row.at_xpath(xpath).text
无敌元气妹 2024-11-16 05:33:57

前几天我添加了同一类问题的答案。这是一个非常简单的过程。

看一下:转换使用 ruby​​ 将 HTML 转换为纯文本并维护结构/格式

I added an answer for this same sort of question the other day. It's a very easy process.

Take a look at: Convert HTML to plain text and maintain structure/formatting, with ruby

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文