使用 Nokogiri 解析内部标签

发布于 2024-11-09 05:33:56 字数 1183 浏览 0 评论 0原文

我无法解析不规则嵌入的 html 标签。有没有办法从节点中删除所有 html 标签并保留所有文本？

我正在使用代码：

rows = doc.search('//table[@id="table_1"]/tbody/tr')

details = rows.collect do |row|
  detail = {}
  [
    [:word, 'td[1]/text()'],
    [:meaning, 'td[6]/font'],
  ].collect do |name, xpath|
      detail[name] = row.at_xpath(xpath).to_s.strip
    end
  detail
end

Using Xpath:

[:meaning, 'td[6]/font']

generates

:meaning: ! '<font size="3">asking for information specifying <font
    color="#CC0000" size="3">what is your name?</font> /what/ as in, <font color="#CC0000" size="3">I'm not sure what you mean</font>
    /what/ as in <a style="text-decoration: none;" href="http://somesecretlink.com">what</a></font>

另一方面，使用Xpath:

'td/font/text()'

generates

:meaning: asking for information specifying

从而忽略节点的所有子节点。我想要实现的是这个

:meaning: asking for information specifying what is your name? /what/ as in, I'm not sure what you mean /what/ as in what? I can't hear you

原文

I'm stuck not being able to parse irregularly embedded html tags. Is there a way to remove all html tags from a node and retain all text?

I'm using the code:

rows = doc.search('//table[@id="table_1"]/tbody/tr')

details = rows.collect do |row|
  detail = {}
  [
    [:word, 'td[1]/text()'],
    [:meaning, 'td[6]/font'],
  ].collect do |name, xpath|
      detail[name] = row.at_xpath(xpath).to_s.strip
    end
  detail
end

Using Xpath:

[:meaning, 'td[6]/font']

generates

:meaning: ! '<font size="3">asking for information specifying <font
    color="#CC0000" size="3">what is your name?</font> /what/ as in, <font color="#CC0000" size="3">I'm not sure what you mean</font>
    /what/ as in <a style="text-decoration: none;" href="http://somesecretlink.com">what</a></font>

On the other hand, using Xpath:

'td/font/text()'

generates

:meaning: asking for information specifying

thus ignoring all children of the node. What I want to achieve is this

:meaning: asking for information specifying what is your name? /what/ as in, I'm not sure what you mean /what/ as in what? I can't hear you

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

心是晴朗的。 2024-11-16 05:33:57

这取决于您需要提取什么。如果您想要字体元素中的所有文本，可以使用以下 xpath 来实现：

'td/font//text()'

它提取字体标签中的所有文本节点。如果您想要单元格中的所有文本节点，那么：

'td//text()'

您还可以在 Nokogiri 节点上调用 text 方法：

row.at_xpath(xpath).text

This depends on what you need to extract. If you want all text in font elements, you can do it with the following xpath:

'td/font//text()'

It extracts all text nodes in font tags. If you want all text nodes in the cell, then:

'td//text()'

You can also call the text method on a Nokogiri node:

row.at_xpath(xpath).text

回复收藏 0 原文

无敌元气妹 2024-11-16 05:33:57

前几天我添加了同一类问题的答案。这是一个非常简单的过程。

看一下：转换使用 ruby 将 HTML 转换为纯文本并维护结构/格式

回复收藏 0 原文

~没有更多了~

关于作者

乙白

暂无简介

0 文章

0 评论

24 人气

关注发私信

lorenzathorton8

文章 0 评论 0

关注

Zero

文章 0 评论 0

关注

萧瑟寒风

文章 0 评论 0

关注

mylayout

文章 0 评论 0

关注

tkewei

文章 0 评论 0

关注

17818769742

文章 0 评论 0

友情链接

文江博客

使用 Nokogiri 解析内部标签

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

lorenzathorton8

Zero

萧瑟寒风

mylayout

tkewei

17818769742

友情链接

使用 Nokogiri 解析内部标签

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

lorenzathorton8

Zero

萧瑟寒风

mylayout

tkewei

17818769742

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。