当前位置：文江博客话题详情

使用 Nokogiri 解析段落元素的内容

发布于 2024-12-11 17:31:21 字数 1577 浏览 0 评论 0 原文

我想知道使用 Nokogiri 解析内容块的正确方法：

我有一些文档需要解析，它们最初包含的格式中每个主容器都是

。奇怪的是，每个信息中的主要信息都用标签进行划分。

实际上，

内容的库存样本包含以下内容，并且是一个典型示例（有些内容较多，有些内容较少）：

<p>
  <font size="5" face="Arial, Helvetica, sans-serif" color="#00CCAA" class="">
    <font color="#AAFF33" class="">
      October 10, 1990 - Maybe a Title
    </font>- 
    <font size="4" class="">
      Some long text here.         
      <font color="#66CC00" class="">
        <a href="SourceTitle/date.pdf">[Blah Blah, October 27, 1982 p. 2</a>
        ]
      </font>. 
      More content. 
      <font color="#00FF33" class="">[Another Source, 1971, issue 01/4]
      </font>. 
    </font>
    <font size="5" face="Arial, Helvetica, sans-serif" color="#00CCAA" class="">
      <font color="#AAFF33" class=""><font size="4" color="#00CCAA" class="">
        Another fantastic article. 
        <a href="SourceTitle/Date.pdf">[Some Source, October 4, p.6]</a>
      </font>
    </font>
  </font>
</font>
</p>

本质上，“字体大小”属性是设置每个内容的内容文章中的组件分开。要提取的要点是第一个 （即文章日期和主标题，如果给出标题）标签，然后是实际内容。

目前我所有的段落块都出来了： doc.xpath('//p').each do |node|

但是我不确定是否应该再次通过 Nokogiri 传递它来解析它内容或者我是否应该通过正则表达式运行它。我假设，希望有一个小例子来“正确”地执行此操作，在初始块中使用嵌入式 xpath 发现来提取元素。我假设有一种方法可以根据字体大小划分提取子组件，但我还没有看到具体的示例。

原文

I'd like to know the proper way to parse a block of contents with Nokogiri:

I have some documents to parse where they originally contained a format where each main container was a <p>. The main pieces of information within each one are divided up, oddly, with <font> tags.

Effectively a stock sample of <p> contents contains the following and is a typical example (some have a lot more content, some a lot less):

<p>
  <font size="5" face="Arial, Helvetica, sans-serif" color="#00CCAA" class="">
    <font color="#AAFF33" class="">
      October 10, 1990 - Maybe a Title
    </font>- 
    <font size="4" class="">
      Some long text here.         
      <font color="#66CC00" class="">
        <a href="SourceTitle/date.pdf">[Blah Blah, October 27, 1982 p. 2</a>
        ]
      </font>. 
      More content. 
      <font color="#00FF33" class="">[Another Source, 1971, issue 01/4]
      </font>. 
    </font>
    <font size="5" face="Arial, Helvetica, sans-serif" color="#00CCAA" class="">
      <font color="#AAFF33" class=""><font size="4" color="#00CCAA" class="">
        Another fantastic article. 
        <a href="SourceTitle/Date.pdf">[Some Source, October 4, p.6]</a>
      </font>
    </font>
  </font>
</font>
</p>

Essentially the "font size" attribute is what sets each component apart in the article. The main points to extract are the FIRST <font size ="5"... (that is the article date and main title, if a title is given) tags, then the actual content.

Presently I have all paragraph chunks coming out with: doc.xpath('//p').each do |node|

However I am not sure if I should pass it through Nokogiri again to parse out it's contents or if I should just run it all through a regex. Was hoping for a small example of doing this "properly" with, I'm assuming, using an embedded xpath discovery within the initial block that pulls the elements out. I assume that there is a way to pull out the sub components based on the font size demarcation, but I've simply not seen a specific example of this yet.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

提笔落墨 2024-12-18 17:31:21

这对您入门有帮助吗？

>> doc.xpath('//p').each do |node|
..     puts node.xpath("font[@size='5']/font").first.content.strip
..   end #=> 0
October 10, 1990 - Maybe a Title

为您需要的其他部分构建类似的表达式，然后就完成了:-)

Does that help you get started?

>> doc.xpath('//p').each do |node|
..     puts node.xpath("font[@size='5']/font").first.content.strip
..   end #=> 0
October 10, 1990 - Maybe a Title

Build similar expressions for the other parts you need and you are done :-)

回复收藏 0 原文

~没有更多了~

关于作者

能否归途做我良人

暂无简介

0 文章

0 评论

666 人气

关注发私信

友情链接

文江博客

使用 Nokogiri 解析段落元素的内容

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

胡图图

zt006

z祗昰~

冰葑

野の

天空

友情链接

使用 Nokogiri 解析段落元素的内容

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

胡图图

zt006

z祗昰~

冰葑

野の

天空

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。