我想知道使用 Nokogiri 解析内容块的正确方法:
我有一些文档需要解析,它们最初包含的格式中每个主容器都是
。奇怪的是,每个信息中的主要信息都用
标签进行划分。
实际上,
内容的库存样本包含以下内容,并且是一个典型示例(有些内容较多,有些内容较少):
<p>
<font size="5" face="Arial, Helvetica, sans-serif" color="#00CCAA" class="">
<font color="#AAFF33" class="">
October 10, 1990 - Maybe a Title
</font>-
<font size="4" class="">
Some long text here.
<font color="#66CC00" class="">
<a href="SourceTitle/date.pdf">[Blah Blah, October 27, 1982 p. 2</a>
]
</font>.
More content.
<font color="#00FF33" class="">[Another Source, 1971, issue 01/4]
</font>.
</font>
<font size="5" face="Arial, Helvetica, sans-serif" color="#00CCAA" class="">
<font color="#AAFF33" class=""><font size="4" color="#00CCAA" class="">
Another fantastic article.
<a href="SourceTitle/Date.pdf">[Some Source, October 4, p.6]</a>
</font>
</font>
</font>
</font>
</p>
本质上,“字体大小”属性是设置每个内容的内容文章中的组件分开。要提取的要点是第一个 (即文章日期和主标题,如果给出标题)标签,然后是实际内容。
目前我所有的段落块都出来了: doc.xpath('//p').each do |node|
但是我不确定是否应该再次通过 Nokogiri 传递它来解析它内容或者我是否应该通过正则表达式运行它。我假设,希望有一个小例子来“正确”地执行此操作,在初始块中使用嵌入式 xpath 发现来提取元素。我假设有一种方法可以根据字体大小划分提取子组件,但我还没有看到具体的示例。
I'd like to know the proper way to parse a block of contents with Nokogiri:
I have some documents to parse where they originally contained a format where each main container was a <p>
. The main pieces of information within each one are divided up, oddly, with <font>
tags.
Effectively a stock sample of <p>
contents contains the following and is a typical example (some have a lot more content, some a lot less):
<p>
<font size="5" face="Arial, Helvetica, sans-serif" color="#00CCAA" class="">
<font color="#AAFF33" class="">
October 10, 1990 - Maybe a Title
</font>-
<font size="4" class="">
Some long text here.
<font color="#66CC00" class="">
<a href="SourceTitle/date.pdf">[Blah Blah, October 27, 1982 p. 2</a>
]
</font>.
More content.
<font color="#00FF33" class="">[Another Source, 1971, issue 01/4]
</font>.
</font>
<font size="5" face="Arial, Helvetica, sans-serif" color="#00CCAA" class="">
<font color="#AAFF33" class=""><font size="4" color="#00CCAA" class="">
Another fantastic article.
<a href="SourceTitle/Date.pdf">[Some Source, October 4, p.6]</a>
</font>
</font>
</font>
</font>
</p>
Essentially the "font size" attribute is what sets each component apart in the article. The main points to extract are the FIRST <font size ="5"...
(that is the article date and main title, if a title is given) tags, then the actual content.
Presently I have all paragraph chunks coming out with: doc.xpath('//p').each do |node|
However I am not sure if I should pass it through Nokogiri again to parse out it's contents or if I should just run it all through a regex. Was hoping for a small example of doing this "properly" with, I'm assuming, using an embedded xpath discovery within the initial block that pulls the elements out. I assume that there is a way to pull out the sub components based on the font size demarcation, but I've simply not seen a specific example of this yet.
发布评论
评论(1)
这对您入门有帮助吗?
为您需要的其他部分构建类似的表达式,然后就完成了:-)
Does that help you get started?
Build similar expressions for the other parts you need and you are done :-)