使用 nokogiri 提取 HTML 标签之间的文本
我有这样的 HTML:
<h1> Header is here</h1>
<h2>Header 2 is here</h2>
<p> Extract me!</p>
<p> Extract me too!</p>
<h2> Next Header 2</h2>
<p>not interested</p>
<p>not interested</p>
<h2>Header 2 is here</h2>
<p> Extract me!</p>
<p> Extract me too!</p>
我有一个基本的 Nokogiri CSS 节点搜索,返回
内容,但我找不到如何定位第 N 个关闭的 H2 和下一个打开的 H2 之间的所有文本的示例。我正在使用输出创建 CSV,因此我还想读取文件列表并将 URL 作为第一个结果。
I have HTML like this:
<h1> Header is here</h1>
<h2>Header 2 is here</h2>
<p> Extract me!</p>
<p> Extract me too!</p>
<h2> Next Header 2</h2>
<p>not interested</p>
<p>not interested</p>
<h2>Header 2 is here</h2>
<p> Extract me!</p>
<p> Extract me too!</p>
I have a basic Nokogiri CSS node search returning <p> content but I can't find examples for how to target all text between the Nth closed H2 and the next open H2. I'm creating a CSV with the output so I would also like to read in a file list and put the URL as first result.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
有时您可以使用 NodeSet 的 &运算符获取节点之间的信息:
You can sometimes use NodeSet's & operator to get information between nodes:
如果开始元素和停止元素具有相同的父元素,则这就像单个 XPath 一样简单。首先,为了清楚起见,我将用一个简化的文档来展示它,然后用您的示例文档来展示它:
现在,这是您的用例(按索引查找):
If the start and stop elements have the same parent, this is as simple as a single XPath. First I'll show it with a simplified document for clarity, and then with your sample document:
Now, here it is with your use case (finding by index):
这里不是 XPath 解决方案,而是一个简单(幼稚)的实现,它假设开始和停止元素共享相同的父元素,并允许独立指定开始和停止的 XPath:
Instead of an XPath solution, here's a simple (naïve) implementation that assumes that the start and stop elements share the same parent and allows the XPaths for start and stop to be specified independently:
此代码可能对您有帮助,但它仍然需要有关标签位置的更多信息(如果需要提取的信息位于某些标签之间,那就更好了)
This code may help you, but it stil needed more information about tags location (it's better if you info which needs to be extracted will be located between some tags)