如何使用 Nokogiri 和 XPath 或 CSS 选择器选择 HTML 块?

发布于 2024-12-20 21:28:13 字数 974 浏览 3 评论 0原文

在我的 Rails 应用程序中,我有如下 HTML,在 Nokogiri 中解析。

我希望能够选择 HTML 块。例如,如何使用 XPath 或 CSS 选择属于 一部分的 HTML 块?假设在真实的 HTML 中,带有 ******** 的部分不存在。

我想通过 分割 HTML,但问题是节点是兄弟节点。

<sup class="v" id="20">
1
</sup> 
this is some random text
<p></p>   
more random text
<sup class="footnote" value='fn1'>
[v]
</sup>

# ****************************** starting here
<sup class="v" id="21">
2
</sup> 
now this is a different section
<p></p>   
how do we keep this separate
<sup class="footnote" value='fn2'>
[x]
</sup>
# ****************************** ending here

<sup class="v" id="23">
3
</sup> 
this is yet another different section
<p></p>   
how do we keep this separate too
<sup class="footnote" value='fn3'>
[r]
</sup>

In my Rails app I have HTML like the following, parsed in Nokogiri.

I want to be able to select chunks of HTML. For example, how can I select the block of HTML that's part of <sup id="21"> using XPath or CSS? Assume that in the real HTML the section with ******** does not exist.

I want to split the HTML by <sup id=*> but the problem is that the nodes are siblings.

<sup class="v" id="20">
1
</sup> 
this is some random text
<p></p>   
more random text
<sup class="footnote" value='fn1'>
[v]
</sup>

# ****************************** starting here
<sup class="v" id="21">
2
</sup> 
now this is a different section
<p></p>   
how do we keep this separate
<sup class="footnote" value='fn2'>
[x]
</sup>
# ****************************** ending here

<sup class="v" id="23">
3
</sup> 
this is yet another different section
<p></p>   
how do we keep this separate too
<sup class="footnote" value='fn3'>
[r]
</sup>

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

贱贱哒 2024-12-27 21:28:13

这是一个简单的解决方案,为您提供包含 之间所有节点的 NodeSet,并按其 id 进行哈希处理>。

doc = Nokogiri.HTML(your_html)

nodes_by_vsup_id = Hash.new{ |k,v| k[v]=Nokogiri::XML::NodeSet.new(doc) }
last_id = nil
doc.at('body').children.each do |n|
  last_id = n['id'] if n['class']=='v'
  nodes_by_vsup_id[last_id] << n
end

puts nodes_by_vsup_id['21']
#=> <sup class="v" id="21">
#=> 2
#=> </sup>
#=>  
#=> now this is a different section
#=> <p></p>
#=>    
#=> how do we keep this separate
#=> <sup class="footnote" value="fn2">
#=> [x]
#=> </sup>

或者,如果您确实不希望分隔“sup”成为集合的一部分,则可以这样做:

doc.at('body').elements.each do |n|
  if n['class']=='v'
    last_id = n['id'] 
  else
    nodes_by_vsup_id[last_id] << n
  end
end

这是一个替代的、更通用的解决方案:

class Nokogiri::XML::NodeSet
  # Yields each node in the set to your block
  # Returns a hash keyed by whatever your block returns
  # Any nodes that return nil/false are grouped with the previous valid value
  def group_chunks
    Hash.new{ |k,v| k[v] = self.class.new(document) }.tap do |result|
      key = nil
      each{ |n| result[key = yield(n) || key] << n }
    end
  end
end

root_items = doc.at('body').children
separated = root_items.group_chunks{ |node| node['class']=='v' && node['id'] }
puts separated['21']

Here's a simple solution that gives you NodeSets with all the nodes between <sup … class="v">, hashed by their id.

doc = Nokogiri.HTML(your_html)

nodes_by_vsup_id = Hash.new{ |k,v| k[v]=Nokogiri::XML::NodeSet.new(doc) }
last_id = nil
doc.at('body').children.each do |n|
  last_id = n['id'] if n['class']=='v'
  nodes_by_vsup_id[last_id] << n
end

puts nodes_by_vsup_id['21']
#=> <sup class="v" id="21">
#=> 2
#=> </sup>
#=>  
#=> now this is a different section
#=> <p></p>
#=>    
#=> how do we keep this separate
#=> <sup class="footnote" value="fn2">
#=> [x]
#=> </sup>

Alternatively, if you didn't really want the delimiting 'sup' to be part of the collection, instead do:

doc.at('body').elements.each do |n|
  if n['class']=='v'
    last_id = n['id'] 
  else
    nodes_by_vsup_id[last_id] << n
  end
end

Here's an alternative, even-more-generic solution:

class Nokogiri::XML::NodeSet
  # Yields each node in the set to your block
  # Returns a hash keyed by whatever your block returns
  # Any nodes that return nil/false are grouped with the previous valid value
  def group_chunks
    Hash.new{ |k,v| k[v] = self.class.new(document) }.tap do |result|
      key = nil
      each{ |n| result[key = yield(n) || key] << n }
    end
  end
end

root_items = doc.at('body').children
separated = root_items.group_chunks{ |node| node['class']=='v' && node['id'] }
puts separated['21']
姐不稀罕 2024-12-27 21:28:13

看起来您想要选择带有 @id='21'sup 和带有 @id=' 的 sup 之间的所有内容23'。使用以下临时表达式:

//sup[@id='21']|(//sup[@id='21']/following-sibling::node()[
    not(self::sup[@id='23'] or preceding-sibling::sup[@id='23'])])

或 Kayessian 节点集交集公式的应用:

//sup[@id='21']|(//sup[@id='21']/following-sibling::node()[
    count(.|//sup[@id='23']/preceding-sibling::node())
     =
    count(//sup[@id='23']/preceding-sibling::node())])

It looks like you want to select everything between the sup with @id='21' and the sup with @id='23'. Use the following ad-hoc expression:

//sup[@id='21']|(//sup[@id='21']/following-sibling::node()[
    not(self::sup[@id='23'] or preceding-sibling::sup[@id='23'])])

Or an application of the Kayessian node-set intersection formula:

//sup[@id='21']|(//sup[@id='21']/following-sibling::node()[
    count(.|//sup[@id='23']/preceding-sibling::node())
     =
    count(//sup[@id='23']/preceding-sibling::node())])
许仙没带伞 2024-12-27 21:28:13
require 'open-uri'
require 'nokogiri'

doc = Nokogiri::HTML(open("http://www.yoururl"))
doc.xpath('//sup[id="21"]').each do |node|
  puts node.text
end
require 'open-uri'
require 'nokogiri'

doc = Nokogiri::HTML(open("http://www.yoururl"))
doc.xpath('//sup[id="21"]').each do |node|
  puts node.text
end
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文