在 hpricot/nokogiri 中搜索 h2 元素之前的所有元素

发布于 2024-08-05 11:01:57 字数 266 浏览 10 评论 0原文

我正在尝试解析维基词典条目以检索所有英语定义。我能够检索所有定义,问题是某些定义是其他语言的。我想做的是以某种方式仅检索具有英文定义的 HTML 块。我发现,在存在其他语言条目的情况下,可以通过以下方式检索英语定义之后的标头:

header = (doc/"h2")[3]

所以我只想搜索此标头元素之前的所有元素。我认为使用 header.preceding_siblings() 可能可以,但这似乎不起作用。有什么建议吗?

I am attempting to parse a Wiktionary entry to retrieve all english definitions. I am able to retrive all definitions, the problem is that some definitions are in other languages. What I would like to do is somehow retrieve only the HTML block with English definitions. I have found that, in the case that there are other language entries, the header after the english definitions can be retrieved with:

header = (doc/"h2")[3]

So I would like to only search all the elements before this header element. I thought that may be possible with header.preceding_siblings(), but that does not seem to be working. Any suggestions?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

面犯桃花 2024-08-12 11:01:57

您可以通过 Nokogiri 使用访客模式。此代码将删除从其他语言定义的 h2 开始的所有内容:

require 'nokogiri'
require 'open-uri'

class Visitor
  def initialize(node)
    @node = node
  end

  def visit(node)
    if @remove || @node == node
      node.remove
      @remove = true
      return
    end
    node.children.each do |child|
      child.accept(self)
    end
  end
end

doc = Nokogiri::XML.parse(open('http://en.wiktionary.org/wiki/pony'))
node = doc.search("h2")[2]  #In this case, the Italian h2 is at index 2.  Your page may differ

doc.root.accept(Visitor.new(node))  #Removes all page contents starting from node

You can make use of the visitor pattern with Nokogiri. This code will remove everything starting from the other language definition's h2:

require 'nokogiri'
require 'open-uri'

class Visitor
  def initialize(node)
    @node = node
  end

  def visit(node)
    if @remove || @node == node
      node.remove
      @remove = true
      return
    end
    node.children.each do |child|
      child.accept(self)
    end
  end
end

doc = Nokogiri::XML.parse(open('http://en.wiktionary.org/wiki/pony'))
node = doc.search("h2")[2]  #In this case, the Italian h2 is at index 2.  Your page may differ

doc.root.accept(Visitor.new(node))  #Removes all page contents starting from node
情话难免假 2024-08-12 11:01:57

以下代码使用Hpricot
它从英语语言的标题 (h2) 获取文本,直到下一个标题 (h2),或者如果没有其他语言,则直到页脚:

require 'hpricot'
require 'open-uri'

def get_english_definition(url)
  doc = Hpricot(open(url))

  span = doc.at('h2/span[@class="mw-headline"][text()=English]')
  english_header = span && span.parent
  return nil unless english_header

  next_header_or_footer =
    Hpricot::Elements[*english_header.following_siblings].at('h2') ||
    doc.at('[@class="printfooter"]')

  Hpricot::Elements.expand(english_header.next_node,
                           next_header_or_footer.previous_node).to_s
end

示例:

get_english_definition "http://en.wiktionary.org/wiki/gift"

The following code is using Hpricot.
It gets the text from the header for the english language (h2) until the next header (h2), or until the footer if there are no further languages:

require 'hpricot'
require 'open-uri'

def get_english_definition(url)
  doc = Hpricot(open(url))

  span = doc.at('h2/span[@class="mw-headline"][text()=English]')
  english_header = span && span.parent
  return nil unless english_header

  next_header_or_footer =
    Hpricot::Elements[*english_header.following_siblings].at('h2') ||
    doc.at('[@class="printfooter"]')

  Hpricot::Elements.expand(english_header.next_node,
                           next_header_or_footer.previous_node).to_s
end

Example:

get_english_definition "http://en.wiktionary.org/wiki/gift"
流殇 2024-08-12 11:01:57

对于 Nokogiri:

doc = Nokogiri::HTML(code)
stop_node = doc.css('h2')[3]
doc.traverse do |node|
  break if node == stop_node
  # else, do whatever, e.g. `puts node.name`
end

这将迭代您在第 2 行中指定为 stop_node 的任何节点之前的所有节点。

For Nokogiri:

doc = Nokogiri::HTML(code)
stop_node = doc.css('h2')[3]
doc.traverse do |node|
  break if node == stop_node
  # else, do whatever, e.g. `puts node.name`
end

This will iterate over all the nodes preceding whatever node you designate as stop_node in line 2.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文