libxml-ruby 解析帮助

发布于 2024-08-02 20:46:18 字数 2240 浏览 11 评论 0原文

好吧，由于速度和 _why 的消失，从工作 Hpricot 切换到 Libxml-ruby，看了 Nokogiri 一秒钟，但决定考虑 Libxml-ruby 的速度和寿命。我一定错过了一些基本的东西，但我试图做的事情不起作用，这是我的 XML 字符串：

<?xml version="1.0" encoding="utf-8" ?>
<feed>
  <title type="xhtml"></title>
  <entry xmlns="http://www.w3.org/2005/Atom">
    <id>urn:publicid:xx.xxx:xxxxxx</id>
    <title>US--xxx-xxxxx</title>
    <updated>2009-08-19T15:49:51.103Z</updated>
    <published>2009-08-19T15:44:48Z</published>
    <author>
      <name>XX</name>
    </author>
    <rights>blehh</rights>
    <content type="text/xml">
      <nitf>
        <head>
          <docdata>
            <doc-id regsrc="XX" />
            <date.issue norm="20090819T154448Z" />
            <ed-msg info="Eds:" />
            <doc.rights owner="xx" agent="hxx" type="none" />
            <doc.copyright holder="xx" year="2009" />
          </docdata>
        </head>
        <body>
          <body.head>
            <hedline>
              <hl1 id="headline">headline</hl1>
              <hl2 id="originalHeadline">blah blah</hl2>
            </hedline>
            <byline>john doe<byttl>staffer</byttl></byline>
            <distributor>xyz</distributor>
            <dateline>
              <location>foo</location>
            </dateline>
          </body.head>
          <body.content>
            <block id="Main">
              story content here
            </block>
          </body.content>
          <body.end />
        </body>
      </nitf>
    </content>
  </entry>  
</feed>

完整的提要中有大约 150 个这样的条目。

我只想循环遍历 150 个条目，然后获取内容和属性，但我在 libxml-ruby 上度过了一段愉快的时光，让它与 Hpricot 一起工作得很好。

这个小片段表明我什至没有得到条目：

parser = XML::Parser.string(file)
doc = parser.parse
entries = doc.find('//entry')
puts entries.size
entries.each do |node|
  puts node.inspect
end

有什么想法吗？我浏览了文档，但找不到一个简单的“这是一个 XML 文件，这是获取 x、y、z 的示例”。这应该很简单。

原文

Alright, switching from working Hpricot to Libxml-ruby due to speed and well the disappearance of _why, looked at Nokogiri for a second but decided to look at Libxml-ruby for speed and longevity. I must be missing something basic but what im trying to do isn't working, here's my XML string:

<?xml version="1.0" encoding="utf-8" ?>
<feed>
  <title type="xhtml"></title>
  <entry xmlns="http://www.w3.org/2005/Atom">
    <id>urn:publicid:xx.xxx:xxxxxx</id>
    <title>US--xxx-xxxxx</title>
    <updated>2009-08-19T15:49:51.103Z</updated>
    <published>2009-08-19T15:44:48Z</published>
    <author>
      <name>XX</name>
    </author>
    <rights>blehh</rights>
    <content type="text/xml">
      <nitf>
        <head>
          <docdata>
            <doc-id regsrc="XX" />
            <date.issue norm="20090819T154448Z" />
            <ed-msg info="Eds:" />
            <doc.rights owner="xx" agent="hxx" type="none" />
            <doc.copyright holder="xx" year="2009" />
          </docdata>
        </head>
        <body>
          <body.head>
            <hedline>
              <hl1 id="headline">headline</hl1>
              <hl2 id="originalHeadline">blah blah</hl2>
            </hedline>
            <byline>john doe<byttl>staffer</byttl></byline>
            <distributor>xyz</distributor>
            <dateline>
              <location>foo</location>
            </dateline>
          </body.head>
          <body.content>
            <block id="Main">
              story content here
            </block>
          </body.content>
          <body.end />
        </body>
      </nitf>
    </content>
  </entry>  
</feed>

there are about 150 such entries from the complete feed.

I just want to loop through the 150 entries and then grab out content and attributes but I'm having a hell of a time with libxml-ruby had it working fine with Hpricot.

This little snippet shows that im not even getting the entries:

parser = XML::Parser.string(file)
doc = parser.parse
entries = doc.find('//entry')
puts entries.size
entries.each do |node|
  puts node.inspect
end

Any ideas? I looked through the docs, and couldn't find a simple here's an XML file, and here are samples of getting out x,y,z. This should be pretty simple.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

败给现实 2024-08-09 20:46:18

事实证明，Nokogiri 具有一定的速度和寿命，因此这里有一些有关如何处理示例 XML 中的命名空间的示例。我将 Nokogiri 用于大型 RDF/RSS/Atom 聚合器，该聚合器每天使用类似的方法处理数千个提要，以获取我想要的字段，然后将它们推入后端数据库。

require 'nokogiri'

doc = Nokogiri::XML(file)
namespace = {'xmlns' => 'http://www.w3.org/2005/Atom'}

entries = []
doc.search('//xmlns:entry', namespace).each do |_entry|

  entry_hash = {}

  %w[title updated published author].each do |_attr|
    entry_hash[_attr.to_sym] = _entry.at('//xmlns:' << _attr, namespace).text.strip
  end

  entry_hash[:headlines] = _entry.search('xmlns|hedline > hl1, xmlns|hedline > hl2', namespace).map{ |n| n.text.strip }
  entry_hash[:body]      = _entry.at('//xmlns:body.content', namespace).text.strip
  entry_hash[:title]     = _entry.at('//xmlns:title', namespace).text

  entries << entry_hash
end

require 'pp'
pp entries 
# >> [{:title=>"US--xxx-xxxxx",
# >>   :updated=>"2009-08-19T15:49:51.103Z",
# >>   :published=>"2009-08-19T15:44:48Z",
# >>   :author=>"XX",
# >>   :headlines=>["headline", "blah blah"],
# >>   :body=>"story content here"}]

Nokogiri 中的 CSS 和 XPath 都可以处理名称空间。 Nokogiri 将通过获取根节点中定义的所有命名空间来简化它们的使用，但是，在此 XML 示例中，命名空间是在入口节点中定义的，这使得我们需要手动执行此操作。

我将标题改用 CSS 表示法，只是为了展示如何操作。为了方便起见，Nokogiri 通常会允许 CSS 使用通配符命名空间（如果它能够找到命名空间声明），这会将访问器简化为 '|headline > 。 hl1' 为 hl1 节点。

Nokogiri has proved to have some speed and longevity, so here's some samples of how to deal with the namespaces in the sample XML. I used Nokogiri for a big RDF/RSS/Atom aggregator that was processing thousands of feeds daily using something similar to this to grab the fields I wanted before pushing them into a backend database.

require 'nokogiri'

doc = Nokogiri::XML(file)
namespace = {'xmlns' => 'http://www.w3.org/2005/Atom'}

entries = []
doc.search('//xmlns:entry', namespace).each do |_entry|

  entry_hash = {}

  %w[title updated published author].each do |_attr|
    entry_hash[_attr.to_sym] = _entry.at('//xmlns:' << _attr, namespace).text.strip
  end

  entry_hash[:headlines] = _entry.search('xmlns|hedline > hl1, xmlns|hedline > hl2', namespace).map{ |n| n.text.strip }
  entry_hash[:body]      = _entry.at('//xmlns:body.content', namespace).text.strip
  entry_hash[:title]     = _entry.at('//xmlns:title', namespace).text

  entries << entry_hash
end

require 'pp'
pp entries 
# >> [{:title=>"US--xxx-xxxxx",
# >>   :updated=>"2009-08-19T15:49:51.103Z",
# >>   :published=>"2009-08-19T15:44:48Z",
# >>   :author=>"XX",
# >>   :headlines=>["headline", "blah blah"],
# >>   :body=>"story content here"}]

Both CSS and XPath in Nokogiri can handle namespaces. Nokogiri would simplify using them by grabbing all namespaces defined in the root node, but, in this XML sample, the namespace is defined in the entry node, making us do it manually.

I switched to CSS notation for the headlines, just to show how to do them. For convenience, Nokogiri would normally allow a wildcarded namespace for CSS, if it had been able to find the namespace declaration, which would have simplified the accessor to '|headline > hl1' for the hl1 node.

回复收藏 0 原文

夜血缘 2024-08-09 20:46:18

我怀疑您由于跳过查找中的命名空间而遇到问题。如果您查看 libxml-ruby 的 xpath 文档，他们有一些非常相关的例子。具体来说，您的发现可能应该类似于entries = doc.find('//atom:entry', 'atom:http://www.w3.org/2005/Atom')，因为格式正确。

回复收藏 0 原文

~没有更多了~