Hpricot XML 文本搜索

发布于 2024-10-16 23:04:12 字数 869 浏览 9 评论 0原文

Hpricot + Ruby XML 解析和逻辑选择。

目标：找到作者鲍勃写的所有标题。

我的 XML 文件：

<rss>
<channel>
<item>
<title>Book1</title>
<pubDate>march 1 2010</pubDate>
<author>Bob</author>
</item>

<item>
<title>book2</title>
<pubDate>october 4 2009</pubDate>
<author>Bill</author>
</item>

<item>
<title>book3</title>
<pubDate>June 5 2010</pubDate>
<author>Steve</author>
</item>
</channel>
</rss>

#my Hpricot, running this code returns no output, however the search pattern works on its own.
 (doc % :rss % :channel / :item).each do |item|

        a=item.search("author[text()*='Bob']")

        #puts "FOUND" if a.include?"Bob"
        puts item.at("title") if a.include?"Bob"

  end

原文

Hpricot + Ruby XML parsing and logical selection.

Objective: Find all title written by author Bob.

My XML file:

<rss>
<channel>
<item>
<title>Book1</title>
<pubDate>march 1 2010</pubDate>
<author>Bob</author>
</item>

<item>
<title>book2</title>
<pubDate>october 4 2009</pubDate>
<author>Bill</author>
</item>

<item>
<title>book3</title>
<pubDate>June 5 2010</pubDate>
<author>Steve</author>
</item>
</channel>
</rss>

#my Hpricot, running this code returns no output, however the search pattern works on its own.
 (doc % :rss % :channel / :item).each do |item|

        a=item.search("author[text()*='Bob']")

        #puts "FOUND" if a.include?"Bob"
        puts item.at("title") if a.include?"Bob"

  end

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

笨死的猪 2024-10-23 23:04:12

如果您还没有设置 Hpricot，这里有一种在 Nokogiri 中使用 XPath 执行此操作的方法：

require 'nokogiri'
doc = Nokogiri::XML( my_rss_string )
bobs_titles = doc.xpath("//title[parent::item/author[text()='Bob']]")
p bobs_titles.map{ |node| node.text }
#=> ["Book1"]

编辑：@theTinMan 的 XPath 也运行良好，更具可读性，而且可能会更快：

bobs_titles = doc.xpath("//author[text()='Bob']/../title")

If you're not set on Hpricot, here's one way to do this with XPath in Nokogiri:

require 'nokogiri'
doc = Nokogiri::XML( my_rss_string )
bobs_titles = doc.xpath("//title[parent::item/author[text()='Bob']]")
p bobs_titles.map{ |node| node.text }
#=> ["Book1"]

Edit: @theTinMan's XPath also works well, is more readable, and may very well be faster:

bobs_titles = doc.xpath("//author[text()='Bob']/../title")

回复收藏 0 原文

少女的英雄梦 2024-10-23 23:04:12

XPath 背后的想法之一是它允许我们像磁盘目录一样导航 DOM：

require 'hpricot'

xml = <<EOT
<rss>
    <channel>
        <item>
            <title>Book1</title>
            <pubDate>march 1 2010</pubDate>
            <author>Bob</author>
        </item>

        <item>
            <title>book2</title>
            <pubDate>october 4 2009</pubDate>
            <author>Bill</author>
        </item>

        <item>
            <title>book3</title>
            <pubDate>June 5 2010</pubDate>
            <author>Steve</author>
        </item>

        <item>
            <title>Book4</title>
            <pubDate>march 1 2010</pubDate>
            <author>Bob</author>
        </item>

    </channel>
</rss>
EOT

doc = Hpricot(xml)

titles = (doc / '//author[text()="Bob"]/../title' )
titles # => #<Hpricot::Elements[{elem <title> "Book1" </title>}, {elem <title> "Book4" </title>}]>

这意味着：“查找 Bob 的所有书籍，然后查找一级并找到标题标签”。

我添加了一本“Bob”的额外书来测试所有出现的情况。

要获取包含 Bob 的书的项目，只需向后移动一个级别：

items = (doc / '//author[text()="Bob"]/..' )
puts items # => nil
# >> <item>
# >>             <title>Book1</title>
# >>             <pubdate>march 1 2010</pubdate>
# >>             <author>Bob</author>
# >>         </item>
# >> <item>
# >>             <title>Book4</title>
# >>             <pubdate>march 1 2010</pubdate>
# >>             <author>Bob</author>
# >>         </item>

我还弄清楚了 (doc % :rss % :channel / :item) 正在做什么。它相当于嵌套搜索，减去包装括号，并且这些在 Hpricot-ese 中应该是相同的：

(doc % :rss % :channel / :item).size # => 4
(((doc % :rss) % :channel) / :item).size # => 4
(doc / '//rss/channel/item').size # => 4
(doc / 'rss channel item').size # => 4

因为 '//rss/channel/item' 是您通常看到 XPath 的方式访问器，而 'rss Channel item' 是 CSS 访问器，我建议使用这些格式以进行维护和清晰。

One of the ideas behind XPath is it allows us to navigate a DOM similarly to a disk directory:

require 'hpricot'

xml = <<EOT
<rss>
    <channel>
        <item>
            <title>Book1</title>
            <pubDate>march 1 2010</pubDate>
            <author>Bob</author>
        </item>

        <item>
            <title>book2</title>
            <pubDate>october 4 2009</pubDate>
            <author>Bill</author>
        </item>

        <item>
            <title>book3</title>
            <pubDate>June 5 2010</pubDate>
            <author>Steve</author>
        </item>

        <item>
            <title>Book4</title>
            <pubDate>march 1 2010</pubDate>
            <author>Bob</author>
        </item>

    </channel>
</rss>
EOT

doc = Hpricot(xml)

titles = (doc / '//author[text()="Bob"]/../title' )
titles # => #<Hpricot::Elements[{elem <title> "Book1" </title>}, {elem <title> "Book4" </title>}]>

That means: "find all the books by Bob, then look up one level and find the title tag".

I added an extra book by "Bob" to test getting all occurrences.

To get the item containing a book by Bob, just move back up a level:

items = (doc / '//author[text()="Bob"]/..' )
puts items # => nil
# >> <item>
# >>             <title>Book1</title>
# >>             <pubdate>march 1 2010</pubdate>
# >>             <author>Bob</author>
# >>         </item>
# >> <item>
# >>             <title>Book4</title>
# >>             <pubdate>march 1 2010</pubdate>
# >>             <author>Bob</author>
# >>         </item>

I also figured out what (doc % :rss % :channel / :item) is doing. It's equivalent to nesting the searches, minus the wrapping parenthesis, and these should all be the same in Hpricot-ese:

(doc % :rss % :channel / :item).size # => 4
(((doc % :rss) % :channel) / :item).size # => 4
(doc / '//rss/channel/item').size # => 4
(doc / 'rss channel item').size # => 4

Because '//rss/channel/item' is how you'd normally see an XPath accessor, and 'rss channel item' is a CSS accessor, I'd recommend using those formats for maintenance and clarity.

回复收藏 0 原文

~没有更多了~