使用 XmlSlurper：如何在迭代 GPathResult 时选择子元素

发布于 2024-08-10 05:16:55 字数 1527 浏览 13 评论 0原文

我正在编写一个 HTML 解析器，它使用 TagSoup 将格式良好的结构传递给 XMLSlurper。

这是通用代码：

def htmlText = """
<html>
<body>
<div id="divId" class="divclass">
<h2>Heading 2</h2>
<ol>
<li><h3><a class="box" href="#href1">href1 link text</a> <span>extra stuff</span></h3><address>Here is the address<span>Telephone number: <strong>telephone</strong></span></address></li>
<li><h3><a class="box" href="#href2">href2 link text</a> <span>extra stuff</span></h3><address>Here is another address<span>Another telephone: <strong>0845 1111111</strong></span></address></li>
</ol>
</div>
</body>
</html>
"""     

def html = new XmlSlurper(new org.ccil.cowan.tagsoup.Parser()).parseText( htmlText );

html.'**'.grep { it.@class == 'divclass' }.ol.li.each { linkItem ->
    def link = linkItem.h3.a.@href
    def address = linkItem.address.text()
    println "$link: $address\n"
}

我希望每个让我依次选择每个“li”，以便我可以检索相应的 href 和地址详细信息。相反，我得到了这样的输出：

#href1#href2: Here is the addressTelephone number: telephoneHere is another addressAnother telephone: 0845 1111111

我已经检查了网络上的各种示例，这些示例要么处理 XML，要么是诸如“从此文件中检索所有链接”之类的单行示例。看来 it.h3.a.@href 表达式正在收集文本中的所有 href，即使我向它传递了对父“li”节点的引用。

你能让我知道吗：

为什么我得到显示的输出
如何检索每个“li”项目的href/地址对

谢谢。

原文

I am writing an HTML parser, which uses TagSoup to pass a well-formed structure to XMLSlurper.

Here's the generalised code:

def htmlText = """
<html>
<body>
<div id="divId" class="divclass">
<h2>Heading 2</h2>
<ol>
<li><h3><a class="box" href="#href1">href1 link text</a> <span>extra stuff</span></h3><address>Here is the address<span>Telephone number: <strong>telephone</strong></span></address></li>
<li><h3><a class="box" href="#href2">href2 link text</a> <span>extra stuff</span></h3><address>Here is another address<span>Another telephone: <strong>0845 1111111</strong></span></address></li>
</ol>
</div>
</body>
</html>
"""     

def html = new XmlSlurper(new org.ccil.cowan.tagsoup.Parser()).parseText( htmlText );

html.'**'.grep { it.@class == 'divclass' }.ol.li.each { linkItem ->
    def link = linkItem.h3.a.@href
    def address = linkItem.address.text()
    println "$link: $address\n"
}

I would expect the each to let me select each 'li' in turn so I can retrieve the corresponding href and address details. Instead, I am getting this output:

#href1#href2: Here is the addressTelephone number: telephoneHere is another addressAnother telephone: 0845 1111111

I've checked various example on the web and these either deal with XML, or are one-liner examples like "retrieve all links from this file". It's seems that the it.h3.a.@href expression is collecting all hrefs in the text, even though I'm passing it a reference to the parent 'li' node.

Can you let me know:

Why I'm getting the output shown
How I can retrieve the href/address pairs for each 'li' item

Thanks.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

极度宠爱 2024-08-17 05:16:55

将 grep 替换为 find:

html.'**'.find { it.@class == 'divclass' }.ol.li.each { linkItem ->
    def link = linkItem.h3.a.@href
    def address = linkItem.address.text()
    println "$link: $address\n"
}

那么你会得到

#href1: Here is the addressTelephone number: telephone

#href2: Here is another addressAnother telephone: 0845 1111111

grep 返回一个 ArrayList 但 find 返回一个 NodeChild 类：

println html.'**'.grep { it.@class == 'divclass' }.getClass()
println html.'**'.find { it.@class == 'divclass' }.getClass()

结果是：

class java.util.ArrayList
class groovy.util.slurpersupport.NodeChild

因此，如果你想使用 grep 你可以像这样嵌套另一个每个，让它工作

html.'**'.grep { it.@class == 'divclass' }.ol.li.each {
    it.each { linkItem ->
        def link = linkItem.h3.a.@href
        def address = linkItem.address.text()
        println "$link: $address\n"
    }
}

长话短说，在你的情况下，使用 find 而不是 grep。

Replace grep with find:

html.'**'.find { it.@class == 'divclass' }.ol.li.each { linkItem ->
    def link = linkItem.h3.a.@href
    def address = linkItem.address.text()
    println "$link: $address\n"
}

then you'll get

#href1: Here is the addressTelephone number: telephone

#href2: Here is another addressAnother telephone: 0845 1111111

grep returns an ArrayList but find returns a NodeChild class:

println html.'**'.grep { it.@class == 'divclass' }.getClass()
println html.'**'.find { it.@class == 'divclass' }.getClass()

results in:

class java.util.ArrayList
class groovy.util.slurpersupport.NodeChild

thus if you wanted to use grep you could then nest another each like this for it to work

html.'**'.grep { it.@class == 'divclass' }.ol.li.each {
    it.each { linkItem ->
        def link = linkItem.h3.a.@href
        def address = linkItem.address.text()
        println "$link: $address\n"
    }
}

Long story short, in your case, use find rather than grep.

回复收藏 0 原文

失而复得 2024-08-17 05:16:55

这是一个棘手的问题。当只有一个元素的 class='divclass' 时，前面的答案肯定没问题。如果 grep 有多个结果，则单个结果的 find() 不是答案。指出结果是 ArrayList 是正确的。插入外部嵌套 .each() 循环会在闭包参数 div 中提供 GPathResult。从这里开始可以继续向下钻取并获得预期结果。

html."**".grep { it.@class == 'divclass' }.each { div -> div.ol.li.each { linkItem ->
   def link = linkItem.h3.a.@href
   def address = linkItem.address.text()
   println "$link: $address\n"
}}

原始代码的行为也可以使用更多的解释。当访问 Groovy 中的列表上的属性时，您将获得一个新列表（大小相同），其中包含列表中每个元素的属性。 grep() 找到的列表只有一项。然后我们得到一个属性 ol 条目，这很好。接下来我们获取该条目的 ol.it 结果。它又是一个 size() == 1 的列表，但这次有一个 size() == 2 的条目。如果我们想要的话，我们可以在那里应用外部循环并获得相同的结果

html."**".grep { it.@class == 'divclass' }.ol.li.each { it.each { linkItem ->
   def link = linkItem.h3.a.@href
   def address = linkItem.address
   println "$link: $address\n"
}}

：节点，我们得到所有文本的串联。这是原始结果，首先是 @href，然后是 address。

This was is a tricky one. When there is just one element with class='divclass' the previous answer sure is fine. If there were multiple results from grep, then a find() for a single result is not the answer. Pointing out that the result is an ArrayList is correct. Inserting an outer nested .each() loop provides a GPathResult in the closure parameter div. From here the drill down can continue with the expected result.

html."**".grep { it.@class == 'divclass' }.each { div -> div.ol.li.each { linkItem ->
   def link = linkItem.h3.a.@href
   def address = linkItem.address.text()
   println "$link: $address\n"
}}

The behavior of the original code can use a bit more of an explanation as well. When a property is accessed on a List in Groovy, you'll get a new list (same size) with the property of each element in the list. The list found by grep() has just one entry. Then we get one entry for property ol, which is fine. Next we get the result of ol.it for that entry. It is a list of size() == 1 again, but this time with an entry of size() == 2. We could apply the outer loop there and get the same result, if we wanted to:

html."**".grep { it.@class == 'divclass' }.ol.li.each { it.each { linkItem ->
   def link = linkItem.h3.a.@href
   def address = linkItem.address
   println "$link: $address\n"
}}

On any GPathResult representing multiple nodes, we get the concatenation of all text. That is the original result, first for @href, then for address.

回复收藏 0 原文

甜点 2024-08-17 05:16:55

我相信在撰写本文时，对于所使用的版本，之前的答案都是正确的。但我正在使用 HTTPBuilder 0.7.1 和 Grails 2.4.4 以及 Groovy 2.3.7，并且存在一个大问题 - HTML 元素转换为大写。 看来这是由于在后台使用了 NekoHTML ：

http://nekohtml.sourceforge.net/faq.html#uppercase

因为这个，接受的答案中的解决方案必须写为：

html.'**'.find { it.@class == 'divclass' }.OL.LI.each { linkItem ->
    def link = linkItem.H3.A.@href
    def address = linkItem.ADDRESS.text()
    println "$link: $address\n"
}

这对于调试来说非常令人沮丧，希望它对某人有帮助。

I believe the previous answers are all correct at the time of writing, for the version used. But I am using HTTPBuilder 0.7.1 and Grails 2.4.4 with Groovy 2.3.7 and there is a big issue - HTML elements are transformed to uppercase. It appears this is due to NekoHTML used under the hood:

http://nekohtml.sourceforge.net/faq.html#uppercase

Because of this, the solution in the accepted answer must be written as:

html.'**'.find { it.@class == 'divclass' }.OL.LI.each { linkItem ->
    def link = linkItem.H3.A.@href
    def address = linkItem.ADDRESS.text()
    println "$link: $address\n"
}

This was very frustrating to debug, hope it helps someone.

回复收藏 0 原文

~没有更多了~