使用 XmlSlurper:如何在迭代 GPathResult 时选择子元素
我正在编写一个 HTML 解析器,它使用 TagSoup 将格式良好的结构传递给 XMLSlurper。
这是通用代码:
def htmlText = """
<html>
<body>
<div id="divId" class="divclass">
<h2>Heading 2</h2>
<ol>
<li><h3><a class="box" href="#href1">href1 link text</a> <span>extra stuff</span></h3><address>Here is the address<span>Telephone number: <strong>telephone</strong></span></address></li>
<li><h3><a class="box" href="#href2">href2 link text</a> <span>extra stuff</span></h3><address>Here is another address<span>Another telephone: <strong>0845 1111111</strong></span></address></li>
</ol>
</div>
</body>
</html>
"""
def html = new XmlSlurper(new org.ccil.cowan.tagsoup.Parser()).parseText( htmlText );
html.'**'.grep { it.@class == 'divclass' }.ol.li.each { linkItem ->
def link = linkItem.h3.a.@href
def address = linkItem.address.text()
println "$link: $address\n"
}
我希望每个让我依次选择每个“li”,以便我可以检索相应的 href 和地址详细信息。相反,我得到了这样的输出:
#href1#href2: Here is the addressTelephone number: telephoneHere is another addressAnother telephone: 0845 1111111
我已经检查了网络上的各种示例,这些示例要么处理 XML,要么是诸如“从此文件中检索所有链接”之类的单行示例。看来 it.h3.a.@href 表达式正在收集文本中的所有 href,即使我向它传递了对父“li”节点的引用。
你能让我知道吗:
- 为什么我得到显示的输出
- 如何检索每个“li”项目的href/地址对
谢谢。
I am writing an HTML parser, which uses TagSoup to pass a well-formed structure to XMLSlurper.
Here's the generalised code:
def htmlText = """
<html>
<body>
<div id="divId" class="divclass">
<h2>Heading 2</h2>
<ol>
<li><h3><a class="box" href="#href1">href1 link text</a> <span>extra stuff</span></h3><address>Here is the address<span>Telephone number: <strong>telephone</strong></span></address></li>
<li><h3><a class="box" href="#href2">href2 link text</a> <span>extra stuff</span></h3><address>Here is another address<span>Another telephone: <strong>0845 1111111</strong></span></address></li>
</ol>
</div>
</body>
</html>
"""
def html = new XmlSlurper(new org.ccil.cowan.tagsoup.Parser()).parseText( htmlText );
html.'**'.grep { it.@class == 'divclass' }.ol.li.each { linkItem ->
def link = linkItem.h3.a.@href
def address = linkItem.address.text()
println "$link: $address\n"
}
I would expect the each to let me select each 'li' in turn so I can retrieve the corresponding href and address details. Instead, I am getting this output:
#href1#href2: Here is the addressTelephone number: telephoneHere is another addressAnother telephone: 0845 1111111
I've checked various example on the web and these either deal with XML, or are one-liner examples like "retrieve all links from this file". It's seems that the it.h3.a.@href expression is collecting all hrefs in the text, even though I'm passing it a reference to the parent 'li' node.
Can you let me know:
- Why I'm getting the output shown
- How I can retrieve the href/address pairs for each 'li' item
Thanks.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
将 grep 替换为 find:
那么你会得到
grep 返回一个 ArrayList 但 find 返回一个 NodeChild 类:
结果是:
因此,如果你想使用 grep 你可以像这样嵌套另一个每个,让它工作
长话短说,在你的情况下,使用 find 而不是 grep。
Replace grep with find:
then you'll get
grep returns an ArrayList but find returns a NodeChild class:
results in:
thus if you wanted to use grep you could then nest another each like this for it to work
Long story short, in your case, use find rather than grep.
这是一个棘手的问题。当只有一个元素的 class='divclass' 时,前面的答案肯定没问题。如果 grep 有多个结果,则单个结果的 find() 不是答案。指出结果是 ArrayList 是正确的。插入外部嵌套 .each() 循环会在闭包参数 div 中提供 GPathResult。从这里开始可以继续向下钻取并获得预期结果。
原始代码的行为也可以使用更多的解释。当访问 Groovy 中的列表上的属性时,您将获得一个新列表(大小相同),其中包含列表中每个元素的属性。 grep() 找到的列表只有一项。然后我们得到一个属性 ol 条目,这很好。接下来我们获取该条目的 ol.it 结果。它又是一个 size() == 1 的列表,但这次有一个 size() == 2 的条目。如果我们想要的话,我们可以在那里应用外部循环并获得相同的结果
:节点,我们得到所有文本的串联。这是原始结果,首先是 @href,然后是 address。
This was is a tricky one. When there is just one element with class='divclass' the previous answer sure is fine. If there were multiple results from grep, then a find() for a single result is not the answer. Pointing out that the result is an ArrayList is correct. Inserting an outer nested .each() loop provides a GPathResult in the closure parameter div. From here the drill down can continue with the expected result.
The behavior of the original code can use a bit more of an explanation as well. When a property is accessed on a List in Groovy, you'll get a new list (same size) with the property of each element in the list. The list found by grep() has just one entry. Then we get one entry for property ol, which is fine. Next we get the result of ol.it for that entry. It is a list of size() == 1 again, but this time with an entry of size() == 2. We could apply the outer loop there and get the same result, if we wanted to:
On any GPathResult representing multiple nodes, we get the concatenation of all text. That is the original result, first for @href, then for address.
我相信在撰写本文时,对于所使用的版本,之前的答案都是正确的。但我正在使用 HTTPBuilder 0.7.1 和 Grails 2.4.4 以及 Groovy 2.3.7,并且存在一个大问题 - HTML 元素转换为大写。 看来这是由于在后台使用了 NekoHTML :
http://nekohtml.sourceforge.net/faq.html#uppercase
因为这个,接受的答案中的解决方案必须写为:
这对于调试来说非常令人沮丧,希望它对某人有帮助。
I believe the previous answers are all correct at the time of writing, for the version used. But I am using HTTPBuilder 0.7.1 and Grails 2.4.4 with Groovy 2.3.7 and there is a big issue - HTML elements are transformed to uppercase. It appears this is due to NekoHTML used under the hood:
http://nekohtml.sourceforge.net/faq.html#uppercase
Because of this, the solution in the accepted answer must be written as:
This was very frustrating to debug, hope it helps someone.