使用 groovy 提取 HTML 部分

发布于 2024-11-03 12:37:20 字数 631 浏览 8 评论 0原文

我需要从给定的 HTML 页面中提取 HTML 的一部分。到目前为止，我使用带有 tagoup 的 XmlSlurper 来解析 HTML 页面，然后尝试使用 StreamingMarkupBuilder 获取所需的部分：

import groovy.xml.StreamingMarkupBuilder
def html = "<html><body>a <b>test</b></body></html>"
def dom = new XmlSlurper(new org.ccil.cowan.tagsoup.Parser()).parseText(html)
println    new StreamingMarkupBuilder().bindNode(dom.body)

但是，我得到的结果

<html:body xmlns:html='http://www.w3.org/1999/xhtml'>a <html:b>test</html:b></html:body>

看起来很棒，但我想在没有 html-namespace 的情况下获取它。

如何避免命名空间？

原文

I need to extract a part of HTML from a given HTML page. So far, I use the XmlSlurper with tagsoup to parse the HTML page and then try to get the needed part by using the StreamingMarkupBuilder:

import groovy.xml.StreamingMarkupBuilder
def html = "<html><body>a <b>test</b></body></html>"
def dom = new XmlSlurper(new org.ccil.cowan.tagsoup.Parser()).parseText(html)
println    new StreamingMarkupBuilder().bindNode(dom.body)

However, the result I get is

<html:body xmlns:html='http://www.w3.org/1999/xhtml'>a <html:b>test</html:b></html:body>

which looks great, but I would like to get it without the html-namespace.

How do I avoid the namespace?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

唠甜嗑 2024-11-10 12:37:20

关闭 TagSoup 解析器上的命名空间功能。例子：

import groovy.xml.StreamingMarkupBuilder
def html = "<html><body>a <b>test</b></body></html>"
def parser = new org.ccil.cowan.tagsoup.Parser()
parser.setFeature(parser.namespacesFeature, false)
def dom = new XmlSlurper(parser).parseText(html)
println new StreamingMarkupBuilder().bindNode(dom.body)

Turn off the namespace feature on the TagSoup parser. Example:

import groovy.xml.StreamingMarkupBuilder
def html = "<html><body>a <b>test</b></body></html>"
def parser = new org.ccil.cowan.tagsoup.Parser()
parser.setFeature(parser.namespacesFeature, false)
def dom = new XmlSlurper(parser).parseText(html)
println new StreamingMarkupBuilder().bindNode(dom.body)

回复收藏 0 原文

~没有更多了~