如何获取 xml.minidom 中元素的完整文本？

发布于 2024-07-14 22:54:12 字数 236 浏览 5 评论 0原文

我想获取一个 Element 的整个文本来解析一些 xhtml:

<div id='asd'>
  <pre>skdsk</pre>
</div>

begin E = div 元素，在上面的示例中，我想获取

<pre>skdsk</pre>

如何？

原文

I want to get the whole text of an Element to parse some xhtml:

<div id='asd'>
  <pre>skdsk</pre>
</div>

begin E = div element on the above example, I want to get

<pre>skdsk</pre>

How?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

笑看君怀她人 2024-07-21 22:54:12

严格来说：

from xml.dom.minidom import parse, parseString
tree = parseString("<div id='asd'><pre>skdsk</pre></div>")
root = tree.firstChild
node = root.childNodes[0]
print node.toxml()

但在实践中，我建议查看 http://www.crummy。 com/software/BeautifulSoup/ 库。在 xhtml 文档中查找正确的 childNode，并跳过“空白节点”是一件痛苦的事情。 BeautifulSoup 是一个强大的 html/xhtml 解析器，具有出色的树搜索功能。

编辑：上面的示例将 HTML 压缩为一个字符串。如果您像问题中那样使用 HTML，换行符等将生成“空白”节点，因此您想要的节点不会位于 childNodes[0] 处。

Strictly speaking:

from xml.dom.minidom import parse, parseString
tree = parseString("<div id='asd'><pre>skdsk</pre></div>")
root = tree.firstChild
node = root.childNodes[0]
print node.toxml()

In practice, though, I'd recommend looking at the http://www.crummy.com/software/BeautifulSoup/ library. Finding the right childNode in an xhtml document, and skipping "whitespace nodes" is a pain. BeautifulSoup is a robust html/xhtml parser with fantastic tree-search capacilities.

Edit: The example above compresses the HTML into one string. If you use the HTML as in the question, the line breaks and so-forth will generate "whitespace" nodes, so the node you want won't be at childNodes[0].

回复收藏 0 原文

~没有更多了~