lxml.etree、element.text 不返回元素的整个文本

发布于 2024-10-13 09:38:15 字数 233 浏览 11 评论 0原文

我通过 xpath 废弃了一些 html，然后将其转换为 etree。与此类似：

<td> text1 <a> link </a> text2 </td>

但是当我调用 element.text 时，我只得到 text1 （它必须在那里，当我在 FireBug 中检查查询时，元素的文本会突出显示，包括嵌入锚元素之前和之后的文本。 ..

原文

I scrapped some html via xpath, that I then converted into an etree. Something similar to this:

<td> text1 <a> link </a> text2 </td>

but when I call element.text, I only get text1 (It must be there, when I check my query in FireBug, the text of the elements is highlighted, both the text before and after the embedded anchor elements...

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

情绪少女 2024-10-20 09:38:15

使用 element.xpath("string()") 或 lxml.etree.tostring(element, method="text") - 请参阅文档。

回复收藏 0 原文

失去的东西太少 2024-10-20 09:38:15

作为一项公共服务，为那些可能像我一样懒惰的人提供服务。这是上面的一些代码，您可以运行。

from lxml import etree

def get_text1(node):
    result = node.text or ""
    for child in node:
        if child.tail is not None:
            result += child.tail
    return result

def get_text2(node):
    return ((node.text or '') +
            ''.join(map(get_text2, node)) +
            (node.tail or ''))

def get_text3(node):
    return (node.text or "") + "".join(
        [etree.tostring(child) for child in node.iterchildren()])


root = etree.fromstring(u"<td> text1 <a> link </a> text2 </td>")

print root.xpath("text()")
print get_text1(root)
print get_text2(root)
print root.xpath("string()")
print etree.tostring(root, method = "text")
print etree.tostring(root, method = "xml")
print get_text3(root)

输出是：

snowy:rpg$ python test.py 
[' text1 ', ' text2 ']
 text1  text2 
 text1  link  text2 
 text1  link  text2 
 text1  link  text2 
<td> text1 <a> link </a> text2 </td>
 text1 <a> link </a> text2

As a public service to people out there who may be as lazy as I am. Here's some code from above that you can run.

from lxml import etree

def get_text1(node):
    result = node.text or ""
    for child in node:
        if child.tail is not None:
            result += child.tail
    return result

def get_text2(node):
    return ((node.text or '') +
            ''.join(map(get_text2, node)) +
            (node.tail or ''))

def get_text3(node):
    return (node.text or "") + "".join(
        [etree.tostring(child) for child in node.iterchildren()])


root = etree.fromstring(u"<td> text1 <a> link </a> text2 </td>")

print root.xpath("text()")
print get_text1(root)
print get_text2(root)
print root.xpath("string()")
print etree.tostring(root, method = "text")
print etree.tostring(root, method = "xml")
print get_text3(root)

Output is:

snowy:rpg$ python test.py 
[' text1 ', ' text2 ']
 text1  text2 
 text1  link  text2 
 text1  link  text2 
 text1  link  text2 
<td> text1 <a> link </a> text2 </td>
 text1 <a> link </a> text2

回复收藏 0 原文

梦里°也失望 2024-10-20 09:38:15

对我来说看起来像是一个 lxml bug，但如果您阅读文档，则根据设计。我已经这样解决了：

def node_text(node):
    if node.text:
        result = node.text
    else:
        result = ''
    for child in node:
        if child.tail is not None:
            result += child.tail
    return result

looks like an lxml bug to me, but according to design if you read the documentation. I've solved it like this:

def node_text(node):
    if node.text:
        result = node.text
    else:
        result = ''
    for child in node:
        if child.tail is not None:
            result += child.tail
    return result

回复收藏 0 原文

丿*梦醉红颜 2024-10-20 09:38:15

另一件似乎可以很好地从元素中获取文本的方法是 "".join(element.itertext())

回复收藏 0 原文

你没皮卡萌 2024-10-20 09:38:15

<td> text1 <a> link </a> text2 </td>

它是这样的（忽略空格）：

td.text == 'text1'
a.text == 'link'
a.tail == 'text2'

如果您不想要子元素内部的文本，那么您可以只收集它们的尾部：

text = td.text + ''.join([el.tail for el in td])

<td> text1 <a> link </a> text2 </td>

Here's how it is (ignoring whitespace):

td.text == 'text1'
a.text == 'link'
a.tail == 'text2'

If you don't want a text that is inside child elements then you could collect only their tails:

text = td.text + ''.join([el.tail for el in td])

回复收藏 0 原文

鲜血染红嫁衣 2024-10-20 09:38:15

def get_text_recursive(node):
    return (node.text or '') + ''.join(map(get_text_recursive, node)) + (node.tail or '')

def get_text_recursive(node):
    return (node.text or '') + ''.join(map(get_text_recursive, node)) + (node.tail or '')

回复收藏 0 原文

囚你心 2024-10-20 09:38:15

如果 element 等于。您可以执行以下操作。

element.xpath('.//text()')

它将为您提供来自 self 的所有文本元素的列表（点的含义）。 // 表示它将获取所有元素，最后 text() 是提取文本的函数。

If the element is equal to <td>. You can do the following.

element.xpath('.//text()')

It will give you a list of all text elements from self (the meaning of the dot). // means that it will take all elements and finally text() is the function to extract text.

回复收藏 0 原文

韵柒 2024-10-20 09:38:15

element.xpath('normalize-space()') also works.

element.xpath('normalize-space()') also works.

回复收藏 0 原文

~没有更多了~

关于作者

零時差

暂无简介

0 文章

0 评论

25 人气

关注发私信

友情链接

文江博客

lxml.etree、element.text 不返回元素的整个文本

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（8）

关于作者

相关话题

热门标签

推荐作者

杨绘峰

听闻余生

谜兔

xiaotwins

你说

若能看破又如何

友情链接

lxml.etree、element.text 不返回元素的整个文本

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（8）

关于作者

相关话题

热门标签

推荐作者

杨绘峰

听闻余生

谜兔

xiaotwins

你说

若能看破又如何

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。