lxml.etree、element.text 不返回元素的整个文本

发布于 2024-10-13 09:38:15 字数 233 浏览 11 评论 0原文

我通过 xpath 废弃了一些 html,然后将其转换为 etree。与此类似:

<td> text1 <a> link </a> text2 </td>

但是当我调用 element.text 时,我只得到 text1 (它必须在那里,当我在 FireBug 中检查查询时,元素的文本会突出显示,包括嵌入锚元素之前和之后的文本。 ..

I scrapped some html via xpath, that I then converted into an etree. Something similar to this:

<td> text1 <a> link </a> text2 </td>

but when I call element.text, I only get text1 (It must be there, when I check my query in FireBug, the text of the elements is highlighted, both the text before and after the embedded anchor elements...

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

情绪少女 2024-10-20 09:38:15

使用 element.xpath("string()")lxml.etree.tostring(element, method="text") - 请参阅 文档

Use element.xpath("string()") or lxml.etree.tostring(element, method="text") - see the documentation.

失去的东西太少 2024-10-20 09:38:15

作为一项公共服务,为那些可能像我一样懒惰的人提供服务。这是上面的一些代码,您可以运行。

from lxml import etree

def get_text1(node):
    result = node.text or ""
    for child in node:
        if child.tail is not None:
            result += child.tail
    return result

def get_text2(node):
    return ((node.text or '') +
            ''.join(map(get_text2, node)) +
            (node.tail or ''))

def get_text3(node):
    return (node.text or "") + "".join(
        [etree.tostring(child) for child in node.iterchildren()])


root = etree.fromstring(u"<td> text1 <a> link </a> text2 </td>")

print root.xpath("text()")
print get_text1(root)
print get_text2(root)
print root.xpath("string()")
print etree.tostring(root, method = "text")
print etree.tostring(root, method = "xml")
print get_text3(root)

输出是:

snowy:rpg$ python test.py 
[' text1 ', ' text2 ']
 text1  text2 
 text1  link  text2 
 text1  link  text2 
 text1  link  text2 
<td> text1 <a> link </a> text2 </td>
 text1 <a> link </a> text2 

As a public service to people out there who may be as lazy as I am. Here's some code from above that you can run.

from lxml import etree

def get_text1(node):
    result = node.text or ""
    for child in node:
        if child.tail is not None:
            result += child.tail
    return result

def get_text2(node):
    return ((node.text or '') +
            ''.join(map(get_text2, node)) +
            (node.tail or ''))

def get_text3(node):
    return (node.text or "") + "".join(
        [etree.tostring(child) for child in node.iterchildren()])


root = etree.fromstring(u"<td> text1 <a> link </a> text2 </td>")

print root.xpath("text()")
print get_text1(root)
print get_text2(root)
print root.xpath("string()")
print etree.tostring(root, method = "text")
print etree.tostring(root, method = "xml")
print get_text3(root)

Output is:

snowy:rpg$ python test.py 
[' text1 ', ' text2 ']
 text1  text2 
 text1  link  text2 
 text1  link  text2 
 text1  link  text2 
<td> text1 <a> link </a> text2 </td>
 text1 <a> link </a> text2 
梦里°也失望 2024-10-20 09:38:15

对我来说看起来像是一个 lxml bug,但如果您阅读文档,则根据设计。我已经这样解决了:

def node_text(node):
    if node.text:
        result = node.text
    else:
        result = ''
    for child in node:
        if child.tail is not None:
            result += child.tail
    return result

looks like an lxml bug to me, but according to design if you read the documentation. I've solved it like this:

def node_text(node):
    if node.text:
        result = node.text
    else:
        result = ''
    for child in node:
        if child.tail is not None:
            result += child.tail
    return result
丿*梦醉红颜 2024-10-20 09:38:15

另一件似乎可以很好地从元素中获取文本的方法是 "".join(element.itertext())

Another thing that seems to be working well to get the text out of an element is "".join(element.itertext())

你没皮卡萌 2024-10-20 09:38:15
<td> text1 <a> link </a> text2 </td>

它是这样的(忽略空格):

td.text == 'text1'
a.text == 'link'
a.tail == 'text2'

如果您不想要子元素内部的文本,那么您可以只收集它们的尾部:

text = td.text + ''.join([el.tail for el in td])
<td> text1 <a> link </a> text2 </td>

Here's how it is (ignoring whitespace):

td.text == 'text1'
a.text == 'link'
a.tail == 'text2'

If you don't want a text that is inside child elements then you could collect only their tails:

text = td.text + ''.join([el.tail for el in td])
鲜血染红嫁衣 2024-10-20 09:38:15
def get_text_recursive(node):
    return (node.text or '') + ''.join(map(get_text_recursive, node)) + (node.tail or '')
def get_text_recursive(node):
    return (node.text or '') + ''.join(map(get_text_recursive, node)) + (node.tail or '')
囚你心 2024-10-20 09:38:15

如果 element 等于 。您可以执行以下操作。

element.xpath('.//text()')

它将为您提供来自 self 的所有文本元素的列表(点的含义)。 // 表示它将获取所有元素,最后 text() 是提取文本的函数。

If the element is equal to <td>. You can do the following.

element.xpath('.//text()')

It will give you a list of all text elements from self (the meaning of the dot). // means that it will take all elements and finally text() is the function to extract text.

韵柒 2024-10-20 09:38:15
element.xpath('normalize-space()') also works.
element.xpath('normalize-space()') also works.
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文