lxml.etree、element.text 不返回元素的整个文本
我通过 xpath 废弃了一些 html,然后将其转换为 etree。与此类似:
<td> text1 <a> link </a> text2 </td>
但是当我调用 element.text 时,我只得到 text1 (它必须在那里,当我在 FireBug 中检查查询时,元素的文本会突出显示,包括嵌入锚元素之前和之后的文本。 ..
I scrapped some html via xpath, that I then converted into an etree. Something similar to this:
<td> text1 <a> link </a> text2 </td>
but when I call element.text, I only get text1 (It must be there, when I check my query in FireBug, the text of the elements is highlighted, both the text before and after the embedded anchor elements...
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
使用
element.xpath("string()")
或lxml.etree.tostring(element, method="text")
- 请参阅 文档。Use
element.xpath("string()")
orlxml.etree.tostring(element, method="text")
- see the documentation.作为一项公共服务,为那些可能像我一样懒惰的人提供服务。这是上面的一些代码,您可以运行。
输出是:
As a public service to people out there who may be as lazy as I am. Here's some code from above that you can run.
Output is:
对我来说看起来像是一个 lxml bug,但如果您阅读文档,则根据设计。我已经这样解决了:
looks like an lxml bug to me, but according to design if you read the documentation. I've solved it like this:
另一件似乎可以很好地从元素中获取文本的方法是
"".join(element.itertext())
Another thing that seems to be working well to get the text out of an element is
"".join(element.itertext())
它是这样的(忽略空格):
如果您不想要子元素内部的文本,那么您可以只收集它们的尾部:
Here's how it is (ignoring whitespace):
If you don't want a text that is inside child elements then you could collect only their tails:
如果
element
等于。您可以执行以下操作。
它将为您提供来自
self
的所有文本元素的列表(点的含义)。//
表示它将获取所有元素,最后text()
是提取文本的函数。If the
element
is equal to<td>
. You can do the following.It will give you a list of all text elements from
self
(the meaning of the dot).//
means that it will take all elements and finallytext()
is the function to extract text.