我问了一个关于如何使用 lxml 解析 url 的 问题获取
元素。已解决。但是,为了完全实现我的目标,我需要考虑
内其他标签的效果。
Acorn 提供的用于解析 url 并获取
返回的可接受答案是:
import lxml.html
htmltree = lxml.html.parse('http://www.google.com/intl/en/about/corporate/index.html')
print htmltree.xpath('//p/text()')
但是,htmltree.xpath('//p/text()')
,如果
段落中还有其他标签,则将返回片段,并且其他标签之间的文本将被忽略。
例如,对于
Text1... 超链接文本.. Text2....
目前通过htmltree.xpath('//p/text()')
解析为['Text1...', '文本2...']
。
更直观地说,预期结果应该是['Text1...超链接文本..Text2...']
。
因此我想知道,我应该使用哪些其他方法,将其解析为一个整体,并以某种方式修复其他类型标签(例如
)的中断?
我进一步研究了 lxml xpath 文档,我怀疑这是因为 //p/text()
中的 >/text()。但我被困在这里,不知道该改变什么。
I've asked a question on how to use lxml to parse a url and get <p>
elements back. It is resolved. However, to fully achieve my goal, I need to consider the effect of other tags inside a <p>
.
The accepted answer provided by Acorn to parse a url and get <p>
back is:
import lxml.html
htmltree = lxml.html.parse('http://www.google.com/intl/en/about/corporate/index.html')
print htmltree.xpath('//p/text()')
However, htmltree.xpath('//p/text()')
, if there are other tags inside the <p>
paragraph, pieces will be returned and also text in between of other tags will be ignored.
E.g. for <p>Text1... <a href="/link.../">hyperlinked text..</a> Text2....
Currently, by using htmltree.xpath('//p/text()')
, it is parsed into ['Text1...','Text2...']
.
More intuitively, the expected result should be ['Text1... hyperlinked text.. Text2...']
.
Hence I would like to know, what other methods I should use, to parse it into a whole and somehow fix the interruptions by other type of tags, e.g. <a>
?
I have further looked into the lxml xpath documentation, and I suspect it is because of the /text()
in //p/text()
. But I am stuck here and have no clue what to change.
发布评论
评论(2)
是的,
/text()
获取该标记中的直接文本元素。相反,获取所有p
标签并使用.text_content()
获取其中的所有文本。来自 lxml.html 文档:所以你会得到这样的东西:
Yes,
/text()
gets the immediate text element in that tag. Instead, get allp
tags and use.text_content()
to get all the text in them. From lxml.html doc:So you will have something like this:
如果您的 xml 无效,请尝试安装 lxml 并将“xml.etree”更改为“lxml.etree”。
希望这有帮助。
If your xml is not valid, try installing lxml and changing 'xml.etree' to 'lxml.etree'.
Hope this helps.