使用 pythons lxml 库正确的 xpath 语法,用于解析任意嵌套 html 标签中的所有文本
在 python 中使用 lxml 我创建了这个 xpath 语法
htmlPage.xpath("/html/body//a/text()")
它让我在我想要的某些 html 范围内获得所有 标签。现在我遇到
标签可能如下所示:
<a>This is a sentence with some <italic>italic text</italic>-formatting I want to parse.</a>
xpath 返回一个列表,其中的元素比我预期的多。我检查并认识到,它将上面提到的 标签拆分为两个列表元素,而不是一个。 我得到的不是字符串,而是
"This is a sentence with some italic text-formatting I want to parse."
两个字符串
"This is a sentence with some" # and
"-formatting I want to parse."
有没有办法纠正这个问题?
Using lxml in python I created this xpath syntax
htmlPage.xpath("/html/body//a/text()")
It gets me all <a>
-tags in certain html scopes I desire. Now I encountered that the <a>
-tags could look like this:
<a>This is a sentence with some <italic>italic text</italic>-formatting I want to parse.</a>
xpath returns me a list that has one element more then I expect. I checked that and recognized, that it splits the <a>
-tag mentioned above into two list elements, instead of one. Instead of the string
"This is a sentence with some italic text-formatting I want to parse."
I get the two strings
"This is a sentence with some" # and
"-formatting I want to parse."
Is there a way to correct that?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我通过首先获取所有
-标签
,然后迭代返回的列表并在列表元素上使用
text_content()
解决了我的问题I solved my problem by first getting all
<a>
-tagsand then iterating the returned list and using
text_content()
on the list elements