使用 pythons lxml 库正确的 xpath 语法，用于解析任意嵌套 html 标签中的所有文本

发布于 2024-11-10 10:51:30 字数 704 浏览 0 评论 0原文

在 python 中使用 lxml 我创建了这个 xpath 语法

htmlPage.xpath("/html/body//a/text()")

它让我在我想要的某些 html 范围内获得所有标签。现在我遇到标签可能如下所示：

<a>This is a sentence with some <italic>italic text</italic>-formatting I want to parse.</a>

xpath 返回一个列表，其中的元素比我预期的多。我检查并认识到，它将上面提到的标签拆分为两个列表元素，而不是一个。我得到的不是字符串，而是

"This is a sentence with some italic text-formatting I want to parse."

两个字符串

"This is a sentence with some" # and
"-formatting I want to parse."

有没有办法纠正这个问题？

原文

Using lxml in python I created this xpath syntax

htmlPage.xpath("/html/body//a/text()")

It gets me all <a>-tags in certain html scopes I desire. Now I encountered that the <a>-tags could look like this:

<a>This is a sentence with some <italic>italic text</italic>-formatting I want to parse.</a>

xpath returns me a list that has one element more then I expect. I checked that and recognized, that it splits the <a>-tag mentioned above into two list elements, instead of one. Instead of the string

"This is a sentence with some italic text-formatting I want to parse."

I get the two strings

"This is a sentence with some" # and
"-formatting I want to parse."

Is there a way to correct that?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

好多鱼好多余 2024-11-17 10:51:30

我通过首先获取所有 -标签

results = htmlPage.xpath("/html/body//a")

，然后迭代返回的列表并在列表元素上使用 text_content()解决了我的问题

for a_tag in results:
    print a_tag.text_content() # prints bthe whol string: "This is a sentence with some italic text-formatting I want to parse."

I solved my problem by first getting all <a>-tags

results = htmlPage.xpath("/html/body//a")

and then iterating the returned list and using text_content() on the list elements

for a_tag in results:
    print a_tag.text_content() # prints bthe whol string: "This is a sentence with some italic text-formatting I want to parse."

回复收藏 0 原文

~没有更多了~