解析 html 以获得整个段落，同时避免其他标签的干扰

发布于 2024-12-10 09:00:52 字数 1112 浏览 0 评论 0 原文

我问了一个关于如何使用 lxml 解析 url 的问题获取

元素。已解决。但是，为了完全实现我的目标，我需要考虑

内其他标签的效果。

Acorn 提供的用于解析 url 并获取

返回的可接受答案是：

import lxml.html

htmltree = lxml.html.parse('http://www.google.com/intl/en/about/corporate/index.html')

print htmltree.xpath('//p/text()')

但是，htmltree.xpath('//p/text()'),如果

段落中还有其他标签，则将返回片段，并且其他标签之间的文本将被忽略。

例如，对于

Text1... 超链接文本.. Text2....

目前通过htmltree.xpath('//p/text()')解析为['Text1...', '文本2...']。
更直观地说，预期结果应该是['Text1...超链接文本..Text2...']。

因此我想知道，我应该使用哪些其他方法，将其解析为一个整体，并以某种方式修复其他类型标签（例如）的中断？

我进一步研究了 lxml xpath 文档，我怀疑这是因为 //p/text() 中的 >/text()。但我被困在这里，不知道该改变什么。

原文

I've asked a question on how to use lxml to parse a url and get  elements back. It is resolved. However, to fully achieve my goal, I need to consider the effect of other tags inside a .

The accepted answer provided by Acorn to parse a url and get  back is:

import lxml.html

htmltree = lxml.html.parse('http://www.google.com/intl/en/about/corporate/index.html')

print htmltree.xpath('//p/text()')

However, htmltree.xpath('//p/text()'), if there are other tags inside the  paragraph, pieces will be returned and also text in between of other tags will be ignored.

E.g. for Text1... <a href="/link.../">hyperlinked text..</a> Text2....

Currently, by using htmltree.xpath('//p/text()'), it is parsed into ['Text1...','Text2...'].
More intuitively, the expected result should be ['Text1... hyperlinked text.. Text2...'].

Hence I would like to know, what other methods I should use, to parse it into a whole and somehow fix the interruptions by other type of tags, e.g. <a>?

I have further looked into the lxml xpath documentation, and I suspect it is because of the /text() in //p/text(). But I am stuck here and have no clue what to change.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

十秒萌定你 2024-12-17 09:00:52

是的，/text() 获取该标记中的直接文本元素。相反，获取所有 p 标签并使用 .text_content() 获取其中的所有文本。来自 lxml.html 文档：

.text_content():

返回元素的文本内容，包括
其子级的文本内容，没有标记。

所以你会得到这样的东西：

import lxml.html

htmltree = lxml.html.parse('http://www.google.com/intl/en/about/corporate/index.html')

p_tags = htmltree.xpath('//p')
p_content = [p.text_content() for p in p_tags]

print p_content

Yes, /text() gets the immediate text element in that tag. Instead, get all p tags and use .text_content() to get all the text in them. From lxml.html doc:

.text_content():

Returns the text content of the element, including
the text content of its children, with no markup.

So you will have something like this:

import lxml.html

htmltree = lxml.html.parse('http://www.google.com/intl/en/about/corporate/index.html')

p_tags = htmltree.xpath('//p')
p_content = [p.text_content() for p in p_tags]

print p_content

回复收藏 0 原文

野侃 2024-12-17 09:00:52

from xml.etree import ElementTree
from StringIO import StringIO

c = ElementTree.iterparse(StringIO('<html><p>hello <a href="">world</a></p>...</html>'))
for a,e in c:
    print '------------- DUMPING --------------'
    ElementTree.dump(e)
    print 'text: ', e.text
    print 'tail: ', e.tail
    print 'tag: ', e.tag

如果您的 xml 无效，请尝试安装 lxml 并将“xml.etree”更改为“lxml.etree”。

希望这有帮助。

from xml.etree import ElementTree
from StringIO import StringIO

c = ElementTree.iterparse(StringIO('<html><p>hello <a href="">world</a></p>...</html>'))
for a,e in c:
    print '------------- DUMPING --------------'
    ElementTree.dump(e)
    print 'text: ', e.text
    print 'tail: ', e.tail
    print 'tag: ', e.tag

If your xml is not valid, try installing lxml and changing 'xml.etree' to 'lxml.etree'.

Hope this helps.

回复收藏 0 原文

~没有更多了~