用lxml解析html(标签h3)
我正在尝试解析一些 html,但这个小 html 代码有一些问题。
XML:
<div>
<p><span><a href="../url"></a></span></p>
<h3 class="header"><a href="../url">Other</a></h3>
<a href="../url">Other</a><br>
<a class="aaaaa" href="../url">Indice</a>
<p></p>
</div>
代码:
import urllib
from lxml import etree
import StringIO
resultado=urllib.urlopen('trozo.html')
html = resultado.read()
parser= etree.HTMLParser()
tree=etree.parse(StringIO.StringIO(html),parser)
xpath='/div/h3'
html_filtrado=tree.xpath(xpath)
print html_filtrado
当我打印代码时,它会出现 [],我认为它应该是一个带有
。 如果我有该列表,我将执行 etree.tostring(html_filtrado) 来查看 的列表;其他
Other< /h3>
。
那么怎样才能得到这个代码呢?
<h3 class="header"><a href="../url">Other</a></h3>
或者只有 ../url
?这就是我想要的部分!
谢谢
I'm trying to parse some html and I have some problem with this little html code.
XML:
<div>
<p><span><a href="../url"></a></span></p>
<h3 class="header"><a href="../url">Other</a></h3>
<a href="../url">Other</a><br>
<a class="aaaaa" href="../url">Indice</a>
<p></p>
</div>
code:
import urllib
from lxml import etree
import StringIO
resultado=urllib.urlopen('trozo.html')
html = resultado.read()
parser= etree.HTMLParser()
tree=etree.parse(StringIO.StringIO(html),parser)
xpath='/div/h3'
html_filtrado=tree.xpath(xpath)
print html_filtrado
When I print the code it appears [], and I suppose that It should be a list with <h3 class="header"><a href="../url">Other</a></h3>
in it.
If I would have that list I would execute etree.tostring(html_filtrado) to see <h3 class="header"><a href="../url">Other</a></h3>
.
So how can get this code?
<h3 class="header"><a href="../url">Other</a></h3>
Or only ../url
? which is the part I want!!
Thank you
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您的示例中的 XPath 查询不太正确。
要获取
div
标签内所有h3
标签的列表,您应该使用:应该给出:
要获取所有
href
属性的列表h3
标签内的a
标签,您可以使用如下内容:
The XPath query in your example is not quite right.
To get a list of all
h3
tags withindiv
tags, you should use this:Which should give:
To get a list of all
href
attributes ofa
tags withinh3
tags, you could use something like this:Which gives:
情况是,etree.HTMLParser() 当接收 HTML 时,它会创建完整的 html DOM 树。
所以,如果你使用 etree.tostring(tree) 你会得到所以,而不是你想要的
,所以,正确的 xpath 将是 '/html/body/div/h3'
The case is, that etree.HTMLParser() when receives HTML, it creates the full html DOM tree.
So, instead of what you intended, if you use etree.tostring(tree) you get
So, the correct xpath would be '/html/body/div/h3'