如何使用 lxml、XPath 和 Python 从网页中提取链接?
我有这个 xpath 查询:
/html/body//tbody/tr[*]/td[*]/a[@title]/@href
它提取带有 title 属性的所有链接 - 并在 FireFox 的 Xpath 检查器插件。
但是,我似乎无法将它与 lxml 一起使用。
from lxml import etree
parsedPage = etree.HTML(page) # Create parse tree from valid page.
# Xpath query
hyperlinks = parsedPage.xpath("/html/body//tbody/tr[*]/td[*]/a[@title]/@href")
for x in hyperlinks:
print x # Print links in <a> tags, containing the title attribute
这不会从 lxml
产生任何结果(空列表)。
如何在 Python 下使用 lxml
获取包含属性标题的超链接的 href
文本(链接)?
I've got this xpath query:
/html/body//tbody/tr[*]/td[*]/a[@title]/@href
It extracts all the links with the title attribute - and gives the href
in FireFox's Xpath checker add-on.
However, I cannot seem to use it with lxml
.
from lxml import etree
parsedPage = etree.HTML(page) # Create parse tree from valid page.
# Xpath query
hyperlinks = parsedPage.xpath("/html/body//tbody/tr[*]/td[*]/a[@title]/@href")
for x in hyperlinks:
print x # Print links in <a> tags, containing the title attribute
This produces no result from lxml
(empty list).
How would one grab the href
text (link) of a hyperlink containing the attribute title with lxml
under Python?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我能够使用以下代码使其工作:
I was able to make it work with the following code:
Firefox 添加了额外的 html 标签 到渲染时的 html 中,使得 firebug 工具返回的 xpath 与服务器返回的实际 html(以及 urllib/2 将返回的内容)不一致。
删除
标签通常可以解决问题。
Firefox adds additional html tags to the html when it renders, making the xpath returned by the firebug tool inconsistent with the actual html returned by the server (and what urllib/2 will return).
Removing the
<tbody>
tag generally does the trick.