如何使用 lxml、XPath 和 Python 从网页中提取链接？

发布于 2024-08-17 20:03:18 字数 728 浏览 4 评论 0原文

我有这个 xpath 查询：

/html/body//tbody/tr[*]/td[*]/a[@title]/@href

它提取带有 title 属性的所有链接 - 并在 FireFox 的 Xpath 检查器插件。

但是，我似乎无法将它与 lxml 一起使用。

from lxml import etree
parsedPage = etree.HTML(page) # Create parse tree from valid page.

# Xpath query
hyperlinks = parsedPage.xpath("/html/body//tbody/tr[*]/td[*]/a[@title]/@href") 
for x in hyperlinks:
    print x # Print links in <a> tags, containing the title attribute

这不会从 lxml 产生任何结果（空列表）。

如何在 Python 下使用 lxml 获取包含属性标题的超链接的 href 文本（链接）？

原文

I've got this xpath query:

/html/body//tbody/tr[*]/td[*]/a[@title]/@href

It extracts all the links with the title attribute - and gives the href in FireFox's Xpath checker add-on.

However, I cannot seem to use it with lxml.

from lxml import etree
parsedPage = etree.HTML(page) # Create parse tree from valid page.

# Xpath query
hyperlinks = parsedPage.xpath("/html/body//tbody/tr[*]/td[*]/a[@title]/@href") 
for x in hyperlinks:
    print x # Print links in <a> tags, containing the title attribute

This produces no result from lxml (empty list).

How would one grab the href text (link) of a hyperlink containing the attribute title with lxml under Python?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

注定孤独终老 2024-08-24 20:03:18

我能够使用以下代码使其工作：

from lxml import html, etree
from StringIO import StringIO

html_string = '''<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
   "http://www.w3.org/TR/html4/loose.dtd">

<html lang="en">
<head/>
<body>
    <table border="1">
      <tbody>
        <tr>
          <td><a href="http://stackoverflow.com/foobar" title="Foobar">A link</a></td>
        </tr>
        <tr>
          <td><a href="http://stackoverflow.com/baz" title="Baz">Another link</a></td>
        </tr>
      </tbody>
    </table>
</body>
</html>'''

tree = etree.parse(StringIO(html_string))
print tree.xpath('/html/body//tbody/tr/td/a[@title]/@href')

>>> ['http://stackoverflow.com/foobar', 'http://stackoverflow.com/baz']

I was able to make it work with the following code:

from lxml import html, etree
from StringIO import StringIO

html_string = '''<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
   "http://www.w3.org/TR/html4/loose.dtd">

<html lang="en">
<head/>
<body>
    <table border="1">
      <tbody>
        <tr>
          <td><a href="http://stackoverflow.com/foobar" title="Foobar">A link</a></td>
        </tr>
        <tr>
          <td><a href="http://stackoverflow.com/baz" title="Baz">Another link</a></td>
        </tr>
      </tbody>
    </table>
</body>
</html>'''

tree = etree.parse(StringIO(html_string))
print tree.xpath('/html/body//tbody/tr/td/a[@title]/@href')

>>> ['http://stackoverflow.com/foobar', 'http://stackoverflow.com/baz']

回复收藏 0 原文