当前位置：文江博客话题详情

如何编写 python 脚本来搜索网站 html 中的匹配链接

发布于 2024-08-23 22:38:00 字数 71 浏览 6 评论 0原文

我对 python 不太熟悉，必须编写一个脚本来执行许多功能。基本上我仍然需要的模块是如何检查网站代码以匹配预先提供的链接。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

恰似旧人归 2024-08-30 22:38:00

匹配链接是什么？他们的 HREF 属性？链接显示文字？
也许是这样的：

from BeautifulSoup import BeautifulSoup, SoupStrainer
import re
import urllib2

doc = urllib2.urlopen("http://somesite.com").read()
links = SoupStrainer('a', href=re.compile(r'^test'))
soup = [str(elm) for elm in BeautifulSoup(doc, parseOnlyThese=links)]
for elm in soup:
    print elm

这将获取 somesite.com 的 HTML 内容，然后使用 BeautifulSoup 解析它，仅查找 HREF 属性以“test”开头的链接。然后它会构建这些链接的列表并将其打印出来。

您可以使用文档对其进行修改以执行任何操作。

Matching links what? Their HREF attribute? The link display text?
Perhaps something like:

from BeautifulSoup import BeautifulSoup, SoupStrainer
import re
import urllib2

doc = urllib2.urlopen("http://somesite.com").read()
links = SoupStrainer('a', href=re.compile(r'^test'))
soup = [str(elm) for elm in BeautifulSoup(doc, parseOnlyThese=links)]
for elm in soup:
    print elm

That will grab the HTML content of somesite.com and then parse it using BeautifulSoup, looking only for links whose HREF attribute starts with "test". It then builds a list of these links and prints them out.

You can modify this to do anything using the documentation.

回复收藏 0 原文