用Python爬虫？

发布于 2024-11-19 10:24:29 字数 135 浏览 2 评论 0原文

我想用python写一个爬虫。这意味着：我已经获得了一些网站主页的网址，并且我希望我的程序能够通过保留在该网站中的链接来爬行所有网站。我怎样才能轻松快速地做到这一点？我已经尝试过 BeautifulSoup，但它确实很耗 CPU，而且在我的电脑上速度很慢。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

洋洋洒洒 2024-11-26 10:24:29

我建议将 mechanize 与 lxml.html 结合使用。正如罗伯特·金所建议的，机械化可能是浏览网站的最佳选择。为了提取元素，我会使用 lxml。 lxml 比 BeautifulSoup 快得多，并且可能是 python 可用的最快解析器。此链接显示了不同 html 解析器的性能测试对于蟒蛇。就我个人而言，我会避免使用 scrapy 包装器。

我还没有测试过它，但这可能就是您正在寻找的，第一部分直接取自 mechanize 文档< /a>. lxml 文档也非常有帮助。特别是看看 this 和此部分。

import mechanize
import lxml.html

br = mechanize.Browser()
response = br.open("somewebsite")

for link in br.links():
    print link
    br.follow_link(link)  # takes EITHER Link instance OR keyword args
    print br
    br.back()

# you can also display the links with lxml
html = response.read()
root = lxml.html.fromstring(html)
for link in root.iterlinks():
    print link

您还可以通过 root.xpath() 获取元素。一个简单的 wget 甚至可能是最简单的解决方案。

希望我能有所帮助。

I'd recommend using mechanize in combination with lxml.html. as robert king suggested, mechanize is probably best for navigating through the site. for extracting elements I'd use lxml. lxml is much faster than BeautifulSoup and probably the fastest parser available for python. this link shows a performance test of different html parsers for python. Personally I'd refrain from using the scrapy wrapper.

I haven't tested it, but this is probably what youre looking for, first part is taken straight from the mechanize documentation. the lxml documentation is also quite helpful. especially take a look at this and this section.

import mechanize
import lxml.html

br = mechanize.Browser()
response = br.open("somewebsite")

for link in br.links():
    print link
    br.follow_link(link)  # takes EITHER Link instance OR keyword args
    print br
    br.back()

# you can also display the links with lxml
html = response.read()
root = lxml.html.fromstring(html)
for link in root.iterlinks():
    print link

you can also get elements via root.xpath(). A simple wget might even be the easiest solution.

Hope I could be helpful.

回复收藏 0 原文