用Python爬虫?
我想用python写一个爬虫。这意味着:我已经获得了一些网站主页的网址,并且我希望我的程序能够通过保留在该网站中的链接来爬行所有网站。我怎样才能轻松快速地做到这一点?我已经尝试过 BeautifulSoup,但它确实很耗 CPU,而且在我的电脑上速度很慢。
I'd like to write a crawler using python. This means: I've got the url of some websites' home page, and I'd like my program to crawl through all the website following links that stay into the website. How can I do this easily and FAST? I tried BeautifulSoup already, but it is really cpu consuming and quite slow on my pc.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
我建议将 mechanize 与 lxml.html 结合使用。正如罗伯特·金所建议的,机械化可能是浏览网站的最佳选择。为了提取元素,我会使用 lxml。 lxml 比 BeautifulSoup 快得多,并且可能是 python 可用的最快解析器。 此链接显示了不同 html 解析器的性能测试对于蟒蛇。就我个人而言,我会避免使用 scrapy 包装器。
我还没有测试过它,但这可能就是您正在寻找的,第一部分直接取自 mechanize 文档< /a>. lxml 文档也非常有帮助。特别是看看 this 和 此 部分。
您还可以通过 root.xpath() 获取元素。一个简单的 wget 甚至可能是最简单的解决方案。
希望我能有所帮助。
I'd recommend using mechanize in combination with lxml.html. as robert king suggested, mechanize is probably best for navigating through the site. for extracting elements I'd use lxml. lxml is much faster than BeautifulSoup and probably the fastest parser available for python. this link shows a performance test of different html parsers for python. Personally I'd refrain from using the scrapy wrapper.
I haven't tested it, but this is probably what youre looking for, first part is taken straight from the mechanize documentation. the lxml documentation is also quite helpful. especially take a look at this and this section.
you can also get elements via root.xpath(). A simple wget might even be the easiest solution.
Hope I could be helpful.
我喜欢使用机械化。它相当简单,您下载它并创建一个浏览器对象。使用此对象您可以打开 URL。您可以像在普通浏览器中一样使用“后退”和“前进”功能。您可以遍历页面上的表单并根据需要填写它们。
您也可以遍历页面上的所有链接。每个链接对象都有您可以单击的 url 等。
这是一个例子:
下载所有链接(相关文档)在使用 Python 的网页上
I like using mechanize. Its fairly simple, you download it and create a browser object. With this object you can open a URL. You can use "back" and "forward" functions as in a normal browser. You can iterate through the forms on the page and fill them out if need be.
You can iterate through all the links on the page too. Each link object has the url etc which you could click on.
here is an example:
Download all the links(related documents) on a webpage using Python
以下是使用 eventlet 的非常快速(并发)递归网络抓取工具的示例 。它只打印找到的网址,但您可以对其进行修改以执行您想要的操作。也许您想使用 lxml (快速)、pyquery (较慢但仍然很快)或 BeautifulSoup (慢)来解析 html 以获取您想要的数据。
Here's an example of a very fast (concurrent) recursive web scraper using eventlet. It only prints the urls it finds but you can modify it to do what you want. Perhaps you'd want to parse the html with lxml (fast), pyquery (slower but still fast) or BeautifulSoup (slow) to get the data you want.
看看 scrapy (和 相关问题)。至于性能......在没有看到代码的情况下很难提出任何有用的建议。
Have a look at scrapy (and related questions). As for performance... very difficult to make any useful suggestions without seeing the code.