Python 中的多线程爬虫真的能加快速度吗?

发布于 2024-09-01 08:04:06 字数 203 浏览 10 评论 0原文

一直想用 python 写一个小网络爬虫。我开始研究将其编写为多线程脚本,一个线程池下载和一个池处理结果。由于 GIL,它实际上会同时下载吗? GIL 如何影响网络爬虫?每个线程是否会从套接字中选取一些数据,然后移动到下一个线程,让它从套接字中选取一些数据,等等?

基本上我问的是,用 python 做一个多线程爬虫真的会给我带来比单线程更高的性能吗?

谢谢!

Was looking to write a little web crawler in python. I was starting to investigate writing it as a multithreaded script, one pool of threads downloading and one pool processing results. Due to the GIL would it actually do simultaneous downloading? How does the GIL affect a web crawler? Would each thread pick some data off the socket, then move on to the next thread, let it pick some data off the socket, etc..?

Basically I'm asking is doing a multi-threaded crawler in python really going to buy me much performance vs single threaded?

thanks!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

心碎的声音 2024-09-08 08:04:06

在进行网络操作时,Python 解释器不持有 GIL。如果您正在进行网络绑定的工作(例如爬虫),则可以安全地忽略 GIL 的影响。

另一方面,如果您创建大量线程进行处理(下载后),您可能需要测量性能。限制线程数量将减少 GIL 对性能的影响。

The GIL is not held by the Python interpreter when doing network operations. If you are doing work that is network-bound (like a crawler), you can safely ignore the effects of the GIL.

On the other hand, you may want to measure your performance if you create lots of threads doing processing (after downloading). Limiting the number of threads there will reduce the effects of the GIL on your performance.

秋叶绚丽 2024-09-08 08:04:06

看看 scrapy 是如何工作的。它可以帮助你很多。它不使用线程,但可以在同一个线程中进行多个“同时”下载。

如果您考虑一下,您只有一个网卡,因此根据定义并行处理并不能真正提供帮助。

scrapy 的作用是只是在发送另一个请求之前不等待一个请求的响应。全部在一个线程中。

Look at how scrapy works. It can help you a lot. It doesn't use threads, but can do multiple "simultaneous" downloading, all in the same thread.

If you think about it, you have only a single network card, so parallel processing can't really help by definition.

What scrapy does is just not wait around for the response of one request before sending another. All in a single thread.

望笑 2024-09-08 08:04:06

当涉及到爬行时,您可能最好使用基于事件的东西,例如使用非阻塞异步套接字的 Twisted操作来获取和返回数据,而不是阻塞每一个数据。

异步网络操作很容易并且通常是单线程的。网络 I/O 几乎总是比 CPU 具有更高的延迟,因为你真的不知道一个页面需要多长时间才能返回,而这正是异步的优势,因为异步操作比线程轻得多。

编辑:这是一个 简单示例如何使用Twisted的getPage创建一个简单的网络爬虫。

When it comes to crawling you might be better off using something event-based such as Twisted that uses non-blocking asynchronous socket operations to fetch and return data as it comes, rather than blocking on each one.

Asynchronous network operations can easily be and usually are single-threaded. Network I/O almost always has higher latency than that of CPU because you really have no idea how long a page is going to take to return, and this is where async shines because an async operation is much lighter weight than a thread.

Edit: Here is a simple example of how to use Twisted's getPage to create a simple web crawler.

孤寂小茶 2024-09-08 08:04:06

另一个考虑因素:如果您正在抓取单个网站,并且服务器对您可以从 IP 地址发送的请求的频率进行限制,那么添加多个线程可能没有什么区别。

Another consideration: if you're scraping a single website and the server places limits on the frequency of requests your can send from your IP address, adding multiple threads may make no difference.

倾`听者〃 2024-09-08 08:04:06

是的,多线程抓取显着提高了处理速度。这不是 GIL 的问题。您会失去大量空闲 CPU 和未使用的带宽来等待请求完成。如果您正在抓取的网页位于本地网络中(一种罕见的抓取情况),那么多线程和单线程抓取之间的差异可能会更小。

您可以自己尝试使用 1 到“n”个线程进行基准测试。我在 发现 Web 资源,我在 自动发现连接到企业网站的博客源和 Twitter、Facebook、LinkedIn 帐户。您可以更改 FocusedWebCrawler 中的 NWORKERS 类变量来选择要使用的线程数。

Yes, multithreading scraping increases the process speed significantly. This is not a case where GIL is an issue. You are losing a lot of idle CPU and unused bandwidth waiting for a request to finish. If the web page you are scraping is in your local network (a rare scraping case) then the difference between multithreading and single thread scraping can be smaller.

You can try the benchmark yourself playing with one to "n" threads. I have written a simple multithreaded crawler on Discovering Web Resources and I wrote a related article on Automated Discovery of Blog Feeds and Twitter, Facebook, LinkedIn Accounts Connected to Business Website. You can select how many threads to use changing the NWORKERS class variable in FocusedWebCrawler.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文