使用 PHP curl_multi 的异步/并行 HTTP 请求
我最近研究了使用curl 发出多个请求的可能性。我可能还没有完全理解,所以我只是希望澄清一些概念。
如果您要从多个来源获取内容,这绝对是一个不错的选择。这样,您就可以开始处理来自较快服务器的结果,同时仍在等待较慢的服务器。如果您从同一服务器请求多个页面,那么使用它是否仍然有意义?服务器是否仍会同时向同一客户端提供多个页面?
I recently looked into the possibility of making multiple requests with curl. I may not be understanding it fully, so I am just hoping to clarify some concepts.
It's definitely a good option if you are fetching content from multiple sources. That way, you can start processing the results from faster servers while still waiting for slower ones. Does it still make sense to use it if you are requesting multiple pages from the same server? Would the server still serve multiple pages at the time to the same client?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
您无法在 PHP 中执行多线程,因此您将无法在仍在检索其他页面时开始处理某一页面。在检索所有页面或超时之前,多卷曲不会返回控制权。因此,检索最慢的页面所需的时间将与检索时间一样长。您将从串行(curl)转向并行(multi_curl),这仍然会给您带来很大的提升。
服务器将为同一客户端提供多个页面,直至达到一定的配置限制。从服务器请求 5-10 个页面就可以了。
You can't do multi-threading in PHP, so you won't be able to start processing one page while the others are still being retrieve. Multi-curl won't return control until all pages are retrieved or timeout. So it will take as long the it takes for the slowest page to be retrieved. You are going from serial (curl) to parallel (multi_curl), which will still give you a big boost.
Servers will serve multiple pages to the same client up to a certain configure limit. Requesting 5-10 pages from a server would be fine.
检查这个,这个家伙制作了一个与curl_multi异步工作的脚本。我已经玩了几个小时了,效果很好。
Check this out, this guy made a script that works async with curl_multi. I have been playing for couple of hours with it, and it works fine.
认为大多数或所有服务器一次都会向同一客户端提供多个页面。您可以为您的连接设置一个合理的超时,然后如果一个连接失败,则将其推送到您的连接数组上,以便在所有其他连接都完成后重试。这样你就可以一次至少获得一个,尽管它总是试图获得多个。这有道理吗? :)
think most or all servers will serve more than one page at a time to the same client. You could set a reasonable timeout for you connections, then if one fails to connect, push it onto your connection array to be retried after all the others have been gone through. That way you'll be getting at least one at a time, even though it will always be trying to get several. Does that make sense? :)
如果服务器认为同一客户端发出过多连接或请求,某些服务器可能会配置为防御性行为。它可能会执行诸如删除/拒绝连接、限制所有连接之间的总带宽或其他操作之类的操作。
无论如何,要考虑周全,就像您希望网络爬虫考虑您的网站一样,并尽量不要一次用太多的东西轰炸单个服务器。
如果您需要从 5 台不同的服务器中分别获取 5 个页面,则与在完成之前与 1 个服务器建立 5 个连接相比,如果在完成之前对每个服务器使用 1 个连接,则完成速度可能会更快。
Some servers might be configured to behave defensively if too many connections or requests are made from what it believes is the same client. It might do things such as drop/reject connections, limit bandwidth to some aggregate total between all your connections, or other things.
Regardless, be considerate like you would want a web crawler to be consider to your site, and try not to bombard a single server with too much at once.
If you need to fetch 5 pages each, from 5 different servers, you're much more likely to finish faster if you use 1 connection to each server until done, than if you did 5 connections to 1 server until done.