当前位置：文江博客话题详情

打开数千个 cURL 句柄而不会遇到问题？（PHP）

发布于 2024-10-05 05:24:41 字数 462 浏览 6 评论 0原文

我需要在 PHP 中使用 cURL 向 API 发出数千个 cURL 请求。我目前的计划是与curl_multi_() 函数并行执行这些操作。基本上是同时并行执行所有数千个 cURL 请求。

我听说打开太多句柄可能会遇到内存问题，这可能会导致致命错误。如何避免这种情况并仍然尽可能快地发出我的 URL 请求？

如果我需要限制一次发出的 cURL 请求的数量，设置该限制的最佳值是多少？

背景：我现在与 Godaddy 共享托管，它对 cURL 请求表现良好，尽管我还没有使用数千个并行请求对其进行测试。将来我将使用可以处理适度负载的 Rackspace 云站点。

如此巨大的 cURL 请求每年都会发生一次，而不是日常网站运营的一部分。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

温柔戏命师 2024-10-12 05:24:41

这听起来像是一个架构问题。为什么需要同时发出数千个请求？这种并行会有什么好处吗？或者您只是会意外地对某些可疑的 Web 服务/API 进行 DOS（拒绝服务）？

假设您没有连接单个远程服务器，您仍然需要担心本地设备可以处理多少个连接。您可以使用的传出端口数量有限，而且数量只有几万个。如果你想疯狂地打开连接，达到这个限制并不难。任何使用 apachebench 进行过度负载测试的人都知道这一点。

对于这种事情来说，PHP 并不是一个很好的工具——而我是一个 90% 使用 PHP 的人。没有线程，而且它是内存密集型的。如果您想要并行 1000 个 PHP 进程，您将需要不止一台机器。你的典型 PHP 进程将消耗大约 10-20 兆内存，除非你彻底调整它（可能是在编译时）。

你说这种情况每年发生一次。这让我认为可能没有必要全部如果你只有 24 或 36 个并行进程怎么办？

也就是说，PHP 可能会很好地工作，如果你遇到内存效率低下的问题，你可以只交换一部分。您需要两个或多或少的异步队列，以及一对处理它们的进程：

一个“获取队列” - 需要发出的 HTTP 请求的工作队列，它们执行请求并粘住请求。处理队列中的数据（参见下一个项目符号）。
“处理队列”是一个工作队列，它处理 HTTP 响应所包含的任何内容，它可以将新项目添加到“获取队列”
在获取队列上并行运行的一些进程（或几十个）。并行性在这里很好，因为网络造成的延迟非常大。
某些进程会占用“处理队列” - 目前尚不清楚并行性是否会有所帮助。所有这些处理都发生在本地，并且可能是一个简单的循环。

This sounds like an architectural problem. Why do you need to make thousands of requests at the same time? Is that sort of parellism going to do any good, or are you just going to accidentally DOS (Denial-of-Service) some poor suspecting web service/API?

Assuming you're not pounding a single remote server, you still need to worry about how many connections your local box can handle. There are only so many ports you can use outgoing, and they're measured in the low tens of thousands. It's not hard to hit that limit if you're going nuts opening connections. Anyone who's overdone load testing with apachebench knows this.

PHP is not a great tool for this kind of thing -- and I'm a guy who does 90% PHP. There's no threading, and it's memory intensive. If you want 1000 PHP processes in parallel, you're going to need more than one machine. Your typical PHP process is going to consume around 10-20 megs of memory, unless you tune the hell out of it (probably at compile-time.

You say this happens once a year. That makes me think it might not be necessary be all that parellel. What if you only had 24 or 36 parallel processes?

That said, here's how I'd probably approach this. PHP will probably work fine, and if you run into the memory inefficiency issues, you can swap out just one part. You want two, more-or-less asynchronous queues, and a pair of processes that work on them:

A "fetch queue" - a work queue of HTTP requests that need to get made. They perform the request and stick the data in the processing queue (see next bullet).
A "processing queue" a work queue that works through whatever the HTTP responses contain. As is queue gets processed, it can add new items to "fetch queue"
Some process (or a couple dozen) that run in parallel working on the fetch queue. Parallelism is nice here, since you have so much latency due to the network.
Some process that chews on the "processing queue" - it's not clear that parallelism will help here. All this processing happens locally, and can probably be a simple loop.

回复收藏 0 原文

梦里寻她 2024-10-12 05:24:41

请查看滚动卷曲。我用它从多个网页中提取链接和网页内容。我不知道这将如何在服务器上工作，因为我只在本地计算机上有过经验。

回复收藏 0 原文

娇柔作态 2024-10-12 05:24:41

timdev 建议的所有内容都包含在 Zebra cURL https://github.com/stefangabos/Zebra_cURL 。你可以传递一个 URL 数组，它会并行排列一些 URL（默认 10 个），然后调用它们并将结果对象传递到回调中。来自 github 文档：

    <?php
        function callback($result) {
            // remember, the "body" property of $result is run through
            // "htmlentities()", so you may need to "html_entity_decode" it
            // show everything
            print_r('<pre>');
            print_r($result->info);
        }
        require 'path/to/Zebra_cURL.php';
        // instantiate the Zebra_cURL class
        $curl = new Zebra_cURL();
        // cache results 60 seconds
        $curl->cache('cache', 60);
        // get RSS feeds of some popular tech websites
        $curl->get(array(
            'http://rss1.smashingmagazine.com/feed/',
            'http://allthingsd.com/feed/',
            'http://feeds.feedburner.com/nettuts',
            'http://www.webmonkey.com/feed/',
            'http://feeds.feedburner.com/alistapart/main',
        ), 'callback');
    ?>

它在内存使用方面确实非常快且甜蜜

All the things suggested by timdev are wrapped up in Zebra cURL https://github.com/stefangabos/Zebra_cURL. you can pass an array of URLs and it will queue up some (default 10) in parallel and then call them and pass a result object into a callback. From the github docs:

    <?php
        function callback($result) {
            // remember, the "body" property of $result is run through
            // "htmlentities()", so you may need to "html_entity_decode" it
            // show everything
            print_r('<pre>');
            print_r($result->info);
        }
        require 'path/to/Zebra_cURL.php';
        // instantiate the Zebra_cURL class
        $curl = new Zebra_cURL();
        // cache results 60 seconds
        $curl->cache('cache', 60);
        // get RSS feeds of some popular tech websites
        $curl->get(array(
            'http://rss1.smashingmagazine.com/feed/',
            'http://allthingsd.com/feed/',
            'http://feeds.feedburner.com/nettuts',
            'http://www.webmonkey.com/feed/',
            'http://feeds.feedburner.com/alistapart/main',
        ), 'callback');
    ?>

It's really fast and sweet on memory usage

回复收藏 0 原文