为什么有效URL上的PHP多卷曲返回错误页面

发布于 2025-01-31 14:35:43 字数 2016 浏览 3 评论 0原文

我想从中获取数据的URL数量很大(约50.000)。为了使执行更快,我决定在PHP中使用Multi_curl。一个Multi_curl包含30个URL,因为我注意到,每当我在Multi_curl中添加更多URL时,就会丢失一些HTML元素。使用30个URL包装后,我注意到了另一个问题。从卷发中获取HTML时,几个URL仅返回一个404错误页面,因为它们是有效的URL,因此不应该这样。这些URL在为每个URL进行单个卷曲重试时确实有效。这是Multi_curl的已知问题,是否有修复程序?

设置URL并将它们添加到数组中:

foreach ($urls as $id => $url) {
            $ch = curl_init();
        
            // Set all your options for each connection here
            curl_setopt($ch, CURLOPT_URL, $url);
            curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
            curl_setopt($ch, CURLOPT_HEADER, 0); 
            curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);
            curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
            curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
            curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.7.12) Gecko/20050915 Firefox/1.0.7");
    
            $curls[$url] = $ch;
            curl_multi_add_handle($cmh,$ch);
            $urlCtr++;
        }

执行Multi_curl Handles:

do {
            $mrc = curl_multi_exec($cmh, $active);
        }
        while ($mrc == CURLM_CALL_MULTI_PERFORM);
    
        while ($active && $mrc == CURLM_OK) {
            if (curl_multi_select($cmh) != -1) {
                do {
                    $mrc = curl_multi_exec($cmh, $active);
                } while ($mrc == CURLM_CALL_MULTI_PERFORM);
            }
        }

从Curls中获取HTML-DATA:

foreach ($curls as $url=>$ch) {
            if($updateCtr == 7)
            {
                $f = 0;
            }
            $response = curl_multi_getcontent($ch); // get the content
                    // do what you want with the HTML
            
            file_put_contents("sqlCommands.txt",$response);
            $html = new simple_html_dom($response);
            $list = $html->find('table[class="ta"]', 0);
            curl_multi_remove_handle($cmh, $ch);}

我非常感谢您的帮助,谢谢!

I have quite a large number (around 50.000) of urls I want to fetch data from. In order to make execution faster I decided to use multi_curl in PHP. One multi_curl contains 30 urls, because I noticed, that whenever I add more urls to the multi_curl, some HTML elements are lost. After using the package of 30 urls I noticed another problem. When fetching the HTML from the curls a few urls return just a 404 error page which shouldn't be the case as they are valid urls. These urls do work when making a single curl-request for each of them. Is this a known problem of multi_curl and is there any fix?

Setting up urls and adding them to array:

foreach ($urls as $id => $url) {
            $ch = curl_init();
        
            // Set all your options for each connection here
            curl_setopt($ch, CURLOPT_URL, $url);
            curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
            curl_setopt($ch, CURLOPT_HEADER, 0); 
            curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);
            curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
            curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
            curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.7.12) Gecko/20050915 Firefox/1.0.7");
    
            $curls[$url] = $ch;
            curl_multi_add_handle($cmh,$ch);
            $urlCtr++;
        }

Executing multi_curl-handles:

do {
            $mrc = curl_multi_exec($cmh, $active);
        }
        while ($mrc == CURLM_CALL_MULTI_PERFORM);
    
        while ($active && $mrc == CURLM_OK) {
            if (curl_multi_select($cmh) != -1) {
                do {
                    $mrc = curl_multi_exec($cmh, $active);
                } while ($mrc == CURLM_CALL_MULTI_PERFORM);
            }
        }

Fetching HTML-data from curls:

foreach ($curls as $url=>$ch) {
            if($updateCtr == 7)
            {
                $f = 0;
            }
            $response = curl_multi_getcontent($ch); // get the content
                    // do what you want with the HTML
            
            file_put_contents("sqlCommands.txt",$response);
            $html = new simple_html_dom($response);
            $list = $html->find('table[class="ta"]', 0);
            curl_multi_remove_handle($cmh, $ch);}

I really appreciate your help, thanks in advance!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文