为什么有效URL上的PHP多卷曲返回错误页面
我想从中获取数据的URL数量很大(约50.000)。为了使执行更快,我决定在PHP中使用Multi_curl。一个Multi_curl包含30个URL,因为我注意到,每当我在Multi_curl中添加更多URL时,就会丢失一些HTML元素。使用30个URL包装后,我注意到了另一个问题。从卷发中获取HTML时,几个URL仅返回一个404错误页面,因为它们是有效的URL,因此不应该这样。这些URL在为每个URL进行单个卷曲重试时确实有效。这是Multi_curl的已知问题,是否有修复程序?
设置URL并将它们添加到数组中:
foreach ($urls as $id => $url) {
$ch = curl_init();
// Set all your options for each connection here
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.7.12) Gecko/20050915 Firefox/1.0.7");
$curls[$url] = $ch;
curl_multi_add_handle($cmh,$ch);
$urlCtr++;
}
执行Multi_curl Handles:
do {
$mrc = curl_multi_exec($cmh, $active);
}
while ($mrc == CURLM_CALL_MULTI_PERFORM);
while ($active && $mrc == CURLM_OK) {
if (curl_multi_select($cmh) != -1) {
do {
$mrc = curl_multi_exec($cmh, $active);
} while ($mrc == CURLM_CALL_MULTI_PERFORM);
}
}
从Curls中获取HTML-DATA:
foreach ($curls as $url=>$ch) {
if($updateCtr == 7)
{
$f = 0;
}
$response = curl_multi_getcontent($ch); // get the content
// do what you want with the HTML
file_put_contents("sqlCommands.txt",$response);
$html = new simple_html_dom($response);
$list = $html->find('table[class="ta"]', 0);
curl_multi_remove_handle($cmh, $ch);}
我非常感谢您的帮助,谢谢!
I have quite a large number (around 50.000) of urls I want to fetch data from. In order to make execution faster I decided to use multi_curl in PHP. One multi_curl contains 30 urls, because I noticed, that whenever I add more urls to the multi_curl, some HTML elements are lost. After using the package of 30 urls I noticed another problem. When fetching the HTML from the curls a few urls return just a 404 error page which shouldn't be the case as they are valid urls. These urls do work when making a single curl-request for each of them. Is this a known problem of multi_curl and is there any fix?
Setting up urls and adding them to array:
foreach ($urls as $id => $url) {
$ch = curl_init();
// Set all your options for each connection here
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.7.12) Gecko/20050915 Firefox/1.0.7");
$curls[$url] = $ch;
curl_multi_add_handle($cmh,$ch);
$urlCtr++;
}
Executing multi_curl-handles:
do {
$mrc = curl_multi_exec($cmh, $active);
}
while ($mrc == CURLM_CALL_MULTI_PERFORM);
while ($active && $mrc == CURLM_OK) {
if (curl_multi_select($cmh) != -1) {
do {
$mrc = curl_multi_exec($cmh, $active);
} while ($mrc == CURLM_CALL_MULTI_PERFORM);
}
}
Fetching HTML-data from curls:
foreach ($curls as $url=>$ch) {
if($updateCtr == 7)
{
$f = 0;
}
$response = curl_multi_getcontent($ch); // get the content
// do what you want with the HTML
file_put_contents("sqlCommands.txt",$response);
$html = new simple_html_dom($response);
$list = $html->find('table[class="ta"]', 0);
curl_multi_remove_handle($cmh, $ch);}
I really appreciate your help, thanks in advance!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论