PHP 在带有错误 url 的curl 函数循环中停止

发布于 2024-12-08 13:01:21 字数 2097 浏览 0 评论 0原文

我有一个包含几千个 URL 的数据库,我正在检查页面上的链接(最终寻找特定链接),因此我通过一个循环抛出下面的函数,每隔一段时间就有一个 URL 是坏的,然后整个程序只是停止并停止运行并开始建立所使用的内存。我以为添加 CURLOPT_TIMEOUT 可以解决这个问题,但事实并非如此。有什么想法吗?

$options = array(
    CURLOPT_RETURNTRANSFER => true,         // return web page
    CURLOPT_HEADER         => false,        // don't return headers
    CURLOPT_FOLLOWLOCATION => true,         // follow redirects
    CURLOPT_ENCODING       => "",           // handle all encodings
    CURLOPT_USERAGENT      =>  "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13'",     // who am i
    CURLOPT_AUTOREFERER    => true,         // set referer on redirect
    CURLOPT_TIMEOUT        => 2,          // timeout on response
    CURLOPT_MAXREDIRS      => 10,           // stop after 10 redirects
    CURLOPT_POST            => 0,            // i am sending post data
       CURLOPT_POSTFIELDS     => $curl_data,    // this are my post vars
    CURLOPT_SSL_VERIFYHOST => 0,            // don't verify ssl
    CURLOPT_SSL_VERIFYPEER => false,        //
    CURLOPT_VERBOSE        => 1                //
);

$ch      = curl_init($url);
curl_setopt_array($ch,$options);
$content = curl_exec($ch);
$err     = curl_errno($ch);
$errmsg  = curl_error($ch) ;
$header  = curl_getinfo($ch);
curl_close($ch);

//  $header['errno']   = $err;
//  $header['errmsg']  = $errmsg;
$header['content'] = $content;

#Extract the raw URl from the current one
$scheme = parse_url($url, PHP_URL_SCHEME); //Ex: http
$host = parse_url($url, PHP_URL_HOST); //Ex: www.google.com
$raw_url = $scheme . '://' . $host; //Ex: http://www.google.com

#Replace the relative link by an absolute one
$relative = array();
$absolute = array();

#String to search
$relative[0] = '/src="\//';
$relative[1] = '/href="\//';

#String to remplace by
$absolute[0] = 'src="' . $raw_url . '/';
$absolute[1] = 'href="' . $raw_url . '/';

$source = preg_replace($relative, $absolute, $content); //Ex: src="/image/google.png" to src="http://www.google.com/image/google.png"

return $source;

I have a database of a few thousand URL's that I am checking for links on pages (end up looking for specific links) and so I am throwing the below function through a loop and every once and awhile one of the URL's is bad and then the entire program just stalls and stops running and starts building up memory used. I thought adding the CURLOPT_TIMEOUT would fix this but it didn't. Any ideas?

$options = array(
    CURLOPT_RETURNTRANSFER => true,         // return web page
    CURLOPT_HEADER         => false,        // don't return headers
    CURLOPT_FOLLOWLOCATION => true,         // follow redirects
    CURLOPT_ENCODING       => "",           // handle all encodings
    CURLOPT_USERAGENT      =>  "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13'",     // who am i
    CURLOPT_AUTOREFERER    => true,         // set referer on redirect
    CURLOPT_TIMEOUT        => 2,          // timeout on response
    CURLOPT_MAXREDIRS      => 10,           // stop after 10 redirects
    CURLOPT_POST            => 0,            // i am sending post data
       CURLOPT_POSTFIELDS     => $curl_data,    // this are my post vars
    CURLOPT_SSL_VERIFYHOST => 0,            // don't verify ssl
    CURLOPT_SSL_VERIFYPEER => false,        //
    CURLOPT_VERBOSE        => 1                //
);

$ch      = curl_init($url);
curl_setopt_array($ch,$options);
$content = curl_exec($ch);
$err     = curl_errno($ch);
$errmsg  = curl_error($ch) ;
$header  = curl_getinfo($ch);
curl_close($ch);

//  $header['errno']   = $err;
//  $header['errmsg']  = $errmsg;
$header['content'] = $content;

#Extract the raw URl from the current one
$scheme = parse_url($url, PHP_URL_SCHEME); //Ex: http
$host = parse_url($url, PHP_URL_HOST); //Ex: www.google.com
$raw_url = $scheme . '://' . $host; //Ex: http://www.google.com

#Replace the relative link by an absolute one
$relative = array();
$absolute = array();

#String to search
$relative[0] = '/src="\//';
$relative[1] = '/href="\//';

#String to remplace by
$absolute[0] = 'src="' . $raw_url . '/';
$absolute[1] = 'href="' . $raw_url . '/';

$source = preg_replace($relative, $absolute, $content); //Ex: src="/image/google.png" to src="http://www.google.com/image/google.png"

return $source;

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

浸婚纱 2024-12-15 13:01:21

如果找不到 URL,curl_exec 将返回 false。
HTTP 状态代码将为零。
检查curl_exec的结果并检查HTTP状态
代码也。

$content = curl_exec($ch);
$httpStatus = curl_getinfo($ch, CURLINFO_HTTP_CODE);
if ( $content === false) {
   if ($httpStatus == 0) {
    $content = "link was not found";
   }
}
....

按照当前的方式,该行代码

header['content'] = $content;

将获得 false 值。这不是你的
想。

我正在使用curl_exec并且我的代码不会停止如果
它找不到网址。代码继续运行。
不过,您的浏览器中可能什么也没有
Firebug 控制台中会显示一条消息,例如“500 内部服务器错误”。
也许这就是你所说的摊位。

curl_exec will return false if it cannot find the URL.
The HTTP status code will be zero.
Check the results of curl_exec and check the HTTP status
code too.

$content = curl_exec($ch);
$httpStatus = curl_getinfo($ch, CURLINFO_HTTP_CODE);
if ( $content === false) {
   if ($httpStatus == 0) {
    $content = "link was not found";
   }
}
....

The way you have it currently, the line of code

header['content'] = $content;

will get the value of false. This is not what you
want.

I am using curl_exec and my code does not stall if
it cannot find the url. The code keeps running.
You may end up with nothing in your browser though
and a message in the Firebug Console like "500 Internal Server Error".
Maybe that's what you mean by stall.

魂归处 2024-12-15 13:01:21

所以基本上你不知道并且只是猜测卷曲请求正在停滞。

对于这个答案我也只能猜测。您可能还需要设置以下curl选项之一:CURLOPT_CONNECTTIMEOUT

如果连接已经停止,则可能不会考虑其他超时设置。我不完全确定,但请参阅 当我将超时设置为 3000ms 时,为什么 CURL 会在 1000ms 内超时?

So basically you don't know and just guess that the curl request is stalling.

For this answer I can only guess as well then. You might need to set one of the following curl option as well: CURLOPT_CONNECTTIMEOUT

If the connect already stalls, the other timeout setting might not be taken into account. I'm not entirely sure, but please see Why would CURL time out in 1000ms when I have set up timeout upto 3000ms?.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文