PHP 在带有错误 url 的curl 函数循环中停止

发布于 2024-12-08 13:01:21 字数 2097 浏览 0 评论 0原文

我有一个包含几千个 URL 的数据库，我正在检查页面上的链接（最终寻找特定链接），因此我通过一个循环抛出下面的函数，每隔一段时间就有一个 URL 是坏的，然后整个程序只是停止并停止运行并开始建立所使用的内存。我以为添加 CURLOPT_TIMEOUT 可以解决这个问题，但事实并非如此。有什么想法吗？

$options = array(
    CURLOPT_RETURNTRANSFER => true,         // return web page
    CURLOPT_HEADER         => false,        // don't return headers
    CURLOPT_FOLLOWLOCATION => true,         // follow redirects
    CURLOPT_ENCODING       => "",           // handle all encodings
    CURLOPT_USERAGENT      =>  "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13'",     // who am i
    CURLOPT_AUTOREFERER    => true,         // set referer on redirect
    CURLOPT_TIMEOUT        => 2,          // timeout on response
    CURLOPT_MAXREDIRS      => 10,           // stop after 10 redirects
    CURLOPT_POST            => 0,            // i am sending post data
       CURLOPT_POSTFIELDS     => $curl_data,    // this are my post vars
    CURLOPT_SSL_VERIFYHOST => 0,            // don't verify ssl
    CURLOPT_SSL_VERIFYPEER => false,        //
    CURLOPT_VERBOSE        => 1                //
);

$ch      = curl_init($url);
curl_setopt_array($ch,$options);
$content = curl_exec($ch);
$err     = curl_errno($ch);
$errmsg  = curl_error($ch) ;
$header  = curl_getinfo($ch);
curl_close($ch);

//  $header['errno']   = $err;
//  $header['errmsg']  = $errmsg;
$header['content'] = $content;

#Extract the raw URl from the current one
$scheme = parse_url($url, PHP_URL_SCHEME); //Ex: http
$host = parse_url($url, PHP_URL_HOST); //Ex: www.google.com
$raw_url = $scheme . '://' . $host; //Ex: http://www.google.com

#Replace the relative link by an absolute one
$relative = array();
$absolute = array();

#String to search
$relative[0] = '/src="\//';
$relative[1] = '/href="\//';

#String to remplace by
$absolute[0] = 'src="' . $raw_url . '/';
$absolute[1] = 'href="' . $raw_url . '/';

$source = preg_replace($relative, $absolute, $content); //Ex: src="/image/google.png" to src="http://www.google.com/image/google.png"

return $source;

原文

I have a database of a few thousand URL's that I am checking for links on pages (end up looking for specific links) and so I am throwing the below function through a loop and every once and awhile one of the URL's is bad and then the entire program just stalls and stops running and starts building up memory used. I thought adding the CURLOPT_TIMEOUT would fix this but it didn't. Any ideas?

$options = array(
    CURLOPT_RETURNTRANSFER => true,         // return web page
    CURLOPT_HEADER         => false,        // don't return headers
    CURLOPT_FOLLOWLOCATION => true,         // follow redirects
    CURLOPT_ENCODING       => "",           // handle all encodings
    CURLOPT_USERAGENT      =>  "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13'",     // who am i
    CURLOPT_AUTOREFERER    => true,         // set referer on redirect
    CURLOPT_TIMEOUT        => 2,          // timeout on response
    CURLOPT_MAXREDIRS      => 10,           // stop after 10 redirects
    CURLOPT_POST            => 0,            // i am sending post data
       CURLOPT_POSTFIELDS     => $curl_data,    // this are my post vars
    CURLOPT_SSL_VERIFYHOST => 0,            // don't verify ssl
    CURLOPT_SSL_VERIFYPEER => false,        //
    CURLOPT_VERBOSE        => 1                //
);

$ch      = curl_init($url);
curl_setopt_array($ch,$options);
$content = curl_exec($ch);
$err     = curl_errno($ch);
$errmsg  = curl_error($ch) ;
$header  = curl_getinfo($ch);
curl_close($ch);

//  $header['errno']   = $err;
//  $header['errmsg']  = $errmsg;
$header['content'] = $content;

#Extract the raw URl from the current one
$scheme = parse_url($url, PHP_URL_SCHEME); //Ex: http
$host = parse_url($url, PHP_URL_HOST); //Ex: www.google.com
$raw_url = $scheme . '://' . $host; //Ex: http://www.google.com

#Replace the relative link by an absolute one
$relative = array();
$absolute = array();

#String to search
$relative[0] = '/src="\//';
$relative[1] = '/href="\//';

#String to remplace by
$absolute[0] = 'src="' . $raw_url . '/';
$absolute[1] = 'href="' . $raw_url . '/';

$source = preg_replace($relative, $absolute, $content); //Ex: src="/image/google.png" to src="http://www.google.com/image/google.png"

return $source;

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

浸婚纱 2024-12-15 13:01:21

如果找不到 URL，curl_exec 将返回 false。
HTTP 状态代码将为零。
检查curl_exec的结果并检查HTTP状态
代码也。

$content = curl_exec($ch);
$httpStatus = curl_getinfo($ch, CURLINFO_HTTP_CODE);
if ( $content === false) {
   if ($httpStatus == 0) {
    $content = "link was not found";
   }
}
....

按照当前的方式，该行代码

header['content'] = $content;

将获得 false 值。这不是你的
想。

我正在使用curl_exec并且我的代码不会停止如果
它找不到网址。代码继续运行。
不过，您的浏览器中可能什么也没有
Firebug 控制台中会显示一条消息，例如“500 内部服务器错误”。
也许这就是你所说的摊位。

curl_exec will return false if it cannot find the URL.
The HTTP status code will be zero.
Check the results of curl_exec and check the HTTP status
code too.

$content = curl_exec($ch);
$httpStatus = curl_getinfo($ch, CURLINFO_HTTP_CODE);
if ( $content === false) {
   if ($httpStatus == 0) {
    $content = "link was not found";
   }
}
....

The way you have it currently, the line of code

header['content'] = $content;

will get the value of false. This is not what you
want.

I am using curl_exec and my code does not stall if
it cannot find the url. The code keeps running.
You may end up with nothing in your browser though
and a message in the Firebug Console like "500 Internal Server Error".
Maybe that's what you mean by stall.

回复收藏 0 原文