- Get Started
- 基础
- 进阶
- 技巧
- 插件推荐
- API
- Get Started
- 基础 API
- QueryList html($html)
- string getHtml($rel = true)
- QueryList rules(array $rules)
- QueryList range($selector)
- QueryList removeHead()
- QueryList query(Closure $callback = null)
- Collection getData(Closure $callback = null)
- Array queryData(Closure $callback = null)
- QueryList static getInstance()
- void destruct()
- void destructDocuments() 静态方法
- QueryList pipe(Closure $callback)
- 特殊 API
- 高级 API
- 默认插件
文章来源于网络收集而来,版权归原创者所有,如有侵权请及时联系!
CurlMulti 插件
Curl多线程采集.
php-curlmulti:https://github.com/ares333/php-curlmulti
安装
composer require jaeger/querylist-curl-multi
API
CurlMulti curlMulti($urls = []): 设置待采集的URL集合
class CurlMulti
- CurlMulti add($urls):添加URL任务
- array getUrls():获取所有URL
- CurlMulti success(Closure $callback):任务成功的时候调用
- CurlMulti error(Closure $callback):任务失败的时候调用
- CurlMulti start(array $opt = []):开始执行采集任务,此方法是阻塞的。
安装参数
QueryList::use(CurlMulti::class,$opt1)
- $opt1:
curlMulti
函数别名.
用法
- 安装插件
use QL\QueryList;
use QL\Ext\CurlMulti;
$ql = QueryList::getInstance();
$ql->use(CurlMulti::class);
//or Custom function name
$ql->use(CurlMulti::class,'curlMulti');
- Example-1
采集GitHub排行榜:
$ql->rules([
'title' => ['h3 a','text'],
'link' => ['h3 a','href']
])->curlMulti([
'https://github.com/trending/php',
'https://github.com/trending/go'
])->success(function (QueryList $ql,CurlMulti $curl,$r){
echo "Current url:{$r['info']['url']} \r\n";
$data = $ql->query()->getData();
print_r($data->all());
})->start();
Out:
Current url:https://github.com/trending/php
Array
(
[0] => Array
(
[title] => jupeter / clean-code-php
[link] => /jupeter/clean-code-php
)
[1] => Array
(
[title] => laravel / laravel
[link] => /laravel/laravel
)
[2] => Array
(
[title] => spatie / browsershot
[link] => /spatie/browsershot
)
//....
)
Current url:https://github.com/trending/go
Array
(
[0] => Array
(
[title] => DarthSim / imgproxy
[link] => /DarthSim/imgproxy
)
[1] => Array
(
[title] => jaegertracing / jaeger
[link] => /jaegertracing/jaeger
)
[2] => Array
(
[title] => jdkato / prose
[link] => /jdkato/prose
)
//...
)
- Example-2
$ql->curlMulti('https://github.com/trending/php')
->success(function (QueryList $ql,CurlMulti $curl,$r){
echo "Current url:{$r['info']['url']} \r\n";
if($r['info']['url'] == 'https://github.com/trending/php'){
// append a task
$curl->add('https://github.com/trending/go');
}
$data = $ql->find('h3 a')->texts();
print_r($data->all());
})
->start();
Out:
Current url:https://github.com/trending/php
Array
(
[0] => jupeter / clean-code-php
[1] => laravel / laravel
[2] => spatie / browsershot
//...
)
Current url:https://github.com/trending/go
Array
(
[0] => DarthSim / imgproxy
[1] => jaegertracing / jaeger
[2] => jdkato / prose
//...
)
- Example-3
$ql->curlMulti([
'https://github-error-host.com/trending/php',
'https://github.com/trending/go'
])->success(function (QueryList $ql,CurlMulti $curl,$r){
echo "Current url:{$r['info']['url']} \r\n";
$data = $ql->rules([
'title' => ['h3 a','text'],
'link' => ['h3 a','href']
])->query()->getData();
print_r($data->all());
})->error(function ($errorInfo,CurlMulti $curl){
echo "Current url:{$errorInfo['info']['url']} \r\n";
print_r($errorInfo['error']);
})->start([
// 最大并发数,这个值可以运行中动态改变。
'maxThread' => 10,
// 触发curl错误或用户错误之前最大重试次数,超过次数$error指定的回调会被调用。
'maxTry' => 3,
// 全局CURLOPT_*
'opt' => [
CURLOPT_TIMEOUT => 10,
CURLOPT_CONNECTTIMEOUT => 1,
CURLOPT_RETURNTRANSFER => true
],
// 缓存选项很容易被理解,缓存使用url来识别。如果使用缓存类库不会访问网络而是直接返回缓存。
'cache' => ['enable' => false, 'compress' => false, 'dir' => null, 'expire' =>86400, 'verifyPost' => false]
]);
Out:
Current url:https://github.com/trending/go
Array
(
[0] => Array
(
[title] => DarthSim / imgproxy
[link] => /DarthSim/imgproxy
)
[1] => Array
(
[title] => jaegertracing / jaeger
[link] => /jaegertracing/jaeger
)
[2] => Array
(
[title] => getlantern / lantern
[link] => /getlantern/lantern
)
//...
)
Current url:https://github-error-host.com/trending/php
Array
(
[0] => 28
[1] => Resolving timed out after 1000 milliseconds
)
- Example-3
$ql->rules([
'title' => ['h3 a','text'],
'link' => ['h3 a','href']
])->curlMulti()->add('https://github.com/trending/go')
->success(function (QueryList $ql,CurlMulti $curl,$r){
echo "Current url:{$r['info']['url']} \r\n";
$data = $ql->query()->getData();
print_r($data->all());
})->start()
->add('https://github.com/trending/php')
->start();
释放内存占用
多线程插件涉及到大量页面采集,如不合理释放资源,很容易造成内存占用过大:
$ql->rules([
'title' => ['h3 a','text'],
'link' => ['h3 a','href']
])->curlMulti([
'https://github.com/trending/php',
'https://github.com/trending/go'
])->success(function (QueryList $ql,CurlMulti $curl,$r){
echo "Current url:{$r['info']['url']} \r\n";
$data = $ql->query()->getData();
print_r($data->all());
// 释放资源
QueryList::destructDocuments();
})->start();
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论