使用 pcntl_fork() 提高 HTML 抓取效率

发布于 2024-09-02 05:10:40 字数 3077 浏览 10 评论 0原文

在前两个问题的帮助下,我现在有了一个可以工作的 HTML 抓取工具,可以将产品信息输入数据库。我现在想做的是通过让我的刮刀与 pcntl_fork

如果我将 php5-cli 脚本分成 10 个单独的块,我会在很大程度上提高总运行时间,因此我知道我不受 i/o 或 cpu 限制,而只是受到抓取函数的线性性质的限制。

使用我从多个来源拼凑而成的代码,我进行了这个工作测试:

<?php
libxml_use_internal_errors(true);
ini_set('max_execution_time', 0); 
ini_set('max_input_time', 0); 
set_time_limit(0);

$hrefArray = array("http://slashdot.org", "http://slashdot.org", "http://slashdot.org", "http://slashdot.org");

function doDomStuff($singleHref,$childPid) {
    $html = new DOMDocument();
    $html->loadHtmlFile($singleHref);

    $xPath = new DOMXPath($html);

    $domQuery = '//div[@id="slogan"]/h2';
    $domReturn = $xPath->query($domQuery);

    foreach($domReturn as $return) {
        $slogan = $return->nodeValue;
        echo "Child PID #" . $childPid . " says: " . $slogan . "\n";
    }
}

$pids = array();
foreach ($hrefArray as $singleHref) {
    $pid = pcntl_fork();

    if ($pid == -1) {
        die("Couldn't fork, error!");
    } elseif ($pid > 0) {
        // We are the parent
        $pids[] = $pid;
    } else {
        // We are the child
        $childPid = posix_getpid();
        doDomStuff($singleHref,$childPid);
        exit(0);
    }
}

foreach ($pids as $pid) {
    pcntl_waitpid($pid, $status);
}

// Clear the libxml buffer so it doesn't fill up
libxml_clear_errors();

这引发了以下问题:

1)假设我的 hrefArray 包含 4 个 url - 如果该数组包含 1,000 个产品 url,则此代码将生成 1,000 个子进程?如果是这样,将进程数量限制为 10 个的最佳方法是什么,再次以 1,000 个 url 为例,将子工作负载拆分为每个子 100 个产品 (10 x 100)。

2)我了解到 pcntl_fork 创建了流程以及所有变量、类等的副本。我想做的是将我的 hrefArray 变量替换为 DOMDocument 查询,该查询构建要抓取的产品列表,然后提供它们交给子进程来进行处理 - 因此将负载分散到 10 个子进程中。

我的大脑告诉我需要做类似以下的事情(显然这不起作用,所以不要运行它):

<?php
libxml_use_internal_errors(true);
ini_set('max_execution_time', 0); 
ini_set('max_input_time', 0); 
set_time_limit(0);
$maxChildWorkers = 10;

$html = new DOMDocument();
$html->loadHtmlFile('http://xxxx');
$xPath = new DOMXPath($html);

$domQuery = '//div[@id=productDetail]/a';
$domReturn = $xPath->query($domQuery);

$hrefsArray[] = $domReturn->getAttribute('href');

function doDomStuff($singleHref) {
    // Do stuff here with each product
}

// To figure out: Split href array into $maxChilderWorks # of workArray1, workArray2 ... workArray10. 
$pids = array();
foreach ($workArray(1,2,3 ... 10) as $singleHref) {
    $pid = pcntl_fork();

    if ($pid == -1) {
        die("Couldn't fork, error!");
    } elseif ($pid > 0) {
        // We are the parent
        $pids[] = $pid;
    } else {
        // We are the child
        $childPid = posix_getpid();
        doDomStuff($singleHref);
        exit(0);
    }
}


foreach ($pids as $pid) {
    pcntl_waitpid($pid, $status);
}

// Clear the libxml buffer so it doesn't fill up
libxml_clear_errors();

但我不知道如何在主/父进程中构建我的 hrefsArray[]仅并将其提供给子进程。目前我尝试过的所有操作都会导致子进程中出现循环。即我的 hrefsArray 是在主进程和每个后续子进程中构建的。

我确信我所做的一切都是完全错误的,所以非常感谢在正确的方向上进行一般性的推动。

With the help from two previous questions, I now have a working HTML scraper that feeds product information into a database. What I am now trying to do is improve efficiently by wrapping my brain around with getting my scraper working with pcntl_fork.

If I split my php5-cli script into 10 separate chunks, I improve total runtime by a large factor so I know I am not i/o or cpu bound but just limited by the linear nature of my scraping functions.

Using code I've cobbled together from multiple sources, I have this working test:

<?php
libxml_use_internal_errors(true);
ini_set('max_execution_time', 0); 
ini_set('max_input_time', 0); 
set_time_limit(0);

$hrefArray = array("http://slashdot.org", "http://slashdot.org", "http://slashdot.org", "http://slashdot.org");

function doDomStuff($singleHref,$childPid) {
    $html = new DOMDocument();
    $html->loadHtmlFile($singleHref);

    $xPath = new DOMXPath($html);

    $domQuery = '//div[@id="slogan"]/h2';
    $domReturn = $xPath->query($domQuery);

    foreach($domReturn as $return) {
        $slogan = $return->nodeValue;
        echo "Child PID #" . $childPid . " says: " . $slogan . "\n";
    }
}

$pids = array();
foreach ($hrefArray as $singleHref) {
    $pid = pcntl_fork();

    if ($pid == -1) {
        die("Couldn't fork, error!");
    } elseif ($pid > 0) {
        // We are the parent
        $pids[] = $pid;
    } else {
        // We are the child
        $childPid = posix_getpid();
        doDomStuff($singleHref,$childPid);
        exit(0);
    }
}

foreach ($pids as $pid) {
    pcntl_waitpid($pid, $status);
}

// Clear the libxml buffer so it doesn't fill up
libxml_clear_errors();

Which raises the following questions:

1) Given my hrefArray contains 4 urls - if the array was to contain say 1,000 product urls this code would spawn 1,000 child processes? If so, what is the best way to limit the amount of processes to say 10, and again 1,000 urls as an example split the child work load to 100 products per child (10 x 100).

2) I've learn that pcntl_fork creates a copy of the process and all variables, classes, etc. What I would like to do is replace my hrefArray variable with a DOMDocument query that builds the list of products to scrape, and then feeds them off to child processes to do the processing - so spreading the load across 10 child workers.

My brain is telling I need to do something like the following (obviously this doesn't work, so don't run it):

<?php
libxml_use_internal_errors(true);
ini_set('max_execution_time', 0); 
ini_set('max_input_time', 0); 
set_time_limit(0);
$maxChildWorkers = 10;

$html = new DOMDocument();
$html->loadHtmlFile('http://xxxx');
$xPath = new DOMXPath($html);

$domQuery = '//div[@id=productDetail]/a';
$domReturn = $xPath->query($domQuery);

$hrefsArray[] = $domReturn->getAttribute('href');

function doDomStuff($singleHref) {
    // Do stuff here with each product
}

// To figure out: Split href array into $maxChilderWorks # of workArray1, workArray2 ... workArray10. 
$pids = array();
foreach ($workArray(1,2,3 ... 10) as $singleHref) {
    $pid = pcntl_fork();

    if ($pid == -1) {
        die("Couldn't fork, error!");
    } elseif ($pid > 0) {
        // We are the parent
        $pids[] = $pid;
    } else {
        // We are the child
        $childPid = posix_getpid();
        doDomStuff($singleHref);
        exit(0);
    }
}


foreach ($pids as $pid) {
    pcntl_waitpid($pid, $status);
}

// Clear the libxml buffer so it doesn't fill up
libxml_clear_errors();

But what I can't figure out is how to build my hrefsArray[] in the master/parent process only and feed it off to the child process. Currently everything I've tried causes loops in the child processes. I.e. my hrefsArray gets built in the master, and in each subsequent child process.

I am sure I am going about this all totally wrong, so would greatly appreciate just general nudge in the right direction.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

煞人兵器 2024-09-09 05:10:40

简介

pcntl_fork() 并不是提高 HTML scraper 性能的唯一方法,而使用 Message Queue 可能是一个好主意 Charles 建议,但您仍然需要一种更快有效的方法来在您的 workers 中拉取该请求

解决方案 1

使用 curl_multi_init ...curl 实际上更快,并且使用 multi curl为您提供并行处理

来自 PHP DOC

curl_multi_init 允许处理多个 cURL 句柄并行。

因此,您可以使用 curl_multi_init 加载多个网址,而不是使用 $html->loadHtmlFile('http://xxxx'); 多次加载文件同时

这里有一些有趣的实现

解决方案 2

您可以使用 pthreadsPHP

使用的示例

// Number of threads you want
$threads = 10;

// Treads storage
$ts = array();

// Your list of URLS // range just for demo
$urls = range(1, 50);

// Group Urls
$urlsGroup = array_chunk($urls, floor(count($urls) / $threads));

printf("%s:PROCESS  #load\n", date("g:i:s"));

$name = range("A", "Z");
$i = 0;
foreach ( $urlsGroup as $group ) {
    $ts[] = new AsyncScraper($group, $name[$i ++]);
}

printf("%s:PROCESS  #join\n", date("g:i:s"));

// wait for all Threads to complete
foreach ( $ts as $t ) {
    $t->join();
}

printf("%s:PROCESS  #finish\n", date("g:i:s"));

输出

9:18:00:PROCESS  #load
9:18:00:START  #5592     A
9:18:00:START  #9620     B
9:18:00:START  #11684    C
9:18:00:START  #11156    D
9:18:00:START  #11216    E
9:18:00:START  #11568    F
9:18:00:START  #2920     G
9:18:00:START  #10296    H
9:18:00:START  #11696    I
9:18:00:PROCESS  #join
9:18:00:START  #6692     J
9:18:01:END  #9620       B
9:18:01:END  #11216      E
9:18:01:END  #10296      H
9:18:02:END  #2920       G
9:18:02:END  #11696      I
9:18:04:END  #5592       A
9:18:04:END  #11568      F
9:18:04:END  #6692       J
9:18:05:END  #11684      C
9:18:05:END  #11156      D
9:18:05:PROCESS  #finish

class AsyncScraper extends Thread {

    public function __construct(array $urls, $name) {
        $this->urls = $urls;
        $this->name = $name;
        $this->start();
    }

    public function run() {
        printf("%s:START  #%lu \t %s \n", date("g:i:s"), $this->getThreadId(), $this->name);
        if ($this->urls) {
            // Load with CURL
            // Parse with DOM
            // Do some work

            sleep(mt_rand(1, 5));
        }
        printf("%s:END  #%lu \t %s \n", date("g:i:s"), $this->getThreadId(), $this->name);
    }
}

Introduction

pcntl_fork() is not the only way to improve performance HTML scraper while it might be a good idea to use Message Queue has Charles suggested but you still need a faster effective way to pull that request in your workers

Solution 1

Use curl_multi_init ... curl is actually faster and using multi curl gives you parallel processing

From PHP DOC

curl_multi_init Allows the processing of multiple cURL handles in parallel.

So Instead of using $html->loadHtmlFile('http://xxxx'); to load the files several times you can just use curl_multi_init to load multiple url at the same time

Here are some Interesting Implementations

Solution 2

You can use pthreads to use multi-threading in PHP

Example

// Number of threads you want
$threads = 10;

// Treads storage
$ts = array();

// Your list of URLS // range just for demo
$urls = range(1, 50);

// Group Urls
$urlsGroup = array_chunk($urls, floor(count($urls) / $threads));

printf("%s:PROCESS  #load\n", date("g:i:s"));

$name = range("A", "Z");
$i = 0;
foreach ( $urlsGroup as $group ) {
    $ts[] = new AsyncScraper($group, $name[$i ++]);
}

printf("%s:PROCESS  #join\n", date("g:i:s"));

// wait for all Threads to complete
foreach ( $ts as $t ) {
    $t->join();
}

printf("%s:PROCESS  #finish\n", date("g:i:s"));

Output

9:18:00:PROCESS  #load
9:18:00:START  #5592     A
9:18:00:START  #9620     B
9:18:00:START  #11684    C
9:18:00:START  #11156    D
9:18:00:START  #11216    E
9:18:00:START  #11568    F
9:18:00:START  #2920     G
9:18:00:START  #10296    H
9:18:00:START  #11696    I
9:18:00:PROCESS  #join
9:18:00:START  #6692     J
9:18:01:END  #9620       B
9:18:01:END  #11216      E
9:18:01:END  #10296      H
9:18:02:END  #2920       G
9:18:02:END  #11696      I
9:18:04:END  #5592       A
9:18:04:END  #11568      F
9:18:04:END  #6692       J
9:18:05:END  #11684      C
9:18:05:END  #11156      D
9:18:05:PROCESS  #finish

Class Used

class AsyncScraper extends Thread {

    public function __construct(array $urls, $name) {
        $this->urls = $urls;
        $this->name = $name;
        $this->start();
    }

    public function run() {
        printf("%s:START  #%lu \t %s \n", date("g:i:s"), $this->getThreadId(), $this->name);
        if ($this->urls) {
            // Load with CURL
            // Parse with DOM
            // Do some work

            sleep(mt_rand(1, 5));
        }
        printf("%s:END  #%lu \t %s \n", date("g:i:s"), $this->getThreadId(), $this->name);
    }
}
最近可好 2024-09-09 05:10:40

看起来我每天都建议这样做,但是你看过Gearman吗?甚至还有一个记录良好的 PECL 类

Gearman 是一个工作队列系统。您将创建连接和侦听作业的工作人员以及连接和发送作业的客户端。客户端可以等待请求的作业完成,也可以触发它并忘记。根据您的选择,工作人员甚至可以发回状态更新,以及他们的流程完成了多少。

换句话说,您可以获得多进程或多线程的好处,而不必担心进程和线程。客户和工作人员甚至可以在不同的机器上。

It seems like I suggest this daily, but have you looked at Gearman? There's even a well documented PECL class.

Gearman is a work queue system. You'd create workers that connect and listen for jobs, and clients that connect and send jobs. The client can either wait for the requested job to be completed, or it can fire it and forget. At your option, workers can even send back status updates, and how far through the process they are.

In other words, you get the benefits of multiple processes or threads, without having to worry about processes and threads. The clients and workers can even be on different machines.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文