当前位置：文江博客话题详情

管理长时间运行的 php 脚本的最佳方法？

发布于 2024-08-20 07:04:35 字数 218 浏览 5 评论 0 原文

我有一个 PHP 脚本需要很长时间（5-30 分钟）才能完成。为了以防万一，该脚本正在使用curl从另一台服务器上抓取数据。这就是它花了这么长时间的原因；它必须等待每个页面加载，然后才能处理它并移动到下一个页面。

我希望能够启动脚本并让它一直运行直到完成，这将在数据库表中设置一个标志。

我需要知道的是如何能够在脚本运行完成之前结束http请求。另外，php 脚本是执行此操作的最佳方法吗？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

穿越时光隧道 2024-08-27 07:04:35

更新 +12 年 - 安全说明

虽然这仍然是调用长时间运行的代码的好方法，但限制甚至禁用 Web 服务器中的 PHP 启动其他代码的能力对于安全性是有好处的。可执行文件。由于这将日志运行的行为与启动它的行为分离，因此在许多情况下，使用守护程序或 cron 作业可能更合适。

原始答案

当然可以使用 PHP 来完成，但是您不应该将其作为后台任务来执行 - 新进程必须与其启动的进程组分离。

由于人们不断对此常见问题解答给出相同的错误答案，因此我在这里写了一个更完整的答案：

http://symcbean.blogspot.com/2010/02/php-and-long-running-processes.html

来自评论：

简短的版本是 shell_exec('echo /usr/bin/php -q longThing.php | at now'); 但“为什么”的原因在这里包含起来有点长。

回复收藏 0 原文

幻想少年梦 2024-08-27 07:04:35

快速但肮脏的方法是使用 php.ini 中的 ignore_user_abort 函数。这基本上是说：不要关心用户做什么，运行这个脚本直到完成。如果它是一个面向公众的站点，这有点危险（因为如果启动 20 次，您最终可能会同时运行 20++ 版本的脚本）。

“干净”的方式（至少恕我直言）是在您想要启动进程并每小时（左右）运行一个 cronjob 来检查是否设置了该标志时设置一个标志（例如在数据库中）。如果设置了，则长时间运行的脚本将启动，如果未设置，则不会发生任何事情。

回复收藏 0 原文

山有枢 2024-08-27 07:04:35

您可以使用 exec 或 system 启动后台作业，然后在其中执行工作。

此外，还有比您正在使用的更好的方法来抓取网络。您可以使用线程方法（多个线程一次执行一页），或使用事件循环的方法（一个线程一次执行多个页面）。我个人使用 Perl 的方法是使用 AnyEvent::HTTP。

ETA： symcbean 解释了如何分离此处正确进行后台进程。

回复收藏 0 原文

书信已泛黄 2024-08-27 07:04:35

是的，您可以用 PHP 来完成。但除了 PHP 之外，使用队列管理器也是明智之举。策略如下：

将大任务分解为小任务。在您的情况下，每个任务都可以加载一个页面。
将每个小任务发送到队列。
在某处运行您的队列工作人员。

使用此策略具有以下优点：

对于长时间运行的任务，它能够在运行中途发生致命问题时进行恢复 - 无需从头开始。
如果您的任务不必按顺序运行，您可以运行多个工作程序来同时运行任务。

如果

您有多种选择（这只是其中一些）：

RabbitMQ (https:// /www.rabbitmq.com/tutorials/tutorial-one-php.html)
ZeroMQ (http:// Zeromq.org/bindings:php)
如果您使用 Laravel 框架，队列是内置的 (https://laravel.com/docs/5.4/queues），带有 AWS SES、Redis、Beanstalkd 的驱动程序

回复收藏 0 原文

扭转时空 2024-08-27 07:04:35

不，PHP 不是最好的解决方案。

我不确定 Ruby 或 Perl，但使用 Python，您可以将页面抓取器重写为多线程，并且它的运行速度可能至少快 20 倍。编写多线程应用程序可能有些挑战，但我编写的第一个 Python 应用程序是多线程页面抓取器。您可以使用 shell 执行函数之一从 PHP 页面中简单地调用 Python 脚本。

回复收藏 0 原文

小女人ら 2024-08-27 07:04:35

PHP 可能是也可能不是最好的工具，但您知道如何使用它，并且应用程序的其余部分是使用它编写的。这两个品质，再加上 PHP“足够好”这一事实，为使用它而不是 Perl、Ruby 或 Python 提供了非常有力的理由。

如果您的目标是学习另一种语言，那么选择一种语言并使用它。您提到的任何语言都可以完成这项工作，没问题。我碰巧喜欢Perl，但你喜欢的可能不一样。

Symcbean 在他的链接中提供了一些关于如何管理后台进程的好建议。

简而言之，编写一个 CLI PHP 脚本来处理长位。确保它以某种方式报告状态。使用 AJAX 或传统方法创建一个 php 页面来处理状态更新。您的启动脚本将启动在其自己的会话中运行的进程，并返回该进程正在进行的确认。

祝你好运。

回复收藏 0 原文

喜爱皱眉﹌ 2024-08-27 07:04:35

我同意答案说这应该在后台进程中运行。但报告状态也很重要，以便用户知道工作正在完成。

当收到启动进程的 PHP 请求时，您可以在数据库中存储具有唯一标识符的任务表示。然后，启动屏幕抓取过程，并向其传递唯一标识符。向 iPhone 应用程序报告任务已启动，并且它应该检查包含新任务 ID 的指定 URL，以获取最新状态。 iPhone 应用程序现在可以轮询（甚至“长轮询”）此 URL。与此同时，后台进程将更新任务的数据库表示，因为它使用完成百分比、当前步骤或您想要的任何其他状态指示器。当它完成时，它会设置一个完成标志。

回复收藏 0 原文

笑叹一世浮沉 2024-08-27 07:04:35

您可以将其作为 XHR (Ajax) 请求发送。与普通 HTTP 请求不同，客户端通常不会有任何 XHR 超时。

回复收藏 0 原文

爱要勇敢去追 2024-08-27 07:04:35

我意识到这是一个很老的问题，但想尝试一下。该脚本尝试解决最初的启动要求以快速完成并将重负载分解为较小的块。我还没有测试过这个解决方案。

<?php
/**
 * crawler.php located at http://mysite.com/crawler.php
 */

// Make sure this script will keep on runing after we close the connection with
// it.
ignore_user_abort(TRUE);


function get_remote_sources_to_crawl() {
  // Do a database or a log file query here.

  $query_result = array (
    1 => 'http://exemple.com',
    2 => 'http://exemple1.com',
    3 => 'http://exemple2.com',
    4 => 'http://exemple3.com',
    // ... and so on.
  );

  // Returns the first one on the list.
  foreach ($query_result as $id => $url) {
    return $url;
  }
  return FALSE;
}

function update_remote_sources_to_crawl($id) {
  // Update my database or log file list so the $id record wont show up
  // on my next call to get_remote_sources_to_crawl()
}

$crawling_source = get_remote_sources_to_crawl();

if ($crawling_source) {


  // Run your scraping code on $crawling_source here.


  if ($your_scraping_has_finished) {
    // Update you database or log file.
    update_remote_sources_to_crawl($id);

    $ctx = stream_context_create(array(
      'http' => array(
        // I am not quite sure but I reckon the timeout set here actually
        // starts rolling after the connection to the remote server is made
        // limiting only how long the downloading of the remote content should take.
        // So as we are only interested to trigger this script again, 5 seconds 
        // should be plenty of time.
        'timeout' => 5,
      )
    ));

    // Open a new connection to this script and close it after 5 seconds in.
    file_get_contents('http://' . $_SERVER['HTTP_HOST'] . '/crawler.php', FALSE, $ctx);

    print 'The cronjob kick off has been initiated.';
  }
}
else {
  print 'Yay! The whole thing is done.';
}

I realize this is a quite old question but would like to give it a shot. This script tries to address both the initial kick off call to finish quickly and chop down the heavy load into smaller chunks. I haven't tested this solution.

<?php
/**
 * crawler.php located at http://mysite.com/crawler.php
 */

// Make sure this script will keep on runing after we close the connection with
// it.
ignore_user_abort(TRUE);


function get_remote_sources_to_crawl() {
  // Do a database or a log file query here.

  $query_result = array (
    1 => 'http://exemple.com',
    2 => 'http://exemple1.com',
    3 => 'http://exemple2.com',
    4 => 'http://exemple3.com',
    // ... and so on.
  );

  // Returns the first one on the list.
  foreach ($query_result as $id => $url) {
    return $url;
  }
  return FALSE;
}

function update_remote_sources_to_crawl($id) {
  // Update my database or log file list so the $id record wont show up
  // on my next call to get_remote_sources_to_crawl()
}

$crawling_source = get_remote_sources_to_crawl();

if ($crawling_source) {


  // Run your scraping code on $crawling_source here.


  if ($your_scraping_has_finished) {
    // Update you database or log file.
    update_remote_sources_to_crawl($id);

    $ctx = stream_context_create(array(
      'http' => array(
        // I am not quite sure but I reckon the timeout set here actually
        // starts rolling after the connection to the remote server is made
        // limiting only how long the downloading of the remote content should take.
        // So as we are only interested to trigger this script again, 5 seconds 
        // should be plenty of time.
        'timeout' => 5,
      )
    ));

    // Open a new connection to this script and close it after 5 seconds in.
    file_get_contents('http://' . $_SERVER['HTTP_HOST'] . '/crawler.php', FALSE, $ctx);

    print 'The cronjob kick off has been initiated.';
  }
}
else {
  print 'Yay! The whole thing is done.';
}

回复收藏 0 原文

胡渣熟男 2024-08-27 07:04:35

我想提出一个与 symcbean 略有不同的解决方案，主要是因为我有额外的要求，即长时间运行的进程需要作为另一个用户运行，而不是作为 apache / www-data 用户运行。

使用 cron 轮询后台任务表的第一个解决方案：

PHP 网页插入后台任务表，状态“已提交”
cron 每 3 分钟运行一次，使用另一个用户，运行 PHP CLI 脚本来检查后台任务表中的“已提交” PHP
PHP CLI 会将行中的状态列更新为“PROCESSING”并开始处理，完成后将更新为“COMPLETED”

使用 Linux inotify 工具的第二种解决方案：

网页使用用户设置的参数更新控制文件，并且还给出一个任务 ID
shell 脚本（作为非 www 用户）运行 inotifywait 将等待控制文件写入，
在写入控制文件后，将引发 close_write 事件，shell 脚本将继续
shell 脚本执行 PHP CLI执行长时间运行的进程
PHP CLI 将输出写入由任务 id 标识的日志文件，或者更新状态表中的进度
PHP 网页可以轮询日志文件（基于任务 id）以显示长时间运行的进程的进度，或者它也可以查询状态表

一些附加信息可以在我的帖子中找到：http://inventorsparadox.blogspot.co.id/2016/01/long-running-process-in-linux-using-php.html

回复收藏 0 原文

软糯酥胸 2024-08-27 07:04:35

我已经用 Perl、双 fork() 和与父进程分离做了类似的事情。所有http 抓取工作都应该在forked 进程中完成。

回复收藏 0 原文

源来凯始玺欢你 2024-08-27 07:04:35

使用代理来委托请求。

回复收藏 0 原文

想你只要分分秒秒 2024-08-27 07:04:35

我总是使用这些变体之一（因为不同版本的 Linux 对于处理输出有不同的规则/某些程序输出不同）：

变体 I
@exec('./myscript.php \1>/dev/null \2>/dev/null &');

变体二
@exec('php -f myscript.php \1>/dev/null \2>/dev/null &');

变体 III
@exec('nohup myscript.php \1>/dev/null \2>/dev/null &');

您可能还没有安装“nohup”。但例如，当我自动化 FFMPEG 视频转换时，输出接口在某种程度上并没有通过重定向输出流 1 和 1 来 100% 处理。 2，所以我使用 nohup 并重定向输出。

回复收藏 0 原文

白首有我共你 2024-08-27 07:04:35

如果您的脚本很长，则借助每个任务的输入参数来划分页面工作。（然后每个页面就像线程一样）
即，如果页面有 1 个 lac Product_keywords 长进程循环，则代替循环为一个关键字创建逻辑，并从 magic 或 cornjobpage.php 传递此关键字（在下面的示例中），

对于后台工作人员，我认为您应该尝试这种技术，它将有助于调用您喜欢的页面数量不限，所有页面都将立即独立运行，而无需异步等待每个页面响应。

cornjobpage.php //mainpage

    <?php

post_async("http://localhost/projectname/testpage.php", "Keywordname=testValue");
//post_async("http://localhost/projectname/testpage.php", "Keywordname=testValue2");
//post_async("http://localhost/projectname/otherpage.php", "Keywordname=anyValue");
//call as many as pages you like all pages will run at once independently without waiting for each page response as asynchronous.
            ?>
            <?php

            /*
             * Executes a PHP page asynchronously so the current page does not have to wait for it to     finish running.
             *  
             */
            function post_async($url,$params)
            {

                $post_string = $params;

                $parts=parse_url($url);

                $fp = fsockopen($parts['host'],
                    isset($parts['port'])?$parts['port']:80,
                    $errno, $errstr, 30);

                $out = "GET ".$parts['path']."?$post_string"." HTTP/1.1\r\n";//you can use POST instead of GET if you like
                $out.= "Host: ".$parts['host']."\r\n";
                $out.= "Content-Type: application/x-www-form-urlencoded\r\n";
                $out.= "Content-Length: ".strlen($post_string)."\r\n";
                $out.= "Connection: Close\r\n\r\n";
                fwrite($fp, $out);
                fclose($fp);
            }
            ?>

testpage.php

    <?
    echo $_REQUEST["Keywordname"];//case1 Output > testValue
    ?>

PS：如果您想以循环方式发送网址参数，请按照以下答案操作：https://stackoverflow.com/a/41225209/6295712

if you have long script then divide page work with the help of input parameter for each task.(then each page act like thread)
i.e if page has 1 lac product_keywords long process loop then instead of loop make logic for one keyword and pass this keyword from magic or cornjobpage.php(in following example)

and for background worker i think you should try this technique it will help to call as many as pages you like all pages will run at once independently without waiting for each page response as asynchronous.

cornjobpage.php //mainpage

    <?php

post_async("http://localhost/projectname/testpage.php", "Keywordname=testValue");
//post_async("http://localhost/projectname/testpage.php", "Keywordname=testValue2");
//post_async("http://localhost/projectname/otherpage.php", "Keywordname=anyValue");
//call as many as pages you like all pages will run at once independently without waiting for each page response as asynchronous.
            ?>
            <?php

            /*
             * Executes a PHP page asynchronously so the current page does not have to wait for it to     finish running.
             *  
             */
            function post_async($url,$params)
            {

                $post_string = $params;

                $parts=parse_url($url);

                $fp = fsockopen($parts['host'],
                    isset($parts['port'])?$parts['port']:80,
                    $errno, $errstr, 30);

                $out = "GET ".$parts['path']."?$post_string"." HTTP/1.1\r\n";//you can use POST instead of GET if you like
                $out.= "Host: ".$parts['host']."\r\n";
                $out.= "Content-Type: application/x-www-form-urlencoded\r\n";
                $out.= "Content-Length: ".strlen($post_string)."\r\n";
                $out.= "Connection: Close\r\n\r\n";
                fwrite($fp, $out);
                fclose($fp);
            }
            ?>

testpage.php

    <?
    echo $_REQUEST["Keywordname"];//case1 Output > testValue
    ?>

PS:if you want to send url parameters as loop then follow this answer :https://stackoverflow.com/a/41225209/6295712

回复收藏 0 原文

善良天后 2024-08-27 07:04:35

正如许多人所说，这不是最好的方法，但这可能会有所帮助：

ignore_user_abort(1); // run script in background even if user closes browser
set_time_limit(1800); // run it for 30 minutes

// Long running script here

Not the best approach, as many stated here, but this might help:

ignore_user_abort(1); // run script in background even if user closes browser
set_time_limit(1800); // run it for 30 minutes

// Long running script here

回复收藏 0 原文