管理长时间运行的 php 脚本的最佳方法?

发布于 2024-08-20 07:04:35 字数 218 浏览 5 评论 0 原文

我有一个 PHP 脚本需要很长时间(5-30 分钟)才能完成。为了以防万一,该脚本正在使用curl从另一台服务器上抓取数据。这就是它花了这么长时间的原因;它必须等待每个页面加载,然后才能处理它并移动到下一个页面。

我希望能够启动脚本并让它一直运行直到完成,这将在数据库表中设置一个标志。

我需要知道的是如何能够在脚本运行完成之前结束http请求。另外,php 脚本是执行此操作的最佳方法吗?

I have a PHP script that takes a long time (5-30 minutes) to complete. Just in case it matters, the script is using curl to scrape data from another server. This is the reason it's taking so long; it has to wait for each page to load before processing it and moving to the next.

I want to be able to initiate the script and let it be until it's done, which will set a flag in a database table.

What I need to know is how to be able to end the http request before the script is finished running. Also, is a php script the best way to do this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(16

穿越时光隧道 2024-08-27 07:04:35

更新 +12 年 - 安全说明

虽然这仍然是调用长时间运行的代码的好方法,但限制甚至禁用 Web 服务器中的 PHP 启动其他代码的能力对于安全性是有好处的。可执行文件。由于这将日志运行的行为与启动它的行为分离,因此在许多情况下,使用守护程序或 cron 作业可能更合适。

原始答案

当然可以使用 PHP 来完成,但是您不应该将其作为后台任务来执行 - 新进程必须与其启动的进程组分离。

由于人们不断对此常见问题解答给出相同的错误答案,因此我在这里写了一个更完整的答案:

http://symcbean.blogspot.com/2010/02/php-and-long-running-processes.html

来自评论:

简短的版本是 shell_exec('echo /usr/bin/php -q longThing.php | at now'); 但“为什么”的原因在这里包含起来有点长。

Update +12 years - Security Note

While this is still a good way to invoke a long running bit of code, it is good for security to limit or even disable the ability of PHP in the webserver to launch other executables. And since this decouples the behaviour of the log running thing from that which started it, in many cases it may be more appropriate to use a daemon or a cron job.

Original Answer

Certainly it can be done with PHP, however you should NOT do this as a background task - the new process has to be dissociated from the process group where it is initiated.

Since people keep giving the same wrong answer to this FAQ, I've written a fuller answer here:

http://symcbean.blogspot.com/2010/02/php-and-long-running-processes.html

From the comments:

The short version is shell_exec('echo /usr/bin/php -q longThing.php | at now'); but the reasons "why", are a bit long for inclusion here.

幻想少年梦 2024-08-27 07:04:35

快速但肮脏的方法是使用 php.ini 中的 ignore_user_abort 函数。这基本上是说:不要关心用户做什么,运行这个脚本直到完成。如果它是一个面向公众的站点,这有点危险(因为如果启动 20 次,您最终可能会同时运行 20++ 版本的脚本)。

“干净”的方式(至少恕我直言)是在您想要启动进程并每小时(左右)运行一个 cronjob 来检查是否设置了该标志时设置一个标志(例如在数据库中)。如果设置了,则长时间运行的脚本将启动,如果未设置,则不会发生任何事情。

The quick and dirty way would be to use the ignore_user_abort function in php. This basically says: Don't care what the user does, run this script until it is finished. This is somewhat dangerous if it is a public facing site (because it is possible, that you end up having 20++ versions of the script running at the same time if it is initiated 20 times).

The "clean" way (at least IMHO) is to set a flag (in the db for example) when you want to initiate the process and run a cronjob every hour (or so) to check if that flag is set. If it IS set, the long running script starts, if it is NOT set, nothin happens.

山有枢 2024-08-27 07:04:35

您可以使用 execsystem 启动后台作业,然后在其中执行工作。

此外,还有比您正在使用的更好的方法来抓取网络。您可以使用线程方法(多个线程一次执行一页),或使用事件循环的方法(一个线程一次执行多个页面)。我个人使用 Perl 的方法是使用 AnyEvent::HTTP

ETA: symcbean 解释了如何分离此处正确进行后台进程。

You could use exec or system to start a background job, and then do the work in that.

Also, there are better approaches to scraping the web that the one you're using. You could use a threaded approach (multiple threads doing one page at a time), or one using an eventloop (one thread doing multiple pages at at time). My personal approach using Perl would be using AnyEvent::HTTP.

ETA: symcbean explained how to detach the background process properly here.

书信已泛黄 2024-08-27 07:04:35

是的,您可以用 PHP 来完成。但除了 PHP 之外,使用队列管理器也是明智之举。策略如下:

  1. 将大任务分解为小任务。在您的情况下,每个任务都可以加载一个页面。

  2. 将每个小任务发送到队列。

  3. 在某处运行您的队列工作人员。

使用此策略具有以下优点:

  1. 对于长时间运行的任务,它能够在运行中途发生致命问题时进行恢复 - 无需从头开始。

  2. 如果您的任务不必按顺序运行,您可以运行多个工作程序来同时运行任务。

    如果

您有多种选择(这只是其中一些):

  1. RabbitMQ (https:// /www.rabbitmq.com/tutorials/tutorial-one-php.html)
  2. ZeroMQ (http:// Zeromq.org/bindings:php)
  3. 如果您使用 Laravel 框架,队列是内置的 (https://laravel.com/docs/5.4/queues),带有 AWS SES、Redis、Beanstalkd 的驱动程序

Yes, you can do it in PHP. But in addition to PHP it would be wise to use a Queue Manager. Here's the strategy:

  1. Break up your large task into smaller tasks. In your case, each task could be loading a single page.

  2. Send each small task to the queue.

  3. Run your queue workers somewhere.

Using this strategy has the following advantages:

  1. For long running tasks it has the ability to recover in case a fatal problem occurs in the middle of the run -- no need to start from the beginning.

  2. If your tasks do not have to be run sequentially, you can run multiple workers to run tasks simultaneously.

You have a variety of options (this is just a few):

  1. RabbitMQ (https://www.rabbitmq.com/tutorials/tutorial-one-php.html)
  2. ZeroMQ (http://zeromq.org/bindings:php)
  3. If you're using the Laravel framework, queues are built-in (https://laravel.com/docs/5.4/queues), with drivers for AWS SES, Redis, Beanstalkd
扭转时空 2024-08-27 07:04:35

不,PHP 不是最好的解决方案。

我不确定 Ruby 或 Perl,但使用 Python,您可以将页面抓取器重写为多线程,并且它的运行速度可能至少快 20 倍。编写多线程应用程序可能有些挑战,但我编写的第一个 Python 应用程序是多线程页面抓取器。您可以使用 shell 执行函数之一从 PHP 页面中简单地调用 Python 脚本。

No, PHP is not the best solution.

I'm not sure about Ruby or Perl, but with Python you could rewrite your page scraper to be multi-threaded and it would probably run at least 20x faster. Writing multi-threaded apps can be somewhat of a challenge, but the very first Python app I wrote was mutlti-threaded page scraper. And you could simply call the Python script from within your PHP page by using one of the shell execution functions.

小女人ら 2024-08-27 07:04:35

PHP 可能是也可能不是最好的工具,但您知道如何使用它,并且应用程序的其余部分是使用它编写的。这两个品质,再加上 PHP“足够好”这一事实,为使用它而不是 Perl、Ruby 或 Python 提供了非常有力的理由。

如果您的目标是学习另一种语言,那么选择一种语言并使用它。您提到的任何语言都可以完成这项工作,没问题。我碰巧喜欢Perl,但你喜欢的可能不一样。

Symcbean 在他的链接中提供了一些关于如何管理后台进程的好建议。

简而言之,编写一个 CLI PHP 脚本来处理长位。确保它以某种方式报告状态。使用 AJAX 或传统方法创建一个 php 页面来处理状态更新。您的启动脚本将启动在其自己的会话中运行的进程,并返回该进程正在进行的确认。

祝你好运。

PHP may or may not be the best tool, but you know how to use it, and the rest of your application is written using it. These two qualities, combined with the fact that PHP is "good enough" make a pretty strong case for using it, instead of Perl, Ruby, or Python.

If your goal is to learn another language, then pick one and use it. Any language you mentioned will do the job, no problem. I happen to like Perl, but what you like may be different.

Symcbean has some good advice about how to manage background processes at his link.

In short, write a CLI PHP script to handle the long bits. Make sure that it reports status in some way. Make a php page to handle status updates, either using AJAX or traditional methods. Your kickoff script will the start the process running in its own session, and return confirmation that the process is going.

Good luck.

喜爱皱眉﹌ 2024-08-27 07:04:35

我同意答案说这应该在后台进程中运行。但报告状态也很重要,以便用户知道工作正在完成。

当收到启动进程的 PHP 请求时,您可以在数据库中存储具有唯一标识符的任务表示。然后,启动屏幕抓取过程,并向其传递唯一标识符。向 iPhone 应用程序报告任务已启动,并且它应该检查包含新任务 ID 的指定 URL,以获取最新状态。 iPhone 应用程序现在可以轮询(甚至“长轮询”)此 URL。与此同时,后台进程将更新任务的数据库表示,因为它使用完成百分比、当前步骤或您想要的任何其他状态指示器。当它完成时,它会设置一个完成标志。

I agree with the answers that say this should be run in a background process. But it's also important that you report on the status so the user knows that the work is being done.

When receiving the PHP request to kick off the process, you could store in a database a representation of the task with a unique identifier. Then, start the screen-scraping process, passing it the unique identifier. Report back to the iPhone app that the task has been started and that it should check a specified URL, containing the new task ID, to get the latest status. The iPhone application can now poll (or even "long poll") this URL. In the meantime, the background process would update the database representation of the task as it worked with a completion percentage, current step, or whatever other status indicators you'd like. And when it has finished, it would set a completed flag.

笑叹一世浮沉 2024-08-27 07:04:35

您可以将其作为 XHR (Ajax) 请求发送。与普通 HTTP 请求不同,客户端通常不会有任何 XHR 超时。

You can send it as an XHR (Ajax) request. Clients don't usually have any timeout for XHRs, unlike normal HTTP requests.

爱要勇敢去追 2024-08-27 07:04:35

我意识到这是一个很老的问题,但想尝试一下。该脚本尝试解决最初的启动要求以快速完成并将重负载分解为较小的块。我还没有测试过这个解决方案。

<?php
/**
 * crawler.php located at http://mysite.com/crawler.php
 */

// Make sure this script will keep on runing after we close the connection with
// it.
ignore_user_abort(TRUE);


function get_remote_sources_to_crawl() {
  // Do a database or a log file query here.

  $query_result = array (
    1 => 'http://exemple.com',
    2 => 'http://exemple1.com',
    3 => 'http://exemple2.com',
    4 => 'http://exemple3.com',
    // ... and so on.
  );

  // Returns the first one on the list.
  foreach ($query_result as $id => $url) {
    return $url;
  }
  return FALSE;
}

function update_remote_sources_to_crawl($id) {
  // Update my database or log file list so the $id record wont show up
  // on my next call to get_remote_sources_to_crawl()
}

$crawling_source = get_remote_sources_to_crawl();

if ($crawling_source) {


  // Run your scraping code on $crawling_source here.


  if ($your_scraping_has_finished) {
    // Update you database or log file.
    update_remote_sources_to_crawl($id);

    $ctx = stream_context_create(array(
      'http' => array(
        // I am not quite sure but I reckon the timeout set here actually
        // starts rolling after the connection to the remote server is made
        // limiting only how long the downloading of the remote content should take.
        // So as we are only interested to trigger this script again, 5 seconds 
        // should be plenty of time.
        'timeout' => 5,
      )
    ));

    // Open a new connection to this script and close it after 5 seconds in.
    file_get_contents('http://' . $_SERVER['HTTP_HOST'] . '/crawler.php', FALSE, $ctx);

    print 'The cronjob kick off has been initiated.';
  }
}
else {
  print 'Yay! The whole thing is done.';
}

I realize this is a quite old question but would like to give it a shot. This script tries to address both the initial kick off call to finish quickly and chop down the heavy load into smaller chunks. I haven't tested this solution.

<?php
/**
 * crawler.php located at http://mysite.com/crawler.php
 */

// Make sure this script will keep on runing after we close the connection with
// it.
ignore_user_abort(TRUE);


function get_remote_sources_to_crawl() {
  // Do a database or a log file query here.

  $query_result = array (
    1 => 'http://exemple.com',
    2 => 'http://exemple1.com',
    3 => 'http://exemple2.com',
    4 => 'http://exemple3.com',
    // ... and so on.
  );

  // Returns the first one on the list.
  foreach ($query_result as $id => $url) {
    return $url;
  }
  return FALSE;
}

function update_remote_sources_to_crawl($id) {
  // Update my database or log file list so the $id record wont show up
  // on my next call to get_remote_sources_to_crawl()
}

$crawling_source = get_remote_sources_to_crawl();

if ($crawling_source) {


  // Run your scraping code on $crawling_source here.


  if ($your_scraping_has_finished) {
    // Update you database or log file.
    update_remote_sources_to_crawl($id);

    $ctx = stream_context_create(array(
      'http' => array(
        // I am not quite sure but I reckon the timeout set here actually
        // starts rolling after the connection to the remote server is made
        // limiting only how long the downloading of the remote content should take.
        // So as we are only interested to trigger this script again, 5 seconds 
        // should be plenty of time.
        'timeout' => 5,
      )
    ));

    // Open a new connection to this script and close it after 5 seconds in.
    file_get_contents('http://' . $_SERVER['HTTP_HOST'] . '/crawler.php', FALSE, $ctx);

    print 'The cronjob kick off has been initiated.';
  }
}
else {
  print 'Yay! The whole thing is done.';
}
胡渣熟男 2024-08-27 07:04:35

我想提出一个与 symcbean 略有不同的解决方案,主要是因为我有额外的要求,即长时间运行的进程需要作为另一个用户运行,而不是作为 apache / www-data 用户运行。

使用 cron 轮询后台任务表的第一个解决方案:

  • PHP 网页插入后台任务表,状态“已提交”
  • cron 每 3 分钟运行一次,使用另一个用户,运行 PHP CLI 脚本来检查后台任务表中的“已提交” PHP
  • PHP CLI 会将行中的状态列更新为“PROCESSING”并开始处理,完成后将更新为“COMPLETED”

使用 Linux inotify 工具的第二种解决方案:

  • 网页使用用户设置的参数更新控制文件,并且还给出一个任务 ID
  • shell 脚本(作为非 www 用户)运行 inotifywait 将等待控制文件写入,
  • 在写入控制文件后,将引发 close_write 事件,shell 脚本将继续
  • shell 脚本执行 PHP CLI执行长时间运行的进程
  • PHP CLI 将输出写入由任务 id 标识的日志文件,或者更新状态表中的进度
  • PHP 网页可以轮询日志文件(基于任务 id)以显示长时间运行的进程的进度,或者它也可以查询状态表

一些附加信息可以在我的帖子中找到:http://inventorsparadox.blogspot.co.id/2016/01/long-running-process-in-linux-using-php.html

I would like to propose a solution that is a little different from symcbean's, mainly because I have additional requirement that the long running process need to be run as another user, and not as apache / www-data user.

First solution using cron to poll a background task table:

  • PHP web page inserts into a background task table, state 'SUBMITTED'
  • cron runs once each 3 minutes, using another user, running PHP CLI script that checks the background task table for 'SUBMITTED' rows
  • PHP CLI will update the state column in the row into 'PROCESSING' and begin processing, after completion it will be updated to 'COMPLETED'

Second solution using Linux inotify facility:

  • PHP web page updates a control file with the parameters set by user, and also giving a task id
  • shell script (as a non-www user) running inotifywait will wait for the control file to be written
  • after control file is written, a close_write event will be raised an the shell script will continue
  • shell script executes PHP CLI to do the long running process
  • PHP CLI writes the output to a log file identified by task id, or alternatively updates progress in a status table
  • PHP web page could poll the log file (based on task id) to show progress of the long running process, or it could also query status table

Some additional info could be found in my post : http://inventorsparadox.blogspot.co.id/2016/01/long-running-process-in-linux-using-php.html

软糯酥胸 2024-08-27 07:04:35

我已经用 Perl、双 fork() 和与父进程分离做了类似的事情。所有http 抓取工作都应该在forked 进程中完成。

I have done similar things with Perl, double fork() and detaching from parent process. All http fetching work should be done in forked process.

源来凯始玺欢你 2024-08-27 07:04:35

使用代理来委托请求。

Use a proxy to delegate the request.

想你只要分分秒秒 2024-08-27 07:04:35

我总是使用这些变体之一(因为不同版本的 Linux 对于处理输出有不同的规则/某些程序输出不同):

变体 I
@exec('./myscript.php \1>/dev/null \2>/dev/null &');

变体二
@exec('php -f myscript.php \1>/dev/null \2>/dev/null &');

变体 III
@exec('nohup myscript.php \1>/dev/null \2>/dev/null &');

您可能还没有安装“nohup”。但例如,当我自动化 FFMPEG 视频转换时,输出接口在某种程度上并没有通过重定向输出流 1 和 1 来 100% 处理。 2,所以我使用 nohup 并重定向输出。

what I ALWAYS use is one of these variants (because different flavors of Linux have different rules about handling output/some programs output differently):

Variant I
@exec('./myscript.php \1>/dev/null \2>/dev/null &');

Variant II
@exec('php -f myscript.php \1>/dev/null \2>/dev/null &');

Variant III
@exec('nohup myscript.php \1>/dev/null \2>/dev/null &');

You might havet install "nohup". But for example, when I was automating FFMPEG video converstions, the output interface somehow wasn't 100% handled by redirecting output streams 1 & 2, so I used nohup AND redirected the output.

白首有我共你 2024-08-27 07:04:35

如果您的脚本很长,则借助每个任务的输入参数来划分页面工作。(然后每个页面就像线程一样)
即,如果页面有 1 个 lac Product_keywords 长进程循环,则代替循环为一个关键字创建逻辑,并从 magic 或 cornjobpage.php 传递此关键字(在下面的示例中),

对于后台工作人员,我认为您应该尝试这种技术,它将有助于调用您喜欢的页面数量不限,所有页面都将立即独立运行,而无需异步等待每个页面响应。

cornjobpage.php //mainpage

    <?php

post_async("http://localhost/projectname/testpage.php", "Keywordname=testValue");
//post_async("http://localhost/projectname/testpage.php", "Keywordname=testValue2");
//post_async("http://localhost/projectname/otherpage.php", "Keywordname=anyValue");
//call as many as pages you like all pages will run at once independently without waiting for each page response as asynchronous.
            ?>
            <?php

            /*
             * Executes a PHP page asynchronously so the current page does not have to wait for it to     finish running.
             *  
             */
            function post_async($url,$params)
            {

                $post_string = $params;

                $parts=parse_url($url);

                $fp = fsockopen($parts['host'],
                    isset($parts['port'])?$parts['port']:80,
                    $errno, $errstr, 30);

                $out = "GET ".$parts['path']."?$post_string"." HTTP/1.1\r\n";//you can use POST instead of GET if you like
                $out.= "Host: ".$parts['host']."\r\n";
                $out.= "Content-Type: application/x-www-form-urlencoded\r\n";
                $out.= "Content-Length: ".strlen($post_string)."\r\n";
                $out.= "Connection: Close\r\n\r\n";
                fwrite($fp, $out);
                fclose($fp);
            }
            ?>

testpage.php

    <?
    echo $_REQUEST["Keywordname"];//case1 Output > testValue
    ?>

PS:如果您想以循环方式发送网址参数,请按照以下答案操作:https://stackoverflow.com/a/41225209/6295712

if you have long script then divide page work with the help of input parameter for each task.(then each page act like thread)
i.e if page has 1 lac product_keywords long process loop then instead of loop make logic for one keyword and pass this keyword from magic or cornjobpage.php(in following example)

and for background worker i think you should try this technique it will help to call as many as pages you like all pages will run at once independently without waiting for each page response as asynchronous.

cornjobpage.php //mainpage

    <?php

post_async("http://localhost/projectname/testpage.php", "Keywordname=testValue");
//post_async("http://localhost/projectname/testpage.php", "Keywordname=testValue2");
//post_async("http://localhost/projectname/otherpage.php", "Keywordname=anyValue");
//call as many as pages you like all pages will run at once independently without waiting for each page response as asynchronous.
            ?>
            <?php

            /*
             * Executes a PHP page asynchronously so the current page does not have to wait for it to     finish running.
             *  
             */
            function post_async($url,$params)
            {

                $post_string = $params;

                $parts=parse_url($url);

                $fp = fsockopen($parts['host'],
                    isset($parts['port'])?$parts['port']:80,
                    $errno, $errstr, 30);

                $out = "GET ".$parts['path']."?$post_string"." HTTP/1.1\r\n";//you can use POST instead of GET if you like
                $out.= "Host: ".$parts['host']."\r\n";
                $out.= "Content-Type: application/x-www-form-urlencoded\r\n";
                $out.= "Content-Length: ".strlen($post_string)."\r\n";
                $out.= "Connection: Close\r\n\r\n";
                fwrite($fp, $out);
                fclose($fp);
            }
            ?>

testpage.php

    <?
    echo $_REQUEST["Keywordname"];//case1 Output > testValue
    ?>

PS:if you want to send url parameters as loop then follow this answer :https://stackoverflow.com/a/41225209/6295712

善良天后 2024-08-27 07:04:35

正如许多人所说,这不是最好的方法,但这可能会有所帮助:

ignore_user_abort(1); // run script in background even if user closes browser
set_time_limit(1800); // run it for 30 minutes

// Long running script here

Not the best approach, as many stated here, but this might help:

ignore_user_abort(1); // run script in background even if user closes browser
set_time_limit(1800); // run it for 30 minutes

// Long running script here
御守 2024-08-27 07:04:35

如果您的脚本所需的输出是某种处理,而不是网页,那么我相信所需的解决方案是从 shell 运行您的脚本,就像

php my_script.php

If the desired output of your script is some processing, not a webpage, then I believe the desired solution is to run your script from shell, simply as

php my_script.php

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文