我目前正在尝试在 php 中实现一个作业队列。然后,队列将作为批处理作业进行处理,并且应该能够并行处理一些作业。
我已经做了一些研究并找到了几种实现它的方法,但我并不太了解它们的优点和缺点。
例如,通过 fsockopen
多次调用脚本来进行并行处理,如下所述:
在 PHP 中轻松进行并行处理
我发现的另一种方法是使用 curl_multi
函数。
curl_multi_exec PHP 文档
但我认为这两种方式将在主要在后台运行的队列上创建批处理会增加相当多的开销吗?
我还阅读了有关 pcntl_fork
的内容,这似乎也是解决该问题的一种方法。但如果你真的不知道自己在做什么(就像现在的我一样),看起来会变得非常混乱。
我还查看了 Gearman,但在那里我还需要根据需要动态生成工作线程,而不仅仅是运行一些工作线程,然后让 gearman 作业服务器将其发送给空闲工作线程。特别是因为线程应该在执行一项作业后干净地退出,以免遇到最终的内存泄漏(该问题中的代码可能并不完美)。
Gearman 入门
所以我的问题是,如何在 PHP 中处理并行处理?您为什么选择您的方法,不同的方法有哪些优点/缺点?
I am currently trying to implement a job queue in php. The queue will then be processed as a batch job and should be able to process some jobs in parallel.
I already did some research and found several ways to implement it, but I am not really aware of their advantages and disadvantages.
E.g. doing the parallel processing by calling a script several times through fsockopen
like explained here:
Easy parallel processing in PHP
Another way I found was using the curl_multi
functions.
curl_multi_exec PHP docs
But I think those 2 ways will add pretty much overhead for creating batch processing on a queue that should mainly run on the background?
I also read about pcntl_fork
which also seems to be a way to handle the problem. But that looks like it can get really messy if you don't really know what you are doing (like me at the moment).
I also had a look at Gearman
, but there I would also need to spawn the worker threads dynamically as needed and not just run a few and let the gearman job server then sent it to the free workers. Especially because the threads should be exit cleanly after one job has been executed, to not run into eventual memory leaks (code may not be perfect in that issue).
Gearman Getting Started
So my question is, how do you handle parallel processing in PHP? And why do you choose your method, which advantages/disadvantages may the different methods have?
发布评论
评论(10)
我使用
exec()
。它既简单又干净。您基本上需要构建一个线程管理器和线程脚本,它们将满足您的需要。我不喜欢
fsockopen()
因为它会打开一个服务器连接,该连接会建立并可能达到 apache 的连接限制我不喜欢
curl
函数,原因与我不喜欢的 原因相同就像 pnctl 一样,因为它需要可用的 pnctl 扩展,并且您必须跟踪父/子关系。
没和齿轮侠一起玩过...
i use
exec()
. Its easy and clean. You basically need to build a thread manager, and thread scripts, that will do what you need.I dont like
fsockopen()
because it will open a server connection, that will build up and may hit the apache's connection limitI dont like
curl
functions for the same reasonI dont like
pnctl
because it needs the pnctl extension available, and you have to keep track of parent/child relations.never played with gearman...
好吧,我想我们有 3 个选择:
A。多线程:
PHP 本身不支持多线程。
但是有一个名为 pthreads 的 PHP 扩展(实验性)(https://github.com/krakjoe/pthreads) 可以让你做到这一点。
B.多进程:
这可以通过 3 种方式完成:
C。分布式并行处理:
工作原理:
客户端
应用程序将数据(又名消息)“可以采用 JSON 格式”发送到引擎(MQ 引擎)“可以是本地的还是外部的 Web 服务”MQ 引擎
将数据“主要存储在内存中,也可以选择存储在数据库中”在队列内(您可以定义队列名称)客户端
应用程序向 MQ 引擎请求数据(消息)按顺序(先进先出或基于优先级)处理它们“您还可以从特定队列请求数据”。一些 MQ 引擎:
面向消息的IPC库,是Erlang中的消息队列服务器,将作业存储在内存中。它是一个充当并发框架的套接字库。对于集群产品和超级计算来说比 TCP 更快。
自托管,企业消息队列,并不是真正的工作队列 - 而是可以用作工作队列但需要额外语义的消息队列。
(Laravel 内置支持,由 facebook 构建,用于工作队列) - 有一个非常好的“Beanstalkd 控制台”工具
(问题:分布式处理的集中式代理系统)
Java 中最受欢迎的开源消息代理(问题:大量错误和问题)
(Laravel 内置支持,托管 - 因此不需要管理。并不是真正的工作队列,因此需要额外的工作来处理语义,例如埋葬作业)
(Laravel 内置支持,用 Go 编写,可作为云版本和本地版本使用)
(Laravel 内置支持,但速度不是那么快,因为它不是为此设计的)
(用Ruby编写,基于memcache)
(Ruby编写,基于memcache,内置twitter)
(只是另一个 QM)
(用 Scala 在 LinkedIn 上编写)
开源、高性能、轻量级队列管理器(用 C 语言编写)
可以在此处找到更多内容:https://github.com/lukaszx0/queues.io/blob/master/projects.yml
Well I guess we have 3 options there:
A. Multi-Thread:
PHP does not support multithread natively.
But there is one PHP extension (experimental) called pthreads (https://github.com/krakjoe/pthreads) that allows you to do just that.
B. Multi-Process:
This can be done in 3 ways:
C. Distributed Parallel Processing:
How it works:
Client
App sends data (AKA message) “can be JSON formatted” to the Engine (MQ Engine) “can be local or external a web service”MQ Engine
stores the data “mostly in Memory and optionally in Database” inside a queues (you can define the queue name)Client
App asks the MQ Engine for a data (message) to be processed them in order (FIFO or based on priority) “you can also request data from specific queue".Some MQ Engines:
a message orientated IPC Library, is a Message Queue Server in Erlang, stores jobs in memory. It is a socket library that acts as a concurrency framework. Faster than TCP for clustered products and supercomputing.
self hosted, Enterprise Message Queues, Not really a work queue - but rather a message queue that can be used as a work queue but requires additional semantics.
(Laravel built in support, built by facebook, for work queue) - has a "Beanstalkd console" tool which is very nice
(problem: centralized broker system for distributed processing)
the most popular open source message broker in Java, (problem: lot of bugs and problems)
(Laravel built in support, Hosted - so no administration is required. Not really a work queue thus will require extra work to handle semantics such as burying a job)
(Laravel built in support, Written in Go, Available both as cloud version and on-premise)
(Laravel built in support, not that fast as its not designed for that)
(written in Ruby that based on memcache)
(written in Ruby that based on memcache, built in twitter)
(just another QM)
(Written at LinkedIn in Scala)
open source, high-performance and lightweight queue manager (Written in C)
More of them can be found here: https://github.com/lukaszx0/queues.io/blob/master/projects.yml
首先,这个答案是基于linux操作系统环境的。
还有一个 pecl 扩展是并行的,您可以通过发出
pecl install parallel
来安装它,但它有一些先决条件:extension=parallel.so
到其中,然后查看完整的示例要点:https://gist.github.com/krakjoe/0ee02b887288720d9b785c9f947f3a0a
或 php 官方网站网址:https://www.php.net/manual/ en/book.parallel.php
First of all, this answer is based on the linux OS env.
Yet another pecl extension is parallel,you can install it by issuing
pecl install parallel
,but it has some prerequisities:extension=parallel.so
to itthen see the full example gist :https://gist.github.com/krakjoe/0ee02b887288720d9b785c9f947f3a0a
or the php official site url:https://www.php.net/manual/en/book.parallel.php
使用原生 PHP (7.2+) Parallel ,即:(
顺便说一句,您将需要通过硬路径安装具有 ZTS 支持的 PHP,然后启用并行,我建议使用 phpbrew 来执行此操作。)
Use native PHP (7.2+) Parallel , i.e.:
(BTW, you will need to go through hard path to install PHP with ZTS support, and then enable parallel. I recommend phpbrew to do that.)
下面总结了 PHP 中并行处理的几个选项。
AMP
Checkout Amp - 异步并发变得简单 - 这看起来是最成熟的 PHP 库我见过并行处理。
Peec 的 Process 类
该类发布在 PHP 的 exec() 函数的注释中 并为创建新流程并跟踪它们提供了一个真正简单的起点。
示例:
比较其他选项
还有一篇很棒的文章 异步处理或多任务处理PHP中的解释了各种方法的优缺点:
门卫
然后,还有这个简单的教程,它被包装成一个小名为 Doorman 的库。
希望这些链接为更多研究提供有用的起点。
Here's a summary of a few options for parallel processing in PHP.
AMP
Checkout Amp - Asynchronous concurrency made simple - this looks to be the most mature PHP library I've seen for parallel processing.
Peec's Process Class
This class was posted in the comments of PHP's exec() function and provides a real simple starting point for forking new processes and keeping track of them.
Example:
Other Options Compared
There's also a great article Async processing or multitasking in PHP that explains the pros and cons of various approaches:
Doorman
Then, there's also this simple tutorial which was wrapped up into a little library called Doorman.
Hope these links provide a useful starting point for more research.
如果您的应用程序要在 unix/linux 环境下运行,我建议您使用分叉选项。让它发挥作用基本上就像儿戏一样。我已将它用于 Cron 管理器,并且如果无法选择分叉,则可以将其恢复为 Windows 友好的代码路径。
正如您所说,多次运行整个脚本的选项确实会增加相当多的开销。如果您的脚本很小,这可能不是问题。但是您可能会习惯于按照您选择的方式在 PHP 中进行并行处理。下次当您的工作使用 200MB 数据时,这很可能会成为一个问题。所以你最好学习一种你可以坚持的方法。
我也测试过 Gearman,我非常喜欢它。有一些事情需要考虑,但总的来说,它提供了一种非常好的方法来将作品分发到运行用不同语言编写的不同应用程序的不同服务器。除了设置之外,在 PHP 或任何其他语言中实际使用它,都是……再一次……小孩子的游戏。
对于你需要做的事情来说,这很可能是矫枉过正的。但在处理数据和作业时,它会让您看到新的可能性,因此我建议您尝试 Gearman。
If your application is going to run under a unix/linux enviroment I would suggest you go with the forking option. It's basically childs play to get it working. I have used it for a Cron manager and had code for it to revert to a Windows friendly codepath if forking was not an option.
The options of running the entire script several times do, as you state, add quite a bit of overhead. If your script is small it might not be a problem. But you will probably get used to doing parallel processing in PHP by the way you choose to go. And next time when you have a job that uses 200mb of data it might very well be a problem. So you'd be better of learning a way that you can stick with.
I have also tested Gearman and I like it a lot. There are a few thing to think about but as a whole it offers a very good way to distribute works to different servers running different applications written in different languages. Besides setting it up, actually using it from within PHP, or any other language for that matter, is... once again... childs play.
It could very well be overkill for what you need to do. But it will open your eyes to new possibilities when it comes to handling data and jobs, so I would recommend you to try Gearman for that fact alone.
我更喜欢 exec() 和 gearman。
exec() 很简单,无需连接,占用内存较少。
gearman 应该需要一个套接字连接,worker 应该占用一些内存。
但 gearman 比 exec() 更灵活、更快。最重要的是它可以将worker部署在其他服务器上。如果工作很耗费时间和资源。
我在当前的项目中使用 gearman。
I prefer exec() and gearman.
exec() is easy and no connection and less memory consuming.
gearman should need a socket connection and the worker should take some memory.
But gearman is more flexible and faster than exec(). And the most important is that it can deploy the worker in other server. If the work is time and resource consuming.
I'm using gearman in my current project.
我使用 PHP 的 pnctl - 只要你知道你在做什么,它就很好。我理解你的情况,但我不认为理解我们的代码有什么困难,我们只是在实现作业队列或并行进程时需要比以往更有意识。
我觉得只要你完美地编码并确保流程完美,你在实现时就应该记住并行过程。
你可能会犯错误的地方:
看看这个例子 - https:// github.com/rakesh-sankar/Tools/blob/master/PHP/fork-parallel-process.php。
希望有帮助。
I use PHP's pnctl - it is good as long as you know what you do. I understand you situation but I don't think it's something difficult to understand our code, we just have to be little more conscious than ever when implementing JOB queue or Parallel process.
I feel as long as you code it perfectly and make sure the flow is perfect off-course you should keep PARALLEL PROCESS in mind when you implement.
Where you could do mistakes:
Take a look at this example - https://github.com/rakesh-sankar/Tools/blob/master/PHP/fork-parallel-process.php.
Hope it helps.
“PHP 中的简单并行处理”中描述的方法是彻头彻尾的可怕 - 原理是好的 - 但实现???正如您已经指出的那样,curl_multi_ fns 提供了一种更好的方法来实现这种方法。
是的,您可能不需要客户端和服务器 HTTP 堆栈来交接工作 - 但除非您在 Google 工作,否则您的开发时间比您的硬件要昂贵得多成本 - 并且有很多用于管理 HTTP/分析性能的工具 - 并且有一个定义的标准,涵盖状态通知和身份验证等内容。
实施解决方案的方式在很大程度上取决于您所需的事务完整性级别以及是否需要按顺序处理。
在您提到的方法中,我建议重点关注使用curl_multi_的HTTP请求方法。但是,如果您需要良好的事务控制/订单交付,那么您绝对应该在消息源和处理代理之间运行代理守护进程(有一个编写良好的单线程服务器适合用作代理的框架此处)。请注意,处理代理应一次处理一条消息。
如果您需要高度可扩展的解决方案,请查看适当的消息队列系统,例如 RabbitMQ。
HTH
C.
The method described in 'Easy parallel processing in PHP' is downright scary - the principle is OK - but the implementation??? As you've already pointed out the curl_multi_ fns provide a much better way of implementing this approach.
Yes, you probably don't need a client and server HTTP stack for handing off the job - but unless you're working for Google, your development time is much more expensive than your hardware costs - and there are plenty of tools for managing HTTP/analysing performance - and there is a defined standard covering stuff such as status notifications and authentication.
A lot of how you implement the solution depends on the level transactional integrity you require and whether you require in-order processing.
Out of the approaches you mention I'd recommend focussing on the HTTP request method using curl_multi_ . But if you need good transactional control / in order delivery then you should definitely run a broker daemon between the source of the messages and the processing agents (there is a well written single threaded server suitable for use as a framework for the broker here). Note that the processing agents should process a single message at a time.
If you need a highly scalable solution, then take a look at a proper message queuing system such as RabbitMQ.
HTH
C.
必须明白并行性是并发汤中的一根面条(这是我自己的解释)。而且,面条最好是厚热量的过程,而不是像头发一样的线(这是我自己的偏好)。
这里我画了一个简单的例子来说明这一点
因此,并发是必须首先选择的基础(如纤程、承诺、期货等)。我还研究了一些进程抽象(WIN 和 NIX 系统),父/子和客户端/服务器因此给出了主/从关系。这种“硬”层次结构应该允许可控执行并涵盖广泛的用例
连接进程的白线是命令和事件通道,它们是通过
Sync
扩展(信号量和共享内存)。所以这个答案主要是理论上的。CreateProcess
用于 WIN 上的proc_open
,并且posix_spawn
(速度更快一点)或pcntl_fork
(或者只是 fork)在 NIX 系统上用于进程启动。另外,只有
PHP_BINARY
(或 PHP 运行时本身)可以使用命令行参数启动来运行您的 php 文件。have to understand that parallelism is a lone noodle in a soup of concurrency (thats my own interpretation). and, that noodle is better be thick caloric process than a hair-like thread (thats my own preference).
here i draw a simple illustration to this
so, concurrency is the base one have to choose first (like fibers, promises, futures etc). i also worked on some process abstraction (both WIN and NIX systems), parent/child and client/server give a master/slave relationship as a result. this "hard" hierarchy should allow controllable execution and cover wide range of use cases
white lines connecting processes are command and event channels, they are implemented with
Sync
extension (semaphores and shared memory). so consider this answer mostly as theoretical.CreateProcess
is used inproc_open
on WIN, and, eitherposix_spawn
(which is somehow a bit faster) orpcntl_fork
(or just fork) is used on NIX system for the process startup.also, theres only
PHP_BINARY
(or PHP runtime itself) that can be started with command line parameter to run your php file.