Node.js 和 CPU 密集型请求
我已经开始修补 Node.js HTTP 服务器,并且非常喜欢编写服务器端 Javascript,但有些事情阻止我开始在我的 Web 应用程序中使用 Node.js。
我了解整个异步 I/O 概念,但我有点担心程序代码非常占用 CPU 资源的边缘情况,例如图像处理或对大型数据集进行排序。
据我了解,服务器对于简单的网页请求(例如查看用户列表或查看博客文章)将非常快。但是,如果我想编写 CPU 密集型代码(例如在管理后端)来生成图形或调整数千个图像的大小,则请求将非常慢(几秒钟)。由于此代码不是异步的,因此在这几秒钟内到达服务器的每个请求都将被阻止,直到我的缓慢请求完成。
一项建议是使用 Web Workers 来执行 CPU 密集型任务。然而,我担心网络工作者会很难编写干净的代码,因为它是通过包含单独的 JS 文件来工作的。如果 CPU 密集型代码位于对象的方法中怎么办?为每个 CPU 密集型方法都编写一个 JS 文件有点糟糕。
另一个建议是生成一个子进程,但这使得代码更难以维护。
有什么建议可以克服这个(感知到的)障碍吗?如何使用 Node.js 编写干净的面向对象代码,同时确保异步执行 CPU 繁重的任务?
I've started tinkering with Node.js HTTP server and really like to write server side Javascript but something is keeping me from starting to use Node.js for my web application.
I understand the whole async I/O concept but I'm somewhat concerned about the edge cases where procedural code is very CPU intensive such as image manipulation or sorting large data sets.
As I understand it, the server will be very fast for simple web page requests such as viewing a listing of users or viewing a blog post. However, if I want to write very CPU intensive code (in the admin back end for example) that generates graphics or resizes thousands of images, the request will be very slow (a few seconds). Since this code is not async, every requests coming to the server during those few seconds will be blocked until my slow request is done.
One suggestion was to use Web Workers for CPU intensive tasks. However, I'm afraid web workers will make it hard to write clean code since it works by including a separate JS file. What if the CPU intensive code is located in an object's method? It kind of sucks to write a JS file for every method that is CPU intensive.
Another suggestion was to spawn a child process, but that makes the code even less maintainable.
Any suggestions to overcome this (perceived) obstacle? How do you write clean object oriented code with Node.js while making sure CPU heavy tasks are executed async?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
这是对 Web 服务器定义的误解——它只能用于与客户端“交谈”。重负载任务应该委托给独立程序(当然也可以用 JS 编写)。
您可能会说它很脏,但我向您保证,陷入调整图像大小的 Web 服务器进程只会更糟(即使对于 Apache,当它不阻止其他查询时也是如此)。尽管如此,您仍然可以使用公共库来避免代码冗余。
编辑:我想出了一个类比;网络应用程序应该像一家餐厅。你有服务员(网络服务器)和厨师(工人)。服务员与顾客联系并执行简单的任务,例如提供菜单或解释某些菜肴是否是素食。另一方面,他们将更艰巨的任务委托给厨房。因为服务员只做简单的事情,所以他们反应很快,厨师也可以专注于他们的工作。
这里的 Node.js 将是一个单一但非常有才华的服务员,可以一次处理许多请求,而 Apache 将是一群愚蠢的服务员,每个服务员只能处理一个请求。如果这个 Node.js 服务员开始做饭,那将立即引发一场灾难。尽管如此,做饭甚至可能会耗尽大量的阿帕奇服务员,更不用说厨房里的混乱和响应能力的逐渐下降。
This is misunderstanding of the definition of web server -- it should only be used to "talk" with clients. Heavy load tasks should be delegated to standalone programs (that of course can be also written in JS).
You'd probably say that it is dirty, but I assure you that a web server process stuck in resizing images is just worse (even for lets say Apache, when it does not block other queries). Still, you may use a common library to avoid code redundancy.
EDIT: I have come up with an analogy; web application should be as a restaurant. You have waiters (web server) and cooks (workers). Waiters are in contact with clients and do simple tasks like providing menu or explaining if some dish is vegetarian. On the other hand they delegate harder tasks to the kitchen. Because waiters are doing only simple things they respond quick, and cooks can concentrate on their job.
Node.js here would be a single but very talented waiter that can process many requests at a time, and Apache would be a gang of dumb waiters that just process one request each. If this one Node.js waiter would begin to cook, it would be an immediate catastrophe. Still, cooking could also exhaust even a large supply of Apache waiters, not mentioning the chaos in the kitchen and the progressive decrease of responsitivity.
你需要的是一个任务队列!将长时间运行的任务移出网络服务器是一件好事。将每个任务保存在“单独的”js 文件中可以促进模块化和代码重用。它迫使您思考如何构建程序,以便从长远来看更容易调试和维护。任务队列的另一个好处是工作人员可以用不同的语言编写。只需弹出一个任务,完成工作,然后写回响应即可。
像这样的 https://github.com/resque/resque
这是 github 上的一篇文章,介绍了为什么他们构建它 http://github.com/blog/542-introducing-resque
What you need is a task queue! Moving your long running tasks out of the web-server is a GOOD thing. Keeping each task in "separate" js file promotes modularity and code reuse. It forces you to think about how to structure your program in a way that will make it easier to debug and maintain in the long run. Another benefit of a task queue is the workers can be written in a different language. Just pop a task, do the work, and write the response back.
something like this https://github.com/resque/resque
Here is an article from github about why they built it http://github.com/blog/542-introducing-resque
您不希望 CPU 密集型代码异步执行,您希望它并行执行。您需要从服务 HTTP 请求的线程中取出处理工作。这是解决这个问题的唯一方法。对于 NodeJS,答案是集群模块,用于生成子进程来完成繁重的工作。 (AFAIK Node 没有任何线程/共享内存的概念;它是进程或什么都没有)。对于如何构建应用程序,您有两种选择。您可以通过生成 8 个 HTTP 服务器并在子进程上同步处理计算密集型任务来获得 80/20 解决方案。这样做相当简单。您可能需要一个小时才能在该链接上阅读相关内容。事实上,如果您直接抄袭该链接顶部的示例代码,您将完成 95% 的工作。
构建此结构的另一种方法是设置作业队列并通过队列发送大型计算任务。请注意,作业队列的 IPC 会产生大量开销,因此仅当任务明显大于开销时这才有用。
令我惊讶的是,这些其他答案都没有提及集群。
背景:
异步代码是暂停的代码,直到其他地方发生某些事情,此时代码被唤醒并继续执行。一种非常常见的情况是,I/O 必定会在其他地方发生缓慢的情况。
如果您的处理器负责完成工作,异步代码就没有用处。 “计算密集型”任务正是如此。
现在,异步代码看起来似乎很小众,但实际上它很常见。它恰好对计算密集型任务没有用处。
例如,等待 I/O 是 Web 服务器中经常发生的一种模式。每个连接到您的服务器的客户端都会获得一个套接字。大多数时候套接字都是空的。在套接字接收到一些数据之前您不想执行任何操作,此时您想要处理请求。在底层,像 Node 这样的 HTTP 服务器正在使用事件库 (libev) 来跟踪数千个打开的套接字。操作系统通知 libev,然后当其中一个套接字获取数据时,libev 通知 NodeJS,然后 NodeJS 将一个事件放入事件队列中,此时您的 http 代码就会启动并逐个处理事件。在套接字有一些数据之前,事件不会被放入队列中,因此事件永远不会等待数据 - 数据已经在那里等待它们。
当瓶颈正在等待一堆大部分为空的套接字连接并且您不希望每个空闲连接都有一个完整的线程或进程并且您不想轮询 250k 时,基于事件的单线程 Web 服务器作为一种范例是有意义的套接字来查找下一个有数据的套接字。
You don't want your CPU intensive code to execute async, you want it to execute in parallel. You need to get the processing work out of the thread that's serving HTTP requests. It's the only way to solve this problem. With NodeJS the answer is the cluster module, for spawning child processes to do the heavy lifting. (AFAIK Node doesn't have any concept of threads/shared memory; it's processes or nothing). You have two options for how you structure your application. You can get the 80/20 solution by spawning 8 HTTP servers and handling compute-intensive tasks synchronously on the child processes. Doing that is fairly simple. You could take an hour to read about it at that link. In fact, if you just rip off the example code at the top of that link you will get yourself 95% of the way there.
The other way to structure this is to set up a job queue and send big compute tasks over the queue. Note that there is a lot of overhead associated with the IPC for a job queue, so this is only useful when the tasks are appreciably larger than the overhead.
I'm surprised that none of these other answers even mention cluster.
Background:
Asynchronous code is code that suspends until something happens somewhere else, at which point the code wakes up and continues execution. One very common case where something slow must happen somewhere else is I/O.
Asynchronous code isn't useful if it's your processor that is responsible for doing the work. That is precisely the case with "compute intensive" tasks.
Now, it might seem that asynchronous code is niche, but in fact it's very common. It just happens not to be useful for compute intensive tasks.
Waiting on I/O is a pattern that always happens in web servers, for example. Every client who connects to your sever gets a socket. Most of the time the sockets are empty. You don't want to do anything until a socket receives some data, at which point you want to handle the request. Under the hood an HTTP server like Node is using an eventing library (libev) to keep track of the thousands of open sockets. The OS notifies libev, and then libev notifies NodeJS when one of the sockets gets data, and then NodeJS puts an event on the event queue, and your http code kicks in at this point and handles the events one after the other. Events don't get put on the queue until the socket has some data, so events are never waiting on data - it's already there for them.
Single threaded event-based web servers makes sense as a paradigm when the bottleneck is waiting on a bunch of mostly empty socket connections and you don't want a whole thread or process for every idle connection and you don't want to poll your 250k sockets to find the next one that has data on it.
您可以使用几种方法。
正如@Tim 所说,您可以创建一个位于主要服务逻辑之外或与之并行的异步任务。取决于您的具体要求,但即使 cron 也可以充当排队机制。
WebWorkers 可以为您的异步进程工作,但目前 Node.js 不支持它们。有几个扩展提供支持,例如: http://github.com/cramforce/node -worker
您仍然可以通过标准的“require”机制重用模块和代码。您只需要确保对工作人员的初始调度传递了处理结果所需的所有信息。
Couple of approaches you can use.
As @Tim notes, you can create an asynchronous task that sits outside or parallel to your main serving logic. Depends on your exact requirements, but even cron can act as a queueing mechanism.
WebWorkers can work for your async processes but they are currently not supported by node.js. There are a couple of extensions that provide support, for example: http://github.com/cramforce/node-worker
You still get you can still reuse modules and code through the standard "requires" mechanism. You just need to ensure that the initial dispatch to the worker passes all the information needed to process the results.
使用
child_process
是一种解决方案。但与 Go goroutines 相比,生成的每个子进程可能会消耗大量内存。您还可以使用基于队列的解决方案,例如 kue
Use
child_process
is one solution. But each child process spawned may consume a lot of memory compared to Gogoroutines
You can also use queue based solution such as kue