Python 中的并行处理选项
我最近创建了一个 python 脚本,它执行一些自然语言处理任务,并且很好地解决了我的问题。但花了9个小时。我首先研究了使用 hadoop 将问题分解为多个步骤,并希望利用通过使用 Amazon Web Services 获得的可扩展并行处理。
但我的一个朋友指出,Hadoop 实际上是为了在磁盘上存储大量数据,您需要对其执行许多简单的操作。在我的情况下,我有一个相对较小的初始数据集(低 100 Mbs),我在其中执行许多复杂的操作,在此过程中占用大量内存,并花费很多时间。
我可以在脚本中使用什么框架来利用 AWS(或类似服务)上的可扩展集群?
I recently created a python script that performed some natural language processing tasks and worked quite well in solving my problem. But it took 9 hours. I first investigated using hadoop to break the problem down into steps and hopefully take advantage of the scalable parallel processing I'd get by using Amazon Web Services.
But a friend of mine pointed out the fact that Hadoop is really for large amounts of data store on disk, for which you want to perform many simple operations. In my situation I have a comparatively small initial data set (low 100s of Mbs) on which I perform many complex operations, taking up a lot of memory during the process, and taking many hours.
What framework can I use in my script to take advantage of scalable clusters on AWS (or similar services)?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
并行 Python 是将事物分布到集群中的多台计算机上的一种选择。
Parallel Python is one option for distributing things over multiple machines in a cluster.
此示例展示了如何执行类似 MapReduce 的脚本,在单台机器上使用进程。其次,如果可以的话,尝试缓存中间结果。我为 NLP 任务执行了此操作,并获得了显着的加速。
This example shows how to do a MapReduce like script, using processes on a single machine. Secondly, if you can, try caching intermediate results. I did this for a NLP task and obtained a significant speed up.
我的软件包 jug 可能非常适合您的需求。如果没有更多信息,我无法真正说出代码的样子,但我是为 sub-hadoop 大小的问题设计的。
My package, jug, might be very appropriate for your needs. Without more information, I can't really say how the code would look like, but I designed it for sub-hadoop sized problems.